The rise of Platform-as-a-Service (PaaS) offerings across the services spectrum has created opportunity for lightweight, modular approaches not just for applications but also for back-end data processing work. In a few paragraphs, I'll describe an approach using the open source Pentaho Data Integration (a.k.a. Kettle) combined with AWS offerings to create a quick, lightweight ETL platform.
In this scenario, we want to transform some data stored in S3, the AWS object store. Some transformations need to run on a schedule (extracting data from a web service perhaps) and some should run on demand when objects are created in a particular S3 bucket.
The goal of our solution is a cost effective, easily understood process that supports a continuous deployment model. We chose Kettle as the ETL tool for its variety of transforms, open source license, and lightweight deployment capability: a small install footprint executable with a standard JVM without external dependencies. We chose AWS for its PaaS offerings, especially the Data Pipeline as well see shortly.
We deployed the solution entirely on AWS using:
CodeCommit– AWS' hosted Git repository. Our development team uses a slightly modified GitFlow process to work collaboratively. The master branch is pulled into the ETL environment prior to execution.
Data Pipeline– a data-centric workflow engine with a few key features:
- Managed EC2 instances – data pipeline creates and terminates EC2 instances as needed to execute transformation workloads. No scripting needed.
- S3 Integration – S3 buckets can be staged as input or output from/to the EC2 instance (and therefore the ETL process). This limits the need for ETL jobs to do the heavy lifting of moving data in/out of S3; ETL jobs simply work with local files.
- Logging – Activity logs and stdout/stderr are logged automatically to S3 and can be retrieved easily from the data pipeline console or from S3 directly.
EC2– On-demand Linux instances managed by data pipeline.
S3– Data in, data out, and notifications.
1. Create your CodeCommit repo and commit your kettle transforms and jobs.
2. Create a Linux AMI to use as your EC2 instance (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html). Install Kettle, clone your git repo.
3. Create your Data Pipeline(s):
- Select either a periodic schedule or "On Activation" for an on-demand pipeline.
- Use ShellCommandActivities to pull the latest code from your git repo and execute your Kettle jobs (kitchen.sh) or transforms (pan.sh).
- Use S3DataNodes to stage input/output from/to S3.
- Use an EC2 Resource with your AMI from step 2 as the Run On instance.
NOTE: In this pipeline, in addition to the stdout logging provided automatically by data pipeline, we explicitly created additional logs in Kettle that were moved to S3 on job completion.
4. Create a notification event in S3 to trigger a Lambda script to in turn trigger the data pipeline.
5. Create the Lamba script. Something like:
That's it! Your ETL jobs will run as scheduled (or on demand based on your S3 events). EC2 instances will launch and terminate as needed. Logs will appear on S3 for your examination (and analysis!).
Writ large, this pattern would need a little more structure and process, but just a little. And the simplicity of the pattern makes the barrier to entry and management refreshingly low.