Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Build an Example Data Warehouse

Now that we know how the pieces of the ETL pipeline work, let's see how we combine them all to build a data warehouse.

Setup S3 Bucket and AWS Keys

As it runs, the ETL pipeline will store intermediate data and task metadata to Amazon S3. It will also store the final data set to S3, where it will be picked up and loaded by Redshift.

For that to happen, you'll need to create a new S3 bucket in your AWS account. You'll also need a set of AWS keys with access to that bucket and to Redshift.

First, create a new S3 bucket:

  1. Go to the S3 Management Console page in the AWS Management Console. If prompted, login with your AWS username and password.
  2. Press the “Create Bucket” button to create a new bucket.
  3. Name your bucket, using dashes to separate words (e.g. mycompany-mortar-recs-data). Keep your bucket in the US Standard Region, where Mortar’s Hadoop clusters run, to ensure fast and free data transfer between Hadoop and S3.
  4. Press “Create Bucket” to make your new bucket.

Next, follow these instructions to generate a new set of AWS keys and add them to Mortar. Be sure to write down your keys, as you will need them in the next step.

When you've generated those AWS keys, you'll need to take one more step to add Redshift access to your AWS keys. From the the Users tab of the AWS IAM Console, select the IAM user you just created. Choose the "Permissions" tab at the bottom of the screen, and click "Attach User Policy":

Attach User Policy

Scroll Down inside "Select Policy Template Until you reach "Amazon Redshift Full Access" and click "Select":

Manage User Permissions

Click "Apply Policy":

Set Permissions

Your keys should now have access to both store data in S3 and copy that data into Redshift.


Configure Luigi

As mentioned earlier, we will use Luigi to run our ETL pipeline.

The mortar-etl-redshift project has a directory called luigiscripts, which contains Luigi pipelines that can be run on Mortar.

The luigiscripts directory contains client.cfg.template, a Luigi configuration file. Each time you run Luigi, Mortar will expand the variables in this file (e.g. ${MORTAR_EMAIL}) to their actual values and store the result in luigiscripts/client.cfg. You should not check luigiscripts/client.cfg into source control, as it will be generated on each new run.

Before running the ETL pipeline we need to set Secure Configuration Parameters for the values referenced in client.cfg.template. To do that, fill in the missing values below (omitting the < and >) and run:

    mortar config:set HOST=<my-endpoint.redshift.amazonaws.com>
    mortar config:set PORT=5439
    mortar config:set DATABASE=<my-database-name>
    mortar config:set USERNAME=<my-master-username>
    mortar config:set PASSWORD=<my-master-username-password>

You can get the details about your Redshift cluster from the AWS Redshift Console.


Run The ETL Pipeline

When running your code in the cloud, you need to decide how large a cluster of computers to use for your job. For this example, we have about 0.5GB of data in the cloud and we'll use a 5-node cluster. The pipeline should run to completion in under 30 minutes, including Hadoop cluster startup time.

By default this Mortar project uses AWS spot instance clusters to save money. Running this example on a 5-node spot instance cluster for 2 hours should cost you approximately $1.40 in pass-through AWS costs. Before running this job you will need to add your credit card to your account. You can do that on our Billing Page.

To run the ETL pipeline, type at the command prompt (replacing <your-bucket-name>):

mortar luigi luigiscripts/wikipedia-luigi.py \
    --input-base-path "s3://mortar-example-data/wikipedia/pagecounts-2011-07-aa" \
    --output-base-path "s3://<your-bucket-name>/wiki" \
    --table-name "pageviews"

When you run the command, you should see the following:

Taking code snapshot... done
Sending code snapshot to Mortar... done
Requesting job execution... done
job_id: some_job_id

Job status can be viewed on the web at:

 https://app.mortardata.com/jobs/pipeline_job_detail?job_id=some_job_id

This tells you that your job has started successfully and gives you the URL to monitor its progress. If you open that URL, you should see your Luigi job running, and should see logs start streaming into the Logs console.

Pipeline Job Details

Here are the tasks that we're running:

  • ExtractWikipediaDataTask
  • TransformWikipediaDataTask
  • CopyToRedshiftTask
  • ShutdownClusters

Explore the ETL Pipeline

Open luigiscripts/wikipedia-luigi.py and take a look at the code to get a better idea of what's happening.

ExtractWikipediaDataTask

This class is responsible for running the extract Pig script. As this is the first task in the pipeline it has no dependencies. So as long as its output doesn't already exist, this task will run when Luigi schedules it.

TransformWikipediaDataTask

This class is responsible for running the transform Pig script. It is very similar to the ExtractWikipediaDataTask, but in this case you'll notice that it has a requires method that states it requires the ExtractWikipediaDataTask to be finished. So in this case Luigi will ensure that the extract task has finished before running this task.

CopyToRedshiftTask

The class CopyToRedshiftTask is used to simplify the copy statement required to move data from your S3 bucket into your Redshift cluster. The config values for this task are set in the client.cfg.template file.

The columns variable is important to note. This allows you to define the schema of your table. Luigi will automatically create the table for you if it does not yet exist. For this example, it is formatted to match the output of the pig script.

columns =[
    ('language_code', 'text'),
    ('wiki_type', 'text'),
    ('article_title', 'varchar(max)'),
    ('day', 'int'),
    ('hour', 'int'),
    ('pageviews', 'int')
    ('PRIMARY KEY', '(article, day, hour)')]

ShutdownClusters

This aptly named task shuts down all active idle clusters for the user. Notice that even though this task just shuts down clusters it still has an output target in S3. This allows Luigi to track if the task has completed or not.

While you are waiting for your job to finish, jump over and do the Hadoop and Pig tutorials to get a better understanding of how they work.

Querying Your Data Warehouse

Once your Luigi job has finished, you can do some analysis with Redshift.

You can connect to Redshift using a Postgresql driver. See the official documentation for more information. Some tools that allow you to interface with Redshift include:

Once connected with your tool of choice, you can run SQL queries on your data. Since you've loaded a wikipedia views data into your Redshift database, let's find the most popular hours of the day to surf wikipedia:

select hour, sum(pageviews) as total_pageviews from pageviews group by hour order by total_pageviews desc;

Not surprisingly, we can see that the most popular times to surf Wikipedia are the afternoon and early evening.

Redshift Results

Now that you've run the example ETL pipeline it's time to build your own. But first, if you haven't already completed the Hadoop and Pig tutorials, you should do that now to cover the fundamentals you'll need to build your own Redshift pipeline.