Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Building Your Data Warehouse

Now that you have gone through the tutorial of building a Redshift data warehouse, it's time to build your own.

Overview

If you’ve already done the Build an Example Pipeline section of this tutorial, you should be familiar with the various steps we're going to cover now. If you haven’t done that section, take a minute to at least read it over to become familiar with the different steps of building a Redshift data warehouse.

There are three things you will need to do to build your custom Redshift data warehouse:

  • Extract Your Data: You will need to decide what data you want to be able to query and how you can extract that data into S3.
  • Transform Your Data: Once you have your data, you need to determine how you would like to query it and how you can transform your extracted data into that format.
  • Customize the ETL Pipeline: The Mortar ETL pipeline described in the Wikipedia example may need a few customizations to run your pig scripts.

In the following tutorial articles we're going to cover each of these in detail. Before doing that there are a couple of things you'll need to set up.


Setup S3 Bucket and AWS Keys

As in the wikipedia example, you'll want an S3 bucket for intermediate data and AWS Access Keys that can reach your data in S3 and write to Redshift. You can use the same bucket and keys you created for the example—for a refresher, see the section about creating a bucket and AWS keys in the wikipedia tutorial.


Setup Your Redshift Cluster

You will need to have a running Redshift cluster to complete this tutorial. If you already have a running cluster you would like to use, you can skip the remaining steps in this article.


Choosing Your Cluster

AWS charges by the hour for Redshift (see pricing). If you're unsure how large of a cluster you will need, start with the smallest Redshift cluster (Node type of dw2.large and cluster type of Single Node) for $0.25/hour. It is easy to upsize your Redshift cluster if you need better performance. To avoid incurring extra costs, be sure to stop your cluster when you are done with it.


Start Your Redshift Cluster

To start a Redshift cluster, follow the official AWS documentation for a step-by-step tutorial of what to do. You will need to do steps 1-3 and the first part of step 4. Be sure to place your cluster in the US East region for fast and free data transfer to Mortar's Hadoop clusters. Also, you do not need to worry about creating tables or running queries, as the Mortar ETL pipeline takes care of that for you.

The AWS documentation recommends using SQL Workbench/J to connect and query Redshift, but most other SQL clients will work as well.