Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Building Your Data Warehouse

Now that everything is ready to go, its time to actually run your ETL Pipeline to build your data warehouse.

Running the ETL Pipeline

Before running your pipeline, you need to pick the number of nodes you want to use. For your first run use a cluster size of 5. This is only an initial value to use--you should adjust your cluster size based on how long your job takes and your individual requirements for acceptable run time.

To adjust your cluster size, find where the cluster_size parameter is set. In my-redshift.py that happens where the TransformDataTask is required by CopyToRedshiftTask

def requires(self):
    return TransformDataTask(
        cluster_size=5,
        input_base_path=self.input_base_path,
        output_base_path=self.output_base_path)

The Mortar ETL pipeline defaults to using AWS spot instances. While spot instance prices aren’t guaranteed, they’re typically around $0.14 per hour per node. On very rare occasions when the spot price goes above $0.44 per hour per node, your cluster will be automatically terminated and the job will fail. In most cases the significant cost savings of spot instances will be worth the very small chance of your job failing, but read Setting Project Defaults if you would like to change the defaults of your project.

Ok, now you’re ready to run!

mortar luigi luigiscripts/my-redshift.py \
    --input-base-path "s3://<Your input data path>" \
    --output-base-path "s3://<your-bucket>/etl" \
    --table-name "<Your Redshift table name>"

If you added or modified any tasks that require additional parameters just add them to the end of your command.

Once the pipeline runs with a given output location it will not run again unless the data in that location is deleted or a different output location is given. The pipeline is idempotent, so running it multiple times has the same result as running it once, meaning that once a task has completed successfully it won’t run it again. Changing the output location (for example, using a date parameter: s3://<your-output-bucket>/<date-today>) causes it to behave like a new pipeline.

Querying Your Redshift Data Warehouse

Once your Luigi job has finished successfully you should be able to query your Redshift cluster.