Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Extract Your Data

The goal of this phase of the ETL pipeline is to find all of the raw data you need for your data warehouse and extract it to S3 for easy use later on in the pipeline.

Which Data

To begin, you need to determine which input data to gather for your ETL pipeline.

Start off by deciding what type of queries you will want to run on your data warehouse and what raw data you will need to power those queries. For your first iteration of building your data warehouse it can help to start with a specific set of questions you will want to ask and focus on answering those. The Mortar ETL pipeline is easy to customize and run so adding more data later isn't a problem.


Extract Your Data

Having determined the data that you need, now it's time to extract that data.

Mortar supports reading data from Amazon S3, MongoDB, and SQL Databases. Let's look at how you would extract data from each of these sources.


Amazon S3

If your data already lives in S3, data extraction is optional. In most cases it is best to leave your data where it lives and read it directly from there in the transform stage of the ETL pipeline. However, there may be times that you still want to copy the data elsewhere (create a specific snapshot, avoid reading files that may still be being modified, etc.). Just be aware that this may increase the time of your ETL pipeline and the amount of S3 space used.

When loading data from S3 you can use our Load Statement Generator to craft a load statement for your data format.

If you are skipping the extraction step, you will use this load statement at the start of your Transform step instead of the provided load statement reading the extract step output.

Your extraction code will go in pigscripts/01-extract-data.pig. Here's what pigscripts/01-extract-data.pig would look like for loading the whitespace-separated Wikipedia data:

raw = load '$INPUT_PATH'
     using PigStorage(' ')
        as (
            wiki_code:chararray,
            article:chararray,
            monthly_pageviews:int,
            encoded_hourly_pageviews:chararray
        );

data = foreach raw generate wiki_code, article, encoded_hourly_pageviews;

rmf $OUTPUT_PATH/extract;
store data into '$OUTPUT_PATH/extract' using PigStorage();

MongoDB

If your data is in MongoDB, you have a number of options for extracting your data. You can choose between connecting and reading directly from your MongoDB cluster or reading backup data that has been stored to S3. Use our MongoDB documentation to determine the best way to connect to your instance and extract your data.

As an example, if the Wikipedia data was stored in a MongoDB collection, then the extract script in pigscripts/01-extract-data.pig would look like:

raw =  load 'mongodb://$USERNAME:$PASSWORD@$HOST:$PORT/$DATABASE.$COLLECTION'
      using com.mongodb.hadoop.pig.MongoLoader('
                mongo_id:chararray,
                wiki_code:chararray,
                article:chararray,
                monthly_pageviews:int,
                encoded_hourly_pageviews:chararray
      ');

data = foreach raw generate wiki_code, article, encoded_hourly_pageviews;

rmf $OUTPUT_PATH/extract;
store data into '$OUTPUT_PATH/extract' using PigStorage();

SQL Database

If your data is stored in a SQL database (MySQL, PostgreSQL, SQLServer, etc), you will need to extract it to a flat file on S3. Pig does not load data directly from databases, but Mortar offers several other options to extract your data.

MySQL

The most efficient way to extract data from a MySQL database is via the MySQL Command Line Tool. It offers a great deal of flexibility, and works well at high volumes of data.

Mortar has created a Luigi Task called ExtractMySQLData that you can drop into any Luigi pipeline to extract data from MySQL. It automatically scripts the MySQL command-line tool to setup a connection to your database, extract the data you want, and then sync that data to S3.

For an example of using the ExtractMySQLData Task to export Wikipedia data from MySQL, see the wikipedia-luigi-mysql.py script, and the ExtractMySQLData documentation.

Other Databases (PostgreSQL, SQL Server, Oracle, etc)

Built-In Extraction Tool

For other databases, we recommend extracting flat files using your database's built-in bulk extraction tool and uploading them to S3 in tab-delimited or CSV format. By database, the recommended solutions are:

Sqoop

Alternately, Mortar has experimental support for extracting data from JDBC-compliant databases with Apache Sqoop. This support is experimental—it currently only works when running Luigi from your computer with mortar local:luigi, not on the Mortar service with mortar luigi. As such, we recommend using it only for one-time extracts and for development, not for production pipelines.

Please see the Sqoop Tasks section of the Mortar Sqoop documentation for instructions on how to use Sqoop.