Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Sqoop

Mortar offers experimental support for extracting data from JDBC-compliant databases with Apache Sqoop.

This support is experimental—it is currently limited to:

  • The mortar local:sqoop command in the Mortar Development Framework
  • MortarSqoopTableTask Luigi tasks run with mortar local:luigi

Since Sqoop does not yet work in Luigi tasks run on the Mortar service with mortar luigi, we recommend using it only for one-time extracts and for development, not for production pipelines.

How It Works

Under the hood, Sqoop uses JDBC to connect to your database, generate any necessary SQL to access the data, then format it before storing the resulting data on S3.

The mortar local flavor of Sqoop commands execute within a single process, unlike traditional Sqoop jobs where multiple Map/Reduce tasks will access the database in parallel. The benefit of this approach is you do not have to worry about accidentally overwhelming your database server, and you do not need a cluster running to use this tool. However, the total amount of time it takes to export your data set is slower with a single process approach. For this reason we recommend that Mortar Local Sqoop commands only be used for smaller data sets or one-off exports of data.

General Information (common options/descriptions)

All commands require the database type (postgres, mysql, etc) and the name of the database that will be queried. Additionally you will need to supply the path on S3 that you wish to store your data. In addition to command specific arguments, you can also specify the following optional arguments when running an export:

  • --host HOSTNAME The hostname where the database instance is running. You must be able to open a network connection to this host from the machine where the sqoop process is running. If unspecified, localhost is assumed.
  • --username USERNAME The username used when authenticating with the database server.
  • --password PASSWORD The password for the user when authenticating with the database server.
  • --direct Perform a direct import. This bypasses JDBC and uses the database’s native tooling to access the data which can be considerably faster, though support is limited to a subset of database types and requires that native access utilities be installed on the machine where the sqoop process is running.
  • --driverjar JARFILE Path to the jar file containing the JDBC driver for your particular database. This is only necessary if your database is not supported, or there are restrictions on redistributing the driver as is the case with MySQL.
  • --jdbcdriver COM.DRIVER.BAR The name of the class that represents the JDBC driver for your database. This is almost never needed.

Also, please note that at the moment all of the mortar sqoop commands have a 5 GB export limit.

Exporting an Entire Database Table

A table export is the simplest option available. It requires only the name of the table to be exported in addition to the previously mentioned options:

mortar local:sqoop_table postgres sitedb purchases s3://my-export-data/purchases

This will select all records from the table purchases in the sitedb database on a Postgres server then store them in s3://my-export-data/purchases.

Exporting the Results of a SQL Query

For instances where you want to export a subset of data, either because you want to apply a filter to remove unnecessary records or because you don’t want all of the columns in the table, a SQL-query-based export is the tool that you should use. An example usage would look like:

mortar local:sqoop_query mysql sitedb “SELECT UserID, TransactionID FROM purchases where SalesTaxID is not null” s3://my-export-data/purchases-with-tax

This will execute the supplied SQL query on the sitedb database on a MySQL server and store the results in s3://my-export-data/purchases-with-tax. Note that the query needs to be enclosed in quotation marks to escape it from the shell. Also, while full arbitrary queries are accepted, the use of complex queries with joins, sub queries, or where clauses that include logical or is strongly discouraged. These can lead to ambiguous behavior and/or result in an unexpected result set.

Performing an Incremental Export

For instances where you want to get the “latest” set of data, there is the incremental command. This will generate a SQL statement and use that statement to select the data to be exported.

This command requires a way to tell what data is a member of the “latest” set. If your table has an autoincrement style single column primary key, or if you have an explicit column on your table that stores either the create or update timestamp of the record, you supply the column name and a value indicating when the last export was performed.

If you use a primary key or create timestamp, you need to update your application code to track update timestamps, but these only allow for export based upon creation. Any updates will not be included in the incremental export. Using a column that tracks the last update timestamp will allow you to do incremental export of record creation and updates, but you will need to implement the update timestamp tracking in your application code.

An example usage for export based upon an update date:

mortar local:sqoop_incremental mysql sitedb purchases last_modified_date "2014-02-01 00:00:00" s3://my-export-data/purchases-by-day/2014-02-01

This will export all records that have been created or modified since 02/01/2014 based upon the timestamp in the last_modified_date column. An example usage for export based upon an autoincrement primary key:

mortar local:sqoop_incremental mysql sitedb purchases purchase_id 1839503928 s3://my-export-data/recent-purchases/

This will export all records that have a value of purchase_id > 1839503928.

After the export is complete, it will display the maximum value seen in the primary key or update date column you specified. This may then be used in your next incremental export.

Luigi Sqoop Tasks

You can also use Sqoop from a Luigi pipeline with the MortarSqoopTableTask Luigi Task. Note: this Task is experimental, and currently only works in local mode with mortar local:luigi, not with mortar luigi. As such, we recommend using it only for one-time extracts and for development, not for production pipelines.

Here's an example Sqoop Task:

from mortar.luigi import sqoop
JDBC_TABLE = 'table-name-to-extract-from'

class SqoopTask(sqoop.MortarSqoopTableTask):
    table = luigi.Parameter(default=JDBC_TABLE)

To have your transform task depend on your Sqoop extract task you would have a requires method similar to:

def requires(self):
    return [SqoopTask(path=self.input_base_path, table=JDBC_TABLE)

All of the configuration information for your database should go in the Luigi client.cfg.template configuration file in the database section:

    [database]
    dbtype: postgres
    database: ${MYDATABASE}
    host: ${DATABASE_HOST}
    port: 5433
    username: ${DATABASE_USERNAME}
    password: ${DATABASE_PASSWORD}