Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Storing Data to SQL Databases

If you have a database with a JDBC driver, you can store data into it using Mortar. Example RDBMSs that support JDBC include PostgreSQL, MySQL, MSSQL Server, Oracle, Greenplum, and more.

Database Store Statement

Microsoft SQL Server

If you are using SQL Server as your JDBC Platform, use the following template STORE statement:

%default DATABASE_HOST 'dbIdentifier.XXXXXXXXXXX.your-time-zone.rds.amazonaws.com:<port number>';
%default DATABASE_NAME 'MY_DATABASE';
%default DATABASE_USER 'MY_USER';
%default DATABASE_PASS 'MY_PASS';
%default DATABASE_DRIVER 'com.microsoft.sqlserver.jdbc.SQLServerDriver';
%default DATABASE_TYPE 'sqlserver';

STORE my_result INTO 'hdfs:///unused-ignore' 
USING org.apache.pig.piggybank.storage.DBStorage('$DATABASE_DRIVER',
   'jdbc:$DATABASE_TYPE://$DATABASE_HOST', 
   '$DATABASE_USER', 
   '$DATABASE_PASS', 
   'USE $DATABASE_NAME INSERT INTO my_table(my_col_1,my_col_2,my_col_3) VALUES (?,?,?)');

General JDBC

To store data into any other JDBC complient database, use the following template STORE statement:

%default DATABASE_HOST 'ec2-XX-XXX-XXX.compute-1.amazonaws.com';
%default DATABASE_NAME 'MY_DATABASE';
%default DATABASE_USER 'MY_USER';
%default DATABASE_PASS 'MY_PASS';
%default DATABASE_DRIVER 'org.postgresql.Driver';
%default DATABASE_TYPE 'postgresql';

STORE my_result INTO 'hdfs:///unused-ignore' 
USING org.apache.pig.piggybank.storage.DBStorage('$DATABASE_DRIVER',
   'jdbc:$DATABASE_TYPE://$DATABASE_HOST/$DATABASE_NAME', 
   '$DATABASE_USER', 
   '$DATABASE_PASS', 
   'INSERT INTO my_table(my_col_1,my_col_2,my_col_3) VALUES (?,?,?)');

You'll need to edit the %default parameters to connect to your specific database. Additionally, DATABASE_DRIVER and DATABASE_TYPE need to specify the JDBC driver name and database vendor. The example above illustrates usage of a Postgres database driver. See the table below for possible values or consult your vendor's JDBC driver documentation. Finally, change the INSERT statement to reflect your target database table and columns. Make sure that the number of fields in the alias you store corresponds to the number of fields in your insert statement.

Also, you'll need to ensure that the table exists before running the insert.

To connect directly to your database server from Mortar there are a few additional details you should know.

  • Your database server must be running in EC2
  • If you have trouble connecting to your server, you may need to email support@mortardata.com so that we can unfirewall that port on your account's Mortar clusters.

JDBC Dependencies

Connecting to a JDBC database requires a driver specific to each RDBMS vender or project. This typically is found in a Java jar file. Due to licensing restrictions, some JDBC drivers can not be distributed by Mortar. However, in many cases these drivers can be freely downloaded from the internet and supplied with your Mortar Project by placing the necessary jar in your project's lib/ directory. The table below describing several common database engines notes whether or not you must supply a JDBC driver with your project.

Common JDBC Platforms

Platform Driver Name Example JDBC Conn String Required Driver
PostgreSQL org.postgresql.Driver jdbc:postgresql://$dbhost/$dbname No
MySQL com.mysql.jdbc.Driver jdbc:mysql://$dbhost/$dbname Yes
MSSQL Server com.microsoft.sqlserver.jdbc.SQLServerDriver jdbc:mysql://$dbhost:$port Yes
Greenplum org.postgresql.Driver jdbc:postgresql://$dbhost/$dbname No

Note about downloading the Microsoft SQL Server driver: You will need to download one of the .tar files. Once expanded, in the enu folder you will find the jar files. Select the jar file which suits your SQL query tool and copy it to your project's lib/ directory.

Transactions

The underlying store function, DBStorage, does insertions within a transaction. It will send batches of 100 rows at a time to the database, but will only commit the transaction when the Hadoop task finishes successfully. This guarantees that you will not see duplicate rows inserted.

Note that there is one transaction per Map or Reduce task, not per MapReduce job. So if the MapReduce Job fails, there may be partial data stored in the target database table from tasks that succeeded.

Controlling Database Load

One thing to be careful about is load on your PostgreSQL database. Each Map or Reduce task will open a connection to the database. With large amounts of data (and hence number of tasks), this can cause significant load on your database.

To limit the number of open connections, you can trigger a reduce phase before running the STORE statement and specify the number of reducers that run. Code for that would look something like:

-- maximum number of database connections to allow
%default MAX_NUM_DATABASE_CONNECTIONS 20;

-- trigger a reduce phase with a specified number 
-- of reducers to write to the database
my_input_data = ...;
throttle_group_by = GROUP my_input_data 
                      BY some_field_with_high_cardinality 
                PARALLEL $MAX_NUM_DATABASE_CONNECTIONS;
throttle_result = FOREACH throttle_group_by GENERATE FLATTEN(my_input_data);

-- store the data out with DBStorage as above
STORE throttle_result INTO ...;