Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Store Data to DynamoDB

Storing Data to DynamoDB

With Mortar, you can easily store data into your DynamoDB table from Hadoop.

Storing Data Into DynamoDB

As a note, your DynamoDB database must currently be located in US-EAST-1.

To store data into your DynamoDB database, customize the following template STORE statement:

-- Percentage of the table's write throughput to use
-- Default: 0.5, valid Range 0.1 - 1.5
SET dynamodb.throughput.write.percent 1.0;

-- Must disable Pig's MultiQuery optimization
-- when using DynamoDBStorage
SET opt.multiquery false;

-- Dynamo table where writes should go
%default DYNAMODB_TABLE 'my_dynamo_db_table';

-- AWS access key to use for writing to DynamoDB table
%default DYNAMODB_AWS_ACCESS_KEY_ID 'SETME';

-- AWS secret key to use for writing to DynamoDB table
%default DYNAMODB_AWS_SECRET_ACCESS_KEY 'SETME';

-- Load up some input data
input_data = LOAD '$INPUT_PATH'
    USING PigStorage()
    AS (...);

-- Select exactly the fields you want to store to dynamodb.
-- MUST include your DynamoDB table's primary key.
exact_fields_to_store = FOREACH input_data
    GENERATE my_field AS name_in_dynamodb_1,
             my_field_2 AS name_in_dynamodb_2,
             my_field_3 AS name_in_dynamodb_3;

-- Store the data to DynamoDB
STORE exact_fields_to_store
 INTO 's3://somewhere-in-s3-i-can-write/but-will-not-get-written'
USING com.mortardata.pig.storage.DynamoDBStorage('$DYNAMODB_TABLE', '$DYNAMODB_AWS_ACCESS_KEY_ID', '$DYNAMODB_AWS_SECRET_ACCESS_KEY');

You'll need to edit the DYNAMODB_TABLE parameter to point to your database, and edit the DYNAMODB_AWS_ACCESS_KEY_ID and DYNAMODB_AWS_SECRET_ACCESS_KEY parameters with AWS keys that can write to your DynamoDB instance.

Pig requires a location for the INTO expression, but no data will be written there. Pig will, however, check that this location is accessible and does not have data. Best practice is to pass an S3 URL that is accessible but has no data.


Schema

Any fields passed to the DynamoDBStorage STORE statement will be stored into DynamoDB. Be sure to put a FOREACH statement in front of your STORE to prune down to just the fields you want and set their field names accordingly.

DynamoDB's data model does not allow item fields to have null or empty string values. If DynamoDBStorage sees an empty or null value, it will omit that field for the given record. The rest of the record's data will be stored as usual.


Controlling Write Throughput

You can control what percentage of your DynamoDB table's provisioned write throughput will be consumed via the dynamodb.throughput.write.percent parameter. DynamoDBStorage will spread this throughput evenly across all running tasks in Hadoop, and will requery for the current throughput as each new task starts (in case you increase or reduce write throughput on the table).


Metrics

As it runs, DyanmoDBStorage will write out metrics via Hadoop counters. These metrics are visible in the Hadoop Cluster JobTracker or via the Mortar Job Details "Visualize" tab:

DynamoDB Metrics

Collected metrics include:

  • Retries: Number of times a batch of items must be retried, usually due to hitting a capacity threshold. If you see retries frequently, lower your dynamodb.throughput.write.percent or provision additional write capacity on your table.
  • Bytes Written: Bytes of data written to DynamoDB. (Note that this is less than capacity consumed, which includes metadata sent with the data).
  • Null Fields Discarded: Number of fields with a null value that were omitted from an item
  • Empty String Fields Discarded: Number of fields with a empty string value that were omitted from an item
  • Consumed Capacity: Total number of DynamoDB write capacity units consumed
  • Records Written: Total number of records written to DynamoDB