As a note, your DynamoDB database must currently be located in US-EAST-1.
To store data into your DynamoDB database, customize the following template STORE statement:
-- Percentage of the table's write throughput to use -- Default: 0.5, valid Range 0.1 - 1.5 SET dynamodb.throughput.write.percent 1.0; -- Must disable Pig's MultiQuery optimization -- when using DynamoDBStorage SET opt.multiquery false; -- Dynamo table where writes should go %default DYNAMODB_TABLE 'my_dynamo_db_table'; -- AWS access key to use for writing to DynamoDB table %default DYNAMODB_AWS_ACCESS_KEY_ID 'SETME'; -- AWS secret key to use for writing to DynamoDB table %default DYNAMODB_AWS_SECRET_ACCESS_KEY 'SETME'; -- Load up some input data input_data = LOAD '$INPUT_PATH' USING PigStorage() AS (...); -- Select exactly the fields you want to store to dynamodb. -- MUST include your DynamoDB table's primary key. exact_fields_to_store = FOREACH input_data GENERATE my_field AS name_in_dynamodb_1, my_field_2 AS name_in_dynamodb_2, my_field_3 AS name_in_dynamodb_3; -- Store the data to DynamoDB STORE exact_fields_to_store INTO 's3://somewhere-in-s3-i-can-write/but-will-not-get-written' USING com.mortardata.pig.storage.DynamoDBStorage('$DYNAMODB_TABLE', '$DYNAMODB_AWS_ACCESS_KEY_ID', '$DYNAMODB_AWS_SECRET_ACCESS_KEY');
You'll need to edit the
DYNAMODB_TABLE parameter to point to your database, and edit the
DYNAMODB_AWS_SECRET_ACCESS_KEY parameters with AWS keys that can write to your DynamoDB instance.
Pig requires a location for the
INTO expression, but no data will be written there. Pig will, however, check that this location is accessible and does not have data. Best practice is to pass an S3 URL that is accessible but has no data.
Any fields passed to the DynamoDBStorage
STORE statement will be stored into DynamoDB. Be sure to put a
FOREACH statement in front of your
STORE to prune down to just the fields you want and set their field names accordingly.
DynamoDB's data model does not allow item fields to have null or empty string values. If DynamoDBStorage sees an empty or null value, it will omit that field for the given record. The rest of the record's data will be stored as usual.
You can control what percentage of your DynamoDB table's provisioned write throughput will be consumed via the
dynamodb.throughput.write.percent parameter. DynamoDBStorage will spread this throughput evenly across all running tasks in Hadoop, and will requery for the current throughput as each new task starts (in case you increase or reduce write throughput on the table).
As it runs, DyanmoDBStorage will write out metrics via Hadoop counters. These metrics are visible in the Hadoop Cluster JobTracker or via the Mortar Job Details "Visualize" tab:
Collected metrics include:
dynamodb.throughput.write.percentor provision additional write capacity on your table.