Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Learn Hadoop and Pig
Your Mongo Data in Hadoop
Load Your Data into Pig

Load Your Data into Pig

Now that you know how to connect to your Mongo database, let's load your data into Pig. If you don't have a Mortar project already set up or if you created a public project for the earlier examples but want your real code to be private, follow the instructions to set up a new private project before getting started.

Connection Strategy

Depending on which connection strategy you chose in Get Connected, follow the Direct Mongo Connection steps or the Mongodump Data in S3 steps below.


Direct Mongo Connection

Follow these steps first if you're connecting directly to your MongoDB data.

Mongo URI Configuration

The first thing to do is securely store the Mongo URI for your database. Mortar provides encrypted configuration storage via the mortar config command for doing this.

Store your Mongo URI in a configuration variable via:

cd <my_project_folder>
mortar config:set MONGO_URI='put your Mongo URI here'

This will store your encrypted configuration at Mortar for running both local and cloud jobs.

If you need help writing a MongoURI, see the MongoDB Reference Page. Be sure to leave the collection out of your Mongo URI so you can connect to multiple collections.

Template Script

In your Mortar project, open pigscripts/my_direct_connection_template.pig. This is the template that you're going to use to connect to your data directly via a Mongo URI.

The first thing you need to do is to point the template script at the proper collections.

/******* Pig Script Parameters **********/

%default INPUT_COLLECTION 'your_input_collection';
%default OUTPUT_COLLECTION 'your_output_collection';

For the input, start off with a single, small collection to make sure everything is working properly.

When you're done, skip ahead to the Load Statement section below.


Mongodump Data in S3

Follow these steps first if you're connecting directly to your MongoDB data.

Template Script

In your Mortar project, open pigscripts/my_mongodump_template.pig. This is the template that you'll use to connect to a mongodump BSON file stored in S3.

Data Paths

The first thing you need to do is to point the template script at your data.

/******* Pig Script Parameters **********/

-- Input path to either a single BSON file
-- or a directory with multiple BSON files from the same collection
-- (Do not point to files from different collections)
%default INPUT_PATH 's3://my-input-bucket/my-folder/mycollection.bson';

%default OUTPUT_PATH 's3://my-output-bucket/my-folder/output';

This section of the script defines the parameters that tell Pig where your input data is stored and where your output should be written. Update these parameters with your S3 paths where you uploaded your collection data and where you'd like output to be stored.

For the input, start off with a single, small BSON collection to make sure everything is working properly.

If you enter a directory as your input path, Pig will read all files in that directory. This is useful if you are loading data for the same collection extracted from multiple shards.


Load Statement

Next, uncomment the first load statement in your template Pig script. To do that, remove the block comment symbols (/* and */) from around the load statement.

-- First time: load without schema
-- Puts all data into a single Pig Map field named document
mycollection = load '...' using com.mongodb.hadoop.pig.WhicheverLoaderYouAreUsingGoesHere();

The above load statement is a simple one that loads your data without a specific schema in Pig. This will put all data from each row of your collection into a single field named document. This field will be a Pig Map type.

Let’s make sure your load statement works properly. Illustrate it with:

# use this one for direct connection
mortar local:illustrate pigscripts/my_direct_connection_template.pig

# or this one for mongodump
mortar local:illustrate pigscripts/my_mongodump_template.pig

Illustrate Results

If your load statement is correct, you should see illustrate results similar to the screenshot above, but with your data in the document field. If you click on the document field, you'll see all of the data for your document inside the Map.


Load With a Schema

Calling the MongoLoader or BSONLoader with no arguments loads your data into a single field without a specific schema in Pig. This is a good way to get started and see what data your collection contains, but for real production work you will want to pass a schema string to the loader. Doing so allows your loader to optimize performance by only extracting the fields you need, and also makes it easier to use each individual field in your script.

The best way to setup your schema string is to start simple and add fields one-at-a-time. To do so, first comment out the load statement you used above by adding block comment symbols (/* and */) around it.

Then, uncomment the more advanced load statement below:

-- Once that works: load with schema
mycollection = load '...' using com.mongodb.hadoop.pig.WhicheverLoaderYouAreUsingGoesHere(
             'argument1',
             'argument2');

This new load statement takes two parameters.

For MongoLoader, the first parameter provides a schema string, telling MongoLoader which fields you'd like extracted from each document and what data types they should take in Pig. The second parameter tells MongoLoader to rename the _id field in your collection to mongo_id, since _id is not a valid field name in Pig.

Confusingly enough, if you're loading mongodump BSON data instead with the BSONLoader, those parameters are reversed.

We've started out by loading only one field, the document's mongo_id, and have told the loader to make it a chararray, which is Pig's name for a string data type. Run an illustrate to see what happens:

# use this one for direct connection
mortar local:illustrate pigscripts/my_direct_connection_template.pig

# or this one for mongodump
mortar local:illustrate pigscripts/my_mongodump_template.pig

Illustrate Results

From here, start adding the fields you want to use to your schema string one field at a time. Run an illustrate to confirm data is coming through as you expect.

Here is a fake schema string that shows how you might load each Mongo data type into Pig:

mycollection = load '$INPUT_PATH' using com.mongodb.hadoop.pig.MongoLoader(
             'mongo_id:chararray,
              my_string_field:chararray,
              my_object_id_field:chararray,
              my_int_field:int,
              my_long_field:long,
              my_float_field:float,
              my_double_field:double,
              my_date_field:DateTime,
              my_embedded_object_with_data_types:tuple(embedded_field_1:int,embedded_field_2:chararray),
              my_embedded_object_no_data_types:map[],
              my_array:bag{t:(array_field:chararray)}',
             'mongo_id');

Once you have the fields you want coming through, you're ready to move to the next step of reporting on your data.


Loading More Collections

It's not necessary yet, but here's how to load additional collections when you're ready. To do so:

  1. Copy your first load statement
  2. Rename the statement's alias to a new name for your new collection (e.g. instead of mycollection use myothercollection)
  3. Change the Mongo URI or path to point to the new collection's data
  4. Change the schema to reflect the new collection's schema. You'll want to follow the same strategy as above: load without a schema first, then build up your schema.

That's all it takes to pull in data from additional collections to your Pig script.


Troubleshooting

If you have an error with your load statement(s), you will get an error message explaining what’s gone wrong. Here are ideas for a few error messages you might see.

You can also find deeper information about Mongo load statements and schemas on the Mortar Mongo Reference page.

AWS Permissions

Launch failed to start! 
Error: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXX, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID:

These 403 errors indicate that you're trying to request data from a bucket or object to which your AWS keys don't have access. Make sure that you've set your AWS keys correctly in Mortar. You might also try using those keys in another S3 utility like s3cmd or Transmit to ensure they have access to the bucket and object you're loading.

Slow Load / Illustrate Hanging

If your load statement is hanging, you may be trying to download a very large BSON file from S3. Because of the way BSON objects are stored, the Mongo Hadoop adapter needs to pull the entire BSON file from S3 to use it. For very large files, this may be prohibitively slow.

As a workaround, try pointing at a local copy of the large BSON file during development. You can do this by using a file:///path/to/my/local/collection.bson file URL instead of a s3://my-bucket/my-folder/mycollection.bson S3 URL.