Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Report on Your Data

This section continues from the previous section about loading your data into Pig. We cover how to process your data and store results back to S3 or to MongoDB.

Processing Your Data

Now that your data is loaded into Pig, you're ready to start processing and reporting on it. We'll show a simple example to get you started.

To begin processing your data, uncomment the Process Data section in your templates script:

/***** Process Data *********/

-- Example processing step: count number of rows in collection
-- Replace with your processing
grouped = group mycollection all;
counted = foreach grouped generate

This two simple statements group and then count the number of rows in your collection. Run an illustrate to see how they work:

# use this one for direct connection
mortar local:illustrate pigscripts/my_direct_connection_template.pig

# or this one for mongodump
mortar local:illustrate pigscripts/my_mongodump_template.pig

Illustrate Results

You can see data being grouped and then counted in the illustrate results. The count is fairly small, as illustrate only uses a subset of rows to show what's happening in the script.

Storing Results

Now that we have a processing step, we'll store our results back out. We can choose to store back to Mongo, Amazon S3, or a number of other destinations.

Once your script has a store statement, Pig will be able to run it locally or on a cluster. When Pig runs a script, it works backward from the store statements to determine which operations it needs to perform. Any code not referenced by a store statement or its ancestor will not be executed.

Storing Back to Mongo

To store back to Mongo, uncomment the Store Data section of the pig script:

-- Store data back to MongoDB
store counted into '$OUTPUT_MONGO_URI'
         using com.mongodb.hadoop.pig.MongoInsertStorage();

For more details about storing to Mongo, see the Mongo Load/Store Reference page.

Storing to S3

To store back to S3, make sure you've connected your Mortar account to S3 and set the OUTPUT_PATH parameter at the top of the script to a valid S3 URL where you can write data. Then, uncomment the Store Data section:

-- Remove any existing output and store data to S3
store counted into '$OUTPUT_PATH' using PigStorage();

This will store data back to your Amazon S3 acount. To learn more about storing data to Amazon S3, check out our Amazon S3 doc.

Running in the Cloud

Now that your script is ready, it's time to run this job in the cloud. Before doing this, you need to pick the number of nodes you want to use for running your job. If your data is small, you can start with a small cluster size (3 is a good choice). On the other hand, if you're pointed at a very large collection, try starting with more nodes, say 5 - 10. You can adjust your cluster size based on how long your job takes and your individual requirements for acceptable run time.

This Mortar project defaults to using AWS spot instance clusters. While spot instance prices aren’t guaranteed, they’re typically around $0.14 per hour per node. On very rare occasions when the spot price goes above $0.44 per hour per node, your cluster will be automatically terminated and the job will fail. In most cases the significant cost savings of spot instances will be worth the very small chance of your job failing, but read Setting Project Defaults if you would like to change the defaults of your project.

Note: if you already have an idle cluster running, you can save time by omitting the clustersize argument below, which will select your running cluster and start your job immediately.

Ok, now you’re ready to run!

mortar run pigscripts/my_bson_loading_template.pig --clustersize 5

After starting your job you will see output similar to:

Taking code snapshot... done
Sending code snapshot to Mortar... done
Requesting job execution... done
job_id: some_job_id

Job status can be viewed on the web at:

Or by running:

    mortar jobs:status some_job_id --poll

This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.

Monitoring Your Job's Progress

In the earlier tweets example we covered how to monitor your job's progress but let's recap what you will do.

Open up your job status page at the url displayed after you started your job.

Visualization Results

The top of the page shows your job's overall progress. Remember that there are three main stages to how your job runs:

  1. Validation - Mortar checks your script for some simple error conditions.
  2. Cluster Starting - Mortar starts a Hadoop cluster for your job. This can take 5-15 minutes.
  3. Running - Your job is running on your cluster.

Once your job starts running on your cluster, you can use the visualization tab to see how your job is broken up into Hadoop Map/Reduce jobs and to see various metrics about how your job is doing.

Job Details

On the details tab you will see various metadata about this job. Mortar keeps track of the exact code and parameters you used to run this job. As you are developing your recommendation engine it can be useful to go back to one of your previous jobs to get a sense of what you changed and how that affected your results.

If your job fails the details tab will show you the error that occurred. To diagnose errors you can use the log tabs on the right to get more detailed information.

Download Your Results

Once your job has finished, it's time to take a look at your results. Open up the web url displayed after you started your job and go to the details tab.

Download Results

If you saved results to S3, you'll be able to download them directly from this tab. Otherwise, you'll need to hop into your MongoDB to grab them.

If you start saving larger files to S3, you may want to use a tool like the AWS Command Line Interface, Transmit, or s3cmd to download your results, providing the AWS keys you wrote down previously to your tool of choice.

Going Deeper

Congratulations! You've written and run a Pig job on your MongoDB data using Hadoop!

Next, pick a problem you're trying to solve with your Mongo data. Adapt your script to explore the data and solve the problem. Use the Pig Cheat Sheet, Pig Reference Page, and Mongo Reference Page to guide you as you write your script.

With Pig and Hadoop, you have the tools and the scalability to join together large data sets, run large custom reports and aggregations, and pull in your own custom code to analyze your data. Have fun, and let us know if you need any help!