Now that your data is loaded into Pig, you're ready to start processing and reporting on it. We'll show a simple example to get you started.
To begin processing your data, uncomment the Process Data section in your templates script:
/***** Process Data *********/ -- Example processing step: count number of rows in collection -- Replace with your processing grouped = group mycollection all; counted = foreach grouped generate COUNT(mycollection);
This two simple statements group and then count the number of rows in your collection. Run an illustrate to see how they work:
# use this one for direct connection mortar local:illustrate pigscripts/my_direct_connection_template.pig # or this one for mongodump mortar local:illustrate pigscripts/my_mongodump_template.pig
You can see data being grouped and then counted in the illustrate results. The count is fairly small, as illustrate only uses a subset of rows to show what's happening in the script.
Now that we have a processing step, we'll store our results back out. We can choose to store back to Mongo, Amazon S3, or a number of other destinations.
Once your script has a store statement, Pig will be able to run it locally or on a cluster. When Pig runs a script, it works backward from the store statements to determine which operations it needs to perform. Any code not referenced by a store statement or its ancestor will not be executed.
To store back to Mongo, uncomment the Store Data section of the pig script:
-- Store data back to MongoDB store counted into '$OUTPUT_MONGO_URI' using com.mongodb.hadoop.pig.MongoInsertStorage();
For more details about storing to Mongo, see the Mongo Load/Store Reference page.
To store back to S3, make sure you've connected your Mortar account to S3 and set the
OUTPUT_PATH parameter at the top of the script to a valid S3 URL where you can write data. Then, uncomment the Store Data section:
-- Remove any existing output and store data to S3 rmf $OUTPUT_PATH; store counted into '$OUTPUT_PATH' using PigStorage();
This will store data back to your Amazon S3 acount. To learn more about storing data to Amazon S3, check out our Amazon S3 doc.
Now that your script is ready, it's time to run this job in the cloud. Before doing this, you need to pick the number of nodes you want to use for running your job. If your data is small, you can start with a small cluster size (3 is a good choice). On the other hand, if you're pointed at a very large collection, try starting with more nodes, say 5 - 10. You can adjust your cluster size based on how long your job takes and your individual requirements for acceptable run time.
This Mortar project defaults to using AWS spot instance clusters. While spot instance prices aren’t guaranteed, they’re typically around $0.14 per hour per node. On very rare occasions when the spot price goes above $0.44 per hour per node, your cluster will be automatically terminated and the job will fail. In most cases the significant cost savings of spot instances will be worth the very small chance of your job failing, but read Setting Project Defaults if you would like to change the defaults of your project.
Note: if you already have an idle cluster running, you can save time by omitting the
clustersize argument below, which will select your running cluster and start your job immediately.
Ok, now you’re ready to run!
mortar run pigscripts/my_bson_loading_template.pig --clustersize 5
After starting your job you will see output similar to:
Taking code snapshot... done Sending code snapshot to Mortar... done Requesting job execution... done job_id: some_job_id Job status can be viewed on the web at: https://app.mortardata.com/jobs/job_detail?job_id=some_job_id Or by running: mortar jobs:status some_job_id --poll
This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.
In the earlier tweets example we covered how to monitor your job's progress but let's recap what you will do.
Open up your job status page at the url displayed after you started your job.
The top of the page shows your job's overall progress. Remember that there are three main stages to how your job runs:
Once your job starts running on your cluster, you can use the visualization tab to see how your job is broken up into Hadoop Map/Reduce jobs and to see various metrics about how your job is doing.
On the details tab you will see various metadata about this job. Mortar keeps track of the exact code and parameters you used to run this job. As you are developing your recommendation engine it can be useful to go back to one of your previous jobs to get a sense of what you changed and how that affected your results.
If your job fails the details tab will show you the error that occurred. To diagnose errors you can use the log tabs on the right to get more detailed information.
Once your job has finished, it's time to take a look at your results. Open up the web url displayed after you started your job and go to the details tab.
If you saved results to S3, you'll be able to download them directly from this tab. Otherwise, you'll need to hop into your MongoDB to grab them.
If you start saving larger files to S3, you may want to use a tool like the AWS Command Line Interface, Transmit, or s3cmd to download your results, providing the AWS keys you wrote down previously to your tool of choice.
Congratulations! You've written and run a Pig job on your MongoDB data using Hadoop!
Next, pick a problem you're trying to solve with your Mongo data. Adapt your script to explore the data and solve the problem. Use the Pig Cheat Sheet, Pig Reference Page, and Mongo Reference Page to guide you as you write your script.
With Pig and Hadoop, you have the tools and the scalability to join together large data sets, run large custom reports and aggregations, and pull in your own custom code to analyze your data. Have fun, and let us know if you need any help!