Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Run an Example at Scale in the Cloud

The previous example demonstrated the main concepts of running a job on MongoDB data. However, we used a very small amount of data. Now, it's time to run your job at scale. Using the power of Hadoop, we'll process the full data set on a cluster of machines.

Running in the Cloud

Now that we've tested our script on small data, let's run it against the full dataset of tweets. We'll use Hadoop's parallel processing to spread the work around a cluster of machines.

When running your code in the cloud, you need to decide how large a cluster of computers to use for your job. For this example, we only have about 0.5 GB of tweets in the cloud (not quite big data), so we don't need a very large cluster. We'll use a 3-node cluster, which should finish the job in under 5 minutes once it starts up. In later tutorials we'll cover more about how Hadoop works and how to determine the right cluster size for your data and your job.

By default this Mortar project uses AWS spot instance clusters to save money. Running this example on a 3-node spot instance cluster for 1 hour should cost you approximately $0.42 in pass-through AWS costs. Before running this job you will need to add your credit card to your account. You can do that on our Billing Page.

When you're ready, run the job on a 3 node cluster:

mortar run pigscripts/tweets_by_time_block.pig -f params/tweets-cloud.params --clustersize 3

After running this command you will see output similar to:

Taking code snapshot... done
Sending code snapshot to Mortar... done
Requesting job execution... done
job_id: some_job_id

Job status can be viewed on the web at:

Or by running:

    mortar jobs:status some_job_id --poll

This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.

While you are waiting for your cluster to start and job to finish, jump over and do the Hadoop and Pig tutorials to get a better understanding of how they work.

Monitoring Your Job's Progress

Your job goes through three main steps after you submit it:

  1. Mortar validates your pigscript, checking for simple Pig errors, Python errors, and that you have the required access to the specified S3 buckets. If the pigscript is invalid, Mortar will return an error explaining the problem. This step is done before launching a cluster to make sure you pay as little as possible.
  2. Mortar starts a Hadoop cluster of the size you specified. This stage can take 5-15 minutes. You do not pay for the time the cluster is starting.
  3. Mortar runs your job on the cluster.

Once your job has started running on the cluster, you can get realtime feedback about its progress from the Mortar web application. Open the job status link displayed after you started your job:

Job status can be viewed on the web at:

Visualization Results

At the top of the screen you can see the basic details of your job: its current progress, the time the job started, how long it's been running, and the number of nodes in the cluster.

Below that is the Visualization tab. This shows how Pig has compiled your pigscript into Hadoop Map/Reduce jobs. The blinking box shows you the current running Map/Reduce job, and clicking on any box will show you information and statistics for that specific job. There's a lot of information in this visualization, so take a minute to click around and see how things work.

Job Details

Click on the Details tab to see some information about your job. Mortar keeps track of every parameter value you used and links to an exact snapshot of the code used in Github. Once your job has finished this page will also display a link to your output data on S3.

The remaining tabs show more detailed log information from Pig and Hadoop. Take a minute to look at those tabs and familiarize yourself with what is being shown.

Getting Results

Once your job has finished you can examine the results. In this case Pig has written our results to S3, so you’ll need to download the output files from S3. The easiest way to do that is to use the Details tab from the Job Status page shown above.

Download Results

Take a look at the results. This time we see more tweets reflected, as we've processed more data:

00 - 01 10
01 - 02 6
02 - 03 2
03 - 04 4
04 - 05 6
05 - 06 7
06 - 07 6
07 - 08 10
08 - 09 15
09 - 10 6
10 - 11 4
11 - 12 2
12 - 13 3
13 - 14 7
14 - 15 12
15 - 16 7
16 - 17 12
17 - 18 8
18 - 19 12
19 - 20 14
20 - 21 8
21 - 22 7
22 - 23 16
23 - 00 10

From the set of tweets we've collected, it appears that the night time is the right time for excitement!

If you haven't already completed the Hadoop and Pig tutorials, you should do that now. If you have, you're ready to connect to your own MongoDB data with Mortar!