Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Run an Example Against MongoDB

Now that you have your Mortar project set up we're going to run an example Pig script using MongoDB data.

Choose Your Own Adventure

For this example, we will work with a small collection of tweets. From our data, we'll answer an interesting question: which hour of the day do people tweet most about being excited?

You can choose to run this example via one of two connection strategies: Direct Mongo Connection or Mongodump Data in S3. Use whichever one you'd like to learn more about.

The main instructions will assume that you're using Direct Mongo Connection, but we will also provide special instructions for the alternate strategy in sections like:

Mongodump Data in S3

These sections will provide details for reading data from a mongodump backup stored on Amazon S3. We've got a mongodump backup of the tweets database in our S3 bucket that you can use for the example.


Setting up your Development Environment

Before getting into the details of our project, you should take a minute to set up your development environment. The first thing to do is pick an editor you want to use for viewing and working with code. If you don't already have a favorite editor you can review some options here.

The second thing to do is to open a terminal and change to the root directory of your Mortar project. This terminal will be used to run various commands while working with your project.

Open pigscripts/tweets_by_time_block.pig in your code editor. This is a script written in Apache Pig that searches for tweets that match a phrase and then counts the number of occurrences by hour of day. We'll cover how to write your own Pig code in a later tutorial. For now we are just going to work through the different sections of this pigscript, explaining what they do.


Registering Custom Python Code

register '../udfs/jython/timeutils.py' using jython as timeutils;

Our first line registers a custom Jython script for use in our Pig script. In Hadoop, custom code is called a User-Defined Function (UDF). UDFs can be written in several languages--Python, Jython, Java, and Ruby--here we're using Jython.


Setting Parameters

/******* Pig Script Parameters **********/

%default INPUT_MONGO_URI 'mongodb://readonly:readonly@ds047020.mongolab.com:47020/twitter-small.tweets';
%default OUTPUT_PATH '../data/tweets_by_time_block';

Here we're setting up some parameters that indicate where we'll read input data and write output data. For this example we're reading in a very small Mongo collection of tweets and saving results back to the data subdirectory underneath our project root. We'll be able to control these parameters later via parameter files.


Loading the Data

In any Pig script the first step is always loading data into Pig. Before we explain that code, here's what a (partial) example tweet looks like in a Mongo document:

{
    "_id" : ObjectId("50fdcbb0347912f10f36318f"),
    "created_at" : "Sat Dec 22 15:02:23 +0000 2012",
    "text" : "@JAlexandra_ ha I got the door! I knew I was going to when I got all dizzy.",
    "user" : {
        "utc_offset" : -21600,
    }
}

Now that you understand the data, take a look at how we load that data into Pig.

/******* Load Data **********/

tweets =  
  LOAD '$INPUT_MONGO_URI' 
 USING com.mongodb.hadoop.pig.MongoLoader( 'created_at:chararray, text:chararray, user:tuple(utc_offset:int)');

Here you can see how to use the Pig MongoLoader to load data directly from a Mongo database. We pass the INPUT_MONGO_URI parameter we defined earlier to provide connection details: username, password, host, port, database, and collection. We also provide a mapping from the schema of our collection in Mongo to Pig's schema. For more info on Mongo-to-Pig schema mapping, see our MongoDB reference page.

Mongodump Data in S3

To load Mongodump BSON data from Amazon S3, you'll need to comment out the "Direct Mongo Connection" load statement and uncomment the following load statement:
-- Uncomment to load tweets from a mongodump backup stored in S3
tweets =  load 's3://mortar-example-data/twitter-mongo/tweets.bson' using com.mongodb.hadoop.pig.BSONLoader(
             'tweet_mongo_id',
             'created_at:chararray,
              text:chararray, 
              user:tuple(utc_offset:int)');

Here you can see how to use the Pig BSONLoader to load data directly from mongodump backup files. The first argument to BSONLoader tells Pig what to rename the _id field in the collection, since _id is not a legal Pig field name. The last argument provides a mapping from mongo's schema to Pig's schema.


Pig has a handy feature called "illustrate" that will help you visualize how your script works. Try it out by illustrating the "tweets" alias in this script. In your terminal, from the root directory of your Mortar project run:

mortar local:illustrate pigscripts/tweets_by_time_block.pig tweets -f params/tweets-local.params

This command runs an illustrate on the tweets_by_time_block pigscript, focusing specifically on the load statement alias tweets.

When we ran the illustrate command, we passed an argument pointing to the tweets-local.params parameter file. This tells Pig to use the parameter values in that file, which in this case point to very small Mongo collection and store the output back to a local file.

The first time you run a mortar local command, it will take a minute or two to set up your environment. On the first time only, Mortar downloads all of the dependencies you need to run a Pig job into a local sandbox for your project. This lets you run everything on your own machine quickly and without having to launch a Hadoop cluster.

After the command finishes you should see output similar to:

Illustrate Results

Here you can see what the data looks like and that it's loading correctly. We will use illustrate throughout the tutorial to help visualize how data is being loaded and processed.


Performing Calculations on the Data

The next steps in the script perform the calculations necessary to answer our question. First, we filter the tweets to find ones that have the word "excite" and have a time zone provided:

filtered_tweets = filter tweets
                  by text matches '.*[Ee]xcite.*'
                     and user.utc_offset is not null;

Then, we call python functions to calculate the local time from the time zone offset and turn that time into 1-hour time blocks (e.g. 4:00-5:00, 5:00-6:00):

tweets_with_local_time = foreach filtered_tweets generate 
    timeutils.local_time(created_at, user.utc_offset) as created_at_local_tz_iso;

tweets_with_time_buckets = foreach tweets_with_local_time generate
    timeutils.hour_block(created_at_local_tz_iso) as hour_block;

And finally we group the data by hour block, count the number of tweets in each hour block, and sort them to make the output easier to read:

grouped = group tweets_with_time_buckets by hour_block;
counted =  foreach grouped generate 
              group as hour_block,
              COUNT(tweets_with_time_buckets.hour_block) as num_tweets;
ordered = order counted BY hour_block asc;

You'll learn how each of these operators works in an upcoming tutorial, but for now let's illustrate the entire script to get a better sense of what’s happening with the data:

mortar local:illustrate pigscripts/tweets_by_time_block.pig -f params/tweets-local.params

You'll see data flowing through each step of the script, all the way to the output, rolled-up by hour of the day.


Storing Results

We have a number of choices for where to store our results (including back to MongoDB), but for now we'll save them as a file in Amazon S3. There are also a number of file format choices, but we'll use Pig's standard PigStorage which saves tab-delimited text files:

/******* Store Results **********/
rmf $OUTPUT_PATH/$MORTAR_EMAIL_S3_ESCAPED;
store ordered into '$OUTPUT_PATH/$MORTAR_EMAIL_S3_ESCAPED' using PigStorage();

Before running the store we use the rmf command to remove any existing output, as Hadoop will refuse to overwrite existing data. We also append a special Mortar-provided parameter called $MORTAR_EMAIL_S3_ESCAPED to the output path, which ensures that every user of this tutorial will store to unique location in S3 later!


Running the Script

Because the locally-stored tweet data set is quite small, you can actually run the entire script locally on your own computer.

mortar local:run pigscripts/tweets_by_time_block.pig -f params/tweets-local.params

Now that you’ve started your run you will see a bunch of Pig output. This output shows how our Pig script above is being compiled and run as a number of Hadoop Map/Reduce jobs. Don't worry if you don't understand the logs or how Hadoop works; we'll cover this in a later tutorial.

Once your job finishes (it may take a couple of minutes) it’s time to take a look at the results.


Getting the Results

To check out your results, open up the output we just generated. You can find the output under the directory data/tweets_by_time_block. The format of the output file name is from Hadoop. Because Hadoop is a distributed processing framework, the final output may be broken up into multiple parts. Here's what the output looks like:

00 - 01 1
03 - 04 1
04 - 05 3
05 - 06 3
06 - 07 2
07 - 08 1
08 - 09 6
09 - 10 2
13 - 14 2
14 - 15 4
15 - 16 1
16 - 17 6
17 - 18 3
18 - 19 1
19 - 20 6
20 - 21 2
21 - 22 3
22 - 23 2
23 - 00 4

The first column is our hour block (e.g. 0:00 - 1:00) and the second column is the number of tweets counted. We've got an extremely small local data set, so there aren't many tweets--the 10:00 - 11:00 hour actually has none.

So now you've run a Pig job on your local machine against a small MongoDB data set. Now that the script is working on small data, the next step is to run against the full dataset on the Mortar platform.