Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Google Drive + Data Hero Example
Run an Example
Create Your Own Dashboards

Run an Example

Now that you have your Mortar project set up we're going to run an example Pig script that stores data to Google Drive. Our script will use twitter data to decide how snobby each U.S. state is about coffee.

Setting up your Development Environment

Before getting into the details of our project, you should take a minute to set up your development environment. The first thing to do is pick an editor you want to use for viewing and working with code. If you don't already have a favorite editor you can review some options here. The second thing to do is to open a terminal and change to the root directory of your Mortar project. This terminal will be used to run various commands while working with your project.

Open pigscripts/coffee_tweets_google_drive.pig in your code editor. This is a script written in Apache Pig that searches for tweets that match a phrase and then counts the number of occurrences by state. We'll cover how to write your own Pig code in a later tutorial. For now we are just going to work through the different sections of this pigscript, explaining what they do.


Registering Custom Code

REGISTER '../udfs/python/twitter_places.py' USING streaming_python AS twitter_places;
REGISTER '../udfs/python/coffee.py' USING streaming_python AS coffee;

Our first two lines register custom python scripts for use in our Pig script. In Hadoop, custom code is called a User-Defined Function (UDF). UDFs can be written in several languages--Python, Jython, Java, and Ruby.


Setting Parameters

%default TWEET_LOAD_PATH 's3n://twitter-gardenhose-mortar/example'
%default OUTPUT_PATH 's3n://mortar-example-output-data/$MORTAR_EMAIL_S3_ESCAPED/coffee_tweets_gdrive'

Here we're setting up some parameters that indicate where we'll find input data and where we'll write our output data. We'll write the output to S3 before writing to Google Drive. $MORTAR_EMAIL_S3_ESCAPED is a parameter Mortar defines so that all Mortar users can write to the same example output data bucket without overwriting each other.


Loading the Data

In any Pig script the first step is always loading data into Pig. Before we explain that code, here's what a (partial) example tweet looks like as a JSON object:

{
    "text": "First latte in a long long time... #withdrawls #prairiechick http://t.co/FLrGnaPcVT",
    "id": 423879526847107100,
    "favorite_count": 0,
    "retweeted": false,
    "coordinates": null,
    "retweet_count": 0,
    "in_reply_to_user_id": null,
    "favorited": false,
    "user": {
        "friends_count": 274,
        "location": "Okoboji, IA",
        "time_zone": "Central Time (US & Canada)"
     },
     "created_at": "Thu Jan 16 18:08:49 +0000 2014",
     "place": {"place_type": "city", "country": "United States", "full_name": "city, KY", "country_code": "US"}
}

Now that you understand the data, take a look at how we load that data into Pig.

tweets =  LOAD '$TWEET_LOAD_PATH'
            USING org.apache.pig.piggybank.storage.JsonLoader(
            'created_at:chararray, place:map[], text:chararray');

This example is loading JSON, but there are many other supported formats.

Pig has a handy feature called "illustrate" that will help you visualize how your script works. Try it out by illustrating the "tweets" alias in this script. In your terminal, from the root directory of your Mortar project run:

mortar local:illustrate pigscripts/coffee_tweets_google_drive.pig tweets

This command runs an illustrate on the coffee_tweets_google_drive pigscript, focusing specifically on the load statement alias tweets.

The first time you run a mortar local command, it will take a minute or two to set up your environment. On the first time only, Mortar downloads all of the dependencies you need to run a Pig job into a local sandbox for your project. This lets you run everything on your own machine quickly and without having to launch a Hadoop cluster.

After the command finishes you should see output similar to:

Illustrate Results

Here you can see what the data looks like and that it's loading correctly. We will use illustrate throughout the tutorials to help visualize how data is being loaded and processed.


Performing Calculations on the Data

The next steps in the script perform the calculations necessary to answer our question. First, we filter the tweets to those that we can extract a location from:

-- Filter to get only tweets that have a location in the US
tweets_with_place =
    FILTER tweets
        BY place IS NOT NULL
       AND place#'country_code' == 'US'
       AND place#'place_type' == 'city';

Then, we call python functions to parse out the state, and to evaluate the text for whether it is snob-coffee-related:

-- Parse out the US state name from the location
-- and determine whether this is a coffee tweet.
coffee_tweets =
    FOREACH tweets_with_place
   GENERATE text,
            place#'full_name' AS place_name,
            twitter_places.us_state(place#'full_name') AS us_state,
            coffee.is_coffee_tweet(text) AS is_coffee_tweet;

We then do another filter to make sure that all of our tweets were successfully assigned a state:

-- Filter to make sure we only include results with
-- where we found a US State
with_state = FILTER coffee_tweets BY us_state IS NOT NULL;

And finally we group the data by state, calculate the percentage of coffee tweets in each state, and sort the output:

-- Calculate the percentage of coffee tweets for each state
coffee_tweets_by_state =
    FOREACH grouped
   GENERATE group as us_state,
            SUM(with_state.is_coffee_tweet) AS num_coffee_tweets,
            COUNT(with_state) AS num_tweets,
            100.0 * SUM(with_state.is_coffee_tweet) / COUNT(with_state) AS pct_coffee_tweets;

-- Order by percentage to get the largest coffee snobs at the top
ordered_output =
    ORDER coffee_tweets_by_state
       BY pct_coffee_tweets DESC;

You'll learn how each of these operators works in an upcoming tutorial, but for now let's illustrate the entire script to get a better sense of what’s happening with the data:

mortar local:illustrate pigscripts/coffee_tweets_google_drive

When we leave out the alias, illustrate will run on the entire script.


Storing Results

We want our results to end up on Google Drive so that they are ultimately accessible to DataHero, so we use the GoogleDriveStorage function:

-- Remove any existing output at this location on S3 before storing results S3 and Google Drive
rmf $OUTPUT_PATH;
STORE ordered_output INTO '$OUTPUT_PATH'
    USING com.mortardata.pig.GoogleDriveStorage('coffee_tweets_output', 'YES_CONVERT', 'NO_OVERWRITE');

This will first store the data to S3 and then seamlessly copy it to your Google Drive as a file named coffee_tweets_output.

The YES_CONVERT argument tells it to store the data as a Google Sheet instead of a plain document, and NO_OVERWRITE tells it not to overwrite a file if there's one of the same name already (to force an overwrite, use YES_OVERWRITE).

Before running the store we use the rmf command to remove any existing output in the $OUTPUT_PATH location (not your Google Drive!), as Hadoop will refuse to overwrite existing data. We also append a special Mortar-provided parameter called $MORTAR_EMAIL_S3_ESCAPED to the S3 output path, which ensures that every user of this tutorial will store to unique location in S3 later!


Running the Script in the Cloud

Ordinarily you can run Pig scripts locally to test them out with mortar:local, but when using GoogleDriveStorage you have to run in the cloud on a real Hadoop cluster. This is because the last mile of moving data into Google Drive requires the connection you created in Setup Google Drive.

When running your code in the cloud, you need to decide how large a cluster of computers to use for your job. For this example, we have a very small number of tweets in the cloud, so we don't need a very large cluster. We'll use a 3-node cluster, which should finish the job in under 5 minutes once it starts up. In later tutorials we'll cover more about how Hadoop works and how to determine the right cluster size for your data and your job.

By default this Mortar project uses AWS spot instance clusters to save money. Running this example on a 3-node spot instance cluster for 1 hour should cost you approximately $0.42 in pass-through AWS costs. Before running this job you will need to add your credit card to your account. You can do that on our Billing Page.

When you're ready, run the job on a 3 node cluster:

mortar jobs:run pigscripts/coffee_tweets_google_drive --clustersize 3

After running this command you will see output similar to:

Taking code snapshot... done
Sending code snapshot to Mortar... done
Requesting job execution... done
job_id: some_job_id

Job status can be viewed on the web at:

    https://app.mortardata.com/jobs/job_detail?job_id=some_job_id

Or by running:

    mortar jobs:status some_job_id --poll

This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.

While you are waiting for your cluster to start and job to finish, jump over and do the Hadoop and Pig tutorials to get a better understanding of how they work.


Monitoring Your Job's Progress

Your job goes through three main steps after you submit it:

  1. Mortar validates your pigscript, checking for simple Pig errors, Python errors, and that you have the required access to the specified S3 buckets. If the pigscript is invalid, Mortar will return an error explaining the problem. This step is done before launching a cluster to make sure you pay as little as possible.
  2. Mortar starts a Hadoop cluster of the size you specified. This stage can take 5-15 minutes. You do not pay for the time the cluster is starting.
  3. Mortar runs your job on the cluster.

Once your job has started running on the cluster, you can get realtime feedback about its progress from the Mortar web application. Open the job status link displayed after you started your job:

Job status can be viewed on the web at:

    https://app.mortardata.com/jobs/job_detail?job_id=some_job_id

Getting Results

Once your job has finished you should be able to see the results in S3, but more importantly, in your Google account. Log in to your Google account to see a Google Sheets doc named "coffee_tweets_output."

Google Drive Results

If you open the file you should see results that look something like this:

Coffee Tweets Output

Showing us that improbably (and likely due to small sample size), Arkansas is winning the coffee snob wars.


Pig Version

Mortar supports two versions of Pig: 0.9 and 0.12. GoogleDriveStorage is only supported for Pig 0.12. If you open the file called .mortar-defaults you will see where the pig version is specified:

[DEFAULTS]
pigversion=0.12

You can also specify the pig version when running a job from the command line using:

--pigversion 0.12