Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

First Web Project

This quick tutorial will show you the basics of using the Mortar Web IDE to work with data in Apache Pig.

Note that you'll need a Mortar account to follow along. You can request a free trial of Mortar here.

If you're not already there, open up the Mortar Web IDE. You should be seeing the "My First Project: Olympic Data" project. If not, click My Web Projects -> Open and choose it from the list.

This project starts us off with a very basic Pig script. This script starts by loading the contents of the Olympics data file into Pig from an Amazon S3 bucket and stores it into an output bucket. This tutorial will guide you through the process of finding the country with the most medals at the 2010 Olympics. In the process, we’ll explore how to use Pig to analyze any size of data (so don’t worry if you’re new to Pig!).

OlympicAthletes.CSV sample data:


LOAD Statement

The first thing we need to do is get data into Pig. We do this by writing a LOAD statement. This statement tells Pig where to get the data and how to load it—for this job, we are loading a CSV file. We’ve created the LOAD statement using Mortar’s load statement generator. You can use this tool to quickly build a Pig load statement customized to your data.


STORE Statement

The last thing we need to do in a Pig script is store our output data—the result of our data transformation or analysis. In this particular example, we are storing the final dataset in an Amazon S3 bucket called ‘output’. Note that before we run the ‘STORE’ command, we run ‘rmf’ to remove any existing output in that bucket. This is because Hadoop will refuse to overwrite existing data.


Analyzing Data

Now that we understand how to load and store data, let’s find the country with the most medals at the 2010 Olympics!

Step 1: Filter

As a rule of thumb, filter early and filter often. We will start by filtering our athletes to find those who competed in the 2010 Olympics. This is what a generic filter statement looks like:

filtered_data1 = FILTER my_data BY field == 'Field Value';

Write a filter statement to find the athletes who completed in the 2010 Olympics.

Solution

athletes_2010 = FILTER athletes BY year == 2010; 


Step 2: Group

Our next step is to group athletes_2010 by the country they hail from. This is what a generic GROUP statement looks like:

data_grp  = GROUP data BY field;

Group the athletes who completed in the 2010 Olympics by country.

Solution

athletes_by_country = GROUP athletes_2010 BY country;


Step 3: Foreach...Generate & Sum

Now that the athletes are grouped by their country, we need to aggregate the total number of medals won by country. The general strategy for summing data in Pig is:

new_data = FOREACH data_grp 
           GENERATE group AS field2, 
           SUM(data.field) AS field_sum;

Find the total number of medals per country.

Solution

medal_sum = FOREACH athletes_by_country 
            GENERATE group AS country,
            SUM(athletes_2010.total) as medal_count;


Step 4: Order by and limit

At this point, we have an alias called medal_sum which is a collection of tuples of countries and the total number of medals for each country. To find the country with the most medals, all we need to do is order our data and output the first country. The general strategy for ordering and limiting looks like:

ordered_data = ORDER summed_data BY field_sum DESC;
alias_lim = LIMIT ordered_data 1;

Find the country with the most medals using ORDER BY and LIMIT.

Solution

 order_by_medals = ORDER medal_sum BY medal_count DESC;
 order_medals_limit = LIMIT order_by_medals 1;



Run

Now we’re almost ready to run our first Pig job on Hadoop. Before we do so, let’s go through a few pre-launch checks:

  1. Make sure your store statement is storing order_medals_limit so you are outputting the correct data.

  2. To conceptually check your code, you can use ‘Illustrate’ to see how your data flows through all the steps of your Pig script. This command takes a sample of data from the CSV file to preview how your code will work.

  3. Now you are ready to run your code! Press the green Run button. Since we're working with a small amount of data, we can use "Local Mode (No Cluster)" to run the script without starting a Hadoop cluster. This is a quick way to run scripts that use smaller amounts of data; for bigger data you'll want to run with a real Hadoop cluster.

Answer

USA: 97


Next Step

Check out our follow-up tutorial to see how easy it is to generate a profile of your data!