Note that you'll need a Mortar account to follow along. You can request a free trial of Mortar here.
If you're not already there, open up the Mortar Web IDE. You should be seeing the "My First Project: Olympic Data" project. If not, click My Web Projects -> Open and choose it from the list.
This project starts us off with a very basic Pig script. This script starts by loading the contents of the Olympics data file into Pig from an Amazon S3 bucket and stores it into an output bucket. This tutorial will guide you through the process of finding the country with the most medals at the 2010 Olympics. In the process, we’ll explore how to use Pig to analyze any size of data (so don’t worry if you’re new to Pig!).
OlympicAthletes.CSV sample data:
The first thing we need to do is get data into Pig. We do this by writing a LOAD statement. This statement tells Pig where to get the data and how to load it—for this job, we are loading a CSV file. We’ve created the LOAD statement using Mortar’s load statement generator. You can use this tool to quickly build a Pig load statement customized to your data.
The last thing we need to do in a Pig script is store our output data—the result of our data transformation or analysis. In this particular example, we are storing the final dataset in an Amazon S3 bucket called ‘output’. Note that before we run the ‘STORE’ command, we run ‘rmf’ to remove any existing output in that bucket. This is because Hadoop will refuse to overwrite existing data.
Now that we understand how to load and store data, let’s find the country with the most medals at the 2010 Olympics!
As a rule of thumb, filter early and filter often. We will start by filtering our athletes to find those who competed in the 2010 Olympics. This is what a generic filter statement looks like:
filtered_data1 = FILTER my_data BY field == 'Field Value';
Write a filter statement to find the athletes who completed in the 2010 Olympics.
athletes_2010 = FILTER athletes BY year == 2010;
Our next step is to group
athletes_2010 by the country they hail from. This is what a generic GROUP
statement looks like:
data_grp = GROUP data BY field;
Group the athletes who completed in the 2010 Olympics by country.
athletes_by_country = GROUP athletes_2010 BY country;
Now that the athletes are grouped by their country, we need to aggregate the total number of medals won by country. The general strategy for summing data in Pig is:
new_data = FOREACH data_grp GENERATE group AS field2, SUM(data.field) AS field_sum;
Find the total number of medals per country.
medal_sum = FOREACH athletes_by_country GENERATE group AS country, SUM(athletes_2010.total) as medal_count;
At this point, we have an alias called
medal_sum which is a collection of tuples of countries and the
total number of medals for each country. To find the country with the most medals, all we need to do
is order our data and output the first country. The general strategy for ordering and limiting looks
ordered_data = ORDER summed_data BY field_sum DESC; alias_lim = LIMIT ordered_data 1;
Find the country with the most medals using ORDER BY and LIMIT.
order_by_medals = ORDER medal_sum BY medal_count DESC; order_medals_limit = LIMIT order_by_medals 1;
Now we’re almost ready to run our first Pig job on Hadoop. Before we do so, let’s go through a few pre-launch checks:
Make sure your store statement is storing
order_medals_limit so you are outputting the correct data.
To conceptually check your code, you can use ‘Illustrate’ to see how your data flows through all the steps of your Pig script. This command takes a sample of data from the CSV file to preview how your code will work.
Now you are ready to run your code! Press the green Run button. Since we're working with a small amount of data, we can use "Local Mode (No Cluster)" to run the script without starting a Hadoop cluster. This is a quick way to run scripts that use smaller amounts of data; for bigger data you'll want to run with a real Hadoop cluster.
Check out our follow-up tutorial to see how easy it is to generate a profile of your data!