For the tutorial, we’re going to do some basic data analysis on this set of data on Olympic medals.
Download the Olympics data file.
In order to have a more interactive experience, we're going to use a REPL to explore the data using Pig code. To get started, make sure that you have a Mortar account (you can request a free trial here) and that you have installed the Mortar framework. Then, inside a Mortar Project run the command:
This starts up a Pig REPL that we’ll be using to learn the fundamental Pig commands.
Just a few quick notes on the REPL:
Open up the data file and take a look. You should see some rows that look like this:
Hopefully this is a pretty easy schema to understand. Every row represents an athlete at a particular Olympic games, what country they represented that year, and what medals they won. Athletes who won no medals do not appear in this file.
The first thing we need to do is get that data into Pig. We do this by creating a LOAD statement. The syntax for LOAD looks like:
alias = LOAD 'file-name' USING Loader('argument0') AS (schema_field_0:datatype_0, schema_field_1:datatype_1);
The easiest way to write a LOAD statement is to visit our Load Statement Generator. This data is in CSV, so select that Data Format. Then choose an input path based on where you put the file. If you put the file in the same location you are running the REPL from, you can just say 'OlympicAthletes.csv'. Leave the delimiter as a comma. You'll also want to check the box that says 'Skip Header Line', because the first row of our data is headers, not a data row.
Next we need to add each field in our data (in order). It's probably easiest to use the column names given. So the first field field would be 'athlete' and its type would be chararray. Country is also a chararray, year is an int, and so on. The generated LOAD statement appears at the bottom of the page.
Be warned that you need spaces on both sides of the equals sign in order for the syntax to parse (for all Pig commands). Also note that your generated LOAD statement will have line breaks so that it’s easier to read, but for this exercise we need to put the entire statement on one line.
Create a LOAD statement for our data and assign it to the alias ‘athletes’. Copy it into the REPL and hit return.
athletes = LOAD 'OlympicAthletes.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') AS (athlete:chararray, country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int, total:int);
How do you know if that worked? We can use DUMP to display data on the screen, but we have to be careful because it will display all the rows of whatever we ask it to. In this case the data set is pretty small, so if you type
you should pretty quickly see the data stream by. Take a look and make sure it looks the way you expected it to.
If you aren’t a fan of big chunks of data rolling across your screen, you might want to make a smaller data set before you run DUMP.
The LIMIT keyword restricts your data set down to the number of rows that you specify. It will take the first n rows of data, so if you’ve ordered it first that order will be preserved. (If you haven’t ordered it, what you get is effectively random).
Here’s an example:
alias_lim = LIMIT my_alias 30;
Make a new alias athletes_lim that contains only 10 rows of data, and run a DUMP command on it.
athletes_lim = LIMIT athletes 10; DUMP athletes_lim;
Now that your screen isn’t full of data you might also notice lines like “Starting job job_local_0002”—that’s Pig converting to MapReduce behind the scenes, and you can ignore them for now.
Let’s start by finding out something pretty basic: which country has won the most medals?
First, a little background on data structure. In Pig, a set of data entries is called a relation, and its name is called an alias (these words are often used interchangeably). Both 'athletes' and 'athletes_lim' are relations--They are effectively sets of of data "rows", all of which describe the same type of thing. In SQL, this would be called a table.
A relation contains rows or entries, which in Pig are represented by tuples. A tuple is made up of fields, which sometimes might also be called columns (especially by people with a SQL background). In the relation 'athletes', the fields are 'athlete', 'country', 'year', etc.
In order to figure out which country has the most medals, we'll want to do a sum of the field 'total', while grouping by the field 'country'. To accomplish the grouping, we use the GROUP BY syntax.
data_grp_field = GROUP data BY col;
Group your data by country.
athletes_grp_country = GROUP athletes BY country;
It can be hard to visualize what the data looks like after performing a GROUP operation. The new alias will have only two fields: one that corresponds to what was being grouped on (country) and one field that contains all the data rows for any single group value.
The data structures that result are more akin to Python than to SQL. That second field is basically a list of tuples; each distinct group value has its own list of corresponding data. In Pig, this list is called a DataBag.
This ability to nest data structures makes Pig powerful, but can also be pretty confusing. One of the tools that can help is DESCRIBE.
This command will output the schema of the alias, which can aid in understanding what's happening to your data. Braces indicate a DataBag, and parentheses indicate tuples. (Notice that the entire relation is itself a DataBag.)
Run a describe on athletes_grp_country
We’ve grouped our data but we still don’t have our answer. We need to use an aggregate function—that is to say, a function that will look at the data in a single field across all rows, and tell us something about it. In this case, the function called SUM will add up the number of medals.
In Pig, any time you want to add, remove, or change the data you have in an alias, you’ll use the FOREACH… GENERATE syntax.
You can use it to get rid of columns:
data = LOAD 'my-file.csv' using PigStorage('field1: int, field2: chararray, field3: long'); new_data = FOREACH data GENERATE field1, field2;
You can use it to add or duplicate columns:
new_data = FOREACH data GENERATE field1, field2, field2 as field2_copy;
You can also use it to apply functions:
data_grp = GROUP data BY field2; new_data = FOREACH data_grp GENERATE group as field2, SUM(data.field) as field_sum;
Calculate the number of medals won by each country in our data.
medal_sum = FOREACH athletes_grp_country GENERATE group AS country, SUM(athletes.total) as medal_count; DUMP medal_sum;
Those numbers seem a little low. What date range does our data set cover? Does it include the 2014 Olympics in Sochi? Does it go back to the 1800s?
In order to find that out we’re going to again need to use an aggregate function. This time we don't want to group over any field in the data, we want to consider all of it together. In Pig, to accomplish that we need to use GROUP ALL. This groups the entire data set into one “bin” so that you can do aggregate functions on it.
In this case, MAX and MIN are the functions we’ll want.
data_grp = GROUP my_data ALL; new_data = FOREACH data_grp GENERATE MIN(data.field1) as min_field1;
As an added bonus, we can actually combine those two statements using a nested FOREACH.
new_data = FOREACH (GROUP my_data ALL) GENERATE MIN(data.field1) as min_field1;
Find out what the min and max dates on our data are.
data_range = FOREACH (GROUP athletes ALL) GENERATE MIN(athletes.year) as min_year, MAX(athletes.year) as max_year; DUMP data_range;
What countries are represented here? We’d like to get a concise list with no duplicates of which countries appear in the data set. To do that we’ll combine our FOREACH...GENERATE syntax with DISTINCT.
Using the DISTINCT command on an alias removes duplicate rows. If we just used it on our initial data we’d get a lot of duplicate countries because we’d at least get every athlete per country once. To get a distinct field we need to first generate only that field.
distinct_rows = DISTINCT my_data; distinct_field = DISTINCT (FOREACH my_data GENERATE field);
Get a list of distinct countries in this data.
distinct_countries = DISTINCT (FOREACH athletes GENERATE country); DUMP distinct_countries;
You probably just got a lot of data streaming across the screen in a way that wasn’t abundantly useful. We already know we can get a smaller data set using LIMIT, but we don’t really want a random sampling of countries. We’re interested in finding out what the country with the most medals is. We can do this by using ORDER first, and then using LIMIT. The parameter 'DESC' indicates that we want the sort to be in descending order.
ordered_data = ORDER summed_data BY field_sum DESC;
Find the country with the most medals by using ORDER and then LIMIT.
ordered_medals = ORDER medal_sum BY medal_count DESC; ordered_medals_lim = LIMIT ordered_medals 1; DUMP ordered_medals_lim;
Ok, so it was the United States. You might happen to know that the U.S. does really well in swimming, which awards (approximately) a bazillion medals. So let's do that calculation again without including swimming medals. You can see in the data sample at the top of the page how an athlete's sport is denoted.
filtered_data1 = FILTER my_data BY field != 'Field Value'; filtered_data2 = FILTER my_data BY field > 12; filtered_data3 = FILTER my_data BY field1 == 0 AND field2 < 6;
Find the country with the most medals EXCLUDING swimming medals.
athletes_filter = FILTER athletes by sport != 'Swimming'; medal_sum = FOREACH (GROUP athletes_filter BY country) GENERATE group as country, SUM(athletes_filter.total) as medal_count; ordered_medals = ORDER medal_sum BY medal_count DESC; ordered_medals_lim = LIMIT ordered_medals 1; DUMP ordered_medals_lim;
Ok, so it’s still the United States, but the number got a lot smaller.
We’ve been doing a lot of aggregated statistics and ignoring the athletes, so let’s drop back to a more granular level. Lets find all the athletes who won the same number of medals in two consecutive Olympics.
Much like SQL, Pig has a JOIN command that puts together two data sets based on common fields between them. Unlike SQL, if you want to JOIN a relation to itself you have to copy it first.
my_data_copy = FOREACH my_data GENERATE *; joined_data = JOIN my_data BY field1, my_data_copy BY field1;
You are now in a slightly weird state where joined_data has two field1s. To distinguish them, you need use this syntax:
So for example:
readable_data = FOREACH joined_data GENERATE my_data::field1 as orig_field1, my_data_copy::field1 as copy_field1;
You can also rename the fields when you do the copy. (Note: unlike SQL, you can only JOIN using exact match. You can’t join on field1 < field2, for example.)
Find the athletes who have gotten the same number of medals in consecutive Olympics. (Hint: do a JOIN, then a FILTER. Note that consecutive Olympics occur four years apart.)
athletes_copy = FOREACH athletes GENERATE athlete, year as year2, total as medals2; athletes_join = JOIN athletes BY athlete, athletes_copy BY athlete; athletes_join_filtered = FILTER athletes_join BY total == medals2 AND year2 == year+4; athletes_output = FOREACH athletes_join_filtered GENERATE athletes::athlete as athlete, total, year, year2; DUMP athletes_output;
We’ve been looking at summer and winter Olympics together, but they really aren’t the same thing at all. They have completely different events and medals, and are dominated by different countries. Let’s split up our data into a winter_olympics data set and a summer_olympics data set.
We could do this by applying FILTER twice, and that would work just fine. However, Pig includes a command exactly for this purpose called SPLIT.
SPLIT my_data INTO data_set1 IF field == "Value1", data_set2 IF field == "Value2", data_set3 IF field2 < 5;
Data may be repeated between aliases and data may not exist in any aliases after the SPLIT.
Create a summer_olympics and winter_olympics data set. (Hint: you can use the ‘%’ sign for modular arithmetic in Pig, just like in other languages, and the summer Olympics happen in years divisible by 4.)
SPLIT athletes INTO summer_olympics IF year % 4 == 0, winter_olympics IF year % 4 != 0;
If you did the bonus question, you found out that the United States and Canada are actually tied on medal count for this time span. Looking at total medals is a pretty rough calculation of success--we might be able to break the tie by some sort of weighting by the kind of medal that was won. We could do that using Pig commands, but given a choice, it might be more pleasant to do that kind of computation in Python.
from pig_util import outputSchema @outputSchema('score:int') def calculate_score(gold, silver, bronze): return 3 * gold + 2 * silver + bronze
In order to use the Python function we need to first register it:
REGISTER 'olympic_udfs.py' USING streaming_python AS udf;
Then in order to use it, we just reference the function as udf.calculate_score. So to calculate the score for an individual athlete we would do this:
athlete_score = FOREACH athletes GENERATE athlete, udf.calculate_score(gold, silver, bronze) as score;
Find out which country scores the highest in the winter Olympics by our new metric.
winter_medal_sum = FOREACH (GROUP winter_olympics BY country) GENERATE group AS country, SUM(winter_olympics.gold) as gold_sum, SUM(winter_olympics.silver) as silver_sum, SUM(winter_olympics.bronze) as bronze_sum; country_scores = FOREACH winter_medal_sum GENERATE country, udf.calculate_score(gold_sum, silver_sum, bronze_sum) as score; ordered_winter_medals = ORDER country_scores BY score DESC; ordered_winter_medals_lim = LIMIT ordered_winter_medals 1; DUMP ordered_winter_medals_lim;
For additional useful Pig commands and ongoing reference, check out the Pig Cheat Sheet.