Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Learn Pig

    This hands-on tutorial will walk you through learning Pig step-by-step.

    For the tutorial, we’re going to do some basic data analysis on this set of data on Olympic medals.

    Download the Olympics data file.

    In order to have a more interactive experience, we're going to use a REPL to explore the data using Pig code. To get started, make sure that you have a Mortar account (you can request a free trial here) and that you have installed the Mortar framework. Then, inside a Mortar Project run the command:

    mortar local:repl
    

    This starts up a Pig REPL that we’ll be using to learn the fundamental Pig commands.

    Just a few quick notes on the REPL:

    • Aliases are only defined for the life of the REPL. If you quit and restart you’ll have to re-create the aliases you want.
    • You can overwrite an existing alias with a new definition.

    LOAD

    Open up the data file and take a look. You should see some rows that look like this:

    Data

    Hopefully this is a pretty easy schema to understand. Every row represents an athlete at a particular Olympic games, what country they represented that year, and what medals they won. Athletes who won no medals do not appear in this file.

    The first thing we need to do is get that data into Pig. We do this by creating a LOAD statement. The syntax for LOAD looks like:

    alias = LOAD 'file-name' 
        USING Loader('argument0') 
        AS (schema_field_0:datatype_0, 
            schema_field_1:datatype_1);
    

    The easiest way to write a LOAD statement is to visit our Load Statement Generator. This data is in CSV, so select that Data Format. Then choose an input path based on where you put the file. If you put the file in the same location you are running the REPL from, you can just say 'OlympicAthletes.csv'. Leave the delimiter as a comma. You'll also want to check the box that says 'Skip Header Line', because the first row of our data is headers, not a data row.

    Next we need to add each field in our data (in order). It's probably easiest to use the column names given. So the first field field would be 'athlete' and its type would be chararray. Country is also a chararray, year is an int, and so on. The generated LOAD statement appears at the bottom of the page.

    Be warned that you need spaces on both sides of the equals sign in order for the syntax to parse (for all Pig commands). Also note that your generated LOAD statement will have line breaks so that it’s easier to read, but for this exercise we need to put the entire statement on one line.

    Create a LOAD statement for our data and assign it to the alias ‘athletes’. Copy it into the REPL and hit return.

    Solution

    athletes = LOAD 'OlympicAthletes.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') AS (athlete:chararray, country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int, total:int);
    

    DUMP

    How do you know if that worked? We can use DUMP to display data on the screen, but we have to be careful because it will display all the rows of whatever we ask it to. In this case the data set is pretty small, so if you type

    DUMP athletes;
    

    you should pretty quickly see the data stream by. Take a look and make sure it looks the way you expected it to.


    LIMIT

    If you aren’t a fan of big chunks of data rolling across your screen, you might want to make a smaller data set before you run DUMP.

    The LIMIT keyword restricts your data set down to the number of rows that you specify. It will take the first n rows of data, so if you’ve ordered it first that order will be preserved. (If you haven’t ordered it, what you get is effectively random).

    Here’s an example:

    alias_lim = LIMIT my_alias 30;
    

    Make a new alias athletes_lim that contains only 10 rows of data, and run a DUMP command on it.

    Solution

    athletes_lim = LIMIT athletes 10;
    DUMP athletes_lim;
    

    Now that your screen isn’t full of data you might also notice lines like “Starting job job_local_0002”—that’s Pig converting to MapReduce behind the scenes, and you can ignore them for now.


    GROUP BY

    Let’s start by finding out something pretty basic: which country has won the most medals?

    First, a little background on data structure. In Pig, a set of data entries is called a relation, and its name is called an alias (these words are often used interchangeably). Both 'athletes' and 'athletes_lim' are relations--They are effectively sets of of data "rows", all of which describe the same type of thing. In SQL, this would be called a table.

    A relation contains rows or entries, which in Pig are represented by tuples. A tuple is made up of fields, which sometimes might also be called columns (especially by people with a SQL background). In the relation 'athletes', the fields are 'athlete', 'country', 'year', etc.

    In order to figure out which country has the most medals, we'll want to do a sum of the field 'total', while grouping by the field 'country'. To accomplish the grouping, we use the GROUP BY syntax.

    data_grp_field = GROUP data BY col;
    

    Group your data by country.

    Solution

    athletes_grp_country = GROUP athletes BY country;
    

    DESCRIBE

    It can be hard to visualize what the data looks like after performing a GROUP operation. The new alias will have only two fields: one that corresponds to what was being grouped on (country) and one field that contains all the data rows for any single group value.

    The data structures that result are more akin to Python than to SQL. That second field is basically a list of tuples; each distinct group value has its own list of corresponding data. In Pig, this list is called a DataBag.

    This ability to nest data structures makes Pig powerful, but can also be pretty confusing. One of the tools that can help is DESCRIBE.

    DESCRIBE alias;
    

    This command will output the schema of the alias, which can aid in understanding what's happening to your data. Braces indicate a DataBag, and parentheses indicate tuples. (Notice that the entire relation is itself a DataBag.)

    Run a describe on athletes_grp_country

    Solution

    DESCRIBE athletes_grp_country;
    

    FOREACH … GENERATE

    We’ve grouped our data but we still don’t have our answer. We need to use an aggregate function—that is to say, a function that will look at the data in a single field across all rows, and tell us something about it. In this case, the function called SUM will add up the number of medals.

    In Pig, any time you want to add, remove, or change the data you have in an alias, you’ll use the FOREACH… GENERATE syntax.

    You can use it to get rid of columns:

    data = LOAD 'my-file.csv' using PigStorage('field1: int, field2: chararray, field3: long');
    new_data = FOREACH data GENERATE field1, field2;
    

    You can use it to add or duplicate columns:

    new_data = FOREACH data GENERATE field1, field2, field2 as field2_copy;
    

    You can also use it to apply functions:

    data_grp  = GROUP data BY field2;
    new_data = FOREACH data_grp GENERATE group as field2, SUM(data.field) as field_sum;
    

    Calculate the number of medals won by each country in our data.

    Solution

    medal_sum = FOREACH athletes_grp_country GENERATE group AS country, SUM(athletes.total) as medal_count;
    DUMP medal_sum;
    

    GROUP ALL

    Those numbers seem a little low. What date range does our data set cover? Does it include the 2014 Olympics in Sochi? Does it go back to the 1800s?

    In order to find that out we’re going to again need to use an aggregate function. This time we don't want to group over any field in the data, we want to consider all of it together. In Pig, to accomplish that we need to use GROUP ALL. This groups the entire data set into one “bin” so that you can do aggregate functions on it.

    In this case, MAX and MIN are the functions we’ll want.

    data_grp = GROUP my_data ALL;
    new_data = FOREACH data_grp GENERATE MIN(data.field1) as min_field1;
    

    As an added bonus, we can actually combine those two statements using a nested FOREACH.

    new_data = FOREACH (GROUP my_data ALL) GENERATE MIN(data.field1) as min_field1;
    

    Find out what the min and max dates on our data are.

    Solution

    data_range = FOREACH (GROUP athletes ALL) GENERATE MIN(athletes.year) as min_year, MAX(athletes.year) as max_year;
    DUMP data_range;
    

    DISTINCT

    What countries are represented here? We’d like to get a concise list with no duplicates of which countries appear in the data set. To do that we’ll combine our FOREACH...GENERATE syntax with DISTINCT.

    Using the DISTINCT command on an alias removes duplicate rows. If we just used it on our initial data we’d get a lot of duplicate countries because we’d at least get every athlete per country once. To get a distinct field we need to first generate only that field.

    distinct_rows = DISTINCT my_data;
    distinct_field = DISTINCT (FOREACH my_data GENERATE field);
    

    Get a list of distinct countries in this data.

    Solution

    distinct_countries = DISTINCT (FOREACH athletes GENERATE country);
    DUMP distinct_countries;
    

    ORDER

    You probably just got a lot of data streaming across the screen in a way that wasn’t abundantly useful. We already know we can get a smaller data set using LIMIT, but we don’t really want a random sampling of countries. We’re interested in finding out what the country with the most medals is. We can do this by using ORDER first, and then using LIMIT. The parameter 'DESC' indicates that we want the sort to be in descending order.

    ordered_data = ORDER summed_data BY field_sum DESC;
    

    Find the country with the most medals by using ORDER and then LIMIT.

    Solution

    ordered_medals = ORDER medal_sum BY medal_count DESC;
    ordered_medals_lim = LIMIT ordered_medals 1;
    DUMP ordered_medals_lim;
    

    Bonus questions:

    • What were the top five countries in medal count?
    • What would happen if we changed DESC to ASC?

    FILTER

    Ok, so it was the United States. You might happen to know that the U.S. does really well in swimming, which awards (approximately) a bazillion medals. So let's do that calculation again without including swimming medals. You can see in the data sample at the top of the page how an athlete's sport is denoted.

    filtered_data1 = FILTER my_data BY field != 'Field Value';
    filtered_data2 = FILTER my_data BY field > 12;
    filtered_data3 = FILTER my_data BY field1 == 0 AND field2 < 6;
    

    Find the country with the most medals EXCLUDING swimming medals.

    Solution

    athletes_filter = FILTER athletes by sport != 'Swimming';
    medal_sum = FOREACH (GROUP athletes_filter BY country) GENERATE group as country, SUM(athletes_filter.total) as medal_count;
    ordered_medals = ORDER medal_sum BY medal_count DESC;
    ordered_medals_lim = LIMIT ordered_medals 1;
    DUMP ordered_medals_lim;
    

    Ok, so it’s still the United States, but the number got a lot smaller.

    Bonus question:

    • How many countries won only 1 medal?

    JOIN

    We’ve been doing a lot of aggregated statistics and ignoring the athletes, so let’s drop back to a more granular level. Lets find all the athletes who won the same number of medals in two consecutive Olympics.

    Much like SQL, Pig has a JOIN command that puts together two data sets based on common fields between them. Unlike SQL, if you want to JOIN a relation to itself you have to copy it first.

    my_data_copy = FOREACH my_data GENERATE *;
    joined_data = JOIN my_data BY field1, my_data_copy BY field1;
    

    You are now in a slightly weird state where joined_data has two field1s. To distinguish them, you need use this syntax:

    my_data::field1
    

    So for example:

    readable_data = FOREACH joined_data GENERATE
                        my_data::field1 as orig_field1, my_data_copy::field1 as copy_field1;
    

    You can also rename the fields when you do the copy. (Note: unlike SQL, you can only JOIN using exact match. You can’t join on field1 < field2, for example.)

    Find the athletes who have gotten the same number of medals in consecutive Olympics. (Hint: do a JOIN, then a FILTER. Note that consecutive Olympics occur four years apart.)

    Solution

    athletes_copy = FOREACH athletes GENERATE athlete, year as year2, total as medals2;
    athletes_join = JOIN athletes BY athlete, athletes_copy BY athlete;
    athletes_join_filtered = FILTER athletes_join BY total == medals2 AND year2 == year+4;
    athletes_output = FOREACH athletes_join_filtered GENERATE athletes::athlete as athlete, total, year, year2;
    DUMP athletes_output;
    

    Bonus questions:

    • Well, turns out there were a lot of them. How about the ones who won more than 3 medals twice?

    SPLIT

    We’ve been looking at summer and winter Olympics together, but they really aren’t the same thing at all. They have completely different events and medals, and are dominated by different countries. Let’s split up our data into a winter_olympics data set and a summer_olympics data set.

    We could do this by applying FILTER twice, and that would work just fine. However, Pig includes a command exactly for this purpose called SPLIT.

    SPLIT my_data INTO
                data_set1 IF field == "Value1",
                data_set2 IF field == "Value2",
                data_set3 IF field2 < 5;
    

    Data may be repeated between aliases and data may not exist in any aliases after the SPLIT.

    Create a summer_olympics and winter_olympics data set. (Hint: you can use the ‘%’ sign for modular arithmetic in Pig, just like in other languages, and the summer Olympics happen in years divisible by 4.)

    Solution

    SPLIT athletes INTO
                summer_olympics IF year % 4 == 0,
                winter_olympics IF year % 4 != 0;
    

    Bonus questions:

    • What country has won the most winter Olympic medals? (We promise it’s interesting.)

    UDFs

    If you did the bonus question, you found out that the United States and Canada are actually tied on medal count for this time span. Looking at total medals is a pretty rough calculation of success--we might be able to break the tie by some sort of weighting by the kind of medal that was won. We could do that using Pig commands, but given a choice, it might be more pleasant to do that kind of computation in Python.

    Fortunately for that arbitrary preference, Pig allows the use of User-Defined Functions in other languages. Those languages include Java, Ruby, Python, and javascript, but for this example let’s download this Python file and put it in the same directory we are running our REPL in. It contains a very simple Python script that returns a score based on the types of medals won.

    from pig_util import outputSchema
    
    @outputSchema('score:int')
    def calculate_score(gold, silver, bronze):
        return 3 * gold + 2 * silver + bronze
    

    In order to use the Python function we need to first register it:

    REGISTER 'olympic_udfs.py' USING streaming_python AS udf;
    

    Then in order to use it, we just reference the function as udf.calculate_score. So to calculate the score for an individual athlete we would do this:

    athlete_score = FOREACH athletes GENERATE athlete, udf.calculate_score(gold, silver, bronze) as score;
    

    Find out which country scores the highest in the winter Olympics by our new metric.

    Solution

    winter_medal_sum = FOREACH (GROUP winter_olympics BY country) GENERATE group AS country, SUM(winter_olympics.gold) as gold_sum, SUM(winter_olympics.silver) as silver_sum, SUM(winter_olympics.bronze) as bronze_sum;
    country_scores = FOREACH winter_medal_sum GENERATE country, udf.calculate_score(gold_sum, silver_sum, bronze_sum) as score;
    ordered_winter_medals = ORDER country_scores BY score DESC;
    ordered_winter_medals_lim = LIMIT ordered_winter_medals 1;
    DUMP ordered_winter_medals_lim;
    

    For additional useful Pig commands and ongoing reference, check out the Pig Cheat Sheet.