Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Run an Example Recommender

Now that you have your Mortar project set up you're going to go through and run an example recommendation engine implementation. The goal of the example is to familiarize you with the different steps required to implement a recommender.

For this example you will use a small set of data from a fictional online movie store. This store lets users buy movies and keep a wishlist of movies that they would like to purchase in the future. From this data you will generate some recommendations the store could use to help improve their site.

Setting up your Development Environment

Before getting into the details of our project, you should take a minute to set up your development environment. The first thing to do is pick an editor you want to use for viewing and working with code. If you don't already have a favorite editor you can review some options here. The second thing to do is to open a terminal and change to the root directory of your Mortar project. This terminal will be used to run various commands while working with your project.

Open pigscripts/retail-recsys.pig in your code editor. This is the top level pigscript that runs the retail recommendation engine. We'll cover how to write your own Pig code in a later tutorial. For now we are just going to work through the different sections of this pigscript, explaining what they do.


Setting Parameters

import 'recommenders.pig';

%default INPUT_PATH_PURCHASES '../data/retail/purchases.json'
%default INPUT_PATH_WISHLIST '../data/retail/wishlists.json'
%default OUTPUT_PATH '../data/retail/out'

Here we're importing the Pig file that contains the core Mortar recommendation engine code and setting up some parameters that indicate where we'll find our input data and where we'll write our output data. For this example we're using a small data set included with your project and we will write our results into a local directory as well.


Loading the Data

In any Pig script the first step is always loading data into Pig. Before we explain that code, take a look at the raw data that we're going to use for this tutorial. In your terminal, run:

head data/retail/purchases.json
head data/retail/wishlists.json

You should see some output like:

{"movie_id": "cffef2de02604b9b86ef36f81a91e583", "row_id": 0, "user_id": "c93e6253d45b42e6b8758c6078a20fdf", "purchase_price": 15, "movie_name": "the graduate"}
{"movie_id": "8206dc792d494b1e85ecc63b40de40f4", "row_id": 1, "user_id": "c93e6253d45b42e6b8758c6078a20fdf", "purchase_price": 13, "movie_name": "as good as it gets"}
{"movie_id": "d7f8b5a8e43f42d6a7d83a49ec6e9f54", "row_id": 2, "user_id": "c93e6253d45b42e6b8758c6078a20fdf", "purchase_price": 13, "movie_name": "wuthering heights"}
{"movie_id": "508f8d00ee9d4a64974e279f3084cf0b", "row_id": 3, "user_id": "b9126de3abce4d098426b16bcbdc1b85", "purchase_price": 15, "movie_name": "alice in wonderland"}
{"movie_id": "968167f5b26046e98e6f06d75cc7f587", "row_id": 4, "user_id": "b9126de3abce4d098426b16bcbdc1b85", "purchase_price": 8, "movie_name": "mary poppins"}
{"movie_id": "661d309280df471daa89e6c799a97d7d", "row_id": 5, "user_id": "b9126de3abce4d098426b16bcbdc1b85", "purchase_price": 15, "movie_name": "the mask"}
{"movie_id": "556c081324184449bcc42053a253eea2", "row_id": 6, "user_id": "8a58b55cb4a844bf8adb22372d3a3680", "purchase_price": 14, "movie_name": "the deep"}
{"movie_id": "d5c9e77ebf8f488084393611e8fded2f", "row_id": 7, "user_id": "8a58b55cb4a844bf8adb22372d3a3680", "purchase_price": 19, "movie_name": "lawrence of arabia"}
{"movie_id": "c7d5159445ec428ca31c5fc1a392d3dc", "row_id": 8, "user_id": "8a58b55cb4a844bf8adb22372d3a3680", "purchase_price": 10, "movie_name": "little big man"}
{"movie_id": "cd2089b1cc1d40a8a2e4b822c26d26cf", "row_id": 9, "user_id": "aa77411821c6405aaf17f3a2e381a413", "purchase_price": 17, "movie_name": "i know what you did last summer"}

As you can see, these files contain multiple json documents, one per line, each representing a single user action of purchasing or wishlisting a movie. Take a minute to familiarize yourself with the data stored in these files.

Now that you understand the data, take a look at how we load that data into Pig.

/******* Load Data **********/

--Get purchase signals
purchase_input = LOAD '$INPUT_PATH_PURCHASES' USING org.apache.pig.piggybank.storage.JsonLoader(
                    'row_id: int,
                     movie_id: chararray,
                     movie_name: chararray,
                     user_id: chararray,
                     purchase_price: int');

--Get wishlist signals
wishlist_input =  LOAD '$INPUT_PATH_WISHLIST' USING org.apache.pig.piggybank.storage.JsonLoader(
                     'row_id: int,
                      movie_id: chararray,
                      movie_name: chararray,
                      user_id: chararray');

Here you can see how to use the Pig JsonLoader to load the two input files.

Pig has a handy feature called "illustrate" that will help you visualize how your script works. Try it out by illustrating the purchase_input load statement. In your terminal, run:

mortar local:illustrate pigscripts/retail-recsys.pig purchase_input -f params/retail.params

This command runs an illustrate on the retail-recsys pigscript focusing specifically on the load statement alias purchase_input.

The Mortar recommendation engine contains a number of parameters that can be used to tune the recommendation logic. These parameters will be covered in a later tutorial. For now, just note that all commands using the recommender require a path to a parameter file.

The first time you run a mortar local command, it will take a minute or two to set up your environment. On the first time only, Mortar downloads all of the dependencies you need to run a Pig job into a local sandbox for your project. This lets you run everything on your own machine quickly and without having to launch a Hadoop cluster.

After the command finishes you should see output similar to:

Illustrate Results

Here you can see what the data looks like and that it's loading correctly. We will use illustrate throughout the tutorials to help visualize how data is being loaded and processed.


Creating Input Signals

The Mortar recommendation engine generates recommendations based on user interactions with items. These interactions are called signals; they tell us how users and items are related and what recommendations to make. In our retail example we use two distinct signals: buying a movie and wishlisting the movie.

The Mortar recommendation engine requires input data in the format: user, item, weight. To get recommendations, we'll need to transform our input data to that format. Weight is an arbitrary value that we assign an interaction to indicate the strength of the interaction between a user and item.

/******* Convert Data to Signals **********/

-- Start with choosing 1 as max weight for a signal.
purchase_signals = FOREACH purchase_input GENERATE
                        user_id    as user,
                        movie_name as item,
                        1.0        as weight;


-- Start with choosing 0.5 as weight for wishlist items because that is a weaker signal than
-- purchasing an item.
wishlist_signals = FOREACH wishlist_input GENERATE
                        user_id    as user,
                        movie_name as item,
                        0.5        as weight;

user_signals = UNION purchase_signals, wishlist_signals;

Here we chose to say purchasing a movie is roughly twice as strong an interaction as just wishlisting it. After all, actually spending your money on something is a stronger sign you want it than just expressing interest.

Use illustrate again to get a better sense of what’s happening with the data.

mortar local:illustrate pigscripts/retail-recsys.pig user_signals -f params/retail.params

Illustrate Results


Generating Recommendations

The next section of code takes the input signals we generated and passes them to the Mortar recommendation engine to create our recommendations.

/******* Use Mortar recommendation engine to convert signals to recommendations **********/

item_item_recs = recsys__GetItemItemRecommendations(user_signals);
user_item_recs = recsys__GetUserItemRecommendations(user_signals, item_item_recs);

Here we're generating two different types of recommendations. Item-to-item recommendations give you a list of items to recommend for another item. In our example, this could be movies to recommend for purchase based on the current movie being viewed. User-to-item recommendations give you a list of items to recommend for a specific user. In our example, this could be movies to show a user on their personalized home page.


Storing Recommendations

/******* Store recommendations **********/

--  If your output folder exists already, hadoop will refuse to write data to it.
rmf $OUTPUT_PATH/item_item_recs;
rmf $OUTPUT_PATH/user_item_recs;

store item_item_recs into '$OUTPUT_PATH/item_item_recs' using PigStorage();
store user_item_recs into '$OUTPUT_PATH/user_item_recs' using PigStorage();

Once we’ve generated our recommendations we need to store them somewhere. These statements tell Pig to store our recommendations into the paths that we defined above.


Running the Recommender

Because this example data set is quite small you can actually run the entire script locally on your own computer.

mortar local:run pigscripts/retail-recsys.pig -f params/retail.params

Now that you’ve started your run you will see a bunch of Pig output. This output shows how our Pig script above is being compiled and run as a number of Hadoop Map/Reduce jobs. Don't worry if you don't understand the logs or how Hadoop works; we'll cover this in a later tutorial.

Once your job finishes (it may take a couple of minutes) it’s time to take a look at the results.


Evaluating the Recommendations

Determining how good your recommendations are is a tough task that requires a good familiarity with your data and your business. Ultimately the best way to do it is to run an A/B test and see how your recommendations are improving the metrics that you care about.

But early on in the development process it is helpful to just take a look at your results and see how things look. To do that, open up the item-item recommendations we just generated. You can find the output in data/retail/out/item_item_recs/part-r-00000. The format of the output file name is from Hadoop. Because Hadoop is a distributed processing framework, the final output may be broken up into parts. Here's a portion of the item-item recommendations:

antz    the lion king   3.9243095   8.266502    1
antz    honey i shrunk the kids 3.7120519   7.348985    2
antz    a connecticut yankee in king arthur's court 3.2631755   1.3863515   3
antz    heaven can wait 3.1456258   4.5953684   4
antz    who killed roger rabbit?    3.1211402   3.5740905   5
alien   psycho  3.2222044   4.5953684   1
alien   i know what you did last summer 2.986917    4.8948455   2
alien   nightmare on elm street 2.8968637   5.500517    3
alien   tremors 2.873042    6.580674    4
alien   scream  2.7485006   5.65644 5
fargo   the godfather   6.8140154   2.6100628   1
fargo   butch cassidy and the sundance kid  6.49299 1.9853055   2
fargo   48 hours    5.010393    1.9853055   3
fargo   the untouchables    3.1157842       4
fargo   m.  2.971754    1.3863515   5
speed   the terminator  6.70754 2.4474227   1
speed   star wars   6.345229    2.9095397   2
speed   jurassic park   6.2409935   1.6858286   3
speed   top gun 5.823684    4.8948455   4
speed   the untouchables    4.5288153   2.9095397   5

Each row of the item-item output has five columns. The first two columns are movie names, the next two columns we'll talk about in a later tutorial, and the final column is the rank of the recommendation. This output shows that the movie "Antz" has five recommendations. “The Lion King” is the best recommendation. “Honey, I Shrunk the Kids” is the second best recommendation. And so on.

A quick look at these recommendations shows that they seem reasonable. “Antz” generates some other animated and/or kid-themed movies. “Alien” generates some adult horror/sci-fi movies. “Speed” generates some other action/adventure movies.

And that's it! You've gotten your recommendations! The next step is to run a larger example on the Mortar platform.