Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Load Your Data Into Pig

Now that your data is in S3, it’s time to use it! If you don't have a Mortar project already set up or if you created a public project for your example recommendation engine but want your real engine to be private, follow the instructions here to set up a new private project before getting started.

Template Recommender

In your Mortar project open pigscripts/my-recommender.pig. This is the template that you're going to use to build your recommendation engine.

If you’ve already done the Running an Example Recommender section of this tutorial, this template should look very familiar to you. If you haven’t done that section, take a minute to read it over to become familiar with the different sections of your recommendation engine.

Data Paths


As described in the MongoDB section of Gather and Upload Data you have a couple of options for how to load your MongoDB data.

If you have chosen to read your Mongo backups from S3 follow this section and set the path to your data accordingly.

If you are connecting directly to MongoDB skip this section and go to the Load Statements section below.

The first thing you need to do is get your data loaded into your Pig script.

import 'recommenders.pig';

%default INPUT_PATH 's3://my-bucket/input/my-data-file-or-directory'
%default OUTPUT_PATH 's3://my-bucket/output'

This section defines the parameters that tell Pig where your input data is stored and where your recommendation output should be written. Update these parameters with the S3 paths you want to use for your recommendation engine. If you have multiple sets of data stored in different formats (csv, json, ...) or with different schemas, you will need to have a separate input_path parameter for each set of data. For example, in the retail recommender example we had one data set that recorded the purchases a user had made and another data set that recorded the movies the user had added to their wishlists.

If you enter a directory as your input path Pig will read all files in that directory. This is useful if you are loading data that is broken up into multiple files, e.g., daily log data.

Load Statements

Next, uncomment the load statement section of the Pig script. To do that, remove the block comment symbols (/* and */) from around the load statement line so that your code looks like:

/******* Load Data **********/

-- For help figuring out how to load your data visit
raw_input = load '$INPUT_PATH' using PigStorage()
              as (user: chararray, item: chararray, date_purchased: chararray);

The above load statement is a simple one that will load a 3-column tab-delimited file with user id, item id, and purchase date fields. For your recommender you’ll need to delete this load statement and replace it with one of your own. If you’re not familiar with Pig, visit the Mortar load statement generator to help generate a load statement for your data. You will need one load statement for every data set you have.


If you are loading your MongoDB backups from S3 then you will want to use a load statement like:

raw_input =  load '$INPUT_PATH'
            using com.mongodb.hadoop.pig.BSONLoader('mongo_id', '

If you are connecting directly to MongoDB then you will want to set your connection string:

mortar config:set CONN=mongodb://<username>:<password>@<host>:<port>

And then use a load statement like:

%default DB '<your_database>'
%default COLLECTION '<your_collection>'

raw_input =  load '$CONN/$DB.$COLLECTION'
            using com.mongodb.hadoop.pig.MongoLoader('

Once you have your load statement(s), let’s make sure it works properly. Illustrate it with:

mortar local:illustrate pigscripts/my-recommender.pig -f params/my_recommender.params

Illustrate Results

If your load statement(s) are correct, you should see illustrate results similar to the screenshot above. Verify that all of your fields are loading as you expect. If you have an error with your load statement(s), you will get an error message explaining what’s gone wrong.

Now that you are successfully loading your data, it's time to generate signals.