Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Run an Example Recommender in the Cloud

The retail example demonstrated the main concepts of a Mortar recommender. However, since it was only processing a very small amount of data it's missing one important part: running your job at scale. As a rule of thumb, more data means better recommendations. Using the power of Hadoop, you can run your recommendation engine on a cluster of machines and easily handle any amount of data you have.

A Music Recommender

MongoDB

The following example involves reading/writing from/to S3. However, we also have an example that you can run that loads and writes data from MongoDB. To run this example you must first load the Last.FM data into your database.

First, you will need to set your mongo connection, database and collection strings. You can securely store these values for your project by doing:

    mortar config:set CONN=mongodb://<username>:<password>@<host>:<port>

    mortar config:set DB=<databasename>

    mortar config:set COLLECTION=<collectionname>

Now you need to load the Last.fm data. You can run pigscripts/mongo/load_lastfm_data_to_mongo.pig to do that. By default this script will load all 17 million documents in the dataset which is approximately 3 GB of data. If you would like to load only a subset of the data you can uncomment the following line:

    plays = limit plays 1000000;

To run the loading pig script:

    mortar run pigscripts/mongo/load_lastfm_data_to_mongo.pig --clustersize 10 

Once the data is loaded you can follow the rest of the tutorial but instead of using pigscripts/lastfm-recsys.pig you can use pigscripts/mongo/lastfm-recsys-online.pig.


DBMS

The following example involves reading/writing from/to S3. However, we also have an example that you can run that writes data to a database. To run this example you must first create the target tables in your database:

CREATE TABLE <ii-table-name> (from_id CHARACTER VARYING NOT NULL, to_id CHARACTER VARYING,
    weight NUMERIC, raw_weight NUMERIC, rank INTEGER NOT NULL, PRIMARY KEY (from_id, rank));

CREATE TABLE <ui-table-name> (from_id CHARACTER VARYING NOT NULL, to_id CHARACTER VARYING,
    weight NUMERIC, reason_item CHARACTER VARYING, user_reason_item_weight NUMERIC,
    item_reason_item_weight NUMERIC, rank INTEGER NOT NULL, PRIMARY KEY (from_id, rank));

These CREATE statements include primary keys, which may be necessary to get appropriate query times on the final volume of data.

After creating those tables, you will need to set your database password. You can securely store this value for your project by doing:

mortar config:set DATABASE_PASS=<password>

Lastly, you'll need to open pigscripts/dbms/lastfm-recsys-dbms.pig and set the default parameters to connect to your database.

%default DATABASE_TYPE 'postgresql'
%default DATABASE_DRIVER 'org.postgresql.Driver'
%default DATABASE_HOST '<host>:<port>'
%default DATABASE_NAME '<dbname>'
%default DATABASE_USER '<username>'
%default II_TABLE '<ii-table-name>'
%default UI_TABLE '<ui-table-name>'

This example uses PostgreSQL, but you can write to any JDBC database by setting the type and driver, and if necessary adding the driver jar to your lib directory.

Once the database and script are set up, yoy can follow the rest of the tutorial but instead of using pigscripts/lastfm-recsys.pig you can use pigscripts/dbms/lastfm-recsys-dbms.pig.


Open pigscripts/lastfm-recsys.pig. This script creates music recommendations based on a data set from LastFM that contains the number of times an individual user played a specific artist. This data lives in a publicly available S3 bucket and contains data for approximately 360,000 users.

Use illustrate to see what this data looks like. This time add the '-s' option so that illustrate will show more rows of data.

mortar local:illustrate pigscripts/lastfm-recsys.pig input_signals -f params/lastfm.params -s

Illustrate Results

As you can see, in this example our data is already in a format that we can use as signals. Users are our users, artist names will be our items, and the number of times a user played the artist will be the weight of the signal. This assumes that the more times a user played the artist, the more they liked them.


Running in the Cloud

When running your recommender in the cloud, you need to decide how large a cluster of computers to use for your job. For this example, you will use a 10-node cluster that should take about an hour and a half to run. In later tutorials we'll cover more about how Hadoop works and how to determine the right cluster size for your data and your job.

By default the Mortar recommendation engine uses AWS spot instances to save money. Running this example on a 10-node spot instance cluster for 2 hours should cost you approximately $3.40 in pass-through AWS costs. Before running this job you will need to add your credit card to your account. You can do that on our Billing Page.

Its a good idea to shut down your cluster when you are done with it. This can be done by passing the '--singlejobcluster' option when running your job or by going to the Clusters Page. If you forget, Mortar will automatically shut down your cluster after it has been idle for 1 hour.

mortar run pigscripts/lastfm-recsys.pig -f params/lastfm.params --clustersize 10

After running this command you will see output similar to:

Taking code snapshot... done
Sending code snapshot to Mortar... done
Requesting job execution... done
job_id: some_job_id

Job status can be viewed on the web at:

    https://app.mortardata.com/jobs/job_detail?job_id=some_job_id

Or by running:

    mortar jobs:status some_job_id --poll

This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.

While you are waiting for your job to finish, jump over and do the Hadoop and Pig tutorials to get a better understanding of how they work.


Monitoring Your Job's Progress

Your recommendation engine job goes through three main steps:

  1. Mortar validates your pigscript, checking for simple Pig errors, Python errors, and that you have the required access to the specified S3 buckets. If the pigscript is invalid, Mortar will return an error explaining the problem. This step is done before launching a cluster to make sure you pay as little as possible.
  2. Mortar starts a Hadoop cluster of the size you specified. This stage can take 5-15 minutes. You do not pay for the time the cluster is starting.
  3. Mortar runs your job on the cluster.

Once your job has started running on the cluster, you can get realtime feedback about its progress from the Mortar web application. Open the job status link displayed after you started your job:

Job status can be viewed on the web at:

    https://app.mortardata.com/jobs/job_detail?job_id=some_job_id

Illustrate Results

At the top of the screen you can see the basic details of your job: its current progress, the time the job started, how long it's been running, and the number of nodes in the cluster.

Below that is the Visualization tab. This shows how Pig has compiled your pigscript into Hadoop Map/Reduce jobs. The blinking box shows you the current running Map/Reduce job, and clicking on any box will show you information and statistics for that specific job. There's a lot of information in this visualization, so take a minute to click around and see how things work. Don't worry if there's information you don't understand--we'll explain the important parts later.

Job Details

Click on the Details tab to see some information about your job. Mortar keeps track of every parameter value you used and links to an exact snapshot of the code used in Github. Once your job has finished this page will also display a link to your output data on S3.

The remaining tabs show more detailed log information from Pig and Hadoop. Take a minute to look at those tabs and familiarize yourself with what is being shown.


Viewing your Recommendations

MongoDB

If you ran the MongoDB example you can connect to your database and query your results once your job has finished. To see the recommendations for the three different artists described below do:

    db.item_item_recs.find({"item_A":"dimmu borgir"})
    db.item_item_recs.find({"item_A":"miley cyrus"})
    db.item_item_recs.find({"item_A":"yo-yo ma"})


DBMS

If you ran the DBMS example you can connect to your database and query your results once your job has finished.


Once your job has finished you can examine the results similar to how we did with the retail example. In this case Pig has written our results to S3, so you’ll need to download the output files from S3. The easiest way to do that is to use the Details tab from the job status page shown above.

Download Results

Take a look at the item-item recommendations. Here are some artist recommendations from the results.

dimmu borgir    ad inferna  0.6363626   6.9999876   1
dimmu borgir    old man's child 0.6347272   539.41705   2
dimmu borgir    old mans child  0.60611665  148.91302   3
dimmu borgir    hecate enthroned    0.56548685  104.47171   4
dimmu borgir    dragonlord  0.53925556  40.964005   5
dimmu borgir    apostasy    0.51258606  20.995056   6
dimmu borgir    martriden   0.5121951   21.0    7
dimmu borgir    vesania 0.5026692   271.64865   8
dimmu borgir    naglfar 0.50033504  516.794 9
dimmu borgir    celestial crown 0.5 5.0 10
...
miley cyrus miley cyrus and billy ray cyrus 0.8259267   37.95765    1
miley cyrus emily osment    0.80874276  65.465225   2
miley cyrus selena gomez    0.7576922   188.882 3
miley cyrus sara paxton 0.75432295  16.958307   4
miley cyrus nick jonas  0.73641354  21.684519   5
miley cyrus nicholas jonas  0.73333335  33.0    6
miley cyrus mitchel musso   0.73020715  54.760765   7
miley cyrus cast of camp rock   0.7088355   28.631786   8
miley cyrus steve rushton   0.7022875   16.512615   9
miley cyrus clique girlz    0.68378335  31.921757   10
...
yo-yo ma    natalie clein   0.36359763  3.9993293   1
yo-yo ma    bar scott   0.33333334  2.0 2
yo-yo ma    mela tenenbaum  0.33333334  2.0 3
yo-yo ma    moscow radio symphony orchestra 0.33333334  2.0 4
yo-yo ma    jonathan biss   0.33333334  2.0 5
yo-yo ma    小松亮太    0.3 3.0 6
yo-yo ma    méav ní mhaolchatha 0.2857143   2.0 7
yo-yo ma    juliana finch   0.2857143   2.0 8
yo-yo ma    academy of st. martin in the fields, joshua bell & michael stern    0.2857143   2.0 9
yo-yo ma    yo-yo ma & the silk road ensemble   0.27272728  6.0 10

Evaluating the Recommendations

One good way to get a sense of the quality of your recommendations is to look at very different items and see if the recommendations make sense. Above, you can see a death metal band, a pop singer, and a cellist, and in each case the recommendations make sense.

You can also see a couple of possible improvements for these recommendations:

  1. Preprocess the data to standardize artist names. "old man's child" and "old mans child" should probably be treated as the same artist.
  2. Add a filter to remove recommendations that involve the current artist just playing with someone else.
  3. Add a diversity metric so that you can get recommendations from multiple genres.

We'll cover ways that you can make these adjustments in a later tutorial, but for now just note that spot checking your results like this can provide good ideas for future improvements.

If you haven't already completed the Hadoop and Pig tutorials, you should do that now. If you have, you're ready to build your own recommendation engine!