The last thing you need to do before you are ready to run your recommendations is to uncomment the remaining portion of your script.
/******* Use Mortar recommendation engine to convert signals to recommendations **********/ -- Call the default Mortar recommender algorithm on your user-item data. -- The input_signals alias needs to have the following fields: (user, item, weight:float) item_item_recs = recsys__GetItemItemRecommendations(user_signals); user_item_recs = recsys__GetUserItemRecommendations(user_signals, item_item_recs); -- Store recommendations rmf $OUTPUT_PATH/item_item_recs; rmf $OUTPUT_PATH/user_item_recs; store item_item_recs into '$OUTPUT_PATH/item_item_recs' using PigStorage(); store user_item_recs into '$OUTPUT_PATH/user_item_recs' using PigStorage();
This code is taking your weighted user-item signals and generating recommendations for individual users and items. For more information about how the Mortar recommendation engine works, please read Recommendation Engine Basics. If you are only interested in generating recommendations for your items (and not your users) you can delete the
user_item_recs line and the corresponding store statement.
If you would like to store your recommendations directly into MongoDB (instead of to S3) replace the section of the script below "-- Store recommendations" with the following code:
-- Store recommendations %default DB '<your_database>' %default II_COLLECTION 'item_item_recs' %default UI_COLLECTION 'user_item_recs' store item_item_recs into '$CONN/$DB.$II_COLLECTION' using com.mongodb.hadoop.pig.MongoInsertStorage('',''); store user_item_recs into '$CONN/$DB.$UI_COLLECTION' using com.mongodb.hadoop.pig.MongoInsertStorage('','');
If you would like to store your recommendations directly into a SQL database (instead of to S3) replace the section of the script below "-- Store recommendations" with the following code:
-- Store recommendations -- Definitions for PostgreSQL example; change for other providers %default DATABASE_TYPE 'postgresql' %default DATABASE_DRIVER 'org.postgresql.Driver' %default DATABASE_HOST '<host>:<port>' %default DATABASE_NAME '<dbname>' %default DATABASE_USER '<username>' %default II_TABLE '<ii-table-name>' %default UI_TABLE '<ui-table-name>' store item_item_recs into 'hdfs:///unused-ignore' USING org.apache.pig.piggybank.storage.DBStorage('$DATABASE_DRIVER', 'jdbc:$DATABASE_TYPE://$DATABASE_HOST/$DATABASE_NAME', '$DATABASE_USER', '$DATABASE_PASS', 'INSERT INTO $II_TABLE(from_id,to_id,weight,raw_weight,rank) VALUES (?,?,?,?,?)'); store user_item_recs into 'hdfs:///unused-ignore' USING org.apache.pig.piggybank.storage.DBStorage('$DATABASE_DRIVER', 'jdbc:$DATABASE_TYPE://$DATABASE_HOST/$DATABASE_NAME', '$DATABASE_USER', '$DATABASE_PASS', 'INSERT INTO $UI_TABLE(from_id,to_id,weight,reason_item,user_reason_item_weight,item_reason_item_weight,rank) VALUES (?,?,?,?,?,?,?)');
You must first create the target tables in your database:
CREATE TABLE <ii-table-name> (from_id CHARACTER VARYING NOT NULL, to_id CHARACTER VARYING, weight NUMERIC, raw_weight NUMERIC, rank INTEGER NOT NULL, PRIMARY KEY (from_id, rank)); CREATE TABLE <ui-table-name> (from_id CHARACTER VARYING NOT NULL, to_id CHARACTER VARYING, weight NUMERIC, reason_item CHARACTER VARYING, user_reason_item_weight NUMERIC, item_reason_item_weight NUMERIC, rank INTEGER NOT NULL, PRIMARY KEY (from_id, rank));
Setting reasonable primary keys and indexes is generally necessary to get acceptable query times on the large mount of resulting data.
After creating those tables, you will need to set your database password. You can securely store this value for your project by doing:
mortar config:set DATABASE_PASS=<password>
Now that your script is ready, it's time to run this job in the cloud. Before doing this, you need to pick the number of nodes you want to use for running your job. For your first run use a cluster size of 10. This is only an initial value to use--you should adjust your cluster size based on how long your job takes and your individual requirements for acceptable run time.
The Mortar recommender project defaults to using AWS spot instances. While spot instance prices aren’t guaranteed, they’re typically around $0.14 per hour per node. On very rare occasions when the spot price goes above $0.44 per hour per node, your cluster will be automatically terminated and the job will fail. In most cases the significant cost savings of spot instances will be worth the very small chance of your job failing, but read Setting Project Defaults if you would like to change the defaults of your project.
Its a good idea to shut down your cluster when you are done with it. This can be done by passing the '--singlejobcluster' option when running your job or by going to the Clusters Page. If you forget, Mortar will automatically shut down your cluster after it has been idle for 1 hour.
Ok, now you’re ready to run!
mortar run pigscripts/my-recommender.pig -f params/my_recommender.params --clustersize 10
After starting your job you will see output similar to:
Taking code snapshot... done Sending code snapshot to Mortar... done Requesting job execution... done job_id: some_job_id Job status can be viewed on the web at: https://app.mortardata.com/jobs/job_detail?job_id=some_job_id Or by running: mortar jobs:status some_job_id --poll
This tells you that your job has started successfully and gives you two common ways to monitor the progress of your job.
In the example recommender tutorial we covered how to monitor your job's progress but let's recap what you will do.
Open up your job status page at the url displayed after you started your job.
The top of the page shows your job's overall progress. Remember that there are three main stages to how your job runs:
Once your job starts running on your cluster, you can use the visualization tab to see how your job is broken up into Hadoop Map/Reduce jobs and to see various metrics about how your job is doing.
On the details tab you will see various metadata about this job. Mortar keeps track of the exact code and parameters you used to run this job. As you are developing your recommendation engine it can be useful to go back to one of your previous jobs to get a sense of what you changed and how that affected your results.
If your job fails the details tab will show you the error that occurred. To diagnose errors you can use the log tabs on the right to get more detailed information.
To view recommendations stored in MongoDB you can connect normally and query the collections you used to store your results. Skip down to "Evaluating Your Results" below.
To view recommendations stored in a SQL database you can connect normally and query the tables you used to store your results. Skip down to "Evaluating Your Results" below.
Once your job has finished, it's time to take a look at your results. Your data will be organized in the following format in the S3 bucket you used for your OUTPUT_PATH parameter.
* item_item_recs * part-r-00000 * part-r-00001 * ... * part-r-NNNNN * user_item_recs * part-r-00000 * part-r-00001 * ... * part-r-NNNNN
Your recommendations are broken up into multiple "part" files based on how Hadoop distributes the processing of your job.
Open up the web url displayed after you started your job and go to the details tab.
From this tab you can download some or all of your result files. Depending on the size of your data you may want to use a tool like the AWS Command Line Interface, Transmit, or s3cmd to download your results, providing the AWS keys you wrote down previously to your tool of choice.
Once you have your results, open them and take a look.
The schema for the item-item recommendation output is:
The schema for the user-item recommendation output is:
An important note on the weight fields is that while a larger weight does indicate a stronger relationship, the absolute weight doesn't mean anything on its own. Absolute weight will vary based on the number of signals, the weight assigned them, and the total number of items available. Moreover, it is not possible to conclude that a link with twice the weight is twice as strong; there is no guarantee of a linear relationship.
Judging the initial quality of your recommendations requires some knowledge of your domain and your data. A good approach to take is to pick a few items that are very different (a romance and a horror movie, a pop artist and an instrumentalist, an outdoor and a kitchen good, etc.) and see that the recommendations for each of those make sense.
Congratulations, you've just generated your first recommendations!