In the beginning we encouraged you to pick some sensible weights for your incoming user-item signals. One possible source of poor recommender results is that what seemed like sensible weights turn out not to be.
There are no golden rules, and these values don’t need to be perfect, but trying a couple of runs using different signal weights can point to where a problem might be. Maybe your customers use their wishlist as a place to store gift ideas, so those items don’t represent their preferences. Maybe your users are sharing media ironically (“look at how bad this is!”). Adjusting the signal weights can help fix these kinds of problems.
If you choose to create detailed data, you can look at the
link_data field in your output to get a sense for which signals are producing better results. If poor results tend to be high in a particular signal type, that signal may want to have a smaller weight.
You can also dig deeper into the signals for a particular item pair; pigscripts/techniques/debug-item-item-recs.pig
shows an example of finding the source signals behind a recommendation using the
mortar local:run pigscripts/techniques/debug-item-item-recs.pig
It may be that some part of your data-entry process is manual, or that you’ve pulled in data from various sources. If you see recommendations for both “old man’s child” and “old mans child,” that means that the signal is getting split between two different items when it should be going entirely to one. Resolving that in your original data set may be helpful to more than just the recommendation engine.
Can your users take items off a wishlist without purchasing them? Can they return items? Can they unfavorite or unfollow? It’s important to make sure that those negative signals are cancelling out the positive signals; a user who returns an item shouldn’t get the full “buy” affinity.
If your data contains a rating scale, ratings below the halfway point should not contribute any positive affinity. If a user rates something as a 1 out of 10, that user does not like the item with a strength of .1, that user actively dislikes it.
You can include signals with negative values (i.e., a negative sign on the weight); to reduce or eliminate positive signals.
When you gathered your signals, was there something that was hard to get that wasn't included? Maybe you've thought of something else that could be added into the mix. Now is the time to integrate anything that didn't get included in the first pass.
The default parameters in
my_recommender.params cut off the number of recommendations generated at a conservative number for performance reasons.
If your data is sparse or you find that a lot of items don't have any recommendations, these numbers can be adjusted to return more results.
This parameter indicates how many signals from a single user will be included. Bots and users with extreme usage patterns can cause performance issues, so they are proactively screened out. If you think your users will often have more than 100 signals, increase this number.
After item-item connections have been calculated, any links below this threshold will be discarded on the premise that the relationship is weak. If there aren't enough recommendations being generated, lowering this number will help retain some of those weaker connections.
This defines the number of recommendations to be generated for each item. The larger the number, the more possible recommendations. If there are users without recommendations, increasing this number may help.
This defines the number of recommendations to be generated for each user. The larger the number, the more recommendations available for each user.
Bots can destroy your recommendations, particularly if you are using view data. The bot will view 10,000 web pages, creating false affinities, and it will do it again and again. Bots add noise to the data, and as an added bonus, they also make the algorithms take longer.
The first step is to add a THRESHOLD parameter to your param file. The THRESHOLD parameter represents the maximum number of user-item signals a single user can have before being considered a bot. Choose a value based on what your estimate of the largest number of signals a real person could actually generate. It's ok if this drops out a couple human users; getting rid of the noise is more important than keeping all of the signal.
The second step is to use the
recsys__RemoveBots macro (found in recsys_util.pig) in your pigscript.
This macro takes two arguments: your user-item signals and the threshold parameter.
The macro removes the bot signals and returns the remaining user-item signals. Use the macro after you have generated the user-item signals and before applying
-- generate user signals -- -- -- remove bots user_signals_no_bots = recsys__RemoveBots(user_signals); item_item_recs = recsys__GetItemItemRecommendations(user_signals_no_bots); -- use user_signals that is cleaned of bots
mortar local:run pigscripts/techniques/remove-bots-technique.pig -f params/technique.params