To begin, we need to determine which input data to gather for our recommendation engine.
Start off by deciding what entity type your recommendation engine should recommend. For retail, you’ll usually recommend an item of merchandise to a user. For media, it’s often a video or image. For social, you’re likely to recommend other users.
After you’ve decided which entity to recommend, the next step is to make a list of all the interactions that a user in your system has with that entity. These data points are called signals—they record a meaningful interaction between a user and an item that you’d like to recommend. Note that if you are recommending users, it can be helpful to include interactions with items if you have them.
Your signals will be specific to your business, but some standard signals by industry are:
Retail (recommending items)
Media (recommending videos or images)
Social (recommending other users)
For each signal data set you identify, write down the location where your business stores that data. Often this will be a table in an application database, but some data may be stored in log files, or in third party tools.
It’s ok to spend a bit of time here—getting a good, full list of signals can greatly improve the quality of your recommendations. That said, you can also get started with a sample set, so don’t feel you have to wait for every piece of data to be in place to get going.
When your signal data is stored in MongoDB you have to decide how you are going to retrieve that data. There are two high level strategies you can take:
For details on the different options see Hadoop with Mongo.
If you choose to read your backups from S3, follow the rest of this section for instructions on how to get your data to S3 and readable from Mortar. If you choose to connect directly to your database you can skip the rest of this section and start Loading your data into Pig.
Having determined the input signal data that you need, now it’s time to gather that data. This process is called "extraction"—retrieving data from a source system to allow it to be processed downstream.
For this first extraction, don’t worry about making your process automated or repeatable. You just need a good copy of the data to start working from; we’ll get the process automated in a later tutorial.
General rules of thumb for this extraction are:
Your recommendation engine will pick up input data from Amazon S3: a simple, inexpensive, and near-infinitely-large storage system at AWS. S3 stores data in “buckets,” which are similar to directories. Buckets contain files, which are called “objects” in S3. To learn more about S3, check out AWS’s S3 Details page.
There are 3 steps to get your extracted data uploaded to S3:
We’ll explore each of these in order.
If you already have an Amazon Web Services (AWS) account and a login to the AWS Management Console, you can skip this portion and move to the next step. Otherwise, we’ll need to create an AWS account where you can upload your recommendation input data.
Creating an account at AWS is very easy. To do so, visit the AWS homepage, click “Sign Up,” provide your information, and create your account. If AWS asks which products you intend to use, be sure to select AWS S3. You’ll need to provide a credit card to AWS to cover any costs you incur, but note that AWS has a very generous free usage tier to get you started, and that S3 pricing is very inexpensive.
Next, you’ll need to get your AWS Access Keys. These keys will allow you to create a new S3 bucket and upload your data to it.
There are two types of AWS Access Keys: account-level keys that provide full account access and fine-grained (IAM) keys that provide access only to specific AWS resources. This tutorial will use account-level keys, but if you prefer IAM keys (more complex), you can follow these alternate setup steps for IAM.
To get your account-level AWS Access Keys:
Note that AWS only allows two pairs of access keys to be active at a time. If you already have two active pairs of keys, you’ll need to look up the Secret Access Key for one of them from the Legacy Security Credentials page, or talk to your IT department to get them.
Now, we’re ready to upload our input data to a newly created S3 bucket. We’ll use the AWS Management Console to do this quickly and easily. (Check here for other upload options.)
First, create a new S3 bucket:
Next, upload your extracted data files into the bucket:
While you are waiting for your data to upload, you should add your AWS Access Keys to your Mortar account on the Mortar AWS Settings page:
These keys will be stored encrypted at Mortar, allowing you to access your data in S3.
When the upload finishes, your input data will be stored in Amazon S3 and ready to load into Pig in the next step.