Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Gather and Upload Data

Now we’re ready to move from making recommendations on example data on to making recommendations on your data. But first, you need to gather your input data and copy it to a place where you can use it to make recommendations.

Which Data?

To begin, we need to determine which input data to gather for our recommendation engine.

Start off by deciding what entity type your recommendation engine should recommend. For retail, you’ll usually recommend an item of merchandise to a user. For media, it’s often a video or image. For social, you’re likely to recommend other users.

After you’ve decided which entity to recommend, the next step is to make a list of all the interactions that a user in your system has with that entity. These data points are called signals—they record a meaningful interaction between a user and an item that you’d like to recommend. Note that if you are recommending users, it can be helpful to include interactions with items if you have them.

Your signals will be specific to your business, but some standard signals by industry are:

Retail (recommending items)

  • Purchases
  • Wishlist adds
  • Cart adds
  • Views
  • Shares
  • Favorites

Media (recommending videos or images)

  • Downloads
  • Watches
  • Shares
  • Favorites
  • Views

Social (recommending other users)

  • Shares
  • Likes
  • Follows

For each signal data set you identify, write down the location where your business stores that data. Often this will be a table in an application database, but some data may be stored in log files, or in third party tools.

It’s ok to spend a bit of time here—getting a good, full list of signals can greatly improve the quality of your recommendations. That said, you can also get started with a sample set, so don’t feel you have to wait for every piece of data to be in place to get going.

Extracting Signal Data

MongoDB

When your signal data is stored in MongoDB you have to decide how you are going to retrieve that data. There are two high level strategies you can take:

  • Offline processing (Recommended): Store backups of your MongoDB data in S3 and read those.
  • Online processing: Connect to MongoDB and read the data directly.

For details on the different options see Hadoop with Mongo.

If you choose to read your backups from S3, follow the rest of this section for instructions on how to get your data to S3 and readable from Mortar. If you choose to connect directly to your database you can skip the rest of this section and start Loading your data into Pig.


DBMS

To extract data from a database, use your database’s built-in extraction tool:

For more information, see Exporting Data from SQL Databases.


Having determined the input signal data that you need, now it’s time to gather that data. This process is called "extraction"—retrieving data from a source system to allow it to be processed downstream.

For this first extraction, don’t worry about making your process automated or repeatable. You just need a good copy of the data to start working from; we’ll get the process automated in a later tutorial.

General rules of thumb for this extraction are:

  • Get as much data as you can, within reason. The more data you can get, the better your recommendations will be. Between several weeks and a month of data for each data set is a good target, depending on your daily volume.
  • Extract data into a simple data format. Tab-delimited text files or CSV are the best choice, as they’re easy for both humans and machines to read. JSON is also a good choice. XML works, but it can be harder to read and work with. If you have log data, feel free to keep it as-is without transformation. In general, stray from other, more esoteric formats, especially anything binary.
  • Cover the same timeframe. When extracting data from different systems, ensure that the timeframes overlap for the data you extract. Otherwise, you may have trouble connecting data points.
  • Don’t use compression. For this first extraction, leave your files uncompressed (no zip, gz, bzip, or tar files). If you must compress in order to successfully upload, bzip2 is the best and most Hadoop-friendly choice.
  • One file per dataset: Hadoop works best with a small number of large files. So, extract one file for each dataset (even if large), rather than splitting datasets into smaller files.

Uploading Data to S3

Your recommendation engine will pick up input data from Amazon S3: a simple, inexpensive, and near-infinitely-large storage system at AWS. S3 stores data in “buckets,” which are similar to directories. Buckets contain files, which are called “objects” in S3. To learn more about S3, check out AWS’s S3 Details page.

There are 3 steps to get your extracted data uploaded to S3:

  1. Find or create your AWS account
  2. Get your AWS access keys
  3. Upload your data to a new S3 bucket

We’ll explore each of these in order.

1. Find or Create an AWS Account

If you already have an Amazon Web Services (AWS) account and a login to the AWS Management Console, you can skip this portion and move to the next step. Otherwise, we’ll need to create an AWS account where you can upload your recommendation input data.

Creating an account at AWS is very easy. To do so, visit the AWS homepage, click “Sign Up,” provide your information, and create your account. If AWS asks which products you intend to use, be sure to select AWS S3. You’ll need to provide a credit card to AWS to cover any costs you incur, but note that AWS has a very generous free usage tier to get you started, and that S3 pricing is very inexpensive.

2. Get your AWS Access Keys

Next, you’ll need to get your AWS Access Keys. These keys will allow you to create a new S3 bucket and upload your data to it.

There are two types of AWS Access Keys: account-level keys that provide full account access and fine-grained (IAM) keys that provide access only to specific AWS resources. This tutorial will use account-level keys, but if you prefer IAM keys (more complex), you can follow these alternate setup steps for IAM.

To get your account-level AWS Access Keys:

  1. Go to the AWS Security Credentials page.
  2. Open the “Access Keys” section and push the “Create New Access Key” button.
  3. Expand the “Show Access Key” link, and write down your Access Key ID and Secret Access Key in a secure location.

Note that AWS only allows two pairs of access keys to be active at a time. If you already have two active pairs of keys, you’ll need to look up the Secret Access Key for one of them from the Legacy Security Credentials page, or talk to your IT department to get them.

3. Upload Your Data to a New S3 Bucket

Now, we’re ready to upload our input data to a newly created S3 bucket. We’ll use the AWS Management Console to do this quickly and easily. (Check here for other upload options.)

First, create a new S3 bucket:

  1. Go to the S3 Management Console page in the AWS Management Console. If prompted, login with your AWS username and password.
  2. Press the “Create Bucket” button to create a new bucket.
  3. Name your bucket, using dashes to separate words (e.g. mycompany-mortar-recs-data). Keep your bucket in the US Standard Region, where Mortar’s Hadoop clusters run, to ensure fast and free data transfer between Hadoop and S3.
  4. Press “Create Bucket” to make your new bucket.

Next, upload your extracted data files into the bucket:

  1. Click on the name of your newly created bucket in the S3 Management Console.
  2. Click the “Create Folder” button to create a new folder, and name it “input”.
  3. Click the “input” folder to open it up.
  4. Click the “Upload” button, select your files, and press “Start Upload” to upload them.

4. Set your AWS Access Keys in Mortar

While you are waiting for your data to upload, you should add your AWS Access Keys to your Mortar account on the Mortar AWS Settings page:

Mortar AWS Settings.

These keys will be stored encrypted at Mortar, allowing you to access your data in S3.

When the upload finishes, your input data will be stored in Amazon S3 and ready to load into Pig in the next step.