If your MongoHQ database is hosted in AWS us-east-1, use the Direct Mongo Connection strategy below. Mortar's clusters run in the AWS us-east-1 region, and will be able to quickly load and store data from your Mongo database.
If your database is hosted outside of AWS us-east-1 (in a different region or different cloud), use the Mongodump Data in S3 strategy below to connect to your data.
For this strategy, you will connect directly to your database in MongoHQ. If you have a replica set, you should be sure to connect to the secondary nodes to keep traffic off of the primary. Your Mongo URI connection string will look like:
You can create the appropriate users and find your connection string from the MongoHQ console.
This strategy lets you point Hadoop and Pig directly at MongoHQ backups stored in Amazon S3.
MongoHQ takes daily or weekly mongodump backups, which it stores to your S3 bucket.
MongoHQ backups are stored tarred and gzipped in S3. In order to process them with Hadoop, they must first be un-tarred and un-gzipped.
Here are the steps to produce and download a MongoHQ backup:
tar -zxvf mybackup.tgz.
Pig's Mongo BSON Loader will pick up input data from Amazon S3: a simple, inexpensive, and near-infinitely-large storage system at AWS. S3 stores data in “buckets,” which are similar to directories. Buckets contain files, which are called “objects” in S3. To learn more about S3, check out AWS’s S3 Details page.
You'll want to upload the BSON files you got from mongodump to an S3 bucket in your AWS account. That bucket must be in the US Standard region for Mortar's Hadoop clusters to process it efficiently. You only need to upload ones for the collections you want to analyze; you can start with a single collection. There are 3 steps to get your BSON files uploaded to S3:
We’ll explore each of these in order.
If you already have an Amazon Web Services (AWS) account and a login to the AWS Management Console, you can skip this portion and move to the next step. Otherwise, we’ll need to create an AWS account where you can upload your recommendation input data.
Creating an account at AWS is very easy. To do so, visit the AWS homepage, click “Sign Up,” provide your information, and create your account. If AWS asks which products you intend to use, be sure to select AWS S3. You’ll need to provide a credit card to AWS to cover any costs you incur, but note that AWS has a very generous free usage tier to get you started, and that S3 pricing is very inexpensive.
Next, you’ll need to get your AWS Access Keys. These keys will allow you to create a new S3 bucket and upload your data to it.
There are two types of AWS Access Keys: account-level keys that provide full account access and fine-grained (IAM) keys that provide access only to specific AWS resources. This tutorial will use account-level keys, but if you prefer IAM keys (more complex), you can follow these alternate setup steps for IAM.
To get your account-level AWS Access Keys:
Note that AWS only allows two pairs of access keys to be active at a time. If you already have two active pairs of keys, you’ll need to look up the Secret Access Key for one of them from the Legacy Security Credentials page, or talk to your IT department to get them.
Now, we’re ready to upload our input data to a newly created S3 bucket. We’ll use the AWS Management Console to do this quickly and easily. (Check here for other upload options.)
First, create a new S3 bucket:
Next, upload your extracted data files into the bucket:
While you are waiting for your data to upload, you should add your AWS Access Keys to your Mortar account on the Mortar AWS Settings page:
These keys will be stored encrypted at Mortar, allowing you to access your data in S3.
When the upload finishes, your input data will be stored in Amazon S3 and ready to load into Pig.