The first thing to do is choose how you will connect to your Mongo data. For a refresher on the different options, see How to Use Hadoop with MongoDB.
Here are our recommendations, based on where your database is hosted:
If your database is hosted at MongoLab or MongoHQ in the AWS us-east-1 region, we recommend connecting directly to a secondary node in your MongoDB replica set.
Otherwise, we recommend using the automated backup service that MongoLab and MongoHQ provide to store a mongodump backup to an Amazon S3 bucket in us-east-1, and then reading the data from there.
For more details, see:
If you host your own MongoDB database in the us-east-1 region of the AWS cloud, Mortar's clusters will be able to transfer data to and from your database efficiently and for free. In this case, we recommend using the direct connection strategy, and connecting to secondary nodes in your MongoDB replica sets.
To see how this is done, read the Direct Mongo Connection section of Connecting a Self-Hosted MongoDB to Mortar.
If you are self-hosted in an AWS region other than us-east-1, use a different cloud, or have your Mongo on-premise, you can still process it with Mortar. In fact, it's very easy to do so.
All you'll need to do is upload a recent mongodump backup to an Amazon S3 bucket in us-east-1. Then, you'll be able to point Mortar at that data.
To do so, follow the Mongodump Data in S3 section of Connecting a Self-Hosted MongoDB to Mortar.
Once you have a connection strategy in place, let's get your data loaded into Pig.