Mongo was built for data storage and retrieval for large data sets, one item at a time. Hadoop, on the other hand, was written for scalable data processing on huge data sets. So naturally, data processing is often better offloaded to Hadoop. Here’s why:
The Hadoop community has created Pig, Hive, and the Cascading family of languages--all of which are expressive and high-level, and let you bring your own custom code.
If you’re doing significant data processing on MongoDB, it can add substantial load to your production database. You often need an order of magnitude more power to process data in a timely way than you need to store it. Hadoop in the cloud handles this nicely, scaling up to meet your processing needs and down to nothing when you're done.
Hadoop was designed and built for fully distributed, multi-process execution, so it performs much, much better for this sort of work.
So if you want to easily write distributed jobs that perform well and don’t add load to your primary storage system, Hadoop is the way to go.
There are several options to analyze and report on your MongoDB data with Hadoop. These are all powered by the MongoDB Connector for Hadoop, which is integrated and distributed with Mortar.
Determining the best strategy often depends on your particular use case and data. We'll explore each option you can take here:
This is the quickest option to get going. You simply connect directly to your MongoDB database using a standard Mongo URI.
To keep the load off the primary node in your replica set, most people connect to their secondary nodes. While reading from secondary nodes can mean you're running against slightly stale data (typically seconds to minutes behind your primary) it is an excellent way to read large amounts of data without affecting query performance on your primary nodes and in your application.
Alternately, you can also connect directly to your primary replica set node(s). If the amount of data you're reading is small compared to the typical traffic your database handles, then any performance impact should be minor. However, if you are reading a large amount of data, especially data outside of the working set that other clients are using, you'll want to be careful not to degrade performance of your application.
Often the simplest way to process your MongoDB data is to skip the database and read your MongoDB backups directly. If you take backups with mongodump, you can upload them to your Amazon S3 bucket and point Hadoop at them directly.
At cloud-based MongoDB providers like MongoLab and MongoHQ, mongodumps stored to Amazon S3 are already the default backup strategy. If you're hosting your own MongoDB, you only need to add an upload step and you're ready.
Using backup files keeps all the data processing load off of your MongoDB database. As long as your job can tolerate some staleness in your data this solution works extremely well.
In cases where you are regularly running large jobs against your MongoDB cluster it can be useful to have dedicated nodes that serve just these jobs and no other clients. The easiest way to do that is to use tag sets or hidden members to distinguish your analytics instances from instances being used by other clients.