Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

How to Use Hadoop with MongoDB
Learn Hadoop and Pig

How to Use Hadoop with MongoDB

Why Hadoop

Mongo was built for data storage and retrieval for large data sets, one item at a time. Hadoop, on the other hand, was written for scalable data processing on huge data sets. So naturally, data processing is often better offloaded to Hadoop. Here’s why:

Easier, more expressive languages

MongoDB supports native MapReduce in JavaScript, but writing MapReduce is a colossal pain. MongoDB also supports the Aggregation Pipeline, but it is hard to use and doesn't support custom code.

The Hadoop community has created Pig, Hive, and the Cascading family of languages--all of which are expressive and high-level, and let you bring your own custom code.

Keep the load off Mongo

If you’re doing significant data processing on MongoDB, it can add substantial load to your production database. You often need an order of magnitude more power to process data in a timely way than you need to store it. Hadoop in the cloud handles this nicely, scaling up to meet your processing needs and down to nothing when you're done.

Libraries to build on

Few popular data processing libraries are written in Javascript, so you’ll often find yourself without access to the libraries you need, such as NumPy, NLTK, etc. Most any Python or Java library can be used directly or adapted for use with Hadoop.

Big performance improvements

Hadoop was designed and built for fully distributed, multi-process execution, so it performs much, much better for this sort of work.

So if you want to easily write distributed jobs that perform well and don’t add load to your primary storage system, Hadoop is the way to go.

How to Do It

There are several options to analyze and report on your MongoDB data with Hadoop. These are all powered by the MongoDB Connector for Hadoop, which is integrated and distributed with Mortar.

Determining the best strategy often depends on your particular use case and data. We'll explore each option you can take here:

Option #1: Connect to Mongo Directly

This is the quickest option to get going. You simply connect directly to your MongoDB database using a standard Mongo URI.

To keep the load off the primary node in your replica set, most people connect to their secondary nodes. While reading from secondary nodes can mean you're running against slightly stale data (typically seconds to minutes behind your primary) it is an excellent way to read large amounts of data without affecting query performance on your primary nodes and in your application.

Alternately, you can also connect directly to your primary replica set node(s). If the amount of data you're reading is small compared to the typical traffic your database handles, then any performance impact should be minor. However, if you are reading a large amount of data, especially data outside of the working set that other clients are using, you'll want to be careful not to degrade performance of your application.

Option #2: Read mongodump backup files from S3

Often the simplest way to process your MongoDB data is to skip the database and read your MongoDB backups directly. If you take backups with mongodump, you can upload them to your Amazon S3 bucket and point Hadoop at them directly.

At cloud-based MongoDB providers like MongoLab and MongoHQ, mongodumps stored to Amazon S3 are already the default backup strategy. If you're hosting your own MongoDB, you only need to add an upload step and you're ready.

Using backup files keeps all the data processing load off of your MongoDB database. As long as your job can tolerate some staleness in your data this solution works extremely well.

Option #3: Connect to dedicated analytics nodes

In cases where you are regularly running large jobs against your MongoDB cluster it can be useful to have dedicated nodes that serve just these jobs and no other clients. The easiest way to do that is to use tag sets or hidden members to distinguish your analytics instances from instances being used by other clients.

Get Started!

That's a brief rundown of how to connect your MongoDB to Hadoop. This tutorial will walk you through the process of doing this on Mortar step-by-step. We'll provide instruction and example code for each connection option above. We'll also provide special instructions for connecting to cloud Mongo providers like MongoLab.

Ready to get started? The first thing you'll want to do is quickly explore how Mortar works.