Mortar is platform for doing data science and data engineering at high scale in the cloud. As a software developer or data scientist, you can write scripts, algorithm libraries, and data pipelines that run in parallel across tens, hundreds, or thousands of nodes to quickly process your data in parallel.
You can build anything from a simple row-counting script to a full-fledged recommendation engine to a machine-learning model using Mortar. The technologies you’ll build upon—Hadoop and Apache Pig—are the open-source standard used in places like Twitter, LinkedIn, Netflix, and many others to do data science and data engineering at scale.
Mortar makes these technologies work together seamlessly and handles all of the infrastructure, operations, and deployment so you can focus on working with your data, not on the plumbing.
We’ll dive into more details as the tutorial progresses, but roughly speaking there are three major pieces to working with Mortar:
One command installs the Mortar Development Framework (MDF) to your computer. It installs, integrates, and configures all of the software you need to develop code locally on your computer for Pig and Hadoop.
The MDF also helps organize your code using Mortar Projects, a standard code organization scheme for Pig and Hadoop code.
Each Mortar Project is backed by a repository at Github. When you’re ready, one command will push your code to Github and then run it on Hadoop via Mortar’s API.
Mortar runs your Pig code on private AWS Elastic MapReduce (EMR) Hadoop clusters, launched and destroyed at your request. Clusters run in Mortar’s AWS cloud, with all provisioning, monitoring, and operations managed by Mortar. We pass through the exact AWS costs for clusters with no upcharge to you.
Logging in to Mortar’s website, you’ll see full details about your job, including visualization of its progress, realtime logs from Pig, and realtime logs from every machine on the Hadoop cluster. When your job finishes, it’s easy to find and download results from the Mortar site, and any error messages are fetched back from the cluster to help diagnose issues.
When your pig code runs on Amazon EMR, it can connect to many different data storage systems. Quite frequently, you’ll connect to Amazon S3 in your AWS account to load input data and store results back. Amazon EMR clusters are highly optimized to make this connection to S3 very fast.
Additionally, you can load and store data directly from a number of databases in the AWS cloud, including Mongo, PostgreSQL, and DynamoDB.
That’s a brief overview of how Mortar works. Now let's jump into using Google Drive and DataHero.