Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

How Mortar Projects Work

Mortar Projects provide a way to organize, develop, run, and share Hadoop code on Mortar. Let's dive a bit deeper into how they work.

Overview

How Mortar Projects Work

You write code for a Mortar Project on your local computer with your favorite code editor. The Mortar Development Framework allows you to run your code without needing to install Hadoop, Pig, Python, or associated libraries to your computer.

Instead, it syncs your code and executes your commands in the cloud. Each time you perform an action on a pigscript in your Mortar Project (e.g. illustrate, validate, run, etc), the Mortar Development Framework takes a snapshot of your project's code. It sends the snapshot to a Mortar-managed repository in GitHub, and then runs that code via the Mortar API.

Because everything is run in the cloud, you only need to install git and the Mortar Development Framework. Mortar will keep the history of every job you've run, along with the exact git version of the code that ran. You can run different versions of code at the same time, and you will never lose track of what code produced what output.

Alternatively, you can use local development mode which automatically installs everything you need to run Pig on your local machine. Everything installed is self-contained into your Mortar project directory. Local mode makes illustrates much faster, which can speed up your development cycle. It also allows you to do local runs of Pig on small test datasets, which avoids the overhead of scheduling Hadoop jobs on a cluster.


Mortar Project Structure

Much like Ruby-on-Rails projects, Mortar Projects have a well-defined structure with a consistent place for everything.

Let's explore that structure via an example project. Open up the millionsong project code in your web browser.

Mortar Project in GitHub

Every project has the same basic structure, with a folder for each type of code you use in Mortar. This consistent structure makes it easy to organize code and share projects with other developers.

All Mortar projects have six top-level folders: pigscripts, udfs, macros, lib, luigiscripts, and params.

Pigscripts

Mortar Pigscripts in GitHub

The pigscripts folder contains Apache Pig code files.

Apache Pig is a simple, open-source data flow language for Hadoop. It is similar to SQL, allowing you to do joins, filters, aggregates, sorting, and the like. Unlike SQL though, it is written and executed step-by-step, making it easier to debug your code and write complex data flows.

Open up the top_density_songs pigscript as an example. Scroll through the Pig code--each statement builds on the statement before. If you're familiar with SQL, it should read easily: Load the data from S3, filter it, apply a Python calculation to each row, order the rows, retain the top 50 rows, store the top 50 rows back to S3.

If you want to read a bit more about Pig, check out our Pig Help and Resources page.

User-Defined Functions (UDFs)

Mortar Python UDFs in GitHub

The udfs folder contains User-Defined Functions (UDFs) that can be applied to rows or groups of rows by Pig. UDFs can be written in Python, Jython, and Java.

Pig is a data-flow language, and it's great for standard operations like joining, filtering, and aggregating. However, when you need to do custom, domain-specific operations on your data, you can drop into a language like Python and write a user-defined function (UDF). UDFs enable rich computation (for example: lingustic analysis, sophisticated parsing, advanced math, sound analysis, etc.).

Open up the the millionsong.py UDF to see a simplified example of a Python UDF. The density function you see does some text processing on a data field, extracting the number of segments in a song. This is a simple example, but UDFs can get as complex as you need.

Mortar currently supports Python UDFs written in Python 2.7, with access to the numpy, scipy, and nltk libraries for data science. It also supports UDFs written in Jython (a JVM-based implementation of Python) which trades library support for greatly improved performance. You can also use Java UDFs, which need to be compiled into jar files before they can be referenced. For more about writing UDFs, see Writing User Defined Functions in Popular Languages.

Macros

Mortar Macros in GitHub

The macros folder contains Pig Macros: libraries of pig functions that you can reuse in any pigscript. If you open up millionsong.pig, you'll see a few macros for loading data from the million song dataset. Whenever we want to load data in a pigscript, we can call one of these macros.

Libraries

To make it easy to bundle dependencies, you can include any needed jars in the lib folder of your project. Any jars in this location will automatically be registered and available for use in your Pig scripts or UDFs.

Luigiscripts

The Luigiscripts folder contains all of the Luigi scripts for your project. Luigi is a powerful and easy-to-use Python workflow framework developed by Spotify. With it, you can easily build and manage complex pipelines of batch jobs.

For more information about how to use Luigi with Mortar go to Luigi with Mortar.

Params

By default the params directory is empty. As you develop more complex Mortar projects you will often end up having separate sets of parameters for different jobs and environments. The params folder is a convenient place to group your parameter sets into files.

For more information about how to use parameters with Mortar go to Local Development.

The Mortar Project Manifest

The Mortar Development Framework automatically creates a file called project.manifest that describes which files and directories Mortar should sync with the cloud. By default, it includes only the standard Mortar directories necessary to run in the cloud: pigscripts, udfs, macros, and lib. This should work for almost all use cases. However, if you have extra resources that your code depends on, you can configure Mortar to sync these resources with the cloud.

  1. Put each resource in your Mortar project directory
  2. Add each resource's path relative to the Mortar project root as a line in project.manifest

To improve performance using the Mortar service you want to avoid syncing large files whenever possible. If your project does require a large file to run you can upload that file to S3 and then reference the S3 url inside of your Mortar project.


On the Mortar Website

You can also view your Mortar Projects, as well as jobs and clusters you launch, on the Mortar website. You will also find functionality related to your account such as changing passwords, setting AWS keys, and the like on the site.


Public vs. Private Mortar Projects

The difference between a public and private Mortar Project is in how the source code of the project is managed. Each Mortar Project is backed by a git repository, stored at GitHub in a Mortar Organization. The git repository for a private Mortar project has access limited to only the users in your Mortar account. The git repository for a public Mortar project is public and others may view and fork your project.


Sharing Mortar Projects

With Other Users In Your Account

Any users that you add to your account (from the Account Settings page on the Mortar website) will be able to clone and collaborate on your Mortar Project with you.

To clone an existing Mortar Project from your account, just execute the command:

mortar projects:clone MYPROJECT

With the Broader Community

Public Mortar Projects are backed by a public git repository and others will automatically be able to view and fork your project for their own use.

If you'd like to share a private Mortar Project more broadly, as we have with the millionsong project, you can still do so using GitHub.

Just create a new public repository in your personal GitHub account and push your code to it:

# add my personal git repository
git remote add personalgitrepo git@github.com:MYGITUSERNAME/myproject.git

# push it
git push personalgitrepo master

Other users on GitHub will be able to clone your repository and create their own Mortar Project to run it in their Mortar account.