You write code for a Mortar Project on your local computer with your favorite code editor. The Mortar Development Framework allows you to run your code without needing to install Hadoop, Pig, Python, or associated libraries to your computer.
Instead, it syncs your code and executes your commands in the cloud. Each time you perform an action on a pigscript in your Mortar Project (e.g. illustrate, validate, run, etc), the Mortar Development Framework takes a snapshot of your project's code. It sends the snapshot to a Mortar-managed repository in GitHub, and then runs that code via the Mortar API.
Because everything is run in the cloud, you only need to install git and the Mortar Development Framework. Mortar will keep the history of every job you've run, along with the exact git version of the code that ran. You can run different versions of code at the same time, and you will never lose track of what code produced what output.
Alternatively, you can use local development mode which automatically installs everything you need to run Pig on your local machine. Everything installed is self-contained into your Mortar project directory. Local mode makes illustrates much faster, which can speed up your development cycle. It also allows you to do local runs of Pig on small test datasets, which avoids the overhead of scheduling Hadoop jobs on a cluster.
Much like Ruby-on-Rails projects, Mortar Projects have a well-defined structure with a consistent place for everything.
Let's explore that structure via an example project. Open up the millionsong project code in your web browser.
Every project has the same basic structure, with a folder for each type of code you use in Mortar. This consistent structure makes it easy to organize code and share projects with other developers.
All Mortar projects have six top-level folders: pigscripts, udfs, macros, lib, luigiscripts, and params.
The pigscripts folder contains Apache Pig code files.
Apache Pig is a simple, open-source data flow language for Hadoop. It is similar to SQL, allowing you to do joins, filters, aggregates, sorting, and the like. Unlike SQL though, it is written and executed step-by-step, making it easier to debug your code and write complex data flows.
Open up the top_density_songs pigscript as an example. Scroll through the Pig code--each statement builds on the statement before. If you're familiar with SQL, it should read easily: Load the data from S3, filter it, apply a Python calculation to each row, order the rows, retain the top 50 rows, store the top 50 rows back to S3.
If you want to read a bit more about Pig, check out our Pig Help and Resources page.
The udfs folder contains User-Defined Functions (UDFs) that can be applied to rows or groups of rows by Pig. UDFs can be written in Python, Jython, and Java.
Pig is a data-flow language, and it's great for standard operations like joining, filtering, and aggregating. However, when you need to do custom, domain-specific operations on your data, you can drop into a language like Python and write a user-defined function (UDF). UDFs enable rich computation (for example: lingustic analysis, sophisticated parsing, advanced math, sound analysis, etc.).
Open up the the millionsong.py UDF to see a simplified example of a Python UDF. The
density function you see does some text processing on a data field, extracting the number of segments in a song. This is a simple example, but UDFs can get as complex as you need.
Mortar currently supports Python UDFs written in Python 2.7, with access to the numpy, scipy, and nltk libraries for data science. It also supports UDFs written in Jython (a JVM-based implementation of Python) which trades library support for greatly improved performance. You can also use Java UDFs, which need to be compiled into jar files before they can be referenced. For more about writing UDFs, see Writing User Defined Functions in Popular Languages.
The macros folder contains Pig Macros: libraries of pig functions that you can reuse in any pigscript. If you open up millionsong.pig, you'll see a few macros for loading data from the million song dataset. Whenever we want to load data in a pigscript, we can call one of these macros.
To make it easy to bundle dependencies, you can include any needed jars in the
lib folder of your project. Any jars in this location will automatically be registered and available for use in your Pig scripts or UDFs.
The Luigiscripts folder contains all of the Luigi scripts for your project. Luigi is a powerful and easy-to-use Python workflow framework developed by Spotify. With it, you can easily build and manage complex pipelines of batch jobs.
For more information about how to use Luigi with Mortar go to Luigi with Mortar.
By default the params directory is empty. As you develop more complex Mortar projects you will often end up having separate sets of parameters for different jobs and environments. The params folder is a convenient place to group your parameter sets into files.
For more information about how to use parameters with Mortar go to Local Development.
The Mortar Development Framework automatically creates a file called
project.manifest that describes which files and directories Mortar should sync with the cloud. By default, it includes only the standard Mortar directories necessary to run in the cloud:
lib. This should work for almost all use cases. However, if you have extra resources that your code depends on, you can configure Mortar to sync these resources with the cloud.
To improve performance using the Mortar service you want to avoid syncing large files whenever possible. If your project does require a large file to run you can upload that file to S3 and then reference the S3 url inside of your Mortar project.
You can also view your Mortar Projects, as well as jobs and clusters you launch, on the Mortar website. You will also find functionality related to your account such as changing passwords, setting AWS keys, and the like on the site.
The difference between a public and private Mortar Project is in how the source code of the project is managed. Each Mortar Project is backed by a git repository, stored at GitHub in a Mortar Organization. The git repository for a private Mortar project has access limited to only the users in your Mortar account. The git repository for a public Mortar project is public and others may view and fork your project.
Any users that you add to your account (from the Account Settings page on the Mortar website) will be able to clone and collaborate on your Mortar Project with you.
To clone an existing Mortar Project from your account, just execute the command:
mortar projects:clone MYPROJECT
Public Mortar Projects are backed by a public git repository and others will automatically be able to view and fork your project for their own use.
If you'd like to share a private Mortar Project more broadly, as we have with the millionsong project, you can still do so using GitHub.
Just create a new public repository in your personal GitHub account and push your code to it:
# add my personal git repository git remote add personalgitrepo firstname.lastname@example.org:MYGITUSERNAME/myproject.git # push it git push personalgitrepo master
Other users on GitHub will be able to clone your repository and create their own Mortar Project to run it in their Mortar account.