Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Learn Hadoop and Pig
Developing Your Data App
Running a Mortar Project

Running a Mortar Project

Once you have the basic pattern of how to develop Pig code using Mortar Projects, there are a few tips and tricks to making the process as fast and smooth as possible.

Running on the Full Dataset

We've already discussed that the way to do a run on a complete dataset is the jobs:run command.

mortar jobs:run pigscripts/my-sample-project.pig

Although this is probably your ultimate goal, you want to avoid doing these kinds of runs until you are sure the script is working correctly. Illustrate is a great tool in checking that a script is working, but there is more we can do to improve our development experience.


Small Local Data

The best way to do iterative development with Mortar is to create a small, local data set.

Why Small?

Illustrate does a good job of sampling to get you useful results quickly. However, if you are using a lot of filters and joins, you might find that you aren't seeing data in every step of the process. You may also get different data each time, which can be helpful, but sometimes makes debugging trickier.

If we instead create a small dataset, we make it more likely that illustrate will find the examples our script cares about. We also open up the possibility of doing a local run on that entire set of data and store the results to get a better picture of what's happening.

Why Local?

When using the mortar:local commands, the primary cause of slowdown is pulling data down from S3. If your entire data set is local, you'll see increased performance on all mortar:local commands.

Additionally, a local data set can be helpful in orienting yourself to the data; it's always good idea to take a look at your original data.


Load From Local

Let's assume you have created a small local data set called data-sample.txt, and put it in the data folder. To load data from that folder instead of from S3, all you need to do is provide a relative path to the data.

data = LOAD '../data/data-sample.txt'
        USING org.apache.pig.piggybank.storage.CSVExcelStorage()
        AS (field1: int, field2: chararray);

To make switching out data sets easier, you can make the location into a Pig parameter.

%default LOCAL_INPUT '../data/data-sample.txt'

data = LOAD '$LOCAL_INPUT'
        USING org.apache.pig.piggybank.storage.CSVExcelStorage()
        AS (field1: int, field2: chararray);

Local Run

Now that you have a small subset of your data loading, you can use the local:run command to execute your script on the data subset without needing to spin up a cluster.

$ mortar local:run pigscripts/my-sample-project.pig mortar local:run -f params/coffee_tweets/local.large.param

By adding the -f option, we pass in a parameter file to set local paths, allowing us to read and write data from our machine without touching S3 at all. This is another good practice for speedy local development.


Full Run with Authority

If your local run is successful, and the output data looks like what is expected, it's time to do a full run on a Hadoop cluster.

It can be hard to figure out how many nodes to run on: too few and your job will take a long time; too many and it's wasted money. There isn't a perfect answer to the "how many nodes" question, but a good rule of thumb is to start with a small number of nodes (say, 2-5), and see how the job progresses. If it's going extremely slowly, double the number of nodes and re-run. Ultimately you will probably do several runs with your full data, so you'll be able to try different cluster sizes until it seems about right.

To specify cluster size for your run, use the --clustersize option:

$ mortar jobs:run pigscripts/my-sample-project.pig --clustersize 5

To use an existing cluster you'll need the cluster_id, which you can find with:

$ mortar clusters

To run using that cluster, pass the --clusterid option:

$ mortar jobs:run pigscripts/my-sample-project.pig --clusterid MY_CLUSTER_ID

Saving Money with Spot Instance Clusters

If you want to save up to 70% on cluster costs, try using Spot Instance Clusters instead of the standard On-Demand Clusters. In exchange for much lower prices, Spot Instance Clusters take longer to launch and can (very rarely) disappear before you are finished using them.

To use spot clusters, just add the --spot switch to your command:

$ mortar jobs:run pigscripts/my-sample-project.pig --clustersize 3 --spot

To learn more, check out the Spot Instance Cluster documentation.


Selecting your Pig Version

Mortar supports Apache Pig version 0.9 and version 0.12. By default Mortar commands will use version 0.9. To use Apache Pig 0.12, add the '--pigversion' or '-g' option to your validate, describe, illustrate, or run commands:

$ mortar jobs:run pigscripts/my-sample-project.pig -g 0.12

To learn more about the Mortar supported versions of Pig, check out Pig on Mortar.


Setting Project Defaults

There are times when you may want to set a custom default value for your Mortar project. Two common cases are when you always want to use Apache Pig version 0.12 instead of the default version of 0.9 or when you always want to run with Spot Instance Clusters instead of the standard On-Demand Clusters.

To set custom default values for a Mortar project you need to edit a file called project.properties in the root directory of your project. In this file you can set values you want to use by default for any Mortar option. Here's what it would look like to default to using Apache Pig version 0.12 and Spot Instance Clusters:

[DEFAULTS]
pigversion=0.12
spot=true

AT THIS POINT, YOU SHOULD BE ABLE TO:

  • Register a Mortar Project
  • Illustrate a pigscript in a Mortar Project
  • Run a job on a cluster using a Mortar Project
  • Create a new Mortar Project
  • REGISTER a UDF in a pigscript
  • Develop iteratively in a Mortar Project
  • Run a Pigscript on local data

If you want to use Mortar projects with your own source control (without having each project be its own Git repo), see Using Your Own Source Control.

For more example projects showing different use-cases for Mortar, see Example Mortar Projects.