Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Local Development

mortar:local lets you develop pigscripts on your own computer, speeding up your development process.

Fast Local Development

When you run a pigscript or luigiscript using Mortar, your code is uploaded to the cloud and a Hadoop cluster is created on demand to execute it. The time to complete all of these steps can make it difficult when you are initially developing a pigscript.

To close this feedback loop, there are several commands that allow you to illustrate, validate, and run pigscripts on your local machine, allowing you to get quicker results from the code you are writing. A similar command also exists to run a luigiscript locally on your machine.


Tutorial

The Mortar example project includes a step by step tutorial of Mortar projects.


Local Commands

Pig

To allow for a faster pigscript development process, Mortar has the commands:

mortar local:validate SCRIPT
mortar local:illustrate SCRIPT
mortar local:run SCRIPT

which allow you to run Pig on your local machine without the overhead of a full Hadoop cluster. These commands take the same arguments as the validate, illustrate, and jobs:run commands respectively. To see a list of arguments, enter mortar help local:run or mortar help local:illustrate.

The first time you run a local command for a project, it will download and install the latest Mortar distribution of the Pig version you are using and any other necessary dependencies within the project folder. You can alternatively do this installation beforehand by running

mortar local:configure

In addition to the first time install, local commands also check for updates to the Mortar distribution of the Pig version you are using to ensure you are always up to date with the latest features and bug fixes that are available when you run on a Mortar-hosted Hadoop Cluster.

Luigi

The corresponding command to run a luigiscript locally is:

mortar local:luigi SCRIPT

As with Pig, the first time you run this command, all of the dependencies you need for Luigi will automatically be installed.


Working with Local Data

To ensure local pig runs are as fast as possible, you should download a small test data set to your local system instead of referencing remotes files in locations such as S3. You can use these files in LOAD statements from your pigscripts using either a relative path (relative to the pigscript file) or an absolute path in your local filesystem.

With local dev, you also can use a few pig commands in your pigscripts that you cannot use on a cluster.

DUMP alias;      -- output contents of pig alias to console
DESCRIBE alias;  -- describe the schema of a pig alias
EXPLAIN alias;   -- show the logical, physical (optimized), and mapreduce plans
                 -- that pig will execute to generate this alias

DUMP is the most useful, as it lets you see small output immediately without having to open up an output file. When you want to run your script on the cloud or with large data however, remember to remove or comment out all DUMP commands.


Seamlessly Transition from Local Dev to Hadoop

If you use pig parameters to specify your input files, you will be able to run you pig script either locally or on a Mortar Hadoop cluster without having to change the code. For example, if you had a load statement such as:

data_2 = LOAD '$INPUT_DATA' USING PigStorage();

You could do a local pig run using either:

# load my_data.txt from the Mortar project's root directory
mortar local:run pigscripts/my_script.pig --parameter INPUT_DATA=../my_data.txt

# load my_data.txt from absolute path in /tmp
mortar local:run pigscripts/my_script.pig --parameter INPUT_DATA=/tmp/my_data.txt

Then, when you're ready to run this script on a full Mortar Hadoop cluster, you can use the mortar jobs:run command and set INPUT_DATA to the S3 path of the full data set you wish to process.

If you want to specify multiple parameters, it may be convenient to put them in parameter files. For example, create two files "local.params" and "cloud.params":

# in local.params
INPUT_PATH=../my_data.txt
OUTPUT_PATH=../output
NUMBER_PARAM=5

# in cloud.params
INPUT_PATH=s3n://my_bucket/my_data
OUTPUT_PATH=s3n://my_bucket/output
NUMBER_PARAM=10000

# run in local mode
mortar local:run pigscripts/my_script.pig -f local.params

# run on a 10-node cluster
mortar jobs:run pigscripts/my_script.pig -f cloud.params -s 10