When you run a pigscript using Mortar, your code is uploaded to the cloud and a Hadoop cluster is created on demand to execute it. The time to complete all of these steps can make it difficult when you are initially developing a pigscript.
To close this feedback loop, there are several commands that allow you to illustrate, validate, and run pigscripts on your local machine, allowing you to get quicker results from the code you are writing.
To allow for a faster development process, Mortar has the commands
mortar local:validate SCRIPT mortar local:illustrate SCRIPT mortar local:run SCRIPT
which allow you to run Pig on your local machine without the overhead of a full Hadoop cluster. These commands take the same
arguments as the
jobs:run commands respectively. To see a list of arguments, enter
mortar help local:run or
mortar help local:illustrate.
The first time you run a
local command for a project, it will download and install the latest Mortar distribution of Pig and any other necessary dependencies within the project folder. You can alternatively do this installation beforehand by running
In addition to the first time install, local commands also periodically check for updates to the Mortar Pig distribution to ensure you are always up to date with the latest features and bug fixes that are available when you run on a Mortar-hosted Hadoop Cluster.
To ensure local pig runs are as fast as possible, you should download a small test data set to your local system instead of referencing remotes files in locations such as S3. You can use these files in LOAD statements from your pigscripts using either a relative path (relative to the pigscript file) or an absolute path in your local filesystem.
With local dev, you also can use a few pig commands in your pigscripts that you cannot use on a cluster.
DUMP alias; -- output contents of pig alias to console DESCRIBE alias; -- describe the schema of a pig alias EXPLAIN alias; -- show the logical, physical (optimized), and mapreduce plans -- that pig will execute to generate this alias
DUMP is the most useful, as it lets you see small output immediately without having to open up an output file. When you want to run your script on the cloud or with large data however, remember to remove or comment out all DUMP commands.
If you use pig parameters to specify your input files, you will be able to run you pig script either locally or on a Mortar Hadoop cluster without having to change the code. For example, if you had a load statement such as:
data_2 = LOAD '$INPUT_DATA' USING PigStorage();
You could do a local pig run using either:
# load my_data.txt from the Mortar project's root directory mortar local:run pigscripts/my_script.pig --parameter INPUT_DATA=../my_data.txt # load my_data.txt from absolute path in /tmp mortar local:run pigscripts/my_script.pig --parameter INPUT_DATA=/tmp/my_data.txt
Then, when you're ready to run this script on a full Mortar Hadoop cluster, you can use the
mortar jobs:run command and set INPUT_DATA to the S3 path of the full data set you wish to process.
If you want to specify multiple parameters, it may be convenient to put them in parameter files. For example, create two files "local.params" and "cloud.params":
# in local.params INPUT_PATH=../my_data.txt OUTPUT_PATH=../output NUMBER_PARAM=5 # in cloud.params INPUT_PATH=s3n://my_bucket/my_data OUTPUT_PATH=s3n://my_bucket/output NUMBER_PARAM=10000 # run in local mode mortar local:run pigscripts/my_script.pig -f local.params # run on a 10-node cluster mortar jobs:run pigscripts/my_script.pig -f cloud.params -s 10