We've already discussed that the way to do a run on a complete dataset is the
mortar jobs:run pigscripts/my-sample-project.pig
Although this is probably your ultimate goal, you want to avoid doing these kinds of runs until you are sure the script is working correctly. Illustrate is a great tool in checking that a script is working, but there is more we can do to improve our development experience.
The best way to do iterative development with Mortar is to create a small, local data set.
Illustrate does a good job of sampling to get you useful results quickly. However, if you are using a lot of filters and joins, you might find that you aren't seeing data in every step of the process. You may also get different data each time, which can be helpful, but sometimes makes debugging trickier.
If we instead create a small dataset, we make it more likely that illustrate will find the examples our script cares about. We also open up the possibility of doing a local run on that entire set of data and store the results to get a better picture of what's happening.
When using the
mortar:local commands, the primary cause of slowdown is pulling data down from S3. If your entire data set is local,
you'll see increased performance on all
Additionally, a local data set can be helpful in orienting yourself to the data; it's always good idea to take a look at your original data.
Let's assume you have created a small local data set called
data-sample.txt, and put it in the
data folder. To load data from that folder instead of from S3, all you need to do is provide a relative path to the data.
data = LOAD '../data/data-sample.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (field1: int, field2: chararray);
To make switching out data sets easier, you can make the location into a Pig parameter.
%default LOCAL_INPUT '../data/data-sample.txt' data = LOAD '$LOCAL_INPUT' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (field1: int, field2: chararray);
Now that you have a small subset of your data loading, you can use the
local:run command to execute your script on the data subset without needing to spin up a cluster.
$ mortar local:run pigscripts/my-sample-project.pig mortar local:run -f params/coffee_tweets/local.large.param
By adding the -f option, we pass in a parameter file to set local paths, allowing us to read and write data from our machine without touching S3 at all. This is another good practice for speedy local development.
If your local run is successful, and the output data looks like what is expected, it's time to do a full run on a Hadoop cluster.
It can be hard to figure out how many nodes to run on: too few and your job will take a long time; too many and it's wasted money. There isn't a perfect answer to the "how many nodes" question, but a good rule of thumb is to start with a small number of nodes (say, 2-5), and see how the job progresses. If it's going extremely slowly, double the number of nodes and re-run. Ultimately you will probably do several runs with your full data, so you'll be able to try different cluster sizes until it seems about right.
To specify cluster size for your run, use the
$ mortar jobs:run pigscripts/my-sample-project.pig --clustersize 5
To use an existing cluster you'll need the
cluster_id, which you can find with:
$ mortar clusters
To run using that cluster, pass the
$ mortar jobs:run pigscripts/my-sample-project.pig --clusterid MY_CLUSTER_ID
If you want to save up to 70% on cluster costs, try using Spot Instance Clusters instead of the standard On-Demand Clusters. In exchange for much lower prices, Spot Instance Clusters take longer to launch and can (very rarely) disappear before you are finished using them.
To use spot clusters, just add the
--spot switch to your command:
$ mortar jobs:run pigscripts/my-sample-project.pig --clustersize 3 --spot
To learn more, check out the Spot Instance Cluster documentation.
Mortar supports Apache Pig version 0.9 and version 0.12. By default Mortar commands will use version 0.9. To use Apache Pig 0.12, add the '--pigversion' or '-g' option to your validate, describe, illustrate, or run commands:
$ mortar jobs:run pigscripts/my-sample-project.pig -g 0.12
To learn more about the Mortar supported versions of Pig, check out Pig on Mortar.
There are times when you may want to set a custom default value for your Mortar project. Two common cases are when you always want to use Apache Pig version 0.12 instead of the default version of 0.9 or when you always want to run with Spot Instance Clusters instead of the standard On-Demand Clusters.
To set custom default values for a Mortar project you need to edit a file called project.properties in the root directory of your project. In this file you can set values you want to use by default for any Mortar option. Here's what it would look like to default to using Apache Pig version 0.12 and Spot Instance Clusters:
[DEFAULTS] pigversion=0.12 spot=true
AT THIS POINT, YOU SHOULD BE ABLE TO:
If you want to use Mortar projects with your own source control (without having each project be its own Git repo), see Using Your Own Source Control.
For more example projects showing different use-cases for Mortar, see Example Mortar Projects.