Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Pigscript Tasks

Pigscript Tasks run Pig (Hadoop) jobs on Mortar.

mortar-luigi

These Tasks are part of the mortar-luigi project, an open-source collection of extensions for using Mortar from within Luigi. It is installed automatically when you use the mortar local:luigi or mortar luigi commands.

MortarProjectPigscriptTask

Runs a Pig script in a Mortar project. This task is idempotent, so restarting it will pick up where your running job left off without causing a new job to run.

Example Usage

class MyBasicPigscriptTask(mortartask.MortarProjectPigscriptTask):

    # cluster size to use
    cluster_size = luigi.IntParameter(default=2)

    def token_path(self):
        """
        Luigi manages dependencies between tasks by checking for 
        the existence of files. When one task finishes it writes
        out a 'token' file that will trigger the next task in the
        dependency graph. This is the base path for
        where those tokens will be written.
        """
        return 's3://my-bucket/my-path-to-token-output'

    def requires(self):
        """
        The requires method is how you build your dependency graph
        in Luigi. Luigi will not run this task until all tasks 
        returning in this list are complete.
        """
        return []

    def script_output(self):
        """
        Any location provided here will be cleaned up 
        if your job fails. This ensures that this Task 
        will stay idempotent for future runs.
        """
        return [S3Target('s3://my-bucket/path-where-i-output-data')]

    def script(self):
        """
        Name of the script to run.
        """
        return 'my-pigscript'

The most important element is defining the script you want to run.

cluster_size, as its name implies, is the size of cluster the job should be run on. If there is an idle cluster of this size or larger the existing cluster will be used, otherwise Luigi will launch a new cluster.

The script_output method takes a list of output directories; if the script fails for some reason any partial data in these directories will be removed.

token_path is the location where the Luigi task will store its status for other tasks to find.

requires is a method that needs to be defined for every Luigi task, and it indicates dependencies for the task. This can either be other Luigi tasks, data in a specified location, or nothing.

More Options

class MyAllOptionsPigscriptTask(mortartask.MortarProjectPigscriptTask):
    # s3 path to the folder where the input data is located
    input_base_path = luigi.Parameter()

    # s3 path to the output folder
    output_base_path = luigi.Parameter()

    # cluster size to use
    cluster_size = luigi.IntParameter(default=2)

    def token_path(self):
        return self.output_path

    def default_parallel(self):
        return (self.cluster_size - 1) * mortartask.NUM_REDUCE_SLOTS_PER_MACHINE

    def script(self):
        """
        Name of the script to run.
        """
        return 'my-pigscript'


    def requires(self):
        return [S3PathTask(self.input_path)]

    def script_output(self):
        return [S3Target(self.output_path)]

    def parameters(self):
        return {'MY_PARAMETER': 'PARAM_VALUE',
                'default_parallel': self.default_parallel()}

This task uses input_base_path and output_base_path as parameters to allow the same task to be run on a recurring basis without needing to change the code. It also sets default_parallel: an option that can be used to make Pig jobs run more efficiently. It's then passed in the parameters method, which defines the same Pig parameters you would have on the command line or in a .params file.

This task uses the S3PathTask() class to look for source data in S3 in the requires method, and the S3Target class to describe where in S3 the output data will go.

Example in Context

For more complete examples of using Luigi to run Pigscripts, check out the mortar-recsys project. Note that the Luigi scripts in that project run a number of Pigscripts that share common attributes (ex. token path), so we make use of a base class to apply those attributes to all the Pigscripts in the pipeline.