Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Load and Store to S3 from Pig
Write Your Own Loader

Write Your Own Pig Load Function

Though Mortar provides Pig Loaders for most standard data types, there may come a time when your data doesn't fit into a standard format. Mortar provides templates that make it easy to get started writing your own Pig load function (LoadFunc) in Java.

Download Template Project

A large part of effort of writing a Pig Loader can be getting all your dependent jars in a row. To ease that difficulty, Mortar provides a template maven project that takes care of all the dependency management and boilerplate code. It also provides examples and test classes to serve as guidelines for a new Loader.


Compile Sample Project

If you don't already have maven installed, follow the instructions to download and install it here. Once you have maven installed, navigate to the template project directory and run the following to ensure the project compiles.

mvn clean package

Maven will download the necessary artifacts and put a jar file in the /target directory. If you see a BUILD SUCCESS message then the project has built successfully.


Write a Loader

Writing a Pig Loader involves a lot of boilerplate code. To minimize that, template Loader functions have been provided in the sample project that already have all the necessary scaffolding. TemplateLoader is the simplest of the three, with the minimal code needed to get a Pig Loader running. TemplateLoaderPushProjection implements an optimization to only load those fields that are needed for the script; this can be a source of significant performance improvement for some Loaders and data sets. Finally, TemplateLoaderStaticSchema is a Loader that returns a schema that it either knows or determines from the data, thus eliminating the need to declare a schema with each use.

TemplateLoader

LOAD 'input' USING com.mortardata.pig.TemplateLoader()
        AS (f1: chararray, f2: int);

The TemplateLoader the simplest and most basic of the templates. All Pig Loaders extend LoadFunc, which requires that they implement four methods.

setLocation(): This is called for Pig to communicate the load location. This method may end up being called multiple times, and thus it must not do anything that will problematic if done repeatedly. No changes needed.

getInputFormat(): This method tells Pig which InputFormat to use for reading input data. The implemented function correctly handles .bz and .bz2 files to ensure that they split correctly to increase performance. No changes needed if your data is newline-delimited text.

prepareToRead(): This method is called before reading an InputSplit. No changes needed.

getNext(): Pig calls this method to get the next Tuple into the processing pipeline. Returning null indicates that the split has been fully read. The pattern of this method is to call reader.getCurrentValue() once to return a single record, parse that record, and return the relevant values in a Tuple. However, it is possible to retrieve multiple records in a single getNext() call by running reader.getCurrentValue() repeatedly.

This method must be modified to parse your data. It should return a tuple of DataByteArray objects; Pig will then be able to cast these into the correct types based on the input schema. The ExampleLoader provides a very simple example of parsing a comma-separated set of values.

TemplateLoaderPushProjection

LOAD 'input' USING com.mortardata.pig.TemplateLoaderPushProjection()
    AS (f1: chararray, f2: int);

One of the first optimizations that can be done to a Loader is to guarantee that it only loads the data needed for the Pig script it's being used in. The TemplateLoaderPushProjection implements the Pig interface LoadPushDown, which requires implementing two additional methods.

getFeatures(): Returns the list of operators that the Loader can support. No changes needed.

pushProjection(): This takes a list of required fields provided by Pig and does whatever processing is required for the other Loader methods to handle them correctly. This generally involves putting the list into an appropriate format, and storing it such that it can be used later. This method may not be called if all fields are required, so the Loader should not assume that it has been. No changes needed.

getNext(): This method should now be modified to only include the required fields in the return Tuple, if a required field list has been provided.

setUdfContextSignature(): This is not explicitly part of the LoadPushDown interface, but it is an essential piece. Pig has two modes of operating—front end (planning) and back end (distributed running). Any information that becomes available after object construction but is needed across all machines must be stored in a UDFContext object. This is a mechanism to propagate necessary fields from the front end to all the back end instantiations. Note that in the example, the prepareToRead() method sets the requiredFields object in the udfContext, because it needs to be available to all instantiations. The reader object is not put in the udfContext, because each instantiation will have its own reader.

TemplateLoaderStaticSchema

LOAD 'input' USING com.mortardata.pig.TemplateLoaderStaticSchema();

When writing a Loader that is for your data alone, it can be convenient for it to assume the correct schema, without it needing to be stated every time. Or it may be the case that your data contains schema information within it that should be passed to Pig. The TemplateLoaderStaticSchema implements the LoadMetadata interface, which tells Pig that this Loader will define the metadata related to the data to be loaded. This interface requires implementing two additional methods. (Note that this template also implements push projection.)

getSchema(): Provides the schema for the data. This method should be modified to return your schema, whether it is a fixed schema or can be extracted from the data.

getStatistics(): Returns statistics about the data to be loaded if the information is available, otherwise returns null. No changes needed.

getPartitionKeys(): Returns those fields on which the data is partitioned. No changes needed.

setPartitionFilter(): Allows Pig to pass filters for fields returned in getPartitionKeys() such that the Loader will return only those values reflected in the filter. No changes needed.

More information on Pig Loaders can be found in O'Reilly's Programming Pig book.


Testing

There exists a PigUnit framework to aid in testing Loaders. This framework effectively allows you to write Pig statements and execute them within a Java junit test and test the outcome. TestTemplateLoader and TestExampleLoader show examples of this: creating a data location, loading the data using the newly written loader, and then testing the output Tuples.

To run a single test in maven, set the test property.

mvn -Dtest=TestExampleLoader test

Use Loader

To use your new loader, move the compiled jar file into your Mortar project in the udfs/java directory. You can then register the jar in pig via the REGISTER statement:

REGISTER ../udfs/java/my_jar.jar;

At this point you can use your Loader in the same way that you would use any of the provided Loader functions:

LOAD 'input' USING com.mortardata.pig.ExampleLoader()
    AS (f1: chararray, f2: int);

Additional Reference

For more info on writing your own Pig LoadFunc, see Chapter 11: Writing Load and Store Functions in the Programming Pig book.

XML