A large part of effort of writing a Pig Loader can be getting all your dependent jars in a row. To ease that difficulty, Mortar provides a template maven project that takes care of all the dependency management and boilerplate code. It also provides examples and test classes to serve as guidelines for a new Loader.
If you don't already have maven installed, follow the instructions to download and install it here. Once you have maven installed, navigate to the template project directory and run the following to ensure the project compiles.
mvn clean install
Maven will download the necessary artifacts and put a jar file in the /target directory. If you see a
BUILD SUCCESS message then the project has built successfully.
Writing a Pig Loader involves a lot of boilerplate code. To minimize that, template Loader functions have been provided in the sample project that already have all the necessary scaffolding.
TemplateLoader is the simplest of the three, with the minimal code needed to get a Pig Loader running.
TemplateLoaderPushProjection implements an optimization to only load those fields that are needed for the script; this can be a source of significant performance improvement for some Loaders and data sets. Finally,
TemplateLoaderStaticSchema is a Loader that returns a schema that it either knows or determines from the data, thus eliminating the need to declare a schema with each use.
LOAD 'input' USING com.mortardata.pig.TemplateLoader() AS (f1: chararray, f2: int);
The TemplateLoader the simplest and most basic of the templates. All Pig Loaders extend
LoadFunc, which requires that they implement four methods.
setLocation(): This is called for Pig to communicate the load location. This method may end up being called multiple times, and thus it must not do anything that will problematic if done repeatedly. No changes needed.
getInputFormat(): This method tells Pig which
InputFormat to use for reading input data. The implemented function correctly handles .bz and .bz2 files to ensure that they split correctly to increase performance. No changes needed if your data is newline-delimited text.
prepareToRead(): This method is called before reading an InputSplit. No changes needed.
getNext(): Pig calls this method to get the next Tuple into the processing pipeline. Returning null indicates that the split has been fully read. The pattern of this method is to call
reader.getCurrentValue() once to return a single record, parse that record, and return the relevant values in a Tuple. However, it is possible to retrieve multiple records in a single
getNext() call by running
This method must be modified to parse your data. It should return a tuple of DataByteArray objects; Pig will then be able to cast these into the correct types based on the input schema. The ExampleLoader provides a very simple example of parsing a comma-separated set of values.
LOAD 'input' USING com.mortardata.pig.TemplateLoaderPushProjection() AS (f1: chararray, f2: int);
One of the first optimizations that can be done to a Loader is to guarantee that it only loads the data needed for the Pig script it's being used in. The TemplateLoaderPushProjection implements the Pig interface
LoadPushDown, which requires implementing two additional methods.
getFeatures(): Returns the list of operators that the Loader can support. No changes needed.
pushProjection(): This takes a list of required fields provided by Pig and does whatever processing is required for the other Loader methods to handle them correctly. This generally involves putting the list into an appropriate format, and storing it such that it can be used later. This method may not be called if all fields are required, so the Loader should not assume that it has been. No changes needed.
getNext(): This method should now be modified to only include the required fields in the return Tuple, if a required field list has been provided.
setUdfContextSignature(): This is not explicitly part of the
LoadPushDown interface, but it is an essential piece. Pig has two modes of operating—front end (planning) and back end (distributed running). Any information that becomes available after object construction but is needed across all machines must be stored in a UDFContext object. This is a mechanism to propagate necessary fields from the front end to all the back end instantiations. Note that in the example, the
prepareToRead() method sets the
requiredFields object in the udfContext, because it needs to be available to all instantiations. The
reader object is not put in the udfContext, because each instantiation will have its own reader.
LOAD 'input' USING com.mortardata.pig.TemplateLoaderStaticSchema()
When writing a Loader that is for your data alone, it can be convenient for it to assume the correct schema, without it needing to be stated every time. Or it may be the case that your data contains schema information within it that should be passed to Pig. The TemplateLoaderStaticSchema implements the
LoadMetadata interface, which tells Pig that this Loader will define the metadata related to the data to be loaded. This interface requires implementing two additional methods. (Note that this template also implements push projection.)
getSchema(): Provides the schema for the data. This method should be modified to return your schema, whether it is a fixed schema or can be extracted from the data.
getStatistics(): Returns statistics about the data to be loaded if the information is available, otherwise returns null. No changes needed.
getPartitionKeys(): Returns those fields on which the data is partitioned. No changes needed.
setPartitionFilter(): Allows Pig to pass filters for fields returned in
getPartitionKeys() such that the Loader will return only those values reflected in the filter. No changes needed.
More information on Pig Loaders can be found in O'Reilly's Programming Pig book.
There exists a PigUnit framework to aid in testing Loaders. This framework effectively allows you to write Pig statements and execute them within a Java junit test and test the outcome. TestTemplateLoader and TestExampleLoader show examples of this: creating a data location, loading the data using the newly written loader, and then testing the output Tuples.
To run a single test in maven, set the
mvn -Dtest=TestExampleLoader test
To create a jar file, run
mvn clean install in the root directory of the template project. This will create a jar file in the
In order for Mortar to access the jar, it needs to be moved to a location in S3. This can be any bucket that you are allowed to upload data to and is accessible using your IAM keys.
To tell your pig script about the new jar, use the
At this point you can use your Loader in the same way that you would use any of the provided Loader functions:
LOAD 'input' USING com.mortardata.pig.ExampleLoader() AS (f1: chararray, f2: int);