Writing Java UDFs

For a performance-critical Pig UDF, Java is much faster than either Python or Jython. Writing a Java UDF requires compiling a jar on your local machine, but the speed improvement is often worthwhile.

Download Template Project

A large part of effort of writing a Java UDF can be getting all your dependent jars in a row. To ease that difficulty, Mortar provides a template maven project that takes care of all the dependency management and allows you to immediately begin developing UDFs.


Compile Sample Project

If you don't already have maven installed, follow the instructions to download and install it here. Once you have maven installed, navigate to the template project directory and run the following to ensure the project compiles.

mvn clean package

Maven will download the necessary artifacts and put a jar file in the /target directory. If you see a BUILD SUCCESS message then the project has built successfully.


Write a UDF

All Java UDFs extend EvalFunc<T>. Your UDF may return a scalar, a map, or the Pig-specific data types of DataBag or Tuple. This depends entirely on what you want your UDF to do.

The exec(Tuple input) method must be implemented in your UDF, and the outputSchema(Schema input) should be implemented if your UDF returns a DataBag or a Tuple.

exec

The exec function is where the guts of your UDF goes. It takes as input a Tuple and returns the declared type T.

The first thing this function needs to do is verify and extract the data needed to operate on. The exec function always takes a Tuple as input, so if your UDF is meant to operate on a single String, the first lines of code in exec should verify that the tuple has exactly one value, and that input can be successfully cast to a String.

Once you have the necessary pieces of input data verified and extracted, you can write the rest of the exec function to perform the desired action on the data.

outputSchema

Pig needs to know what type of object is being returned by the UDF. In the case of maps and scalars it accomplishes this through reflection, but when a DataBag or a Tuple is returned, that information needs to be explicitly passed back through the outputSchema method.

More information on Java UDFs can be found in the Pig Documentation, or in O'Reilly's Programming Pig book.


Testing

Because there are a few steps involved in packaging the UDF and putting it in place, you'll want to make sure it's unit tested first. For an example unit test see TestIdentityUDFin the template project; this class runs the exec function on a test Tuple.

To run a single test in maven, set the test property.

mvn -Dtest=TestIdentityUDF test

Use UDF

To use your new UDF, move the compiled jar file into your Mortar project in the udfs/java directory. You can then register the jar in pig via the REGISTER statement:

REGISTER ../udfs/java/my_jar.jar;

At this point you can use your UDF in your Pig script the way you can use any of the Pig built-in functions. Remember that the UDF name will need to be fully qualified; to get around this often it's easiest to DEFINE an alias at the beginning of your script.

DEFINE MyUDF the.full.package.path.MyUDF();

This is also the place to define the arguments of your UDF, if it has any. For example, if the UDF takes a time parameter in minutes you might define both of the following.

DEFINE MyUDF_5 the.full.package.path.MyUDF(5);
DEFINE MyUDF_10 the.full.package.path.MyUDF(10);