A large part of effort of writing a Java UDF can be getting all your dependent jars in a row. To ease that difficulty, Mortar provides a template maven project that takes care of all the dependency management and allows you to immediately begin developing UDFs.
If you don't already have maven installed, follow the instructions to download and install it here. Once you have maven installed, navigate to the template project directory and run the following to ensure the project compiles.
mvn clean package
Maven will download the necessary artifacts and put a jar file in the /target directory. If you see a
BUILD SUCCESS message then the project has built successfully.
All Java UDFs extend EvalFunc<T>. Your UDF may return a scalar, a map, or the Pig-specific data types of DataBag or Tuple. This depends entirely on what you want your UDF to do.
exec(Tuple input) method must be implemented in your UDF, and the
outputSchema(Schema input) should be implemented if your UDF returns a DataBag or a Tuple.
exec function is where the guts of your UDF goes. It takes as input a Tuple and returns the declared type T.
The first thing this function needs to do is verify and extract the data needed to operate on. The exec function always takes a Tuple as input, so if your UDF is meant to operate on a single String, the first lines of code in exec should verify that the tuple has exactly one value, and that input can be successfully cast to a String.
Once you have the necessary pieces of input data verified and extracted, you can write the rest of the exec function to perform the desired action on the data.
Pig needs to know what type of object is being returned by the UDF. In the case of maps and scalars it accomplishes this through reflection, but when a DataBag or a Tuple is returned, that information needs to be explicitly passed back through the
Because there are a few steps involved in packaging the UDF and putting it in place, you'll want to make sure it's unit tested first. For an example unit test see TestIdentityUDFin the template project; this class runs the exec function on a test Tuple.
To run a single test in maven, set the
mvn -Dtest=TestIdentityUDF test
To use your new UDF, move the compiled jar file into your Mortar project in the
udfs/java directory. You can then register the jar in pig via the
At this point you can use your UDF in your Pig script the way you can use any of the Pig built-in functions. Remember that the UDF name will need to be fully qualified; to get around this often it's easiest to
DEFINE an alias at the beginning of your script.
DEFINE MyUDF the.full.package.path.MyUDF();
This is also the place to define the arguments of your UDF, if it has any. For example, if the UDF takes a time parameter in minutes you might define both of the following.
DEFINE MyUDF_5 the.full.package.path.MyUDF(5); DEFINE MyUDF_10 the.full.package.path.MyUDF(10);