A large part of effort of writing a Java UDF can be getting all your dependent jars in a row. To ease that difficulty, Mortar provides a template maven project that takes care of all the dependency management and allows you to immediately begin developing UDFs.
If you don't already have maven installed, follow the instructions to download and install it here. Once you have maven installed, navigate to the template project directory and run the following to ensure the project compiles.
mvn clean install
Maven will download the necessary artifacts and put a jar file in the /target directory. If you see a
BUILD SUCCESS message then the project has built successfully.
All Java UDFs extend EvalFunc<T>. Your UDF may return a scalar, a map, or the Pig-specific data types of DataBag or Tuple. This depends entirely on what you want your UDF to do.
exec(Tuple input) method must be implemented in your UDF, and the
outputSchema(Schema input) should be implemented if your UDF returns a DataBag or a Tuple.
exec function is where the guts of your UDF goes. It takes as input a Tuple and returns the declared type T.
The first thing this function needs to do is verify and extract the data needed to operate on. The exec function always takes a Tuple as input, so if your UDF is meant to operate on a single String, the first lines of code in exec should verify that the tuple has exactly one value, and that input can be successfully cast to a String.
Once you have the necessary pieces of input data verified and extracted, you can write the rest of the exec function to perform the desired action on the data.
Pig needs to know what type of object is being returned by the UDF. In the case of maps and scalars it accomplishes this through reflection, but when a DataBag or a Tuple is returned, that information needs to be explicitly passed back through the
Because there are a few steps involved in packaging the UDF and putting it in place, you'll want to make sure it's unit tested first. For an example unit test see TestIdentityUDFin the template project; this class runs the exec function on a test Tuple.
To run a single test in maven, set the
mvn -Dtest=TestIdentityUDF test
To create a jar file, run
mvn clean install in the root directory of the template project. This will create a jar file in the
In order for Mortar to access the jar, it needs to be moved to a location in S3. This can be any bucket that you are allowed to upload data to and is accessible using your IAM keys.
To tell your pig script about the new jar, use the
At this point you can use your UDF in your Pig script the way you can use any of the Pig built-in functions. Remember that the UDF name will need to be fully qualified; to get around this often it's easiest to
DEFINE an alias at the beginning of your script.
DEFINE MyUDF the.full.package.path.MyUDF();
This is also the place to define the arguments of your UDF, if it has any. For example, if the UDF takes a time parameter in minutes you might define both of the following.
DEFINE MyUDF_5 the.full.package.path.MyUDF(5); DEFINE MyUDF_10 the.full.package.path.MyUDF(10);