Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Writing Jython UDFs

Mortar supports Pig User-Defined Functions that run in Jython. These have many of the same niceties of Python UDFs, but perform much faster.

Mortar's Jython

Mortar provides Jython 2.5.2 for writing User-Defined Functions (UDFs).


Why Use Jython instead of Python?

Since Jython is run in the same Java Runtime environment as Pig and Hadoop, transferring data between your Jython UDFs and Pig Scripts is really fast. As a result, you're going to want to use Jython UDFs most of the time. However, there are some specific cases where using python is preferable:

  • Currently Mortar does not support accessing external libraries such as numpy, scipy, and nltk from a Jython UDF in the Web IDE. For most cases the Standard Jython Library should be enough, but there are some notable examples of limits of the standard jython library that you should be aware of.

  • The Jython Standard Library as of version 2.5.2 doesn't have a json parser. This means at this time if you need to work with JSON, you'll have to do it in regular Python.


Registering a Jython UDF in Pig

To register a Jython UDF file, you use Pig's REGISTER statement. For example:

REGISTER '../udfs/jython/myudfs.py' USING jython AS myudfs;

Afterward, you can call jython UDFs from your Pig script:

myalias = FOREACH myinput GENERATE myudfs.my_jython_udf(myfield);

Using the @outputSchema Decorator in Jython

The easiest way to let Pig know what the output of the UDF will be is to use the @outputSchema decorator. Any Jython function that will be called from Pig should use this to define its output.

@outputSchema('value:int')
def return_one():
    return 1

Simple Types

The simple types that are available to return in an @outputSchema string are: chararray,bytearray, long, int, double, datetime, and boolean:

chararray

@outputSchema('output_field_name:chararray')
def my_chararray_udf(arg0, arg1, argn):
    return 'mystr'

bytearray

@outputSchema('output_field_name:bytearray')
def my_bytearray_udf(arg0, arg1, argn):
    return 'some_random_bytes_from_python'

long

@outputSchema('output_field_name:long')
def my_long_udf(arg0, arg1, argn):
    return 10000L

int

@outputSchema('output_field_name:int')
def my_int_udf(arg0, arg1, argn):
    return 10000

double

@outputSchema('output_field_name:double')
def my_double_udf(arg0, arg1, argn):
    return 10000.0

datetime

Pig 0.12+

@outputSchema('output_field_name:datetime')
def my_datetime_udf(arg0, arg1, argn):
    return datetime.datetime.now()

boolean

Pig 0.12+

@outputSchema('output_field_name:boolean')
def my_boolean_udf(arg0, arg1, argn):
    return True

Complex Types

You can also return a Tuple, Bag, or Map from your UDF.

Tuple

@outputSchema('output_field_name:tuple(inner_field_name_1:chararray, inner_field_name_2:int)')
def my_tuple_udf(arg0, arg1, argn):
    return ('a_str', 1000)

DataBag

@outputSchema('output_bag_field_name:bag{t:(inner_field_name_1:chararray, inner_field_name_2:int)}')
def my_bag_udf(arg0, arg1, argn):
    bag = []
    # each element in the bag must be a tuple
    bag.append(('a_str', 1000))
    bag.append(('another_str', 500))
    return bag

Map

@outputSchema('output_field_name:map[]')
def my_map_udf(arg0, arg1, argn):
    return {'sky': 'blue', 'grass': 'green', 'submarine': 'yellow'}

Streaming Data to Jython

An idiosyncrasy between Jython and Python is how Pig streams data to the UDF. Pig lazily casts data, meaning it will only determine a field's datatype when it can infer what that datatype is. Since Pig doesn't have any knowledge of a UDFs inner workings, all fields not explicitly cast in the schema will be passed to the UDF as a Java byte array. You will then have to manually cast that field to the appropriate datatype using the Jython Standard Library.


Example Jython UDFs

Some example Jython UDFs can be found in the mortar-examples project. Some can also be found in the mortar-recsys project.