Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Writing Python UDFs

Mortar supports Pig User-Defined Functions that run in pure Python (with C extensions supported). Python UDFs are slower than Java or Jython UDFs, but offer the most flexibility for third party libraries.

Mortar's Python

Mortar provides Python 2.7 for writing User-Defined Functions (UDFs).

Mortar's Python allows the use of Python C packages. This means that statistical packages like numpy, scipy, and scikit-learn, which don't run in Jython, work properly in Mortar.


Pre-Installed Libraries

In addition to the Python Standard Library, Mortar comes pre-installed with several Python libraries often used for data science and data pipelines:


Using Additional Libraries

Mortar supports installing any Python library that can be installed by pip, including private packages stored in your S3 bucket.

For jobs that are run on the Mortar service, you can install additional Python libraries from the Python Settings tab under My Settings.

For jobs that you are running locally you will need to define your dependencies in the root directory of your Mortar project in a file called requirements.txt.

Here are some examples of how to define a Python dependency:

  • Version 2.1 of python-dateutil

    python-dateutil==2.1
    
  • Version 2.1 or higher of python-dateutil

    python-dateutil>=2.1
    
  • Latest version of python-dateutil

    python-dateutil
    
  • Your private library stored in S3

    s3://my-bucket/my-custom-lib.tar.gz
    

Registering a Python UDF in Pig

To register a Python UDF file, you use Pig's REGISTER statement. For example:

REGISTER '../udfs/python/myudfs.py' USING streaming_python AS myudfs;

Afterward, you can call python UDFs from your Pig script:

myalias = FOREACH myinput GENERATE myudfs.my_python_udf(myfield);

Using the @outputSchema Decorator in Python

The easiest way to let Pig know what the output of the UDF will be is to use the @outputSchema decorator. Any Python function that will be called from Pig should use this to define its output.

from pig_util import outputSchema

@outputSchema('value:int')
def return_one():
    return 1

Simple Types

The simple types that are available to return in an @outputSchema string are: chararray,bytearray, long, int, double, datetime, and boolean:

chararray

@outputSchema('output_field_name:chararray')
def my_chararray_udf(arg0, arg1, argn):
    return 'mystr'

bytearray

@outputSchema('output_field_name:bytearray')
def my_bytearray_udf(arg0, arg1, argn):
    return 'some_random_bytes_from_python'

long

@outputSchema('output_field_name:long')
def my_long_udf(arg0, arg1, argn):
    return 10000L

int

@outputSchema('output_field_name:int')
def my_int_udf(arg0, arg1, argn):
    return 10000

double

@outputSchema('output_field_name:double')
def my_double_udf(arg0, arg1, argn):
    return 10000.0

datetime

Pig 0.12+

@outputSchema('output_field_name:datetime')
def my_datetime_udf(arg0, arg1, argn):
    return datetime.datetime.now()

boolean

Pig 0.12+

@outputSchema('output_field_name:boolean')
def my_boolean_udf(arg0, arg1, argn):
    return True

Complex Types

You can also return a Tuple, Bag, or Map from your UDF.

Tuple

@outputSchema('output_field_name:tuple(inner_field_name_1:chararray, inner_field_name_2:int)')
def my_tuple_udf(arg0, arg1, argn):
    return ('a_str', 1000)

DataBag

@outputSchema('output_bag_field_name:bag{t:(inner_field_name_1:chararray, inner_field_name_2:int)}')
def my_bag_udf(arg0, arg1, argn):
    bag = []
    # each element in the bag must be a tuple
    bag.append(('a_str', 1000))
    bag.append(('another_str', 500))
    return bag

Map

@outputSchema('output_field_name:map[]')
def my_map_udf(arg0, arg1, argn):
    return {'sky': 'blue', 'grass': 'green', 'submarine': 'yellow'}

Example Python UDFs

For some examples of python UDFs, see the mortar-examples udfs/python directory. Also, check out the blog post where we announced this feature for another example.