Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Load and Store to S3 from Pig

Using Fixed-Width Data

You can load fixed-width formatted data into Mortar, or tell Mortar to output data in fixed-width format, by using the Piggybank functions FixedWidthLoader and FixedWidthStorer.

Loading Data from Fixed-Width Files

You can use FixedWidthLoader to load your data as such:

data = LOAD 's3n://my-s3-bucket/path/to/input'
            '-5, 7-10, 11-15, 16, 17-', 'SKIP_HEADER',
            'f1: int, f2: int, f3: float, f4: int, f5: chararray'

The first parameter is mandatory and specifies the positions of the columns. They are 1-indexed and inclusive on both ends. "-5" means columns 1 through 5, and "17-" means 17 to the end of the line. Single-character columns at position n can be specified as either n-n or simply n.

This syntax is intended to mimic that of the Unix utility "cut".

The second parameter is optional and specifies what to do with header rows (a first row containing the titles of each column). If the parameter is set to 'SKIP_HEADER', FixedWidthLoader will skip the header row of each input file. The default behavior is to not skip the header; if you need to explicitly state this, set the parameter to 'USE_HEADER'.

The third parameter is optional and allows you to specify a schema. You could alternatively use an AS clause, but specifying it like this allows the FixedWidthLoader to perform type coercions that Pig does not natively support, such as "17" to 17.

Storing Pig Output into Fixed-Width Format

We're not sure why you'd want to do this, but we support it anyway!

You can use FixedWidthStorer to store data as such:

STORE data INTO 's3n://my-s3-bucket/path/to/output'
    '-5, 7-10, 11-15, 16, 17-30', 'WRITE_HEADER'

The first parameter is mandatory and specifies the positions of the columns using the same syntax as FixedWidthLoader. Note that the last column must have a defined end though; "17-" will cause an error.

The second parameter is optional and specifies whether or not to write a header row containing the name of each field at the top of each output file. The default behavior, 'NO_HEADER' is not to write headers. Set this parameter to 'WRITE_HEADER' to write headers. If you choose to write headers, make sure you allocate enough space in each column to fit the name of its field.

If a field does not fit in the space allotted to it, a null will be written (all spaces). The exception is floats and doubles. If a float or double field is too large to fit into a column, and the column is wide enough to fit all of the digits in the decimal left of the decimal point, the point itself, and at least one digit to the right of it, FixedWidthStorer will round the field to fit into the column.