Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Using PigStorage

PigStorage is a built-in function of Pig, and one of the most common functions used to load and store data in pigscripts. PigStorage can be used to parse text data with an arbitrary delimiter, or to output data in an delimited format.

Delimiter

If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter. For example, to load a space-separated file:

data = LOAD 's3n://input-bucket/input-folder' USING PigStorage(' ')
            AS (field0:chararray, field1:int);

The schema must be provided in the AS clause.

To store data using PigStorage, the same delimiter rules apply:

STORE data INTO 's3n://output-bucket/output-folder' USING PigStorage('\t');

Limitations

PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context. For this reason, when loading a CSV file it is recommended to use CSVExcelStorage rather than PigStorage with a comma delimiter.