Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Using CSV Data

You can load CSV data into Mortar, or tell Mortar to output data in Excel-flavoured CSV format, by using the Piggybank function CSVExcelStorage.

Loading Data from CSV files

You can use CSVExcelStorage to load your data as such:

data = LOAD 's3n://my-s3-bucket/path/to/csv/file'
            USING org.apache.pig.piggybank.storage.CSVExcelStorage()
            AS (field1: int, field2: chararray);

Additionally, Mortar has extended CSVExcelStorage to take several parameters that the basic Piggybank version doesn't have. These parameters help tailor the loader to your type of CSV.

The first parameter specifies which character to use as a field delimiter. The default is comma (',').

The second parameter specifies how to treat quoted fields with newlines in them. These fields are valid CSV, but loading them properly is complicated and decreases performance. Therefore, the default behavior is 'NO_MULTILINE', not allowing multiline fields. If you wish to allow multiline fields, set this parameter to 'YES_MULTILINE'.

The third parameter specifies how to handle line endings for storing, and is not used for loading. Set it to 'NOCHANGE' if you want to use the fourth parameter. The parameters are in this unintuitive order to maintain backwards compatibility with an older version of CSVExcelStorage in the Piggybank not developed by Mortar.

The fourth parameter specifies what to do with header rows (the first row of each file). For loading, set it to 'SKIP_INPUT_HEADER' to skip header rows. If this parameter is not specified, header rows will be read.

Example of using parameters to load a TSV file (CSV with tab delimiters), skipping header rows:

data = LOAD 's3n://my-s3-bucket/path/to/csv/output'
            USING org.apache.pig.piggybank.storage.CSVExcelStorage(
            '\t', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER'
            );

Storing Pig Output in CSV Format

You can store the output of a Pig script in CSV format using CSVExcelStorage as well.

Example:

STORE result INTO 's3n://my-s3-bucket/path/to/output' USING org.apache.pig.piggybank.storage.CSVExcelStorage();

Again, Mortar's version of CSVExcelStorage can optionally be passed parameters.

The first parameter specifies which character to use as a field delimiter. The default is comma (',').

The second parameter controls how to handle newlines in quoted fields in the same manner as it did for loading. The default is 'NO_MULTILINE'.

The third parameter controls how to write line endings. It may be set to 'UNIX' (line endings are LF), 'WINDOWS' (line endings are CRLF), or 'UNCHANGED' (line endings are the system default). The default is 'UNCHANGED'.

The fourth parameter specifies whether to write a header row containing the title of each field in the schema. Set it to 'WRITE_OUTPUT_HEADER' to write such a header to each file. If this parameter is not specified, no header row will be written.

Example of using parameters to store into a TSV format (CSV with tab delimiters), with Windows line-endings and with a header row:

data = STORE result INTO 's3n://my-s3-bucket/path/to/output'
            USING org.apache.pig.piggybank.storage.CSVExcelStorage(
            '\t', 'YES_MULTILINE', 'WINDOWS', 'WRITE_OUTPUT_HEADER'
            );

You can read Pig output in Excel on Windows if you use 'WINDOWS' line endings. You will have to give the output files a ".csv" extension to do so however.

You may also wish to concatenate the part files into a single file. Alternatively, you can add the clause "PARALLEL 1" to your store statement. This tells Pig to use only a single reducer, so the output will be sent to only one file.


Mortar Project Example

For a full example in a Mortar project, clone down the mortar-examples repository and check out the airline_travel pigscript.