Mortar has joined Datadog, the leading SaaS-based monitoring service for cloud applications. Read more about what this means here.

Using XML Data

You can load XML data into Mortar by using the Piggybank function StreamingXMLLoader.

Loading XML Data with StreamingXMLLoader

Precursor

To load XML data into Mortar, first ensure that your files are formatted with exactly one XML object per line. This helps Pig and Hadoop to properly split your files for parallel processing.

StreamingXMLLoader takes two string parameters: the record identifier, which indicates the chunks of XML which will correspond to individual Pig records, and the (optional) tag list, which indicates which particular tags should be extracted from each record.

Quick Start - No Tag Extraction

The simplest way to use the StreamingXMLLoader is with no tag extraction, providing only a record identifier:

XML_objects = LOAD 's3n://my-s3-bucket/path/to/XML/file'
                USING org.apache.pig.piggybank.storage.StreamingXMLLoader('Entry');

This loads each XML document (delimited by 'Entry') into an individual tuple. The entire document, including the delimiting tags, will appear in the record. Each record will be a single chararray, so make sure any schema you include reflects this.

This will give you a general idea about your data, but you'll be stuck with a single giant field full of unparsed XML.

Tag Selection

Once you know which fields you need from your data, you can provide StreamingXMLLoader with a list of tags. You pass the list to StreamingXMLLoader as a second (optional) string parameter, like this:

XML_objects = LOAD 's3n://my-s3-bucket/path/to/XML/file'
                USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
                'Entry',
                'tag1, tag2, tag3'
                ) AS (
                    tag1: {(attr: map[], content:chararray)},
                    tag2: {(attr: map[], content:chararray)},
                    tag3: {(attr: map[], content:chararray)},
                );

It's also possible (and an extremely good idea) to add a schema in the AS clause of your LOAD statement. The records that come back are somewhat complex but structurally uniform (which is more than you can say about the average XML document), so this will benefit you in the long run.

StreamingXMLLoader will extract these fields from your data, filling the attr map with attribute name/value pairs. The raw contents of each element (not including its enclosing start and end tags) appear as a chararray.

Simple Example

Let's read in a slice of the Protein Sequence Database. To load the protein entries without doing any element extraction:

proteins = LOAD 's3://mortar-example-data/protein-sequence/Protein-Sequence-DB-head.xml'
            USING org.apache.pig.piggybank.storage.StreamingXMLLoader('ProteinEntry');

However, let's provide a tag list so we can go after specific elements:

proteins = LOAD 's3://mortar-example-data/protein-sequence/Protein-Sequence-DB-head.xml'
            USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
            'ProteinEntry',
            'protein, organism, source'
            ) AS (
                protein:  {(attr:map[], content:chararray)},
                organism: {(attr:map[], content:chararray)},
                source:   {(attr:map[], content:chararray)}
            );

We can now use the fields in the script as we please, say to find ones that came from a human source:

proteins = LOAD 's3://mortar-example-data/protein-sequence/Protein-Sequence-DB-head.xml'
            USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
            'ProteinEntry',
            'protein, organism, source'
            ) AS (
                protein:  {(attr:map[], content:chararray)},
                organism: {(attr:map[], content:chararray)},
                source:   {(attr:map[], content:chararray)}
            );

-- We know there's only one protein, organism and source element in each entry -- so we can get away with flattening like this. You might need to be more clever -- in real life, though.

proteins        =  FOREACH proteins GENERATE
                        FLATTEN(protein.content)   AS protein,
                        FLATTEN(organism.content)  AS organism,
                        FLATTEN(source.content)    AS source;

human_proteins  =  FILTER proteins BY source == 'human';

If you take a peek at that protein sequence data, you might notice that source elements are nested inside organism elements. By design, StreamingXMLLoader allows you to extract arbitrarily nested elements from your XML. However, a nested field that is explicitly extracted will not appear in the content chararray of the enclosing field.