Having extracted the data that you need, now it's time to do any necessary transformations. Your transformation code will go in
The first section of the pig script looks like this:
-- Load data from the extract step. raw = load '$INPUT_PATH/extract' using PigStorage() as ( < Your schema here> ); <Your transform logic here>
Here you need to provide two things:
The schema of your data. This needs to match the schema of the data that you stored out in the previous extraction step. If you skipped the extraction step because your data already lived in S3, you will replace the whole load statement with the statement provided by our Load Statement Generator.
Your transform logic. This logic is driven by your data's raw format and the format you will load into Redshift. If you are having a hard time knowing what to do here think about creating separate fields that will be easy to query or aggregate against. For example, in the Wikipedia example we separated the hard to query/aggregate field encoded_hourly_pageviews (ex. "KH1,VO1,^J1,^K2") and transformed that into separate fields and rows for day, hour, and pageviews.
In more complex data pipelines your transform logic will also need to do things like joining separate data sources and cleaning up invalid/dirty data.
The second half of the transform script looks like:
grouped_data = group <Your transformed data alias here> by <Your grouping key(s) here> parallel $REDSHIFT_PARALLELIZATION; reduced = foreach grouped_data generate flatten(<Your transformed data alias here>); -- Use gzip compression set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec; -- remove any existing data rmf $OUTPUT_PATH/transform; store reduced into '$OUTPUT_PATH/transform' using PigStorage();
To speed up loading data from S3 into Redshift we need to break our data into multiple files. The Mortar ETL pipeline will take care of setting the parallelization factor, but you need to pick a key (or keys) that can break your data up into multiple files. The goal is to pick something that will generate 5 to 10 times more equally sized files than nodes in your cluster.
These numbers are very rough but you want to avoid picking a grouping strategy that results in a small number of very large files and a large number of very small files as that will cause Reducer Skew and cause your cluster to be under utilized.
If you are having a hard time figuring out what to use for grouping your data just pick a field with somewhere between 100-1000 distinct values.