This quick tutorial will walk you through the basics of running an R script inside a data pipeline with Mortar.
For this example, we will craft a simple data pipeline to pull some data down from S3, run an R script to graph the data, and then upload the graph back to S3. We will also generate a URL where you can download the graph automatically from S3.
The general process is to get an R script working locally, and then wire it up to a data pipeline. Once it is wired up, you can run it automatically as often as you like.
First, make sure that you have a Mortar account (you can request a free trial here) and that you've installed the Mortar Development Framework. You can download it here—it's the fastest way to build and run data pipelines.
For the purposes of this tutorial we'll be working within the mortar-examples project developed by Mortar that contains several ready-to-run examples. We'll use Luigi to run a data pipeline containing an R script.
Once you've installed Mortar, run the following command (prepending your handle to generate a unique name) in the terminal to get a copy of Mortar's example project:
mortar projects:fork firstname.lastname@example.org:mortardata/mortar-examples.git <your-handle>-mortar-examples
Mortar projects have a standardized structure that keeps the various elements of a project (Pig scripts, UDFs, Luigi scripts) organized. R scripts live in the
rscripts directory, and Luigi scripts, unsurprisingly, live in the
The R script we'll be working with is olympics.R, a script that graphs the medals won by each country in the Olympics.
Open up olympics.R in your favorite code editor and take a look. The script takes in two parameters,
output_path. Let's run it once quickly to make sure it works. Make sure you have R installed on your computer, and run:
cd <your_mortar_project_root> Rscript rscripts/olympics.R \ data/OlympicAthletes.csv \ /tmp/medals_graph.png
When the script completes, you should have an Olympic medal histogram in
/tmp/medals_graph.png. Open it up and take a look!
Now we've got a locally-tested R script. But what if your company needs this graph updated every day? We don't want to get stuck manually pulling down the data, building the graph, and uploading it every day. Computers should do this for us!
So let's build a data pipeline to automate the process. We'll have the pipeline pull data from S3, run the R script on it, upload the graph back to S3, and then give us a link to the data.
There are four steps in the pipeline, each represented by a Python class that extends luigi.Task:
CopyOlympicsDataFromS3: copies the raw data from S3 to the local disk
PlotOlympicsData: runs the R script to generate a graph
CopyOlympicMedalsGraphToS3: uploads the graph from local disk to S3
PrintS3GraphLink: generates an https link to the graph and prints it to the log file
In the terminal, just run:
cd <your_mortar_project_root> mortar luigi luigiscripts/olympics-luigi.py \ --s3-path "s3://mortar-example-output-data/<your-handle>/medals_graph.png"
<your-handle> should be an identifier unique to you (such as first-initial-last-name).
When you run the command, you should see the following:
Taking code snapshot... done Sending code snapshot to Mortar... done Requesting job execution... done job_id: some_job_id Job status can be viewed on the web at: https://app.mortardata.com/jobs/pipeline_job_detail?job_id=some_job_id
This tells you that your job has started successfully and gives you the URL to monitor its progress. If you open that URL, you should see your pipeline running in the logs: first moving data from S3, then running your R script, then uploading data to S3, and finally printing out the URL for the graph.
The URL in the logs should look something like:
2014-10-12 05:42:01,811 (olympics-luigi.py:160) DOWNLOAD GRAPH AT: https://mortar-example-output-data.s3.amazonaws.com/...
Open up your URL and pull down the graph!
Proceed to the next article to start working with your own R scripts. Or, learn a bit more about Luigi data pipelines.