Now that you have defined a data pipeline with a Luigi script, it's time to run your pipeline.
First we're going to try running the Luigi script on our local machine rather than offloading the full orchestration of the pipeline to Mortar.
In the terminal, just run:
cd YOUR_PROJECT_ROOT_DIRECTORY mortar local:luigi luigiscripts/word-luigi.py --output-path s3://mortar-example-output-data/<your-handle-here>/q-words
<your-handle-here> should be an identifier unique to you (such as first initial and last name).
--output-path argument passes the required
output_path parameter to your Luigi script. Note that although the parameters within the Luigi script use underscores, the parameter must be passed in the command line using hyphens. You can use similar parameters to override the defaults in your script (for example,
You should see status messages streaming through the terminal window, like this:
(If Luigi returns an error that you can't solve, try looking at the
word-luigi-solution.py script in the
luigiscripts directory for a completed version of the script.)
When your script runs to completion, you will be able to find the output data in the S3 location specified by your
output-path parameter. (You can access files in S3 using a tool such as s3cmd or via the AWS console.) Or you can run your pipeline in the cloud, after which Mortar will serve up a convenient download link for your results.
That pipeline runs just fine locally, but eventually you'll build more complex pipelines that you'll want to deploy to run in the cloud. Also, you won't want to keep your laptop running for hours while pipelines complete!
It's incredibly easy to run in the cloud. Just substitute
mortar luigi for
mortar local:luigi in your run command:
mortar luigi luigiscripts/word-luigi.py --output-path s3://mortar-example-output-data/<your-handle-here>/q-words2
Note that we also added a
2 to the end of the
output-path parameter. Otherwise, Luigi would have detected that our results already exist at the original location and not rerun the job. This behavior is called idempotency, and it's one of Luigi's most useful features for complex pipelines.
When you submit the
mortar luigi command, your code will sync to the cloud, and your job will start running. Instead of seeing a stream of log messages, you'll get one link to check the status of your job in the Mortar application.
If you follow the link, you'll see both the progress of your Luigi job and of any Pig jobs run as part of your pipeline:
Running in the cloud makes it easier to retrieve the output of any of your Pig scripts, should you need them. Once your Luigi job completes, click on the "Results" button next to your Pig job on the Job History page. You'll see a link to download your results directly from S3, like this:
Go ahead and download the file and open it up. If you’ve ever wondered which words beginning with “q” are the most common, now you have an answer to your … wait for it … question.