Luigi is a powerful, easy-to-use framework for building data pipelines. It was developed at Spotify, where it runs thousands of jobs per day.
Luigi pipelines automatically handle dependency resolution, checkpointing and failure recovery, parallel execution, command line integration, and much more. Pipelines are expressed as simple, easy-to-read Python that can be reused anywhere. The open source community has contributed a library of common tasks, with frequent updates and contributions.
The Mortar platform offers full support for Luigi. You can build pipelines on your computer with the Mortar Development Framework, and then deploy and run them in the cloud with one command.
Luigi scripts consist of a pipeline of actions called “Tasks” that have dependencies on each other. Unlike other workflow tools, in Luigi a task can be anything you can write in Python, from simple data aggregation to connecting to AWS services. Luigi chains the tasks together, automates the run process and handles failures in an elegant way.
Each Luigi task declares its dependencies on other tasks, requiring those to be completed before it can start. Therefore, if you run the last task in a dependency chain, Luigi will run all its dependent tasks, and all the dependencies of those tasks, and so on, all the way down until it finds a task whose dependencies are fully satisfied.
This is very similar to the dependency chain we’ll create for the recommendation system pipeline.
Luigi Task Dependency Diagram
The above diagram assumes the data for your recommender system is being loaded from S3 and the results are being written to DynamoDB. In the case of reading and writing from MongoDB the bottom task "Load data from S3" just becomes "Load data from MongoDB" and because MongoDB supports creating collections on the fly the 3 tasks "Create Item-Item Table", "Create User-Item Table", and "Write DynamoDB Table" just become one "Write MongoDB Collection" task. However, the principles of how Luigi works are still the same regardless of the specific tasks in the pipeline.
The above diagram assumes the data for your recommendation engine is being loaded from S3 and the results are being written to DynamoDB. In the case of reading and writing from an SQL database the bottom task "Load data from S3" becomes "Load data into S3 from Database" and "Write DynamoDB Table" just becomes "Write Database Tables" task. However, the principles of how Luigi works are still the same regardless of the specific tasks in the pipeline.
To guarantee production stability, Mortar runs Luigi from an integration-tested fork at github.com/mortardata/luigi. We regularly pull in and release updates as they pass unit and integration tests.
Mortar has also open-sourced the mortar-luigi project, a collection of extensions for using Mortar from within Luigi. It contains a large collection of useful Tasks for things like running Pigscripts, managing clusters, running shell scripts, and interacting with databases like DynamoDB.
There are a number of features of Luigi that aren’t used by Mortar, but which you can read about in the Luigi docs.
In addition to running pipelines, Luigi was also originally designed to run Hadoop jobs directly. It has native Python MapReduce support built in to its luigi.hadoop.JobTask class. Mortar does not use Luigi’s Hadoop support; instead we use the MortarProjectPigscriptTask Task to run and manage Pig jobs on Hadoop. Then, Mortar handles all of the heavy lifting on an Elastic MapReduce cluster for you.
To learn more about Luigi, take a look at our Luigi documentation and tutorials.
Next we’ll run an example Luigi pipeline using Mortar.