Luigi is a powerful, easy-to-use framework for building data pipelines. It was developed at Spotify, where it runs thousands of jobs per day.
Luigi pipelines automatically handle dependency resolution, checkpointing and failure recovery, parallel execution, command line integration, and much more. Pipelines are expressed as simple, easy-to-read Python that can be reused anywhere. The open source community has contributed a library of common tasks, with frequent updates and contributions.
The Mortar platform offers full support for Luigi. You can build pipelines on your computer with the Mortar Development Framework, and then deploy and run them in the cloud with one command.
Luigi scripts consist of a pipeline of actions called “Tasks” that have dependencies on each other. Unlike other workflow tools, in Luigi a task can be anything you can write in Python, from simple data aggregation to connecting to AWS services. Luigi chains the tasks together, automates the run process and handles failures in an elegant way.
Each Luigi task declares its dependencies on other tasks, requiring those to be completed before it can start. Therefore, if you run the last task in a dependency chain, Luigi will run all its dependent tasks, and all the dependencies of those tasks, and so on, all the way down until it finds a task whose dependencies are fully satisfied.
To guarantee production stability, Mortar runs Luigi from an integration-tested fork at github.com/mortardata/luigi. We regularly pull in and release updates as they pass unit and integration tests.
Mortar has also open-sourced the mortar-luigi project, a collection of extensions for using Mortar from within Luigi. It contains a large collection of useful Tasks for things like running Pigscripts, managing clusters, running shell scripts, and interacting with databases like DynamoDB.
There are a number of features of Luigi that aren’t used by Mortar, but which you can read about in the Luigi docs.
In addition to running pipelines, Luigi was also originally designed to run Hadoop jobs directly. It has native Python MapReduce support built in to its luigi.hadoop.JobTask class. Mortar does not use Luigi’s Hadoop support; instead we use the MortarProjectPigscriptTask Task to run and manage Pig jobs on Hadoop. Then, Mortar handles all of the heavy lifting on an Elastic MapReduce cluster for you.
To learn more about Luigi, take a look at our Luigi documentation and tutorials.
Now that we've covered the basic components you'll need to use in this tutorial, let's get started building an example data warehouse.