2. Foursquare
• 35 million users
• Nearly 4 billion check-ins
• More than 5 million check-ins per day
• 50 million point-of-interest database
• 100's of GB of log data per day
3. Tools We Use
• Hive
o Ad hoc analytics, data dumping ground
• Raw MapReduce
o 100's of MapReduce jobs in our codebase
• Pig
o Fits between structure Hive and free-form
MapReduce
• Vertica
o Low latency analytics
4. Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
5. Cron - Problems
• Brittle
• Hard to reason about / visualize
• Spend a lot of time waiting
• Difficult to tell what succeeded or failed
• No one likes writing Bash scripts
6. Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
8. Oozie - Problems
• Workflows are all-or-nothing
o Cannot just run step that failed
o Very little code reuse
• Little to no extensibility
• Limited control flow
• Extremely verbose
• Difficult to test
• No one likes writing XML
9. Luigi
• Python framework for batch processing jobs
• Created by Spotify, open-sourced Sept. 2012
• Tasks are units of work that produce Targets
• Tasks can depend on one or more other Tasks
• A Task is only run if all of its dependent Tasks are done
• Tasks are idempotent
11. Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
12. Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
16. Luigi - Advantages over Cron
• Explicit dependencies
• No wasted time waiting
• Easy to tell what has failed
• Avoid duplicate work / partial failures
17. Luigi - Advantages over Oozie
• Explicit dependencies between workflows
• Easier to write
• Vastly more extensible
• Code reuse
• Can easily re-run individual steps
18. Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com