5. A Workflow Engine
• Oozie executes workflow defined as DAG of jobs
• The job type includes: Map-Reduce/Pig/Hive/Any script/
Custom Java Code etc
M/R
streaming
job
M/R
start job fork join
Pig MORE
decision
job
M/R ENOUGH
job
FS
end Java
job
6. A Scheduler
• Oozie executes workflow based on:
– Time Dependency (Frequency)
– Data Dependency
Oozie Server
Check
WS API Oozie Data Availability
Coordinator
Oozie
Oozie Workflow
Client Hadoop
7. REST-API for Hadoop Components
• Direct access to Hadoop components
– Emulates the command line through REST
API.
• Supported Products:
– Pig
– Map Reduce
8. Three Questions …
Do you need Oozie?
Q1 : Do you have multiple jobs with
dependency?
Q2 : Does your job start based on time or data
availability?
Q3 : Do you need monitoring and operational
support for your jobs?
If any one of your answers is YES,
then you should consider Oozie!
9. What Oozie is NOT
• Oozie is not a resource scheduler
• Oozie is not for off-grid scheduling
o Note: Off-grid execution is possible through
SSH action.
• If you want to submit your job occasionally,
Oozie is an option.
o Oozie provides REST API based submission.
11. Oozie in Apache
• Y! internal usages:
– Total number of user : 375
– Total number of processed jobs ≈ 750K/
month
• External downloads:
– 2500+ in last year from GitHub
– A large number of downloads maintained by
3rd party packaging.
16. Manageability
• Email action
• Query Pig Stats/Hadoop Counters
– Runtime control of Workflow based on stats
– Application-level control using the stats
17. Challenges : Queue Starvation
• Which Queue?
– Not a Hadoop queue issue.
– Oozie internal queue to process the Oozie
sub-tasks.
– Oozie’s main execution engine.
• User Problem :
– Job’s kill/suspend takes very long time.
18. Challenges : Queue Starvation
Technical Problem:
• Before execution, every task acquires lock on the job id.
• Specialhigh-priority tasks (such as Kill or Suspend)
couldn’t get the lock and therefore, starve.
In Queue J1 J2
J1 J1 J2 J1(H) J2 J1
Starvation for High Priority Task!
19. Challenges : Queue Starvation
Resolution:
• Add the high priority task in both the interrupt list and normal queue.
• Before de-queue, check if there is any task in the interrupt list for the
same job id. If there is one, execute that first.
In Queue J1 J2
J1 J1 J2 J1(H) J2 J1
finds a task in interrupt queue
In Interrupt List
J1(H)
20. Oozie Futures
• Easy adoption
– Modeling tool
– IDE integration
– Modular Configurations
• Allow job notification through JMS
• Event-based data processing
• Prioritization
– By user, system level.
21. Take Away ..
• Oozie is
– In Apache!
– Reliable and feature-rich.
– Growing fast.
22. Q&A
Mohammad K Islam
kamrul@yahoo-inc.com
http://incubator.apache.org/oozie/
23. Who needs Oozie?
• Multiple jobs that have sequential/
conditional/parallel dependency
• Need to run job/Workflow periodically.
• Need to launch job when data is available.
• Operational requirements:
– Easy monitoring
– Reprocessing
– Catch-up
24. Challenges : Queue Starvation
Problem:
• Consider queue with tasks of type T1 and T2. Max Concurrency = 2.
• Over-provisioned task (marked by red) is pushed back to the queue.
• At high load, it gets penalized in favor of same type, but later arrival
of tasks .
In Queue Running C (T1) C (T2)
T1 T2 T1 T1 T1 T2 T1 012 01
Starvation!
T1 cannot execute and is pushed to head of queue
25. Challenges : Queue Starvation
Resolution:
• Before de-queuing any task, check its concurrency.
• If violated, skip and get the next task.
In Queue Running C (T1) C (T2)
T1 T2 T1 T1 T1 T2 T1 012 01 2
Enqueue T2 now T1 cannot execute, so skip by one normallyfront
T1 now executes node to