SlideShare a Scribd company logo
1 of 25
Oozie Evolution
Gateway to Hadoop Eco-System


             Mohammad Islam
Agenda

•    What is Oozie?
•    What is in the Next Release?
•    Challenges
•    Future Works
•    Q&A
Oozie in Hadoop Eco-System

                Oozie




                               HCatalog
        Pig    Sqoop    Hive
Oozie




              Map-Reduce

                  HDFS
Oozie : The Conductor
A Workflow Engine
•  Oozie executes workflow defined as DAG of jobs
•  The job type includes: Map-Reduce/Pig/Hive/Any script/
   Custom Java Code etc
                                      M/R
                                   streaming
                                       job


             M/R
  start      job      fork                           join



                                     Pig                    MORE
                                                                          decision
                                     job



                                                        M/R                   ENOUGH
                                                        job




                                               FS
                             end                                   Java
                                               job
A Scheduler
•  Oozie executes workflow based on:
   –  Time Dependency (Frequency)
   –  Data Dependency

                 Oozie Server
                                        Check
  WS API           Oozie            Data Availability
                 Coordinator

                   Oozie
 Oozie            Workflow
 Client                                     Hadoop
REST-API for Hadoop Components

•  Direct access to Hadoop components
  –  Emulates the command line through REST
     API.
•  Supported Products:
  –  Pig
  –  Map Reduce
Three Questions …
 Do you need Oozie?


Q1 : Do you have multiple jobs with
     dependency?
Q2 : Does your job start based on time or data
     availability?
Q3 : Do you need monitoring and operational
     support for your jobs?
   If any one of your answers is YES,
   then you should consider Oozie!
What Oozie is NOT

•  Oozie is not a resource scheduler

•  Oozie is not for off-grid scheduling
   o  Note: Off-grid execution is possible through
   SSH action.

•  If you want to submit your job occasionally,
   Oozie is an option.
    o  Oozie provides REST API based submission.
Oozie in Apache
Main Contributors
Oozie in Apache

•  Y! internal usages:
  –  Total number of user : 375
  –  Total number of processed jobs ≈ 750K/
     month
•  External downloads:
  –  2500+ in last year from GitHub
  –  A large number of downloads maintained by
     3rd party packaging.
Oozie Usages Contd.

•  User Community:
  –  Membership
    •  Y! internal - 286
    •  External – 163
  –  Message (approximate)
    •  Y! internal – 7/day
    •  External – 8/day
Next Release …

•  Integration with Hadoop 0.23

•  HCatalog integration
  –  Non-polling approach
Usability

•    Script Action
•    Distcp Action
•    Suspend Action
•    Mini-Oozie for CI
     –  Like Mini-cluster
•  Support multiple versions
     –  Pig, Distcp, Hive etc.
Reliability

•  Auto-Retry in WF Action level

•  High-Availability
  –  Hot-Warm through ZooKeeper
Manageability

•  Email action

•  Query Pig Stats/Hadoop Counters
  –  Runtime control of Workflow based on stats
  –  Application-level control using the stats
Challenges : Queue Starvation

•  Which Queue?
  –  Not a Hadoop queue issue.
  –  Oozie internal queue to process the Oozie
     sub-tasks.
  –  Oozie’s main execution engine.
•  User Problem :
  –  Job’s kill/suspend takes very long time.
Challenges : Queue Starvation
Technical Problem:
           •  Before   execution, every task acquires lock on the job id.
           •  Specialhigh-priority tasks (such as Kill or Suspend)
           couldn’t get the lock and therefore, starve.


           In Queue                                          J1   J2

 J1   J1        J2      J1(H)   J2                           J1



       Starvation for High Priority Task!
Challenges : Queue Starvation
Resolution:
    • Add the high priority task in both the interrupt list and normal queue.
   •  Before de-queue, check if there is any task in the interrupt list for the
   same job id. If there is one, execute that first.



             In Queue                                                 J1    J2

 J1     J1         J2        J1(H)         J2                        J1

                finds a task in interrupt queue

             In Interrupt List

J1(H)
Oozie Futures

•  Easy adoption
  –  Modeling tool
  –  IDE integration
  –  Modular Configurations
•  Allow job notification through JMS
•  Event-based data processing
•  Prioritization
  –  By user, system level.
Take Away ..

•  Oozie is
  –  In Apache!
  –  Reliable and feature-rich.
  –  Growing fast.
Q&A




                  Mohammad K Islam
               kamrul@yahoo-inc.com
      http://incubator.apache.org/oozie/
Who needs Oozie?

•  Multiple jobs that have sequential/
   conditional/parallel dependency
•  Need to run job/Workflow periodically.
•  Need to launch job when data is available.
•  Operational requirements:
  –  Easy monitoring
  –  Reprocessing
  –  Catch-up
Challenges : Queue Starvation
Problem:
                 •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2.
                 •  Over-provisioned task (marked by red) is pushed back to the queue.
                 •  At high load, it gets penalized in favor of same type, but later arrival
                    of tasks .


             In Queue                                   Running             C (T1) C (T2)

T1      T2     T1       T1    T1     T2      T1                              012      01



     Starvation!
     T1 cannot execute and is pushed to head of queue
Challenges : Queue Starvation
Resolution:
            •  Before de-queuing any task, check its concurrency.
            •  If violated, skip and get the next task.


          In Queue                               Running           C (T1) C (T2)

T1   T2     T1       T1   T1    T2     T1                          012     01 2


Enqueue T2 now   T1 cannot execute, so skip by one normallyfront
                                T1 now executes node to

More Related Content

Similar to Nov 2011 HUG: Oozie

Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013
Harel Ben-Attia
 
Lessons from Branch's launch
Lessons from Branch's launchLessons from Branch's launch
Lessons from Branch's launch
aflock
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
Nitay Joffe
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
Nitay Joffe
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
ducquoc_vn
 

Similar to Nov 2011 HUG: Oozie (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Lessons from Branch's launch
Lessons from Branch's launchLessons from Branch's launch
Lessons from Branch's launch
 
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceInnovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Asynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscanaAsynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscana
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Message Queues in Ruby - An Overview
Message Queues in Ruby - An OverviewMessage Queues in Ruby - An Overview
Message Queues in Ruby - An Overview
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
C# Async/Await Explained
C# Async/Await ExplainedC# Async/Await Explained
C# Async/Await Explained
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
3 ilp
3 ilp3 ilp
3 ilp
 

More from Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Nov 2011 HUG: Oozie

  • 1. Oozie Evolution Gateway to Hadoop Eco-System Mohammad Islam
  • 2. Agenda •  What is Oozie? •  What is in the Next Release? •  Challenges •  Future Works •  Q&A
  • 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop Hive Oozie Map-Reduce HDFS
  • 4. Oozie : The Conductor
  • 5. A Workflow Engine •  Oozie executes workflow defined as DAG of jobs •  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start job fork join Pig MORE decision job M/R ENOUGH job FS end Java job
  • 6. A Scheduler •  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 7. REST-API for Hadoop Components •  Direct access to Hadoop components –  Emulates the command line through REST API. •  Supported Products: –  Pig –  Map Reduce
  • 8. Three Questions … Do you need Oozie? Q1 : Do you have multiple jobs with dependency? Q2 : Does your job start based on time or data availability? Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 9. What Oozie is NOT •  Oozie is not a resource scheduler •  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action. •  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
  • 10. Oozie in Apache Main Contributors
  • 11. Oozie in Apache •  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month •  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  • 12. Oozie Usages Contd. •  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
  • 13. Next Release … •  Integration with Hadoop 0.23 •  HCatalog integration –  Non-polling approach
  • 14. Usability •  Script Action •  Distcp Action •  Suspend Action •  Mini-Oozie for CI –  Like Mini-cluster •  Support multiple versions –  Pig, Distcp, Hive etc.
  • 15. Reliability •  Auto-Retry in WF Action level •  High-Availability –  Hot-Warm through ZooKeeper
  • 16. Manageability •  Email action •  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  • 17. Challenges : Queue Starvation •  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine. •  User Problem : –  Job’s kill/suspend takes very long time.
  • 18. Challenges : Queue Starvation Technical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  • 19. Challenges : Queue Starvation Resolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt List J1(H)
  • 20. Oozie Futures •  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations •  Allow job notification through JMS •  Event-based data processing •  Prioritization –  By user, system level.
  • 21. Take Away .. •  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
  • 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 23. Who needs Oozie? •  Multiple jobs that have sequential/ conditional/parallel dependency •  Need to run job/Workflow periodically. •  Need to launch job when data is available. •  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
  • 24. Challenges : Queue Starvation Problem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  • 25. Challenges : Queue Starvation Resolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 2 Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to