SlideShare a Scribd company logo
1 of 37
Download to read offline
Oozie towards zero downtime
Hadoop Summit 2015/04/15
Purshotam Shah purushah@yahoo-inc.com
Ryota Egashira egashira@yahoo-inc.com
● Introduction
■ Scale at Yahoo
■ Use Cases
■ Why zero-down time matters?
● Architectural Overview
● Technical Challenge
■ Security
■ Log Streaming
■ HCatalog Integration in HA
● Experiences
● Future Work
2 Yahoo Confidential & Proprietary
Agenda
Why Oozie?
The Problem The Need
▪ Doing something on the grid often
required multiple steps
▪ MapReduce job
▪ Pig job
▪ Streaming job
▪ HDFS operation (mkdir, chmod,
etc)…
▪ Workflow scheduler with better support
for grid jobs (native integration with
Hadoop)
▪ orchestrate dependency between jobs
▪ execute at specific time or on data
availability
▪ retry jobs in the event of failures
(reliable)
▪ Multiple ad-hoc solutions existed
▪ custom job control
▪ cron…
▪ Common framework for communication
and execution of production process
▪ sync (clocked dataset) awareness
▪ async (unspecified freq) data
awareness
A server-based workflow
scheduling system to
manage Hadoop jobs
3 Yahoo Confidential & Proprietary
Scale at Yahoo
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
28.9 million Hadoop Jobs monthly (Jan 2015, total)
72% from Oozie (including launcher jobs)
108,000 workflow jobs daily (Feb 2015, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
1,700 coordinator jobs daily (Feb 2015, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
67 % of workflow jobs kicked from coordinator
60 bundle jobs daily (Feb 2015, one busy cluster)
4 Yahoo Confidential & Proprietary
Hadoop Jobs on the Platform Job distribution (Jan, 2015)
5 Yahoo Confidential & Proprietary
Y! business processed by Oozie
Ad Exchange
Ad Latency
Search Advertising
Content Agility
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
6 Yahoo Confidential & Proprietary
Y! business processed by Oozie
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Event Pipeline
7 Yahoo Confidential & Proprietary
Use Case - Data pipeline
8 Yahoo Confidential & Proprietary
Number of action created hourly
Mid-Night PMPMAM
4am 2pm 0am 10am 8pm
9 Yahoo Confidential & Proprietary
Number of action created per minute
10 Yahoo Confidential & Proprietary
SCALE
▪ At one point of time all 5 min, 15 min, 30 min, hourly,
daily, monthly coordinator job will collide and there will
be outburst of coordinator actions, which single host
can’t handle.
▪ We noticed processing delay and customers
complaining slowness.
11 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Oozie Upgrade (Major Release > 1 per Quarter, Minor > 1 per
Month)
12 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Dependent Hadoop Projects Upgrade (YARN, HDFS, Hive, HBase, etc)
Oozie
YARN
HDFS
Hive
HBase
Pig
HCatlog
Pig
13 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Configuration error / change
14 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Hardware error / upgrade
15 Yahoo Confidential & Proprietary
Why Downtime matters? Customers
Revenue-impact applications need running all the time, no delay!
16 Yahoo Confidential & Proprietary
Why Downtime matters? Ops
Ops- under pressure to minimize downtime
17 Yahoo Confidential & Proprietary
Solution : High Availability
18 Yahoo Confidential & Proprietary
● Definition: failure of a component != failure of entire
system
o by removing single point of failure
● Requirement: Transparency to Users
o User should not know it’s HA or not
o No change in API and usage pattern
19 Yahoo Confidential & Proprietary
Architecture
Load
Balancer
RDB
Hadoop Cluster
submit request
request redirection
Oozie Server 1
Oozie Server n
Inter server communication
Zookeeper
Curator
Architectural Overview: Database
20 Yahoo Confidential & Proprietary
● Oozie stores most of its state in a database
o (submitted jobs, workflow definitions, etc)
● Oracle database( 2 rack) in HA is used ( Hot-warm).
● Zookeeper ( Curator) for coordination
Architectural Overview: Access
21 Yahoo Confidential & Proprietary
● Users and client programs need a single address to
connect to
o Web UI, REST/Java API,
JobTracker/ResourceManager callbacks, etc
● Virtual IP (VIP) is used as user facing URL.
Architectural Overview: Security
22 Yahoo Confidential & Proprietary
 We use Kerberos and some of internal security system to communicate
among components.
23 Yahoo Confidential & Proprietary
Security: https + kerberos
/ cookie-based auth
Architectural Overview: Authentication
Load
Balancer
RDB
Hadoop Cluster
submit request
request redirection
Oozie Server 1
Oozie Server n
Inter server communication
for log streaming etc
Zookeeper
Curator
Security: https + kerberos /
cookie-based-auth
Security: https+kerberos
Zookeeper for lock and
management
Security: Kerberos
Security: kerberos
Technical Challenge: Log Streaming
24 Yahoo Confidential & Proprietary
● Each Oozie server only has access to its own logs
● Jobs can execute on any server
o Job execution can switch among server
● User need to see sequential logs rather than server1 and
server2 logs.
25 Yahoo Confidential & Proprietary
Architectural: Log Streaming
2. Call other server
to fetch logs
1. user request comes
to server1
3. Call all other server are
merge logs using log
timestamp
4. Log is displayed to
user
2. Fetch
server list
from ZK
Caveat:Log Streaming
26 Yahoo Confidential & Proprietary
 If an Oozie Server goes down, any logs from it will be unavailable
27 Yahoo Confidential & Proprietary
Technical Challenge:HCatalog Integration
• Hive Metastore(HCatalog) : Manage metadata for datasets
– Oozie register for dataset to HCatlog
– Oozie receive notification from HCatlog through JMS (e.g., ActiveMQ)
– Oozie starts job immediately after data becomes ready
JMS
(e.g, ActiveMQ)
. Push notification
<New Partition>
1. Register Topic
. Notify New Partition
Job
Oozie Server 1
Oozie Server 2
28 Yahoo Confidential & Proprietary
Technical Challenge:Hive Metastore Integration
• Oozie maintains in-memory list of datasets which need
notification.
• Notification comes to only one server.
• One notification come to one server, Oozie need to
invalidate cache in all other servers.
• This is done by having a periodic task on each server
which check job status of each dataset and if it’s not
waiting. It remove the dataset from cache.
29 Yahoo Confidential & Proprietary
Technical Challenge:Hive Metastore Integration
3. Push notification
<New Partition>2. Register Topic
4. Notify New Partition JMS
(e.g, ActiveMQ)
Job
Oozie Server 1
Oozie Server 2
Remove
registrationPeriodic check
Challenges
30 Yahoo Confidential & Proprietary
● Distributed Job ID
o Maintain distributed sequence number for Job ID using
Apache Curator + Zookeeper
● Zookeeper Failure Handling
o Oozie servers automatically shutdown when Zookeeper is
down
● Sharelib
o Support sharelib update in HA
More Challenges
• SLA support
– Oozie has in-memory data structure to track sla status
for each job (start/duration/end met/miss and
notifications)
– add check of sla status against Database
– use ZK lock to synchronize update on the same job
from multiple servers.
• Distributed Locks
– Reentrant distributed lock using Apache Curator +
Zookeeper31 Yahoo Confidential & Proprietary
Experiences
• HA running on all
production grids > 7
months at Yahoo!
– Stable !
32 Yahoo Confidential & Proprietary
Issues
– Zookeeper down
(when upgrading zk quorum h/w)
– Server going out of sync
(during upgrade, sharelib)
33 Yahoo Confidential & Proprietary
Benefits
▪ Zero downtime for applications
▪ Rolling upgrade (zero downtime)
› Maintenance upgrade
› Configuration upgrade
▪ No more materization delay
34 Yahoo Confidential & Proprietary
Workflow Job Submission Throughput
35 Yahoo Confidential & Proprietary
Future work
• Faster job fail-over
– currently wait for a thread (Recovery Service) to pick
non-progressing jobs every few minutes
– Oozie server should immediately notice when other
server is down and fail-over job (e.g, using ZK
watcher)
• Improve log streaming
36 Yahoo Confidential & Proprietary
Acknowledgement
Robert Kanter
Olga L. Natkovich
Rohini Palaniswamy
Michelle Chiang
Jacob Tolar
Sumeet Singh
37 Yahoo Confidential & Proprietary

More Related Content

What's hot

Oozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepOozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepPradeep Pandey
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureMatt Ray
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowMatt Ray
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cTanel Poder
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7Zhaoyang Wang
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜Michitoshi Yoshida
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsBjoern Rost
 
Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Guatemala User Group
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talkAmrit Sarkar
 
監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜Michitoshi Yoshida
 

What's hot (20)

Oozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepOozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeep
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
Gradle - Build System
Gradle - Build SystemGradle - Build System
Gradle - Build System
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜
 

Similar to Oozie towards zero downtime

DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Tejas Purohit
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architectureRiccardo Perico
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouseunicast
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the fieldJoAnna Cheshire
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
SaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltStack
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesYshay Yaacobi
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...Matthew (정재화)
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusJulien Pivotto
 

Similar to Oozie towards zero downtime (20)

DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the field
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
SaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google Scale
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with Prometheus
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 

Recently uploaded (20)

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 

Oozie towards zero downtime

  • 1. Oozie towards zero downtime Hadoop Summit 2015/04/15 Purshotam Shah purushah@yahoo-inc.com Ryota Egashira egashira@yahoo-inc.com
  • 2. ● Introduction ■ Scale at Yahoo ■ Use Cases ■ Why zero-down time matters? ● Architectural Overview ● Technical Challenge ■ Security ■ Log Streaming ■ HCatalog Integration in HA ● Experiences ● Future Work 2 Yahoo Confidential & Proprietary Agenda
  • 3. Why Oozie? The Problem The Need ▪ Doing something on the grid often required multiple steps ▪ MapReduce job ▪ Pig job ▪ Streaming job ▪ HDFS operation (mkdir, chmod, etc)… ▪ Workflow scheduler with better support for grid jobs (native integration with Hadoop) ▪ orchestrate dependency between jobs ▪ execute at specific time or on data availability ▪ retry jobs in the event of failures (reliable) ▪ Multiple ad-hoc solutions existed ▪ custom job control ▪ cron… ▪ Common framework for communication and execution of production process ▪ sync (clocked dataset) awareness ▪ async (unspecified freq) data awareness A server-based workflow scheduling system to manage Hadoop jobs 3 Yahoo Confidential & Proprietary
  • 4. Scale at Yahoo Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 28.9 million Hadoop Jobs monthly (Jan 2015, total) 72% from Oozie (including launcher jobs) 108,000 workflow jobs daily (Feb 2015, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 1,700 coordinator jobs daily (Feb 2015, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 67 % of workflow jobs kicked from coordinator 60 bundle jobs daily (Feb 2015, one busy cluster) 4 Yahoo Confidential & Proprietary
  • 5. Hadoop Jobs on the Platform Job distribution (Jan, 2015) 5 Yahoo Confidential & Proprietary
  • 6. Y! business processed by Oozie Ad Exchange Ad Latency Search Advertising Content Agility Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting 6 Yahoo Confidential & Proprietary
  • 7. Y! business processed by Oozie Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Event Pipeline 7 Yahoo Confidential & Proprietary
  • 8. Use Case - Data pipeline 8 Yahoo Confidential & Proprietary
  • 9. Number of action created hourly Mid-Night PMPMAM 4am 2pm 0am 10am 8pm 9 Yahoo Confidential & Proprietary
  • 10. Number of action created per minute 10 Yahoo Confidential & Proprietary
  • 11. SCALE ▪ At one point of time all 5 min, 15 min, 30 min, hourly, daily, monthly coordinator job will collide and there will be outburst of coordinator actions, which single host can’t handle. ▪ We noticed processing delay and customers complaining slowness. 11 Yahoo Confidential & Proprietary
  • 12. Why Downtime matters? Downtime needed Oozie Upgrade (Major Release > 1 per Quarter, Minor > 1 per Month) 12 Yahoo Confidential & Proprietary
  • 13. Why Downtime matters? Downtime needed Dependent Hadoop Projects Upgrade (YARN, HDFS, Hive, HBase, etc) Oozie YARN HDFS Hive HBase Pig HCatlog Pig 13 Yahoo Confidential & Proprietary
  • 14. Why Downtime matters? Downtime needed Configuration error / change 14 Yahoo Confidential & Proprietary
  • 15. Why Downtime matters? Downtime needed Hardware error / upgrade 15 Yahoo Confidential & Proprietary
  • 16. Why Downtime matters? Customers Revenue-impact applications need running all the time, no delay! 16 Yahoo Confidential & Proprietary
  • 17. Why Downtime matters? Ops Ops- under pressure to minimize downtime 17 Yahoo Confidential & Proprietary
  • 18. Solution : High Availability 18 Yahoo Confidential & Proprietary ● Definition: failure of a component != failure of entire system o by removing single point of failure ● Requirement: Transparency to Users o User should not know it’s HA or not o No change in API and usage pattern
  • 19. 19 Yahoo Confidential & Proprietary Architecture Load Balancer RDB Hadoop Cluster submit request request redirection Oozie Server 1 Oozie Server n Inter server communication Zookeeper Curator
  • 20. Architectural Overview: Database 20 Yahoo Confidential & Proprietary ● Oozie stores most of its state in a database o (submitted jobs, workflow definitions, etc) ● Oracle database( 2 rack) in HA is used ( Hot-warm). ● Zookeeper ( Curator) for coordination
  • 21. Architectural Overview: Access 21 Yahoo Confidential & Proprietary ● Users and client programs need a single address to connect to o Web UI, REST/Java API, JobTracker/ResourceManager callbacks, etc ● Virtual IP (VIP) is used as user facing URL.
  • 22. Architectural Overview: Security 22 Yahoo Confidential & Proprietary  We use Kerberos and some of internal security system to communicate among components.
  • 23. 23 Yahoo Confidential & Proprietary Security: https + kerberos / cookie-based auth Architectural Overview: Authentication Load Balancer RDB Hadoop Cluster submit request request redirection Oozie Server 1 Oozie Server n Inter server communication for log streaming etc Zookeeper Curator Security: https + kerberos / cookie-based-auth Security: https+kerberos Zookeeper for lock and management Security: Kerberos Security: kerberos
  • 24. Technical Challenge: Log Streaming 24 Yahoo Confidential & Proprietary ● Each Oozie server only has access to its own logs ● Jobs can execute on any server o Job execution can switch among server ● User need to see sequential logs rather than server1 and server2 logs.
  • 25. 25 Yahoo Confidential & Proprietary Architectural: Log Streaming 2. Call other server to fetch logs 1. user request comes to server1 3. Call all other server are merge logs using log timestamp 4. Log is displayed to user 2. Fetch server list from ZK
  • 26. Caveat:Log Streaming 26 Yahoo Confidential & Proprietary  If an Oozie Server goes down, any logs from it will be unavailable
  • 27. 27 Yahoo Confidential & Proprietary Technical Challenge:HCatalog Integration • Hive Metastore(HCatalog) : Manage metadata for datasets – Oozie register for dataset to HCatlog – Oozie receive notification from HCatlog through JMS (e.g., ActiveMQ) – Oozie starts job immediately after data becomes ready JMS (e.g, ActiveMQ) . Push notification <New Partition> 1. Register Topic . Notify New Partition Job Oozie Server 1 Oozie Server 2
  • 28. 28 Yahoo Confidential & Proprietary Technical Challenge:Hive Metastore Integration • Oozie maintains in-memory list of datasets which need notification. • Notification comes to only one server. • One notification come to one server, Oozie need to invalidate cache in all other servers. • This is done by having a periodic task on each server which check job status of each dataset and if it’s not waiting. It remove the dataset from cache.
  • 29. 29 Yahoo Confidential & Proprietary Technical Challenge:Hive Metastore Integration 3. Push notification <New Partition>2. Register Topic 4. Notify New Partition JMS (e.g, ActiveMQ) Job Oozie Server 1 Oozie Server 2 Remove registrationPeriodic check
  • 30. Challenges 30 Yahoo Confidential & Proprietary ● Distributed Job ID o Maintain distributed sequence number for Job ID using Apache Curator + Zookeeper ● Zookeeper Failure Handling o Oozie servers automatically shutdown when Zookeeper is down ● Sharelib o Support sharelib update in HA
  • 31. More Challenges • SLA support – Oozie has in-memory data structure to track sla status for each job (start/duration/end met/miss and notifications) – add check of sla status against Database – use ZK lock to synchronize update on the same job from multiple servers. • Distributed Locks – Reentrant distributed lock using Apache Curator + Zookeeper31 Yahoo Confidential & Proprietary
  • 32. Experiences • HA running on all production grids > 7 months at Yahoo! – Stable ! 32 Yahoo Confidential & Proprietary
  • 33. Issues – Zookeeper down (when upgrading zk quorum h/w) – Server going out of sync (during upgrade, sharelib) 33 Yahoo Confidential & Proprietary
  • 34. Benefits ▪ Zero downtime for applications ▪ Rolling upgrade (zero downtime) › Maintenance upgrade › Configuration upgrade ▪ No more materization delay 34 Yahoo Confidential & Proprietary
  • 35. Workflow Job Submission Throughput 35 Yahoo Confidential & Proprietary
  • 36. Future work • Faster job fail-over – currently wait for a thread (Recovery Service) to pick non-progressing jobs every few minutes – Oozie server should immediately notice when other server is down and fail-over job (e.g, using ZK watcher) • Improve log streaming 36 Yahoo Confidential & Proprietary
  • 37. Acknowledgement Robert Kanter Olga L. Natkovich Rohini Palaniswamy Michelle Chiang Jacob Tolar Sumeet Singh 37 Yahoo Confidential & Proprietary