SlideShare uma empresa Scribd logo
1 de 30
Hadoop at Yahoo!
Ready for Business
Arun C. Murthy
Hadoop Team
acm@yahoo-inc.com
@acmurthy
Existential Angst – Who Am I?
•  Yahoo
–  Lead, Hadoop Map-Reduce development team
•  Apache Hadoop
–  Full time contributor since April, 2006
–  Long-term Committer
–  Member of Apache Hadoop Project Management Committee
2
Outline
•  Hadoop is mission critical for Yahoo
•  Making Hadoop enterprise-ready for Yahoo
3
Hadoop at Yahoo!
•  Hadoop is mission critical for Yahoo
•  Making Hadoop enterprise-ready for Yahoo
4
Hadoop at Yahoo! - Scale of Operation
5
Washington
25000 nodes
Nebraska
9000 nodes
Virginia
10000 nodes
The Team - Hadoop Development
6
Hadoop Contributions
7
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Feb-06
Apr-06
Jun-06
Aug-06
Oct-06
Dec-06
Feb-07
Apr-07
Jun-07
Aug-07
Oct-07
Dec-07
Feb-08
Apr-08
Jun-08
Aug-08
Oct-08
Dec-08
Feb-09
Apr-09
Jun-09
Aug-09
Oct-09
Dec-09
Feb-10
Apr-10
Jun-10
Aug-10
Oct-10
Patches
Hadoop Patches
yahoo
powerset
other
facebook
cloudera
Hadoop at Yahoo!
8
99.85
99.47
99.69
99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9
Production
Research
Sandbox
Availability SLA
Hadoop Usage at Yahoo!
Research
Science
Impact
Daily
Production
“Behind every
click”
Today
9
ThousandsofServers
Petabytes
44K Hadoop Servers
170 PB Raw Hadoop Storage
1M+ Monthy Hadoop Jobs
Research to Mission Critical
Research
workloads
•  Search
•  Advertising
Modeling
•  Machine
Learning
•  WebMap
(production)
Revenue
Systems
•  Strong
Security
•  Improved
SLAs
•  Small Jobs
Increased user
base
•  Partitioned
Namespaces
•  All data
storage and
processing
•  Mainstream
10
2006/2007 2008 2009 2010
Application Patterns
•  Data Processing and Aggregations
•  Data co-located in a shared environment
•  Batch processing of Data
•  Processing 100 Billion events per day
ETL /
Warehouse
•  Modeling and Machine Learning Algorithms
•  Weekly/Monthly run of algorithms
Analytics &
Sciences
•  Derive Insights form the production data
•  Feedback for Optimizations in the production
environments
•  Nearline production optimizations
Nearline
Production
11
Getting there…
•  Hadoop is mission critical for Yahoo
•  Making Hadoop enterprise-ready for Yahoo
12
Crossing the Chasm
•  Hadoop grew rapidly charting new territories in features,
abstractions, APIs, scale, …
–  Small team
–  Small number of early customers who needed a new platform
•  Today: dramatic growth in customer base
–  New requirements and expectations
•  Choices/tradeoffs in approaches – past and future
–  Scale
–  Backward Compatibility
–  Security
–  SLAs & Predictability
13
Geoffrey A Moore*
Evolution of Hadoop at Yahoo!
14
•  Utilization at Scale
•  Security
•  Multi-tenancy
•  Super-size
09/09
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
4400+ patches on hadoop-0.20!
Utilization at Scale
15
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
Motivation
•  Exploit shared storage
–  Unified namespace
•  Provide compute elasticity
–  Stop relying on private clusters (Hadoop on Demand)
•  Higher utilization at massive scale
16
CapacityScheduler
•  Resource allocation in shared, multi-tenant cluster
•  A cluster is funded by several organizations
•  Each organization gets queue allocations based on their funding
–  Guaranteed capacity
–  Control who can submit jobs to their queues
–  Set job priorities within their queues
17
CapacityScheduler - Benefits
•  Improved utilization and latency
•  Almost dedicated hardware via
virtual clusters
•  Significantly better utilization of
excess capacity
–  Mix SLA critical and ad-hoc
jobs
•  Predictable behavior
18
0.00 1.00 2.00 3.00
Job throughput
InputBytes
throughput
OutputBytes
throughput
Normalized Throughput
Hadoop 20
Hadoop 18
936 GB/hr
0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%
MapSlot Utilization
ReduceSlot Utilization
Slot Utilization (%)
Hadoop
20
Security
19
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
Motivation
•  Revenue bearing applications
•  Strong security for data on multi-tenant clusters
–  Enable sharing clusters between disjoint kinds of users
•  Auditing
–  Access to data
–  Access and change management
20
Secure Hadoop
•  Kerberos based strong authentication
–  Client-based authentication introduced in hadoop-0.16 (2007)
–  Authenticate RPC and HTTP connections
•  Multiple man years of development
•  Integration with existing security mechanisms in Yahoo
•  Authorization
–  Use HDFS Authorization
–  Add MapReduce Authorization
21
Multi-Tenancy
22
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
Motivation
•  Ever growing demand
–  Consolidation for economics of scale and operability
–  Several clusters of 4k nodes each
•  Growing demand for stability
–  Isolation for applications
–  Shield framework from poorly designed or rogue applications
23
Fred
•  Limits
–  Plug uptime vulnerabilities in the framework
–  Enforce best practices
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/
apache_hadoop_best_practices_a/
•  Shield clusters from poorly written applications
–  NameNode exposed to applications performing too many metadata
operations from the backend tasks
–  JT exposed to with Counters
•  Shield users from each other
–  Isolation
•  Metrics and Monitoring
24
Super-Sized Hadoop
25
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
Motivation
•  Massive storage and processing
–  Hardware gets more capable per dollar
–  (4k 2011 nodes) = (12k 2009 nodes)
–  Continued consolidation for economics and operability
26
HDFS Federation
•  Redefine the meaning of a HDFS cluster
–  Scale horizontally by having multiple NameNodes per cluster
•  Striping – Already in production
–  Shared storage pool
–  Shared namespace
•  Striping – Mount tables in production
–  Helps availability
–  Better isolation
•  72 PB raw storage per cluster
–  6000 nodes per cluster
–  12TB raw, per node
27
Availability
•  Mission critical system
•  HDFS
–  Faster HDFS restarts
•  Full cluster restart in 75min (down from 3-4 hrs)
•  NN bounce in 15 minutes
•  Part of the problem is the NameNode’s size – Federation will help
–  Steps towards automated failover
•  Backup NN
•  Move state off the NN server so we can failover easily
–  Federation will significantly improve NN isolation, availability, & stability
•  Availability for Map-Reduce framework and jobs
–  Continued operation across HDFS restarts
28
Conclusions
•  Yahoo Hadoop is behind every click at Yahoo!
–  Stable, scalable and secure
–  The most tested and reliable version of Hadoop – 4400 patches!
•  Yahoo continues to be the primary contributor to Apache Hadoop
29
Questions?
30
Thanks!

Mais conteúdo relacionado

Mais procurados

HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For HadoopCloudera, Inc.
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Edureka!
 
Cloudera Hadoop Administrator Content - ReadyNerd
Cloudera Hadoop Administrator Content - ReadyNerdCloudera Hadoop Administrator Content - ReadyNerd
Cloudera Hadoop Administrator Content - ReadyNerdReadyNerd Computer Academy
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin trainingArun Kumar
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Hortonworks
 
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksCloudera, Inc.
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Hive on mesos Strata
Hive on mesos StrataHive on mesos Strata
Hive on mesos StrataSzehon Ho
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11Hortonworks
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 

Mais procurados (20)

HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
 
Cloudera
ClouderaCloudera
Cloudera
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
 
Cloudera Hadoop Administrator Content - ReadyNerd
Cloudera Hadoop Administrator Content - ReadyNerdCloudera Hadoop Administrator Content - ReadyNerd
Cloudera Hadoop Administrator Content - ReadyNerd
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
 
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hive on mesos Strata
Hive on mesos StrataHive on mesos Strata
Hive on mesos Strata
 
Hadoop
HadoopHadoop
Hadoop
 
Hive paris
Hive parisHive paris
Hive paris
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 

Semelhante a Yahoo! - Arun Murthy - Hadoop World 2010

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer
 
A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataSaurav Kumar Sinha
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache HadoopMike Frampton
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and developmentconline training
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadooplamont_lockwood
 

Semelhante a Yahoo! - Arun Murthy - Hadoop World 2010 (20)

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big Data
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache Hadoop
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Yahoo! - Arun Murthy - Hadoop World 2010

  • 1. Hadoop at Yahoo! Ready for Business Arun C. Murthy Hadoop Team acm@yahoo-inc.com @acmurthy
  • 2. Existential Angst – Who Am I? •  Yahoo –  Lead, Hadoop Map-Reduce development team •  Apache Hadoop –  Full time contributor since April, 2006 –  Long-term Committer –  Member of Apache Hadoop Project Management Committee 2
  • 3. Outline •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 3
  • 4. Hadoop at Yahoo! •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 4
  • 5. Hadoop at Yahoo! - Scale of Operation 5 Washington 25000 nodes Nebraska 9000 nodes Virginia 10000 nodes
  • 6. The Team - Hadoop Development 6
  • 8. Hadoop at Yahoo! 8 99.85 99.47 99.69 99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9 Production Research Sandbox Availability SLA
  • 9. Hadoop Usage at Yahoo! Research Science Impact Daily Production “Behind every click” Today 9 ThousandsofServers Petabytes 44K Hadoop Servers 170 PB Raw Hadoop Storage 1M+ Monthy Hadoop Jobs
  • 10. Research to Mission Critical Research workloads •  Search •  Advertising Modeling •  Machine Learning •  WebMap (production) Revenue Systems •  Strong Security •  Improved SLAs •  Small Jobs Increased user base •  Partitioned Namespaces •  All data storage and processing •  Mainstream 10 2006/2007 2008 2009 2010
  • 11. Application Patterns •  Data Processing and Aggregations •  Data co-located in a shared environment •  Batch processing of Data •  Processing 100 Billion events per day ETL / Warehouse •  Modeling and Machine Learning Algorithms •  Weekly/Monthly run of algorithms Analytics & Sciences •  Derive Insights form the production data •  Feedback for Optimizations in the production environments •  Nearline production optimizations Nearline Production 11
  • 12. Getting there… •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 12
  • 13. Crossing the Chasm •  Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, … –  Small team –  Small number of early customers who needed a new platform •  Today: dramatic growth in customer base –  New requirements and expectations •  Choices/tradeoffs in approaches – past and future –  Scale –  Backward Compatibility –  Security –  SLAs & Predictability 13 Geoffrey A Moore*
  • 14. Evolution of Hadoop at Yahoo! 14 •  Utilization at Scale •  Security •  Multi-tenancy •  Super-size 09/09 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop 4400+ patches on hadoop-0.20!
  • 15. Utilization at Scale 15 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 16. Motivation •  Exploit shared storage –  Unified namespace •  Provide compute elasticity –  Stop relying on private clusters (Hadoop on Demand) •  Higher utilization at massive scale 16
  • 17. CapacityScheduler •  Resource allocation in shared, multi-tenant cluster •  A cluster is funded by several organizations •  Each organization gets queue allocations based on their funding –  Guaranteed capacity –  Control who can submit jobs to their queues –  Set job priorities within their queues 17
  • 18. CapacityScheduler - Benefits •  Improved utilization and latency •  Almost dedicated hardware via virtual clusters •  Significantly better utilization of excess capacity –  Mix SLA critical and ad-hoc jobs •  Predictable behavior 18 0.00 1.00 2.00 3.00 Job throughput InputBytes throughput OutputBytes throughput Normalized Throughput Hadoop 20 Hadoop 18 936 GB/hr 0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0% MapSlot Utilization ReduceSlot Utilization Slot Utilization (%) Hadoop 20
  • 19. Security 19 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 20. Motivation •  Revenue bearing applications •  Strong security for data on multi-tenant clusters –  Enable sharing clusters between disjoint kinds of users •  Auditing –  Access to data –  Access and change management 20
  • 21. Secure Hadoop •  Kerberos based strong authentication –  Client-based authentication introduced in hadoop-0.16 (2007) –  Authenticate RPC and HTTP connections •  Multiple man years of development •  Integration with existing security mechanisms in Yahoo •  Authorization –  Use HDFS Authorization –  Add MapReduce Authorization 21
  • 22. Multi-Tenancy 22 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 23. Motivation •  Ever growing demand –  Consolidation for economics of scale and operability –  Several clusters of 4k nodes each •  Growing demand for stability –  Isolation for applications –  Shield framework from poorly designed or rogue applications 23
  • 24. Fred •  Limits –  Plug uptime vulnerabilities in the framework –  Enforce best practices http://developer.yahoo.com/blogs/hadoop/posts/2010/08/ apache_hadoop_best_practices_a/ •  Shield clusters from poorly written applications –  NameNode exposed to applications performing too many metadata operations from the backend tasks –  JT exposed to with Counters •  Shield users from each other –  Isolation •  Metrics and Monitoring 24
  • 25. Super-Sized Hadoop 25 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 26. Motivation •  Massive storage and processing –  Hardware gets more capable per dollar –  (4k 2011 nodes) = (12k 2009 nodes) –  Continued consolidation for economics and operability 26
  • 27. HDFS Federation •  Redefine the meaning of a HDFS cluster –  Scale horizontally by having multiple NameNodes per cluster •  Striping – Already in production –  Shared storage pool –  Shared namespace •  Striping – Mount tables in production –  Helps availability –  Better isolation •  72 PB raw storage per cluster –  6000 nodes per cluster –  12TB raw, per node 27
  • 28. Availability •  Mission critical system •  HDFS –  Faster HDFS restarts •  Full cluster restart in 75min (down from 3-4 hrs) •  NN bounce in 15 minutes •  Part of the problem is the NameNode’s size – Federation will help –  Steps towards automated failover •  Backup NN •  Move state off the NN server so we can failover easily –  Federation will significantly improve NN isolation, availability, & stability •  Availability for Map-Reduce framework and jobs –  Continued operation across HDFS restarts 28
  • 29. Conclusions •  Yahoo Hadoop is behind every click at Yahoo! –  Stable, scalable and secure –  The most tested and reliable version of Hadoop – 4400 patches! •  Yahoo continues to be the primary contributor to Apache Hadoop 29