Arun Murthy presented on how Hadoop has become mission critical at Yahoo. Hadoop is used at a massive scale with over 44,000 servers and 170 petabytes of storage. Yahoo has contributed over 4,400 patches to make Hadoop more scalable, secure, and reliable for their large-scale production use. Key areas Yahoo has focused on include multi-tenancy, high availability, and supporting very large clusters with HDFS federation. Hadoop is now used for nearly all of Yahoo's computing and storage needs behind many of their major services and revenue systems.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Yahoo! - Arun Murthy - Hadoop World 2010
1. Hadoop at Yahoo!
Ready for Business
Arun C. Murthy
Hadoop Team
acm@yahoo-inc.com
@acmurthy
2. Existential Angst – Who Am I?
• Yahoo
– Lead, Hadoop Map-Reduce development team
• Apache Hadoop
– Full time contributor since April, 2006
– Long-term Committer
– Member of Apache Hadoop Project Management Committee
2
3. Outline
• Hadoop is mission critical for Yahoo
• Making Hadoop enterprise-ready for Yahoo
3
4. Hadoop at Yahoo!
• Hadoop is mission critical for Yahoo
• Making Hadoop enterprise-ready for Yahoo
4
5. Hadoop at Yahoo! - Scale of Operation
5
Washington
25000 nodes
Nebraska
9000 nodes
Virginia
10000 nodes
9. Hadoop Usage at Yahoo!
Research
Science
Impact
Daily
Production
“Behind every
click”
Today
9
ThousandsofServers
Petabytes
44K Hadoop Servers
170 PB Raw Hadoop Storage
1M+ Monthy Hadoop Jobs
10. Research to Mission Critical
Research
workloads
• Search
• Advertising
Modeling
• Machine
Learning
• WebMap
(production)
Revenue
Systems
• Strong
Security
• Improved
SLAs
• Small Jobs
Increased user
base
• Partitioned
Namespaces
• All data
storage and
processing
• Mainstream
10
2006/2007 2008 2009 2010
11. Application Patterns
• Data Processing and Aggregations
• Data co-located in a shared environment
• Batch processing of Data
• Processing 100 Billion events per day
ETL /
Warehouse
• Modeling and Machine Learning Algorithms
• Weekly/Monthly run of algorithms
Analytics &
Sciences
• Derive Insights form the production data
• Feedback for Optimizations in the production
environments
• Nearline production optimizations
Nearline
Production
11
12. Getting there…
• Hadoop is mission critical for Yahoo
• Making Hadoop enterprise-ready for Yahoo
12
13. Crossing the Chasm
• Hadoop grew rapidly charting new territories in features,
abstractions, APIs, scale, …
– Small team
– Small number of early customers who needed a new platform
• Today: dramatic growth in customer base
– New requirements and expectations
• Choices/tradeoffs in approaches – past and future
– Scale
– Backward Compatibility
– Security
– SLAs & Predictability
13
Geoffrey A Moore*
14. Evolution of Hadoop at Yahoo!
14
• Utilization at Scale
• Security
• Multi-tenancy
• Super-size
09/09
04/09
04/11
04/10
Multi-Tenancy
hadoop-0.20 yhadoop-0.20 20.S Fred
HDFS
Federation
hadoop-next
09/10
CapacityScheduler
Security
Yahoo Hadoop
Apache Hadoop
4400+ patches on hadoop-0.20!
16. Motivation
• Exploit shared storage
– Unified namespace
• Provide compute elasticity
– Stop relying on private clusters (Hadoop on Demand)
• Higher utilization at massive scale
16
17. CapacityScheduler
• Resource allocation in shared, multi-tenant cluster
• A cluster is funded by several organizations
• Each organization gets queue allocations based on their funding
– Guaranteed capacity
– Control who can submit jobs to their queues
– Set job priorities within their queues
17
20. Motivation
• Revenue bearing applications
• Strong security for data on multi-tenant clusters
– Enable sharing clusters between disjoint kinds of users
• Auditing
– Access to data
– Access and change management
20
21. Secure Hadoop
• Kerberos based strong authentication
– Client-based authentication introduced in hadoop-0.16 (2007)
– Authenticate RPC and HTTP connections
• Multiple man years of development
• Integration with existing security mechanisms in Yahoo
• Authorization
– Use HDFS Authorization
– Add MapReduce Authorization
21
23. Motivation
• Ever growing demand
– Consolidation for economics of scale and operability
– Several clusters of 4k nodes each
• Growing demand for stability
– Isolation for applications
– Shield framework from poorly designed or rogue applications
23
24. Fred
• Limits
– Plug uptime vulnerabilities in the framework
– Enforce best practices
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/
apache_hadoop_best_practices_a/
• Shield clusters from poorly written applications
– NameNode exposed to applications performing too many metadata
operations from the backend tasks
– JT exposed to with Counters
• Shield users from each other
– Isolation
• Metrics and Monitoring
24
26. Motivation
• Massive storage and processing
– Hardware gets more capable per dollar
– (4k 2011 nodes) = (12k 2009 nodes)
– Continued consolidation for economics and operability
26
27. HDFS Federation
• Redefine the meaning of a HDFS cluster
– Scale horizontally by having multiple NameNodes per cluster
• Striping – Already in production
– Shared storage pool
– Shared namespace
• Striping – Mount tables in production
– Helps availability
– Better isolation
• 72 PB raw storage per cluster
– 6000 nodes per cluster
– 12TB raw, per node
27
28. Availability
• Mission critical system
• HDFS
– Faster HDFS restarts
• Full cluster restart in 75min (down from 3-4 hrs)
• NN bounce in 15 minutes
• Part of the problem is the NameNode’s size – Federation will help
– Steps towards automated failover
• Backup NN
• Move state off the NN server so we can failover easily
– Federation will significantly improve NN isolation, availability, & stability
• Availability for Map-Reduce framework and jobs
– Continued operation across HDFS restarts
28
29. Conclusions
• Yahoo Hadoop is behind every click at Yahoo!
– Stable, scalable and secure
– The most tested and reliable version of Hadoop – 4400 patches!
• Yahoo continues to be the primary contributor to Apache Hadoop
29