2. Crossing the Chasm Geoffrey A. Moore Apache Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, fault tolerance, multi-tenancy, operations … Small number of early customers who needed a new platform Provide Hadoop as a service to make adoption easy Today: Dramatic growth in adoption and customer base Growth of Hadoop stack and applications New requirements and expectations Mission Critical Late Majority Early Majority Early Adopters 3
3. Crossing the Chasm: Overview How the Chasm is being crossed Security SLAs & Predictability Scalability Availability & Data Integrity Backward Compatibility Quality & Testing Fundamental architectural improvements Federation & MR.next Adapt to changing, sometime unforeseen, needs Fuel innovation and rapid development The Community Effect 4
4. Security Early Gains No authorization or authentication requirements Added permissions and passed client-side userid to server (0.16) Addresses accidental deletes by another user Service Authorization (0.18, 0.20) Issues: Stronger Authorization required Shared clusters – multiple tenants Critical data New categories of users (financial) SOX compliance Our Response Authentication using Kerberos (0.20.203) 10 Person-year effort by Yahoo! 5
5. SLAs and Predictability Issue: Customers uncomfortable with shared clusters Customer traditionally plan for peaks with dedicated HW Dedicated clusters had poor utilization Response: Capacity scheduler (0.20) Guaranteed capacities in a multi-tenant shared cluster Almost like dedicated hardware Each organization given queue(s) with a guaranteed capacity controls who is allowed to submit jobs to their queues sets the priorities of jobs within their queue creates sub-queues (0.21) for finer grain control within their capacity Unused capacity given to tasks in other queues Better than private cluster –access to unused capacity when in crunch Resource limits for tasks – deals with misbehaved apps Response: FairShare Scheduler (0.20) Focus is fair share of resources, but does have pools 6
6. Scalability Early Gains Simple design allowed rapid improvements Single master, namespace in RAM, simpler locking Cluster size improvements: 1K 2K 4K Vertical scaling: Tuned GC + Efficient memory usage Archive file system – reduce files and blocks (0.20) Current Issues Growth of files and storage limited by single NN (0.20) Only an issue for very very large clusters JobTracker does not scale to beyond 30K tasks – needs redesign Our Response RW locks in NN (0.22) MR.next– complete rewrite of MR servers (JT, TT) - 100K tasks (0.23) Federation: horizontal scaling of namespace – billion files (0.23) NN that keeps only part of Namespace in memory –trillion files (0.23.x) 7
7. HDFS Availability & Data Integrity:Early Gains Simple design, Java, storage fault tolerance Java – saved from pointer errors that lead to data corruption Simplicity - subset of Posix – random writers not supported Storage: Rely in OS’s file system rather than use raw disk Storage Fault Tolerance: multiple replicas, active monitoring Single Namenode Master Persistent state: multiple copies + checkpoints Restart on failure How well did it work? Lost 650 blocks out of 329 M on 10 clusters with 20K nodes in 2009 82% abandoned open file (append bug, fixed in 0.21) 15% files created with single replica (data reliability not needed) 3% due to roughly 7 bugs that were then fixed (0.21) Over the last 18 months 22 failures on 25 clusters Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) NN is very robust and can take a lot of abuse NN is resilient against overload caused by misbehaving apps 8
8. HDFS Availability & Data Integrity:Response Data Integrity Append/flush/sync redesign (0.21) Pipeline recruits new replicas rather than just remove them on failures (0.23) Improving Availability of NN Faster HDFS restarts NN bounce in 20 minutes (0.23) Federation allows smaller NNs (0.23) Federation will significantly improve NN isolation hence availability (0.23) Why did we wait this long for HA NN? The failure rates did not demand making this a high priority Failover requires corner cases to be correctly addressed Correct fencing of shared state during failover is critical Can lead to corruption of data and reduceavailability!! Many factors impact availability, not just failover 9
9. HDFS Availability & Data Integrity:Response: HA NN Active work has started on HA NN (Failover) HA NN – Detailed design (HDFS-1623) Community effort HDFS-1971, 1972, 1973, 1974,1975, 2005, 2064, 1073 HA: Prototype work Backup NN (0.21) Avatar NN (Facebook) HA NN prototype using Linux HA (Yahoo!) HA NN prototype with Backup NN and block report replicator (EBay) HA the highest priority for 23.x 10
10. MapReduce: Fault Tolerance and Availability Early Gains: Fault-tolerance of tasks and compute nodes Current Issues:Loss of job queue if Job tracker is restarted Our Response MR.next designed with fault tolerance and availability HA Resource Manager (0.23.x) Loss of Resource Manager – degraded mode - recover via restart or failover Apps continue with their current resources App Manager can reschedule with current resources New apps cannot submitted or launched, New resources cannot be allocated Loss of an App Manager - recovers App is restarted and state is recovered Loss of tasks and nodes - recovers Recovered as in old MapReduce 11
11. Backwards Compatibility Early Gains Early success stemmed from a philosophy of ship early and often, resulting in changing APIs. Data and metadata compatibility always maintained The early customers paid the price current customers reap benefits of more mature interfaces Issues Increased adoption leads to increased expectations of backwards compatibility 12
12. Backward Compatibility:Response Interface classification - audience and stability tags (0.21) Patterned on enterprise-quality software process Evolve interfaces but maintain backward compatible Added newer forward looking interfaces - old interface maintained Test for compatibility Run old jars of automation tests, Real Yahoo applications Applications adopting higher abstractions (Pig, Hive) Insulates from lower primitive interfaces Wire compatibility (Hadoop-7347) Maintain compatibility with current protocol (java serialization) Adapters for addressing future discontinuity e.g. serializationor protocol change Moved to ProtocolBuf for data transfer protocol 13
13. Testing & Quality Nightly Testing Against 1200 automated tests on 30 nodes Against live data and live applications QE Certification for Release Large variety and scale tests on 500 nodes Performance benchmarking QE HIT integration testing of whole stack Release Testing Sandbox cluster – 3 clusters each with 400 -1K nodes Major releases: 2 months testing on actual data - all production projects must sign off Research clusters – 6 Clusters (non-revenue production jobs) (4K Nodes) Major releases – minimum 2 months before moving to production .25Million to .5Million jobs per week if it clears research then mostly fine in fine in production Release Production clusters - 11 clusters (4.5K nodes) Revenue generating, stricter SLAs 14
14. Fundamental Architecture Changes that cut across several issues Coupled One-to-One Job Manager Resource Scheduler Storage Resources Compute Resources MapReduce HDFS Namesystem HDFS storage: mostly a separate layer – but one customer: one NN Federation generalizes the layer MapReduce – compute resource scheduling tightly coupled to MapReduce job management MapReduce.next separates the layers 15
15. HBase Fundamental Architecture Changes that cut across several issues Resource Scheduler Layered One-to-Many HDFS Namesystem HDFS Namesystem Alternate NN Implementation MR App with Different version MR lib Storage Resources HDFS Namesystem HDFS Namesystem MR tmp MPI App HDFS Namesystem MR App Compute Resources Scalability, Isolation, Availability Generic lower layer: first class support new applications on top MR tmp, HBase, MPI, Layering facilitates faster development of new work NN that caches Namespace – a few months of work New implementations of MR App manager Compatibility: Support multiple versions of MR Tenants upgrade at their own pace – crucial for shared clusters 16
16. The Community Effect Some projects are done entirely by teams at Yahoo!, FB or Cloudera But several projects are joint work Yahoo & FB on NN scalability and concurrency esp in face of misbehaved apps Edits log v2 and refactoring edits log (Cloudera and Yahoo!/Hortonworks) HDFS-1073, 2003, 1557, 1926 NN HA – Yahoo!/Hortonworks, Cloudera, FB, EBay HDFS-1623, 1971, 1972, 1973, 1974, ,1975, 2005 Features to support HBase: FB, Cloudera, Yahoo, and the HBase community Expect to see rapid improvements in the very near future Further Scalability - NN that cache part of namespace Improved IO Performance - DN performance improvements Wire Compatibility - Wire protocols, operational improvements, New App Managers for MR.next Continued improvement of management and operability 17
17. Hadoop is Successfully Crossing the Chasm Hadoop used in enterprises for revenue generating applications Apache Hadoop is improving at a rapid rate Addressing many issues including HA Fundamental design improvements to fuel innovation The might of a large growing developer community Battle tested on large clusters and variety of applications At Yahoo!, Facebook and the many other Hadoop customers. Data integrity has been a focus from the early days A level of testing that even the large commercial vendors cannot match! Can you trust your data to anything less? 18
18. Q & A Hortonworks @ Hadoop Summit 1:45pm: Next Generation Apache Hadoop MapReduce Community track by Arun Murthy 2:15pm: Introducing HCatalog (Hadoop Table Manager) Community track by Alan Gates 4:00pm: Large Scale Math with Hadoop MapReduce Applications and Research Track by Tsz-Wo Sze 4:30pm: HDFS Federation and Other Features Community track by Suresh Srinivas and Sanjay Radia 19
Sanjay Radia –been on Hadoop project at Yahoo for last 4 yearsThese 4 years have been a blastVery proud to be a Y employee - Yahoo’s leadership has made Hadoop the success it is today(Forward thinking in Spinning out Hortonworks.)I am very exited to be part of the team at Hortonworks and continuing to grow Apache Hadoop as an open-source project
Early customers at YahooA key idea was to provide Hadoop as a service – otherwise, adoption would not have been so rapid.Customer starts focusing his application on hadoop rather than figure out how to get the capex and convince IT to deploy hadoop
Some early decisions, why and how well they workedChoices/tradeoffs based on immediate customer needs, time, resources, operational needsGoal:Gain acceptance, evolve and grow customer baseFundamental architectural improvements that cuts across various issuesConclude with the community effect
Stronger security driven by multiple tenants in shared clustersSecurity will be critical for new enterprise users especially those with financial data10 Person year effort by completely by Yahoo! and a major milestone for Apache Hadoop
Early customers happy with a new platform that let them solve problems they could not solve otherwiseBut many new customers …Unused capacity – esp effective if applications peaks at different timesMix production and non production jobsFairness not a goal –customers are guaranteed capacity they have paid forNN under federation also provides isolation
scaling is enough to meet needs of very largecustomers – like FB and Yother customers should be more than okay Before you need more namespace, it will be therea NN that stores only partial namespace in memory
Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
Pipeline – useful for slow appender
Certification once ready for releaseHDFS (over 92% coverage)balancer, block replication, corruption, fsck, security,. Cmd lines , viewfs, faults and injection, checkpointing, Federation testing – ready to move to release testingMRDist cache, capacity scheduler, speculative exec, block listing, decommissioning, ui testingFailures and injectionsScale testing – 4 TT per node.Performance– sort, scan, compression, gridmixv3, slive, ..HIT integration whole stack (hdfs, MR, Oozie, Pig, Hive, HCat,) Sandbox – Customer application validation - release vote for 23.0Research:Larger variety of jobs and load than production clusters
So far I have been explaining how we have been addressing specific issuesBut we have made fundamental change to the architecture to both HDFS and MRFundamental changes that cuts across issues
HDFS – the change was fairly simpleMR – redesign – allocation of compute resources a general layer separate Scheduler from App/Job Manager
New Append, Security, Federation, MR.next – YahooRaid, High tide, FBHBase improvements – FB & Cloudera
The fundamental design improvements will accelerate the rate of improvementIf there is feature that customers need – it will be provided quickly