SlideShare uma empresa Scribd logo
1 de 44
DO NOT USE PUBLICLY
    High Availability for the HDFS NameNode
                                     PRIOR TO 10/23/12
    Phase 2
    Headline Goes Here
    Aaron T. Myers and Todd Lipcon | Cloudera HDFS Team
    Speaker Name or Subhead Goes Here
    October 2012




1
Introductions / who we are
    • Software engineers on Cloudera’s HDFS engineering team
    • Committers/PMC Members for Apache Hadoop at ASF
    • Main developers on HDFS HA
          •   Responsible for ~80% of the code for all phases of HA
              development
    •   Have helped numerous customers setup and troubleshoot HA
        HDFS clusters this year


                                  ©2012 Cloudera, Inc. All Rights
2
                                           Reserved.
Outline
    • HDFS HA Phase 1
        • How did it work? What could it do?
        • What problems remained?
    • HDFS HA Phase 2: Automatic failover
    • HDFS HA Phase 2: Quorum Journal




                           ©2012 Cloudera, Inc. All Rights
3
                                    Reserved.
HDFS HA Phase 1 Review
    HDFS-1623: completed March 2012




4
HDFS HA Development Phase 1
    • Completed March 2012 (HDFS-1623)
    • Introduced the StandbyNode, a hot backup for the HDFS
      NameNode.
    • Relied on shared storage to synchronize namespace state
        •   (e.g. a NAS filer appliance)
    • Allowed operators to manually trigger failover to the Standby
    • Sufficient for many HA use cases: avoided planned downtime
      for hardware and software upgrades, planned machine/OS
      maintenance, configuration changes, etc.
                                  ©2012 Cloudera, Inc. All Rights
5
                                           Reserved.
HDFS HA Architecture Phase 1
    • Parallel block reports sent to Active and Standby NameNodes
    • NameNode state shared by locating edit log on NAS over NFS
    • Fencing of shared resources/data
          •   Critical that only a single NameNode is Active at any point in time
    •   Client failover done via client configuration
          •   Each client configured with the address of both NNs: try both to
              find active


                                   ©2012 Cloudera, Inc. All Rights
6
                                            Reserved.
HDFS HA Architecture Phase 1




                     ©2012 Cloudera, Inc. All Rights
7
                              Reserved.
Fencing and NFS
    •   Must avoid split-brain syndrome
          •   Both nodes think they are active and try to write to the same file. Your
              metadata becomes corrupt and requires manual intervention to restart
    •   Configure a fencing script
          •   Script must ensure that prior active has stopped writing
          •   STONITH: shoot-the-other-node-in-the-head
          •   Storage fencing: e.g using NetApp ONTAP API to restrict filer access
    •   Fencing script must succeed to have a successful failover


                                     ©2012 Cloudera, Inc. All Rights
8
                                              Reserved.
Shortcomings of Phase 1
    •   Insufficient to protect against unplanned downtime
          • Manual failover only: requires an operator to step in quickly after
            a crash
          • Various studies indicated this was the minority of downtime, but
            still important to address
    •   Requirement of a NAS device made deployment
        complex, expensive, and error-prone

    (we always knew this was just the first phase!)

                                        ©2012 Cloudera, Inc. All Rights
9
                                                 Reserved.
HDFS HA Development Phase 2
     •   Multiple new features for high availability
           •   Automatic failover, based on Apache ZooKeeper
           •   Remove dependency on NAS (network-attached storage)

     •   Address new HA use cases
           •   Avoid unplanned downtime due to software or hardware faults
           •   Deploy in filer-less environments
           •   Completely stand-alone HA with no external hardware or software
               dependencies
                 •   no Linux-HA, filers, etc

                                         ©2012 Cloudera, Inc. All Rights
10
                                                  Reserved.
Automatic Failover Overview
     HDFS-3042: completed May 2012




11
Automatic Failover Goals
     •   Automatically detect failure of the Active NameNode
           •   Hardware, software, network, etc.
     •   Do not require operator intervention to initiate failover
           •   Once failure is detected, process completes automatically
     •   Support manually initiated failover as first-class
           •   Operators can still trigger failover without having to stop Active
     •   Do not introduce a new SPOF
           •   All parts of auto-failover deployment must themselves be HA

                                    ©2012 Cloudera, Inc. All Rights
12
                                             Reserved.
Automatic Failover Architecture
     • Automatic failover requires ZooKeeper
         • Not required for manual failover
     • ZK makes it easy to:
         • Detect failure of Active NameNode
         • Determine which NameNode should become the Active NN




                          ©2012 Cloudera, Inc. All Rights
13
                                   Reserved.
Automatic Failover Architecture
     • Introduce new daemon in HDFS: ZooKeeper Failover Controller
     • In an auto failover deployment, run two ZKFCs
          • One per NameNode, on that NameNode machine
     • ZooKeeper Failover Controller (ZKFC) is responsible for:
          • Monitoring health of associated NameNode
          • Participating in leader election of NameNodes
          • Fencing the other NameNode if it wins election


                           ©2012 Cloudera, Inc. All Rights
14
                                    Reserved.
Automatic Failover Architecture




                       ©2012 Cloudera, Inc. All Rights
15
                                Reserved.
ZooKeeper Failover Controller Details
     •   When a ZKFC is started, it:
          • Begins checking the health of its associated NN via RPC
          • As long as the associated NN is healthy, attempts to create
            an ephemeral znode in ZK
          • One of the two ZKFCs will succeed in creating the znode
            and transition its associated NN to the Active state
          • The other ZKFC transitions its associated NN to the Standby
            state and begins monitoring the ephemeral znode

                               ©2012 Cloudera, Inc. All Rights
16
                                        Reserved.
What happens when…
     • … a NameNode process crashes?
         • Associated ZKFC notices the health failure of the NN and
           quits from active/standby election by removing znode
     • … a whole NameNode machine crashes?
         • ZKFC process crashes with it and the ephemeral znode is
           deleted from ZK



                             ©2012 Cloudera, Inc. All Rights
17
                                      Reserved.
What happens when…
     • … the two NameNodes are partitioned from each other?
         • Nothing happens: Only one will still have the znode
     • … ZooKeeper crashes (or down for upgrade)?
         • Nothing happens: active stays active




                            ©2012 Cloudera, Inc. All Rights
18
                                     Reserved.
Fencing Still Required with ZKFC
     • Tempting to think ZooKeeper means no need for fencing
     • Consider the following scenario:
         • Two NameNodes: A and B, each with associated ZKFC
         • ZKFC A process crashes, ephemeral znode removed
         • NameNode A process is still running
         • ZKFC B notices znode removed
         • ZKFC B wants to transition NN B to Active, but without
           fencing NN A, both NNs would be active simultaneously
                            ©2012 Cloudera, Inc. All Rights
19
                                     Reserved.
Auto-failover recap
     •   New daemon ZooKeeperFailoverController monitors the
         NameNodes
           • Automatically triggers fail-overs
           • No need for operator intervention




             Fencing and dependency on NFS storage still a pain


                             ©2012 Cloudera, Inc. All Rights
20
                                      Reserved.
Removing the NAS dependency
     HDFS-3077: completed October 2012




21
Shared Storage in HDFS HA
     •   The Standby NameNode synchronizes the namespace by
         following the Active NameNode’s transaction log
           • Each operation (eg mkdir(/foo)) is written to the log by the Active
           • The StandbyNode periodically reads all new edits and applies
             them to its own metadata structures
     •   Reliable shared storage is required for correct operation




                                  ©2012 Cloudera, Inc. All Rights
22
                                           Reserved.
Shared Storage in “Phase 1”
     • Operator configures a traditional shared storage device (eg SAN
       or NAS)
     • Mount the shared storage via NFS on both Active and Standby
       NNs
     • Active NN writes to a directory on NFS, while Standby reads it




                             ©2012 Cloudera, Inc. All Rights
23
                                      Reserved.
Shortcomings of NFS-based approach
     •   Custom hardware
           •   Lots of our customers don’t have SAN/NAS available in their datacenter
           •   Costs money, time and expertise
           •   Extra “stuff” to monitor outside HDFS
           •   We just moved the SPOF, didn’t eliminate it!
     •   Complicated
           •   Storage fencing, NFS mount options, multipath networking, etc
           •   Organizationally complicated: dependencies on storage ops team
     •   NFS issues
           •   Buggy client implementations, little control over timeout behavior, etc
                                      ©2012 Cloudera, Inc. All Rights
24
                                               Reserved.
Primary Requirements for Improved Storage
     • No special hardware (PDUs, NAS)
     • No custom fencing configuration
          •   Too complicated == too easy to misconfigure
     •   No SPOFs
          • punting to filers isn’t a good option
          • need something inherently distributed




                                  ©2012 Cloudera, Inc. All Rights
25
                                           Reserved.
Secondary Requirements
     •   Configurable failure toleration
           •   Configure N nodes to tolerate (N-1)/2
     •   Making N bigger (within reasonable bounds) shouldn’t hurt
         performance. Implies:
           • Writes done in parallel, not pipelined
           • Writes should not wait on slowest replica
     •   Locate replicas on existing hardware investment (eg share with
         JobTracker, NN, SBN)

                                   ©2012 Cloudera, Inc. All Rights
26
                                            Reserved.
Operational Requirements
     •   Should be operable by existing Hadoop admins. Implies:
           • Same metrics system (“hadoop metrics”)
           • Same configuration system (xml)
           • Same logging infrastructure (log4j)
           • Same security system (Kerberos-based)
     • Allow existing ops to easily deploy and manage the new feature
     • Allow existing Hadoop tools to monitor the feature
           •   (eg Cloudera Manager, Ganglia, etc)

                                   ©2012 Cloudera, Inc. All Rights
27
                                            Reserved.
Our solution: QuorumJournalManager
     •   QuorumJournalManager (client)
           • Plugs into JournalManager abstraction in NN (instead of existing
             FileJournalManager)
           • Provides edit log storage abstraction
     •   JournalNode (server)
           • Standalone daemon running on an odd number of nodes
           • Provides actual storage of edit logs on local disks
           • Could run inside other daemons in the future


                                 ©2012 Cloudera, Inc. All Rights
28
                                          Reserved.
Architecture




                    ©2012 Cloudera, Inc. All Rights
29
                             Reserved.
Commit protocol
     • NameNode accumulates edits locally as they are logged
     • On logSync(), sends accumulated batch to all JNs via Hadoop
       RPC
     • Waits for success ACK from a majority of nodes
         • Majority commit means that a single lagging or crashed replica
           does not impact NN latency
         • Latency @ NN = median(Latency @ JNs)



                               ©2012 Cloudera, Inc. All Rights
30
                                        Reserved.
JN Fencing
     • How do we prevent split-brain?
     • Each instance of QJM is assigned a unique epoch number
         •   provides a strong ordering between client NNs
         •   Each IPC contains the client’s epoch
         •   JN remembers on disk the highest epoch it has seen
         •   Any request from an earlier epoch is rejected. Any from a newer
             one is recorded on disk
         •   Distributed Systems folks may recognize this technique from
             Paxos and other literature

                                 ©2012 Cloudera, Inc. All Rights
31
                                          Reserved.
Fencing with epochs
     • Fencing is now implicit
     • The act of becoming active causes any earlier active NN to be
       fenced out
           •   Since a quorum of nodes has accepted the new active, any other
               IPC by an earlier epoch number can’t get quorum
     •   Eliminates confusing and error-prone custom fencing
         configuration


                                  ©2012 Cloudera, Inc. All Rights
32
                                           Reserved.
Segment recovery
     • In normal operation, a minority of JNs may be out of sync
     • After a crash, all JNs may have different numbers of txns (last batch
       may or may not have arrived at each)
           •   eg JN1 was down, JN2 crashed right before NN wrote txnid 150:
                 • JN1: has no edits
                 • JN2: has edits 101-149
                 • JN3: has edits 101-150
     •   Before becoming active, we need to come to consensus on this last
         batch: was it committed or not?
           •   Use the well-known Paxos algorithm to solve consensus

                                     ©2012 Cloudera, Inc. All Rights
33
                                              Reserved.
Other implementation features
     •   Hadoop Metrics
           • lag, percentile latencies, etc from perspective of JN, NN
           • metrics for queued txns, % of time each JN fell behind, etc, to
             help suss out a slow JN before it causes problems
     •   Security
           •   full Kerberos and SSL support: edits can be optionally encrypted
               in-flight, and all access is mutually authenticated



                                   ©2012 Cloudera, Inc. All Rights
34
                                            Reserved.
Testing
     •   Randomized fault test
           • Runs all communications in a single thread with deterministic
             order and fault injections based on a seed
           • Caught a number of really subtle bugs along the way
           • Run as an MR job: 5000 fault tests in parallel
           • Multiple CPU-years of stress testing: found 2 bugs in Jetty!
     •   Cluster testing: 100-node, MR, HBase, Hive, etc
           •   Commit latency in practice: within same range as local disks
               (better than one of two local disks, worse than the other one)

                                   ©2012 Cloudera, Inc. All Rights
36
                                            Reserved.
Deployment and Configuration
     •   Most customers running 3 JNs (tolerate 1 failure)
           •   1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager
           •   Optionally run 2 more (eg on bastion/gateway nodes) to tolerate 2
               failures
     •   Configuration:
           •   dfs.namenode.shared.edits.dir:
               qjournal://nn1.company.com:8485,nn2.company.com:8485,jt.company.
               com:8485/my-journal
           •   dfs.journalnode.edits.dir:
               /data/1/hadoop/journalnode/
           •   dfs.ha.fencing.methods:
               shell(/bin/true)    (fencing not required!)
                                     ©2012 Cloudera, Inc. All Rights
37
                                              Reserved.
Status
     • Merged into Hadoop development trunk in early October
     • Available in CDH4.1
     • Deployed at several customer/community sites with good
       success so far
         •   Planned rollout to 20+ production HBase clusters within the
             month




                                 ©2012 Cloudera, Inc. All Rights
38
                                          Reserved.
Conclusion




39
HA Phase 2 Improvements
     • Run an active NameNode and a hot Standby NameNode
     • Automatically triggers seamless failover using Apache
       ZooKeeper
     • Stores shared metadata on QuorumJournalManager: a fully
       distributed, redundant, low latency journaling system.

     •   All improvements available now in HDFS trunk and CDH4.1


                              ©2012 Cloudera, Inc. All Rights
40
                                       Reserved.
41
Backup Slides




42
Why not BookKeeper?
     •   Pipelined commit instead of quorum commit
           •   Unpredictable latency
     • Research project
     • Not “Hadoopy”
           •   Their own IPC system, no security, different configuration, no
               metrics
     •   External
           •   Feels like “two systems” to ops/deployment instead of just one
     •   Nevertheless: it’s pluggable and BK is an additional option.
                                    ©2012 Cloudera, Inc. All Rights
43
                                             Reserved.
Epoch number assignment
     •   On startup:
           •   NN -> JN: getEpochInfo()
                 •   JN: respond with current promised epoch
           • NN: set epoch = max(promisedEpoch) + 1
           • NN -> JN: newEpoch(epoch)
                 •   JN: if it is still higher than promisedEpoch, remember it and
                     ACK, otherwise NACK
           •   If NN receives ACK from a quorum of nodes, then it has uniquely
               claimed that epoch
                                       ©2012 Cloudera, Inc. All Rights
44
                                                Reserved.

Mais conteúdo relacionado

Mais procurados

Veeam Webinar - Case study: building bi-directional DR
Veeam Webinar - Case study: building bi-directional DRVeeam Webinar - Case study: building bi-directional DR
Veeam Webinar - Case study: building bi-directional DR
Joep Piscaer
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based Applications
Yury Kaliaha
 

Mais procurados (20)

Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL Universe
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
ttec infortrend ds
ttec infortrend dsttec infortrend ds
ttec infortrend ds
 
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
Running without a ZFS system pool
Running without a ZFS system poolRunning without a ZFS system pool
Running without a ZFS system pool
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
INF7827 DRS Best Practices
INF7827 DRS Best PracticesINF7827 DRS Best Practices
INF7827 DRS Best Practices
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Veeam Webinar - Case study: building bi-directional DR
Veeam Webinar - Case study: building bi-directional DRVeeam Webinar - Case study: building bi-directional DR
Veeam Webinar - Case study: building bi-directional DR
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
Migrating Novell GroupWise to Linux
Migrating Novell GroupWise to LinuxMigrating Novell GroupWise to Linux
Migrating Novell GroupWise to Linux
 
Webinar NETGEAR - Storagecraft e Netgear: soluzioni per il backup e il disast...
Webinar NETGEAR - Storagecraft e Netgear: soluzioni per il backup e il disast...Webinar NETGEAR - Storagecraft e Netgear: soluzioni per il backup e il disast...
Webinar NETGEAR - Storagecraft e Netgear: soluzioni per il backup e il disast...
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based Applications
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
 

Destaque

Destaque (9)

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 

Semelhante a Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
DataWorks Summit
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
aidanshribman
 
ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
Christian Posta
 

Semelhante a Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2 (20)

Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase Replication
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Upgrading from NetWare to Novell Open Enterprise Server on Linux: The Novell ...
Upgrading from NetWare to Novell Open Enterprise Server on Linux: The Novell ...Upgrading from NetWare to Novell Open Enterprise Server on Linux: The Novell ...
Upgrading from NetWare to Novell Open Enterprise Server on Linux: The Novell ...
 
7 Ways to Optimize Hudson in Production
7 Ways to Optimize Hudson in Production7 Ways to Optimize Hudson in Production
7 Ways to Optimize Hudson in Production
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the EnterpriseStrata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
Introduction to failover clustering with sql server
Introduction to failover clustering with sql serverIntroduction to failover clustering with sql server
Introduction to failover clustering with sql server
 
Apache hbase for the enterprise (Strata+Hadoop World 2012)
Apache hbase for the enterprise (Strata+Hadoop World 2012)Apache hbase for the enterprise (Strata+Hadoop World 2012)
Apache hbase for the enterprise (Strata+Hadoop World 2012)
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Webinar: The Future of Hadoop
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of Hadoop
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

  • 1. DO NOT USE PUBLICLY High Availability for the HDFS NameNode PRIOR TO 10/23/12 Phase 2 Headline Goes Here Aaron T. Myers and Todd Lipcon | Cloudera HDFS Team Speaker Name or Subhead Goes Here October 2012 1
  • 2. Introductions / who we are • Software engineers on Cloudera’s HDFS engineering team • Committers/PMC Members for Apache Hadoop at ASF • Main developers on HDFS HA • Responsible for ~80% of the code for all phases of HA development • Have helped numerous customers setup and troubleshoot HA HDFS clusters this year ©2012 Cloudera, Inc. All Rights 2 Reserved.
  • 3. Outline • HDFS HA Phase 1 • How did it work? What could it do? • What problems remained? • HDFS HA Phase 2: Automatic failover • HDFS HA Phase 2: Quorum Journal ©2012 Cloudera, Inc. All Rights 3 Reserved.
  • 4. HDFS HA Phase 1 Review HDFS-1623: completed March 2012 4
  • 5. HDFS HA Development Phase 1 • Completed March 2012 (HDFS-1623) • Introduced the StandbyNode, a hot backup for the HDFS NameNode. • Relied on shared storage to synchronize namespace state • (e.g. a NAS filer appliance) • Allowed operators to manually trigger failover to the Standby • Sufficient for many HA use cases: avoided planned downtime for hardware and software upgrades, planned machine/OS maintenance, configuration changes, etc. ©2012 Cloudera, Inc. All Rights 5 Reserved.
  • 6. HDFS HA Architecture Phase 1 • Parallel block reports sent to Active and Standby NameNodes • NameNode state shared by locating edit log on NAS over NFS • Fencing of shared resources/data • Critical that only a single NameNode is Active at any point in time • Client failover done via client configuration • Each client configured with the address of both NNs: try both to find active ©2012 Cloudera, Inc. All Rights 6 Reserved.
  • 7. HDFS HA Architecture Phase 1 ©2012 Cloudera, Inc. All Rights 7 Reserved.
  • 8. Fencing and NFS • Must avoid split-brain syndrome • Both nodes think they are active and try to write to the same file. Your metadata becomes corrupt and requires manual intervention to restart • Configure a fencing script • Script must ensure that prior active has stopped writing • STONITH: shoot-the-other-node-in-the-head • Storage fencing: e.g using NetApp ONTAP API to restrict filer access • Fencing script must succeed to have a successful failover ©2012 Cloudera, Inc. All Rights 8 Reserved.
  • 9. Shortcomings of Phase 1 • Insufficient to protect against unplanned downtime • Manual failover only: requires an operator to step in quickly after a crash • Various studies indicated this was the minority of downtime, but still important to address • Requirement of a NAS device made deployment complex, expensive, and error-prone (we always knew this was just the first phase!) ©2012 Cloudera, Inc. All Rights 9 Reserved.
  • 10. HDFS HA Development Phase 2 • Multiple new features for high availability • Automatic failover, based on Apache ZooKeeper • Remove dependency on NAS (network-attached storage) • Address new HA use cases • Avoid unplanned downtime due to software or hardware faults • Deploy in filer-less environments • Completely stand-alone HA with no external hardware or software dependencies • no Linux-HA, filers, etc ©2012 Cloudera, Inc. All Rights 10 Reserved.
  • 11. Automatic Failover Overview HDFS-3042: completed May 2012 11
  • 12. Automatic Failover Goals • Automatically detect failure of the Active NameNode • Hardware, software, network, etc. • Do not require operator intervention to initiate failover • Once failure is detected, process completes automatically • Support manually initiated failover as first-class • Operators can still trigger failover without having to stop Active • Do not introduce a new SPOF • All parts of auto-failover deployment must themselves be HA ©2012 Cloudera, Inc. All Rights 12 Reserved.
  • 13. Automatic Failover Architecture • Automatic failover requires ZooKeeper • Not required for manual failover • ZK makes it easy to: • Detect failure of Active NameNode • Determine which NameNode should become the Active NN ©2012 Cloudera, Inc. All Rights 13 Reserved.
  • 14. Automatic Failover Architecture • Introduce new daemon in HDFS: ZooKeeper Failover Controller • In an auto failover deployment, run two ZKFCs • One per NameNode, on that NameNode machine • ZooKeeper Failover Controller (ZKFC) is responsible for: • Monitoring health of associated NameNode • Participating in leader election of NameNodes • Fencing the other NameNode if it wins election ©2012 Cloudera, Inc. All Rights 14 Reserved.
  • 15. Automatic Failover Architecture ©2012 Cloudera, Inc. All Rights 15 Reserved.
  • 16. ZooKeeper Failover Controller Details • When a ZKFC is started, it: • Begins checking the health of its associated NN via RPC • As long as the associated NN is healthy, attempts to create an ephemeral znode in ZK • One of the two ZKFCs will succeed in creating the znode and transition its associated NN to the Active state • The other ZKFC transitions its associated NN to the Standby state and begins monitoring the ephemeral znode ©2012 Cloudera, Inc. All Rights 16 Reserved.
  • 17. What happens when… • … a NameNode process crashes? • Associated ZKFC notices the health failure of the NN and quits from active/standby election by removing znode • … a whole NameNode machine crashes? • ZKFC process crashes with it and the ephemeral znode is deleted from ZK ©2012 Cloudera, Inc. All Rights 17 Reserved.
  • 18. What happens when… • … the two NameNodes are partitioned from each other? • Nothing happens: Only one will still have the znode • … ZooKeeper crashes (or down for upgrade)? • Nothing happens: active stays active ©2012 Cloudera, Inc. All Rights 18 Reserved.
  • 19. Fencing Still Required with ZKFC • Tempting to think ZooKeeper means no need for fencing • Consider the following scenario: • Two NameNodes: A and B, each with associated ZKFC • ZKFC A process crashes, ephemeral znode removed • NameNode A process is still running • ZKFC B notices znode removed • ZKFC B wants to transition NN B to Active, but without fencing NN A, both NNs would be active simultaneously ©2012 Cloudera, Inc. All Rights 19 Reserved.
  • 20. Auto-failover recap • New daemon ZooKeeperFailoverController monitors the NameNodes • Automatically triggers fail-overs • No need for operator intervention Fencing and dependency on NFS storage still a pain ©2012 Cloudera, Inc. All Rights 20 Reserved.
  • 21. Removing the NAS dependency HDFS-3077: completed October 2012 21
  • 22. Shared Storage in HDFS HA • The Standby NameNode synchronizes the namespace by following the Active NameNode’s transaction log • Each operation (eg mkdir(/foo)) is written to the log by the Active • The StandbyNode periodically reads all new edits and applies them to its own metadata structures • Reliable shared storage is required for correct operation ©2012 Cloudera, Inc. All Rights 22 Reserved.
  • 23. Shared Storage in “Phase 1” • Operator configures a traditional shared storage device (eg SAN or NAS) • Mount the shared storage via NFS on both Active and Standby NNs • Active NN writes to a directory on NFS, while Standby reads it ©2012 Cloudera, Inc. All Rights 23 Reserved.
  • 24. Shortcomings of NFS-based approach • Custom hardware • Lots of our customers don’t have SAN/NAS available in their datacenter • Costs money, time and expertise • Extra “stuff” to monitor outside HDFS • We just moved the SPOF, didn’t eliminate it! • Complicated • Storage fencing, NFS mount options, multipath networking, etc • Organizationally complicated: dependencies on storage ops team • NFS issues • Buggy client implementations, little control over timeout behavior, etc ©2012 Cloudera, Inc. All Rights 24 Reserved.
  • 25. Primary Requirements for Improved Storage • No special hardware (PDUs, NAS) • No custom fencing configuration • Too complicated == too easy to misconfigure • No SPOFs • punting to filers isn’t a good option • need something inherently distributed ©2012 Cloudera, Inc. All Rights 25 Reserved.
  • 26. Secondary Requirements • Configurable failure toleration • Configure N nodes to tolerate (N-1)/2 • Making N bigger (within reasonable bounds) shouldn’t hurt performance. Implies: • Writes done in parallel, not pipelined • Writes should not wait on slowest replica • Locate replicas on existing hardware investment (eg share with JobTracker, NN, SBN) ©2012 Cloudera, Inc. All Rights 26 Reserved.
  • 27. Operational Requirements • Should be operable by existing Hadoop admins. Implies: • Same metrics system (“hadoop metrics”) • Same configuration system (xml) • Same logging infrastructure (log4j) • Same security system (Kerberos-based) • Allow existing ops to easily deploy and manage the new feature • Allow existing Hadoop tools to monitor the feature • (eg Cloudera Manager, Ganglia, etc) ©2012 Cloudera, Inc. All Rights 27 Reserved.
  • 28. Our solution: QuorumJournalManager • QuorumJournalManager (client) • Plugs into JournalManager abstraction in NN (instead of existing FileJournalManager) • Provides edit log storage abstraction • JournalNode (server) • Standalone daemon running on an odd number of nodes • Provides actual storage of edit logs on local disks • Could run inside other daemons in the future ©2012 Cloudera, Inc. All Rights 28 Reserved.
  • 29. Architecture ©2012 Cloudera, Inc. All Rights 29 Reserved.
  • 30. Commit protocol • NameNode accumulates edits locally as they are logged • On logSync(), sends accumulated batch to all JNs via Hadoop RPC • Waits for success ACK from a majority of nodes • Majority commit means that a single lagging or crashed replica does not impact NN latency • Latency @ NN = median(Latency @ JNs) ©2012 Cloudera, Inc. All Rights 30 Reserved.
  • 31. JN Fencing • How do we prevent split-brain? • Each instance of QJM is assigned a unique epoch number • provides a strong ordering between client NNs • Each IPC contains the client’s epoch • JN remembers on disk the highest epoch it has seen • Any request from an earlier epoch is rejected. Any from a newer one is recorded on disk • Distributed Systems folks may recognize this technique from Paxos and other literature ©2012 Cloudera, Inc. All Rights 31 Reserved.
  • 32. Fencing with epochs • Fencing is now implicit • The act of becoming active causes any earlier active NN to be fenced out • Since a quorum of nodes has accepted the new active, any other IPC by an earlier epoch number can’t get quorum • Eliminates confusing and error-prone custom fencing configuration ©2012 Cloudera, Inc. All Rights 32 Reserved.
  • 33. Segment recovery • In normal operation, a minority of JNs may be out of sync • After a crash, all JNs may have different numbers of txns (last batch may or may not have arrived at each) • eg JN1 was down, JN2 crashed right before NN wrote txnid 150: • JN1: has no edits • JN2: has edits 101-149 • JN3: has edits 101-150 • Before becoming active, we need to come to consensus on this last batch: was it committed or not? • Use the well-known Paxos algorithm to solve consensus ©2012 Cloudera, Inc. All Rights 33 Reserved.
  • 34. Other implementation features • Hadoop Metrics • lag, percentile latencies, etc from perspective of JN, NN • metrics for queued txns, % of time each JN fell behind, etc, to help suss out a slow JN before it causes problems • Security • full Kerberos and SSL support: edits can be optionally encrypted in-flight, and all access is mutually authenticated ©2012 Cloudera, Inc. All Rights 34 Reserved.
  • 35.
  • 36. Testing • Randomized fault test • Runs all communications in a single thread with deterministic order and fault injections based on a seed • Caught a number of really subtle bugs along the way • Run as an MR job: 5000 fault tests in parallel • Multiple CPU-years of stress testing: found 2 bugs in Jetty! • Cluster testing: 100-node, MR, HBase, Hive, etc • Commit latency in practice: within same range as local disks (better than one of two local disks, worse than the other one) ©2012 Cloudera, Inc. All Rights 36 Reserved.
  • 37. Deployment and Configuration • Most customers running 3 JNs (tolerate 1 failure) • 1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager • Optionally run 2 more (eg on bastion/gateway nodes) to tolerate 2 failures • Configuration: • dfs.namenode.shared.edits.dir: qjournal://nn1.company.com:8485,nn2.company.com:8485,jt.company. com:8485/my-journal • dfs.journalnode.edits.dir: /data/1/hadoop/journalnode/ • dfs.ha.fencing.methods: shell(/bin/true) (fencing not required!) ©2012 Cloudera, Inc. All Rights 37 Reserved.
  • 38. Status • Merged into Hadoop development trunk in early October • Available in CDH4.1 • Deployed at several customer/community sites with good success so far • Planned rollout to 20+ production HBase clusters within the month ©2012 Cloudera, Inc. All Rights 38 Reserved.
  • 40. HA Phase 2 Improvements • Run an active NameNode and a hot Standby NameNode • Automatically triggers seamless failover using Apache ZooKeeper • Stores shared metadata on QuorumJournalManager: a fully distributed, redundant, low latency journaling system. • All improvements available now in HDFS trunk and CDH4.1 ©2012 Cloudera, Inc. All Rights 40 Reserved.
  • 41. 41
  • 43. Why not BookKeeper? • Pipelined commit instead of quorum commit • Unpredictable latency • Research project • Not “Hadoopy” • Their own IPC system, no security, different configuration, no metrics • External • Feels like “two systems” to ops/deployment instead of just one • Nevertheless: it’s pluggable and BK is an additional option. ©2012 Cloudera, Inc. All Rights 43 Reserved.
  • 44. Epoch number assignment • On startup: • NN -> JN: getEpochInfo() • JN: respond with current promised epoch • NN: set epoch = max(promisedEpoch) + 1 • NN -> JN: newEpoch(epoch) • JN: if it is still higher than promisedEpoch, remember it and ACK, otherwise NACK • If NN receives ACK from a quorum of nodes, then it has uniquely claimed that epoch ©2012 Cloudera, Inc. All Rights 44 Reserved.