SlideShare uma empresa Scribd logo
1 de 25
HDFS
High Availability
Suresh Srinivas- Hortonworks
Aaron T. Myers - Cloudera
Overview
• Part 1 – Suresh Srinivas(Hortonworks)
  − HDFS Availability and Reliability – what is the record?
  − HA Use Cases
  − HA Design
• Part 2 – Aaron T. Myers (Cloudera)
  − NN HA Design Details
         Automatic failure detection and NN failover
         Client-NN connection failover
  − Operations and Admin of HA
  − Future Work



                                          2
Availability, Reliability and Maintainability
Reliability = MTBF/(1 + MTBF)
• Probability a system performs its functions without failure for
  a desired period of time
Maintainability = 1/(1+MTTR)
• Probability that a failed system can be restored within a given
  timeframe
Availability = MTTF/MTBF
• Probability that a system is up when requested for use
• Depends on both on Reliability and Maintainability

Mean Time To Failure (MTTF): Average time between successive failures
Mean Time To Repair/Restore (MTTR): Average time to repair failed system
Mean Time Between Failures (MTBF): Average time between successive failures = MTTR + MTTF

                                              3
Current HDFS Availability & Data Integrity
• Simple design for Higher Reliability
  − Storage: Rely on Native file system on the OS rather than use raw disk
  − Single NameNode master
          Entire file system state is in memory
  − DataNodes simply store and deliver blocks
          All sophisticated recovery mechanisms in NN
• Fault Tolerance
  − Design assumes disks, nodes and racks fail
  − Multiple replicas of blocks
          active monitoring and replication
          DN actively monitor for block deletion and corruption
  − Restart/migrate the NameNode on failure
          Persistent state: multiple copies + checkpoints
          Functions as Cold Standby
  − Restart/replace the DNs on failure
  − DNs tolerate individual disk failures

                                                4
How Well Did HDFS Work?

• Data Reliability
  − Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009
  − 7-9’s of reliability
  − Related bugs fixed in 20 and 21.
• NameNode Availability
  − 18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster
  − Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year)
  − NN is very reliable
          Resilient against overload caused by misbehaving apps
• Maintainability
  − Large clusters see failure of one DataNode/day and more frequent disk failures
  − Maintenance once in 3 months to repair or replace DataNodes

                                             5
Why NameNode HA?
• NameNode is highly reliable (low MTTF)
  − But Availability is not the same as Reliability
• NameNode MTTR depends on
  − Restarting NameNode daemon on failure
          Operator restart – (failure detection + manual restore) time
          Automatic restart – 1-2 minutes
  − NameNode Startup time
          Small/medium cluster 1-2 minutes
          Very large cluster – 5-15 minutes
• Affects applications that have real time requirement
• For higher HDFS Availability
  − Need redundant NameNode to eliminate SPOF
  − Need automatic failover to reduce MTTR and improve Maintainability
  − Need Hot standby to reduce MTTR for very large clusters
          Cold standby is sufficient for small clusters


                                              6
NameNode HA – Initial Goals

• Support for Active and a single Standby
  − Active and Standby with manual failover
         Standby could be cold/warm/hot
         Addresses downtime during upgrades – main cause of unavailability
  − Active and Standby with automatic failover
         Hot standby
         Addresses downtime during upgrades and other failures
• Backward compatible configuration
• Standby performs checkpointing
  − Secondary NameNode not needed
• Management and monitoring tools
• Design philosophy – choose data integrity over service availability


                                         7
High Level Use Cases
• Planned downtime                Supported failures
 − Upgrades                       • Single hardware failure
 − Config changes
                                    − Double hardware failure not
 − Main reason for downtime           supported
                                  • Some software failures
                                    − Same software failure affects
• Unplanned downtime                  both active and standby
 − Hardware failure
 − Server unresponsive
 − Software failures
 − Occurs infrequently



                              8
High Level Design
• Service monitoring and leader election outside NN
  − Similar to industry standard HA frameworks
• Parallel Block reports to both Active and Standby NN
• Shared or non-shared NN file system state
• Fencing of shared resources/data
  − DataNodes
  − Shared NN state (if any)
• Client failover
  − Client side failover (based on configuration or ZooKeeper)
  − IP Failover


                                      9
Design Considerations
• Sharing state between Active and Hot Standby
  − File system state and Block locations
• Automatic Failover
  − Monitoring Active NN and performing failover on failure
• Making a NameNode active during startup
  − Reliable mechanism for choosing only one NN as active and the other as
    standby
• Prevent data corruption on split brain
  − Shared Resource Fencing
         DataNodes and shared storage for NN metadata
  − NameNode Fencing
         when shared resource cannot be fenced
• Client failover
  − Clients connect to the new Active NN during failover


                                        10
Failover Control Outside NN
                                                  • Similar to Industry Standard HA
                                                    frameworks
                                                  • HA daemon outside NameNode
                               ZooKeeper            − Simpler to build
                                                    − Immune to NN failures
                                                  • Daemon manages resources
                                     Resources
 Failover
                                                    − Resources – OS, HW, Network etc.
                                      Resources
                 Actions             Resources
Controller   start, stop,
             failover, monitor, …
                                                    − NameNode is just another resource
                                                  • Performs
                                Shared
                               Resources            − Active NN election during startup
                                                    − Automatic Failover
                                                    − Fencing
                                                           Shared resources
                                                           NameNode
Architecture
                         ZK       ZK            ZK

                              Leader election

     Failover                                                    Failover
    Controller                                                  Controller
      Active                                                      Standby
                 Cmds              editlog
Monitor Health                                                 Monitor Health
                                   editlogs
                    NN            (fencing)           NN
                    Active                           Standby



                               Block Reports




                   DN               DN                   DN
First Phase – Hot Standby

                                           Needs to be HA


                         editlogs
            NN      (Shared NFS storage)         NN
           Active                              Standby
                      Manual Failover



                       Block Reports
                       DN fencing



          DN               DN                       DN
HA Design Details


                    14
Client Failover Design Details

• Smart clients (client side failover)
  − Users use one logical URI, client selects correct NN to connect to
  − Clients know which operations are idempotent, therefore safe to retry on
    a failover
  − Clients have configurable failover/retry strategies
• Current implementation
  − Client configured with the addresses of all NNs
• Other implementations in the future (more later)




                                      15
Client Failover Configuration Example
...
<property>
 <name>dfs.namenode.rpc-address.name-service1.nn1</name>
 <value>host1.example.com:8020</value>
</property>
<property>
 <name>dfs.namenode.rpc-address.name-service1.nn2</name>
 <value>host2.example.com:8020</value>
</property>
<property>
 <name>dfs.namenode.http-address.name-service1.nn1</name>
 <value>host1.example.com:50070</value>
</property>
...



                                    16
Automatic Failover Design Details
• Automatic failover requires Zookeeper
  − Not required for manual failover
  − ZK makes it easy to:
         Detect failure of the active NN
         Determine which NN should become the Active NN
• On both NN machines, run another daemon
  − ZKFailoverController (Zookeeper Failover Controller)
• Each ZKFC is responsible for:
  − Health monitoring of its associated NameNode
  − ZK session management / ZK-based leader election
• See HDFS-2185 and HADOOP-8206 for more details


                                       17
Automatic Failover Design Details (cont)




                     18
Ops/Admin: Shared Storage

• To share NN state, need shared storage
  − Needs to be HA itself to avoid just shifting SPOF
  − Many come with IP fencing options
  − Recommended mount options:
         tcp,soft,intr,timeo=60,retrans=10
• Still configure local edits dirs, but shared dir is special
• Work is currently underway to do away with shared storage
  requirement (more later)




                                       19
Ops/Admin: NN fencing
• Critical for correctness that only one NN is active at a time
• Out of the box
  − RPC to active NN to tell it to go to standby (graceful failover)
  − SSH to active NN and `kill -9’ NN
• Pluggable options
  − Many filers have protocols for IP-based fencing options
  − Many PDUs have protocols for IP-based plug-pulling (STONITH)
         Nuke the node from orbit. It’s the only way to be sure.
• Configure extra options if available to you
  − Will be tried in order during a failover event
  − Escalate the aggressiveness of the method
  − Fencing is critical for correctness of NN metadata


                                         20
Ops/Admin: Automatic Failover
• Deploy ZK as usual (3 or 5 nodes) or reuse existing ZK
  − ZK daemons have light resource requirement
  − OK to collocate 1 on each NN, many collocate 3rd on the YARN RM
  − Advisable to configure ZK daemons with dedicated disks for isolation
  − Fine to use the same ZK quorum as for HBase, etc.
• Fencing methods still required
  − The ZKFC that wins the election is responsible for performing fencing
  − Fencing script(s) must be configured and work from the NNs
• Admin commands which manually initiate failovers still work
  − But rather than coordinating the failover themselves, use the ZKFCs



                                      21
Ops/Admin: Monitoring
• New NN metrics
  − Size of pending DN message queues
  − Seconds since the standby NN last read from shared edit log
  − DN block report lag
  − All measurements of standby NN lag – monitor/alert on all of these
• Monitor shared storage solution
  − Volumes fill up, disks go bad, etc
  − Should configure paranoid edit log retention policy (default is 2)
• Canary-based monitoring of HDFS a good idea
  − Pinging both NNs not sufficient



                                       22
Ops/Admin: Hardware
• Active/Standby NNs should be on separate racks
• Shared storage system should be on separate rack
• Active/Standby NNs should have close to the same hardware
  − Same amount of RAM – need to store the same things
  − Same # of processors - need to serve same number of clients
• All the same recommendations still apply for NN
  − ECC memory, 48GB
  − Several separate disks for NN metadata directories
  − Redundant disks for OS drives, probably RAID 5 or mirroring
  − Redundant power



                                     23
Future Work
• Other options to share NN metadata
  − Journal daemons with list of active JDs stored in ZK (HDFS-3092)
  − Journal daemons with quorum writes (HDFS-3077)


• More advanced client failover/load shedding
  − Serve stale reads from the standby NN
  − Speculative RPC
  − Non-RPC clients (IP failover, DNS failover, proxy, etc.)
  − Less client-side configuration (ZK, custom DNS records, HDFS-3043)


• Even Higher HA
  − Multiple standby NNs

                                      24
QA

• HA design: HDFS-1623
 −First released in Hadoop 2.0.0-alpha
• Auto failover design: HDFS-3042 / -2185
 −First released in Hadoop 2.0.1-alpha
• Community effort



                       25

Mais conteúdo relacionado

Mais procurados

Policy-driven, Platform-aware Nova Scheduler
Policy-driven, Platform-aware Nova SchedulerPolicy-driven, Platform-aware Nova Scheduler
Policy-driven, Platform-aware Nova SchedulerRam (Ramki) Krishnan
 
Os rtos.ppt
Os rtos.pptOs rtos.ppt
Os rtos.pptrahul km
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time KernelsArnav Soni
 
vSAN Beyond The Basics
vSAN Beyond The BasicsvSAN Beyond The Basics
vSAN Beyond The BasicsSumit Lahiri
 
Protecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster RecoveryProtecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster RecoveryNovell
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...VMworld
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementationRajan Kumar
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systemsDTQ4
 
Galvin-operating System(Ch1)
Galvin-operating System(Ch1)Galvin-operating System(Ch1)
Galvin-operating System(Ch1)dsuyal1
 
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQL
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQLBlue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQL
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQLBlue Medora
 
STN Event 12.8.09 - Chris Vain Powerpoint Presentation
STN Event 12.8.09 - Chris Vain Powerpoint PresentationSTN Event 12.8.09 - Chris Vain Powerpoint Presentation
STN Event 12.8.09 - Chris Vain Powerpoint Presentationmcini
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Heechul Yun
 
Galvin-operating System(Ch5)
Galvin-operating System(Ch5)Galvin-operating System(Ch5)
Galvin-operating System(Ch5)dsuyal1
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating SystemTech_MX
 
vSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at ScalevSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at ScaleSumit Lahiri
 

Mais procurados (20)

Policy-driven, Platform-aware Nova Scheduler
Policy-driven, Platform-aware Nova SchedulerPolicy-driven, Platform-aware Nova Scheduler
Policy-driven, Platform-aware Nova Scheduler
 
Os rtos.ppt
Os rtos.pptOs rtos.ppt
Os rtos.ppt
 
Rtos
RtosRtos
Rtos
 
Rt linux-lab1
Rt linux-lab1Rt linux-lab1
Rt linux-lab1
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time Kernels
 
vSAN Beyond The Basics
vSAN Beyond The BasicsvSAN Beyond The Basics
vSAN Beyond The Basics
 
Protecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster RecoveryProtecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster Recovery
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster base...
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systems
 
Galvin-operating System(Ch1)
Galvin-operating System(Ch1)Galvin-operating System(Ch1)
Galvin-operating System(Ch1)
 
RT linux
RT linuxRT linux
RT linux
 
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQL
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQLBlue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQL
Blue Medora Oracle Enterprise Manager (EM12c) Plug-in for PostgreSQL
 
STN Event 12.8.09 - Chris Vain Powerpoint Presentation
STN Event 12.8.09 - Chris Vain Powerpoint PresentationSTN Event 12.8.09 - Chris Vain Powerpoint Presentation
STN Event 12.8.09 - Chris Vain Powerpoint Presentation
 
Real-Time Operating Systems
Real-Time Operating SystemsReal-Time Operating Systems
Real-Time Operating Systems
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
Galvin-operating System(Ch5)
Galvin-operating System(Ch5)Galvin-operating System(Ch5)
Galvin-operating System(Ch5)
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
 
vSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at ScalevSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at Scale
 

Destaque

Impact of Soft Errors in Silicon on Reliability and Availability of Servers
Impact of Soft Errors in Silicon on Reliability and Availability of ServersImpact of Soft Errors in Silicon on Reliability and Availability of Servers
Impact of Soft Errors in Silicon on Reliability and Availability of ServersIshwar Parulkar
 
High Availability Infrastructure for Cloud Computing
High Availability Infrastructure for Cloud ComputingHigh Availability Infrastructure for Cloud Computing
High Availability Infrastructure for Cloud ComputingBob Rhubart
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High AvailabilityHortonworks
 
Improving substation reliability & availability
Improving substation reliability & availability Improving substation reliability & availability
Improving substation reliability & availability Vincent Wedelich, PE MBA
 
Cloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesCloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesThoughtWorks Studios
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 Linaro
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Slides cloud computing
Slides cloud computingSlides cloud computing
Slides cloud computingHaslina
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple pptAgarwaljay
 

Destaque (12)

Impact of Soft Errors in Silicon on Reliability and Availability of Servers
Impact of Soft Errors in Silicon on Reliability and Availability of ServersImpact of Soft Errors in Silicon on Reliability and Availability of Servers
Impact of Soft Errors in Silicon on Reliability and Availability of Servers
 
High Availability Infrastructure for Cloud Computing
High Availability Infrastructure for Cloud ComputingHigh Availability Infrastructure for Cloud Computing
High Availability Infrastructure for Cloud Computing
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Improving substation reliability & availability
Improving substation reliability & availability Improving substation reliability & availability
Improving substation reliability & availability
 
System dependability
System dependabilitySystem dependability
System dependability
 
Availability and reliability
Availability and reliabilityAvailability and reliability
Availability and reliability
 
Cloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesCloud Computing - Benefits and Challenges
Cloud Computing - Benefits and Challenges
 
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Slides cloud computing
Slides cloud computingSlides cloud computing
Slides cloud computing
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
 

Semelhante a Hadoop Summit 2012 | HDFS High Availability

HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and FutureDataWorks Summit
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Cloudera, Inc.
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutSander Temme
 
[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage SolutionsPerforce
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br conceptsAmit Bhalla
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) PostgreSQL Experts, Inc.
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...SQLExpert.pl
 
Presentation st9900 virtualization - emea - primary disk
Presentation   st9900 virtualization - emea - primary diskPresentation   st9900 virtualization - emea - primary disk
Presentation st9900 virtualization - emea - primary diskxKinAnx
 
Embracing Database Diversity: The New Oracle / MySQL DBA - UKOUG
Embracing Database Diversity: The New Oracle / MySQL DBA -   UKOUGEmbracing Database Diversity: The New Oracle / MySQL DBA -   UKOUG
Embracing Database Diversity: The New Oracle / MySQL DBA - UKOUGKeith Hollman
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomMicrosoft Singapore
 

Semelhante a Hadoop Summit 2012 | HDFS High Availability (20)

HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
 
[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions[NetApp] Simplified HA:DR Using Storage Solutions
[NetApp] Simplified HA:DR Using Storage Solutions
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br concepts
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009)
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
 
Presentation st9900 virtualization - emea - primary disk
Presentation   st9900 virtualization - emea - primary diskPresentation   st9900 virtualization - emea - primary disk
Presentation st9900 virtualization - emea - primary disk
 
Embracing Database Diversity: The New Oracle / MySQL DBA - UKOUG
Embracing Database Diversity: The New Oracle / MySQL DBA -   UKOUGEmbracing Database Diversity: The New Oracle / MySQL DBA -   UKOUG
Embracing Database Diversity: The New Oracle / MySQL DBA - UKOUG
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
les12.pdf
les12.pdfles12.pdf
les12.pdf
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Hadoop Summit 2012 | HDFS High Availability

  • 1. HDFS High Availability Suresh Srinivas- Hortonworks Aaron T. Myers - Cloudera
  • 2. Overview • Part 1 – Suresh Srinivas(Hortonworks) − HDFS Availability and Reliability – what is the record? − HA Use Cases − HA Design • Part 2 – Aaron T. Myers (Cloudera) − NN HA Design Details  Automatic failure detection and NN failover  Client-NN connection failover − Operations and Admin of HA − Future Work 2
  • 3. Availability, Reliability and Maintainability Reliability = MTBF/(1 + MTBF) • Probability a system performs its functions without failure for a desired period of time Maintainability = 1/(1+MTTR) • Probability that a failed system can be restored within a given timeframe Availability = MTTF/MTBF • Probability that a system is up when requested for use • Depends on both on Reliability and Maintainability Mean Time To Failure (MTTF): Average time between successive failures Mean Time To Repair/Restore (MTTR): Average time to repair failed system Mean Time Between Failures (MTBF): Average time between successive failures = MTTR + MTTF 3
  • 4. Current HDFS Availability & Data Integrity • Simple design for Higher Reliability − Storage: Rely on Native file system on the OS rather than use raw disk − Single NameNode master  Entire file system state is in memory − DataNodes simply store and deliver blocks  All sophisticated recovery mechanisms in NN • Fault Tolerance − Design assumes disks, nodes and racks fail − Multiple replicas of blocks  active monitoring and replication  DN actively monitor for block deletion and corruption − Restart/migrate the NameNode on failure  Persistent state: multiple copies + checkpoints  Functions as Cold Standby − Restart/replace the DNs on failure − DNs tolerate individual disk failures 4
  • 5. How Well Did HDFS Work? • Data Reliability − Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 − 7-9’s of reliability − Related bugs fixed in 20 and 21. • NameNode Availability − 18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster − Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) − NN is very reliable  Resilient against overload caused by misbehaving apps • Maintainability − Large clusters see failure of one DataNode/day and more frequent disk failures − Maintenance once in 3 months to repair or replace DataNodes 5
  • 6. Why NameNode HA? • NameNode is highly reliable (low MTTF) − But Availability is not the same as Reliability • NameNode MTTR depends on − Restarting NameNode daemon on failure  Operator restart – (failure detection + manual restore) time  Automatic restart – 1-2 minutes − NameNode Startup time  Small/medium cluster 1-2 minutes  Very large cluster – 5-15 minutes • Affects applications that have real time requirement • For higher HDFS Availability − Need redundant NameNode to eliminate SPOF − Need automatic failover to reduce MTTR and improve Maintainability − Need Hot standby to reduce MTTR for very large clusters  Cold standby is sufficient for small clusters 6
  • 7. NameNode HA – Initial Goals • Support for Active and a single Standby − Active and Standby with manual failover  Standby could be cold/warm/hot  Addresses downtime during upgrades – main cause of unavailability − Active and Standby with automatic failover  Hot standby  Addresses downtime during upgrades and other failures • Backward compatible configuration • Standby performs checkpointing − Secondary NameNode not needed • Management and monitoring tools • Design philosophy – choose data integrity over service availability 7
  • 8. High Level Use Cases • Planned downtime Supported failures − Upgrades • Single hardware failure − Config changes − Double hardware failure not − Main reason for downtime supported • Some software failures − Same software failure affects • Unplanned downtime both active and standby − Hardware failure − Server unresponsive − Software failures − Occurs infrequently 8
  • 9. High Level Design • Service monitoring and leader election outside NN − Similar to industry standard HA frameworks • Parallel Block reports to both Active and Standby NN • Shared or non-shared NN file system state • Fencing of shared resources/data − DataNodes − Shared NN state (if any) • Client failover − Client side failover (based on configuration or ZooKeeper) − IP Failover 9
  • 10. Design Considerations • Sharing state between Active and Hot Standby − File system state and Block locations • Automatic Failover − Monitoring Active NN and performing failover on failure • Making a NameNode active during startup − Reliable mechanism for choosing only one NN as active and the other as standby • Prevent data corruption on split brain − Shared Resource Fencing  DataNodes and shared storage for NN metadata − NameNode Fencing  when shared resource cannot be fenced • Client failover − Clients connect to the new Active NN during failover 10
  • 11. Failover Control Outside NN • Similar to Industry Standard HA frameworks • HA daemon outside NameNode ZooKeeper − Simpler to build − Immune to NN failures • Daemon manages resources Resources Failover − Resources – OS, HW, Network etc. Resources Actions Resources Controller start, stop, failover, monitor, … − NameNode is just another resource • Performs Shared Resources − Active NN election during startup − Automatic Failover − Fencing  Shared resources  NameNode
  • 12. Architecture ZK ZK ZK Leader election Failover Failover Controller Controller Active Standby Cmds editlog Monitor Health Monitor Health editlogs NN (fencing) NN Active Standby Block Reports DN DN DN
  • 13. First Phase – Hot Standby Needs to be HA editlogs NN (Shared NFS storage) NN Active Standby Manual Failover Block Reports DN fencing DN DN DN
  • 15. Client Failover Design Details • Smart clients (client side failover) − Users use one logical URI, client selects correct NN to connect to − Clients know which operations are idempotent, therefore safe to retry on a failover − Clients have configurable failover/retry strategies • Current implementation − Client configured with the addresses of all NNs • Other implementations in the future (more later) 15
  • 16. Client Failover Configuration Example ... <property> <name>dfs.namenode.rpc-address.name-service1.nn1</name> <value>host1.example.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.name-service1.nn2</name> <value>host2.example.com:8020</value> </property> <property> <name>dfs.namenode.http-address.name-service1.nn1</name> <value>host1.example.com:50070</value> </property> ... 16
  • 17. Automatic Failover Design Details • Automatic failover requires Zookeeper − Not required for manual failover − ZK makes it easy to:  Detect failure of the active NN  Determine which NN should become the Active NN • On both NN machines, run another daemon − ZKFailoverController (Zookeeper Failover Controller) • Each ZKFC is responsible for: − Health monitoring of its associated NameNode − ZK session management / ZK-based leader election • See HDFS-2185 and HADOOP-8206 for more details 17
  • 18. Automatic Failover Design Details (cont) 18
  • 19. Ops/Admin: Shared Storage • To share NN state, need shared storage − Needs to be HA itself to avoid just shifting SPOF − Many come with IP fencing options − Recommended mount options:  tcp,soft,intr,timeo=60,retrans=10 • Still configure local edits dirs, but shared dir is special • Work is currently underway to do away with shared storage requirement (more later) 19
  • 20. Ops/Admin: NN fencing • Critical for correctness that only one NN is active at a time • Out of the box − RPC to active NN to tell it to go to standby (graceful failover) − SSH to active NN and `kill -9’ NN • Pluggable options − Many filers have protocols for IP-based fencing options − Many PDUs have protocols for IP-based plug-pulling (STONITH)  Nuke the node from orbit. It’s the only way to be sure. • Configure extra options if available to you − Will be tried in order during a failover event − Escalate the aggressiveness of the method − Fencing is critical for correctness of NN metadata 20
  • 21. Ops/Admin: Automatic Failover • Deploy ZK as usual (3 or 5 nodes) or reuse existing ZK − ZK daemons have light resource requirement − OK to collocate 1 on each NN, many collocate 3rd on the YARN RM − Advisable to configure ZK daemons with dedicated disks for isolation − Fine to use the same ZK quorum as for HBase, etc. • Fencing methods still required − The ZKFC that wins the election is responsible for performing fencing − Fencing script(s) must be configured and work from the NNs • Admin commands which manually initiate failovers still work − But rather than coordinating the failover themselves, use the ZKFCs 21
  • 22. Ops/Admin: Monitoring • New NN metrics − Size of pending DN message queues − Seconds since the standby NN last read from shared edit log − DN block report lag − All measurements of standby NN lag – monitor/alert on all of these • Monitor shared storage solution − Volumes fill up, disks go bad, etc − Should configure paranoid edit log retention policy (default is 2) • Canary-based monitoring of HDFS a good idea − Pinging both NNs not sufficient 22
  • 23. Ops/Admin: Hardware • Active/Standby NNs should be on separate racks • Shared storage system should be on separate rack • Active/Standby NNs should have close to the same hardware − Same amount of RAM – need to store the same things − Same # of processors - need to serve same number of clients • All the same recommendations still apply for NN − ECC memory, 48GB − Several separate disks for NN metadata directories − Redundant disks for OS drives, probably RAID 5 or mirroring − Redundant power 23
  • 24. Future Work • Other options to share NN metadata − Journal daemons with list of active JDs stored in ZK (HDFS-3092) − Journal daemons with quorum writes (HDFS-3077) • More advanced client failover/load shedding − Serve stale reads from the standby NN − Speculative RPC − Non-RPC clients (IP failover, DNS failover, proxy, etc.) − Less client-side configuration (ZK, custom DNS records, HDFS-3043) • Even Higher HA − Multiple standby NNs 24
  • 25. QA • HA design: HDFS-1623 −First released in Hadoop 2.0.0-alpha • Auto failover design: HDFS-3042 / -2185 −First released in Hadoop 2.0.1-alpha • Community effort 25

Notas do Editor

  1. Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  2. Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  3. Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  4. Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs