SlideShare uma empresa Scribd logo
1 de 32
Considerations when implementing HA in DMF



            DMF UG meeting
             Gerald Hofer

     Presented by Susheel Ghokhale
What is High Availability?

According to Wikipedia:

"High availability is a system design approach and associated
service implementation that ensures a prearranged level of
operational performance will be met during a contractual
measurement period."


The keywords are "system" and "measurement".
What is High Availability?
            MTBF: Mean Time Between Failures
            MTBR: Mean Time Between Repairs



                                   MTTF
      Availability =
• So:                       MTTF + MTTR
  o   Increase MTTF (better hardware)
  o   Decrease MTTR (redundant hardware + software)
System Design Considerations



  Client Network   Ethernet
                              DMF Server

                                           Dual F/C
                                                      RAID




Reasonably Highly Available, Most of the Time
Redundant Hardware

For components with lowest MTBF

• Disks
   o mirrored or Raid5/6
   o external Raid
       is usually designed for no single point of failure
       HBA
       redundant RAID controllers
       Cables
• Tape drives
• Tape libraries
• Power supplies
Single DMF system
What remains that can affect availability?
• Hardware
   o CPU
   o memory
   o backplane
• Software
   o kernel
   o applications
   o need for updates
• External factors
   o Environment
   o Power
• The human factor / Murphy
   o administrators
   o service personal
System Design Considerations

                                 Node 1
                 Ethernet



Client Network                       Private
                                     Network



                 Ethernet
                                 Node 2
                                               RAID


                            “DMF Server”



   Node 2 takes over when Node 1 fails
What is High Availability?

The paradox is that adding more components and making a
system more complex can decrease the system availability.

A simple single physical system with redundant hardware can
potentially achieve the highest availability.

But this ignores the fact that that a single physical system
needs to be brought down for patching, upgrades, testing.
Considerations

How long does it take to fix a problem?
• how long to identify the problem
• how long to get a spare part
• how long does it take to reboot

Frequency of planned outages
 • updates

How much is dependent on the DMF system?
• An archive/backup system - low impact
• A DMF/NFS server for several thousand cluster cores with
  time sensitive applications - very high impact
History

SGI has long tradition supporting High Availability Software
• Failsafe
   o IRIX
• SGI InfiniteStorage Cluster Manager
   o SLES9 and SLES10
• SGI Heartbeat
   o SLES 10
• SLE/HAE - Novell High Availability Environment for SLES
   o SLES 11
SGI InfiniteStorage Cluster Manager
 •    based on Redhat Cluster Manager
 •    SLES9 and SLES10
 •    inflexible, needs shutdown for configuration changes
 •    still running at QUT Creative Industries as DMF/Samba
      server
       o lots of problems in the beginning
       o now that system has matured
       o planned to retire soon (since about a year?)
metal:~ # clustat
Cluster Status - QUT_CI                                  01:48:05
Cluster Quorum Incarnation #1
Shared State: Shared Raw Device Driver v1.2

 Member                Status
 ------------------ ----------
 indie             Active
 metal              Active <-- You are here

 Service          Status Owner (Last) Last Transition Chk Tmout Restarts
 -------------- -------- ---------------- --------------- --- ----- --------
 QUT_CI_Fileser started metal                        20:19:03 Dec 18 0 400   0

metal:~ # uptime
 1:39am up 65 days 5:31, 1 user, load average: 1.10, 1.12, 1.09
SGI Heartbeat
• SGI build of the Linux-HA Heartbeat package that is a
  product of the community High Availability Linux Project
• SGI specific modules and changes
   o cxfs, xvm, tmf, openvault, DMF, L1 and L2 controller
• SLES 10
• Heartbeat v2
• in active use at several sites in Australia
   o QUT HPC, UQ HPC, JCU, DERM
• flexible but hard to configure and administer (XML, cryptic
  command line)
<resources>
   <group id="dmfGroup">
      <primitive class="ocf" provider="sgi" type="lxvm" id="local_xvm">
       <instance_attributes id="1123e13b-69b0-4190-9638-4053acf2c23d">
        <attributes>
          <nvpair name="volnames" value="DMFback,DMFhome,DMFjournals,DMFmove,DMFspool,DMFtmp,DMFcache,archive,home,orac,pkg,scratch,static,work" id="55931f76-
c083-4a22-998b-c949746e4dcb"/>
          <nvpair name="physvols" value="DMFback,DMFback_45,DMFcache_19,DMFcache_20,DMFcache_21,DMFcache_4,DMFcache_5,DMFcache_6,DMFcache_7,DMFcache_8,DMFhome
,DMFjournals,DMFmove,DMFmove_46,DMFspool,DMFtmp,archive_34,home_0,home_1,home_2,home_3,orac_32,pkg_26,pkg_27,pkg_28,pkg_29,scratch_39,scratch_40,scratch_41,scr
atch_42,scratch_43,scratch_44,static_33,work_22,work_23,work_24,work_25,work_35A,work_36A,work_37A,work_38A,www_30" id="842b3da6-1ead-40b7-8e8b-e94a283702a3"/>
[...]
SLE/HAE - Novell High Availability
Environment for SLES
• Novell product with support
• Based on open source components
   o Pacemaker, OpenAIS, Corosync
• SLES11SP1 and up
• SGI is adding extensions
   o cxfs, xvm, tmf, openvault, DMF, L1 and L2 controller
• Very flexible
• Much easier configuration and administration
   o powerful CLI
   o working GUI
• First installations in the region (without DMF)
   o Korea, UQ EBI mirror
Architecture
Architecture

• Resource Layer
  o Resource Agents (RA)
• Resource Allocation Layer
  o Cluster Resource Manager (CRM)
  o Cluster Information Base (CIB)
  o Policy Engine (PE)
  o Local Resource Manager (LRM)
• Messaging and Infrastructure Layer
  o Corosync, OpenAIS
Resource Layer

• Instead of starting resources during boot, the resources are
  started by HA
• The resources agents (RA) are usually scripts
• The most common script standard is OCF
   o Set of actions with defined exit codes
   o start, stop, monitor
• LSB scripts
   o normal init.d scripts
   o not all init.d scripts use correct exit codes
Resources example
 Resource Group: haGroup
   local_xvm (ocf::sgi:lxvm):         Started nimbus
   _dmf_home (ocf::heartbeat:Filesystem): Started nimbus
   _dmf_journals        (ocf::heartbeat:Filesystem): Started nimbus
   _dmf_spool (ocf::heartbeat:Filesystem): Started nimbus
   _HPC_home (ocf::heartbeat:Filesystem): Started nimbus
   tmf (ocf::sgi:tmf): Started nimbus
   dmf (ocf::sgi:dmf): Started nimbus
   nfs (lsb:nfsserver):       Started nimbus
   ip_public (ocf::sgi:sgi-nfsserver):       Started nimbus
   ip_nas0 (ocf::sgi:sgi-nfsserver):         Started nimbus
   ip_nas1 (ocf::sgi:sgi-nfsserver):         Started nimbus
   vsftpd     (lsb:vsftpd): Started nimbus
   Mediaflux (ocf::sgi:Mediaflux_ha):          Started nimbus
   ip_ib1     (ocf::sgi:sgi-ib-nfsserver): Started nimbus
   ip_ib2     (ocf::sgi:sgi-ib-nfsserver): Started nimbus
   ip_ib0     (ocf::sgi:sgi-ib-nfsserver): Started nimbus
 Clone Set: stonith-l2network-set
   stonith-l2network:0 (stonith:l2network): Started stratus
   stonith-l2network:1 (stonith:l2network): Started nimbus
Resource Allocation Layer
Most complex layer
• Local Resource Manager (LRM)
   o start/stop/monitor the different supported scripts
• Cluster Information Base (CIB)
   o in memory XML of configuration and status
• Designated Coordinator (DC)
   o one of the nodes in the cluster is the boss
• Policy Engine (PE)
   o If something changes in the cluster an new state is
     calculated based on the rules and state in the CIB
   o Only the DC can make the changes
   o The PE is running on all nodes to speed up failover
• Cluster Resource Manager (CRM)
   o binds all the components together and provides the
     communication path
   o serializes the access to the CIB
Messaging and Infrastructure Layer

• I am alive signals
• communication to send updates to other nodes
2 node cluster
• Most DMF HA installations are 2 nodes clusters
• DMF can only run once -> active, passive
• storage and tapes are accessible from both nodes
• Only one node is allowed to write to the file system and to
  tapes
• Who is the boss?
• If the HA system is not sure a resource has been stopped
  on a node, there is no safe way to use the resource on a
  different node

STONITH - shoot the other node in the head

• A reliable way to kill the other node
STONITH implementation

• STONITH devices are implemented as resources
• A stonithd daemon is hiding the complexity

Different physical implementations
• Remote power board
• L1/L2
• BMC

The implementation needs to make sure that when it reports
the successful completion, that the targeted system has no way
of running any resources any more.
Real world implementation
• Set up single DMF server
• Test it - and fix all problems
• Then convert the system to HA

Most problems with HA during the installation phase are related
to the fact that the base system was not working correctly in the
first place.

• System does not shut down cleanly
  o maybe only under load
• System needs user intervention after a boot
  o Switch is not using 'portfast' and the interface is not
    configured properly after boot
Problems - monitoring timeouts
• Resources and the other node are periodically monitored
• In case of a fault of a resource or the other node HA is
  initiating usually a failover, sometimes STONITH the other
  node
• In some cases high utilization can cause monitoring to fail
  without real problem
• This causes unplanned and unnecessary failover events
• Usually not a big problem for NFS (service interruption of a
  few minutes)
• Potentially more issue for other services (FTP, SMB,
  backup)

After a new installation every failover incident needs to be
analysed and appropriate changes and improvement
implemented. It can take a while to bed down a HA systems.
Experience helps.
Problems - crash dumps

If a node is crashing the best way to diagnose the problems is
to capture a crash dump.

In an HA environment the surviving node will detect the crash
and issue a STONITH. Because the node was in kdb or in the
middle of a crash dump this reset destroys vital debugging
information.

Manual intervention might be required to capture these dumps.
Problems - syslog

Most of the information about the state of HA is logged into
syslog. If a node is reset, you are usually loosing the last few
syslog enties as that nodes had no time to write out the
information to disk. This makes it hard to debug why a node
failed.

Solution is to write the syslog also to the other node over the
network.
Problems - configuration changes
Changing the configuration on the running system is possible
and can be done safely.
 • unmanage resources
 • make changes and verify normal operation
 • manage resources

But there can be some traps
• Administrators forget they are working on an HA system and
  restart a service without unmanaging the resources
    o if the monitoring is active that can mean a failover
• Administrators add a new file system and forget to add the
  mount point on the second system
    o when the service want to start it fails and it fails over to
      the other system again.
• The faults were not cleaned up correctly after the
  maintenance
    o When the system is managed again it fails over
Problems - STONITH matches

It can happen that the system gets into a state where nodes
are constantly shooting each other. So every time both nodes
are booted and forming a cluster, one of the nodes get shot.

This usually only happens because of configuration problems.

There are easy ways to break the loop and fix the
configuration. Experience helps.
Why do you want to stay away from
HA?
One of the system administration rules is (or should be):

"Keep it simple"

Because HA is adding more complexity, there are more ways
Murphy can strike. So overall your availability can be negatively
affected.
Why not use just a spare system (cold
spare) and avoid HA
A cold spare system can reduce the time to get spare parts.

But there is still a considerable downtime to activate and test
the cold spare system.

The biggest problem is that usually the procedures to use a
cold spare are infrequently tested (they need a downtime) and
changes and updates to the running system may invalidate
these procedures.

Because most of the hardware is most likely in a cold spare, it
makes sense to invest into HA and make sure that the spare is
always usable and configured correctly.

In some cases you are better off with a good support contract.
Reasons to consider HA
Most of the time the DMF system is a central component and
any outage is affecting the availability of other systems as well.

Because in HA the failing of one system is a normal operation,
these fault scenarios are constantly tested. This assures that if
a fault occurs that everything is working as expected. On single
systems the fault scenario is rarely tested and a lot of times
uncovers problems at the most inconvenient time.

In HA systems both systems are already booted and the
standby system is ready to steal volumes and start services. A
failover event is much faster then a reboot of a system.

Updates and upgrades can be done online with minimal
interruption.
References

Novell SLE/HA
http://www.novell.com/documentation/sle_ha/pdfdoc/book_sleha/book_sleha.pdf
http://www.clusterlabs.org/wiki/Documentation
http://ourobengr.com/high-availability-in-37-easy-steps.odp

Mais conteúdo relacionado

Mais procurados

XPDDS18: Xen Project Weather Report 2018
XPDDS18: Xen Project Weather Report 2018XPDDS18: Xen Project Weather Report 2018
XPDDS18: Xen Project Weather Report 2018The Linux Foundation
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityCloudera, Inc.
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
Kernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianKernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianAnne Nicolas
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodesEvans Ye
 
Service Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsService Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsNikos Anastopoulos
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Oleksandra Nazola
 
Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018The Linux Foundation
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0guest72e8c1
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time KernelsArnav Soni
 
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...The Linux Foundation
 
Overview of Linux real-time challenges
Overview of Linux real-time challengesOverview of Linux real-time challenges
Overview of Linux real-time challengesDaniel Stenberg
 
kexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisorkexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisorThe Linux Foundation
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systemsDTQ4
 

Mais procurados (20)

XPDDS18: Xen Project Weather Report 2018
XPDDS18: Xen Project Weather Report 2018XPDDS18: Xen Project Weather Report 2018
XPDDS18: Xen Project Weather Report 2018
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High Availability
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Kernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: DebianKernel maintainance in Linux distributions: Debian
Kernel maintainance in Linux distributions: Debian
 
Real time Linux
Real time LinuxReal time Linux
Real time Linux
 
DTraceCloud2012
DTraceCloud2012DTraceCloud2012
DTraceCloud2012
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
 
Service Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native EnvironmentsService Assurance for Virtual Network Functions in Cloud-Native Environments
Service Assurance for Virtual Network Functions in Cloud-Native Environments
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Rt linux-lab1
Rt linux-lab1Rt linux-lab1
Rt linux-lab1
 
Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time Kernels
 
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...
XPDS14: Xen and the Art of Certification - Nathan Studer & Robert VonVossen, ...
 
XS Japan 2008 Xen Mgmt English
XS Japan 2008 Xen Mgmt EnglishXS Japan 2008 Xen Mgmt English
XS Japan 2008 Xen Mgmt English
 
Overview of Linux real-time challenges
Overview of Linux real-time challengesOverview of Linux real-time challenges
Overview of Linux real-time challenges
 
kexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisorkexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisor
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systems
 

Destaque

D bus specification
D bus specificationD bus specification
D bus specificationhik_lhz
 
Developments in linux
Developments in linuxDevelopments in linux
Developments in linuxhik_lhz
 
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)hik_lhz
 
0 mq the guide
0 mq   the guide0 mq   the guide
0 mq the guidehik_lhz
 
Larena3 0架构与关键技术
Larena3 0架构与关键技术Larena3 0架构与关键技术
Larena3 0架构与关键技术hik_lhz
 
Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Fit47Chic
 
Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Fit47Chic
 
Karla jackson4 hw499-01-project4
Karla jackson4 hw499-01-project4Karla jackson4 hw499-01-project4
Karla jackson4 hw499-01-project4Fit47Chic
 
Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Fit47Chic
 
Log4c developersguide
Log4c developersguideLog4c developersguide
Log4c developersguidehik_lhz
 
Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Fit47Chic
 
Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Fit47Chic
 
自动生成 Makefile 的全过程详解!
自动生成 Makefile 的全过程详解!自动生成 Makefile 的全过程详解!
自动生成 Makefile 的全过程详解!hik_lhz
 
Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Fit47Chic
 
Karla jackson4 hw499-01-project6
Karla jackson4 hw499-01-project6Karla jackson4 hw499-01-project6
Karla jackson4 hw499-01-project6Fit47Chic
 

Destaque (17)

Cloud learning
Cloud learningCloud learning
Cloud learning
 
D bus specification
D bus specificationD bus specification
D bus specification
 
Developments in linux
Developments in linuxDevelopments in linux
Developments in linux
 
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)
Linux集群应用实战 通过lvs+keepalived搭建高可用的负载均衡集群系统(第二讲)
 
0 mq the guide
0 mq   the guide0 mq   the guide
0 mq the guide
 
Larena3 0架构与关键技术
Larena3 0架构与关键技术Larena3 0架构与关键技术
Larena3 0架构与关键技术
 
Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5
 
Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8
 
Karla jackson4 hw499-01-project4
Karla jackson4 hw499-01-project4Karla jackson4 hw499-01-project4
Karla jackson4 hw499-01-project4
 
Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8Karla jackson4 hw499-01-project8
Karla jackson4 hw499-01-project8
 
Log4c developersguide
Log4c developersguideLog4c developersguide
Log4c developersguide
 
Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5Karla jackson4 hw499-01-project5
Karla jackson4 hw499-01-project5
 
Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7
 
自动生成 Makefile 的全过程详解!
自动生成 Makefile 的全过程详解!自动生成 Makefile 的全过程详解!
自动生成 Makefile 的全过程详解!
 
Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7Karla jackson4 hw499-01-project7
Karla jackson4 hw499-01-project7
 
Karla jackson4 hw499-01-project6
Karla jackson4 hw499-01-project6Karla jackson4 hw499-01-project6
Karla jackson4 hw499-01-project6
 
Money and Banking
Money and BankingMoney and Banking
Money and Banking
 

Semelhante a Considerations when implementing_ha_in_dmf

[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) [발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) 동현 김
 
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Continuent
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalTommy Lee
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaSShawn Zhu
 
RHCE (RED HAT CERTIFIED ENGINEERING)
RHCE (RED HAT CERTIFIED ENGINEERING)RHCE (RED HAT CERTIFIED ENGINEERING)
RHCE (RED HAT CERTIFIED ENGINEERING)Sumant Garg
 
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022Stefano Stabellini
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Roger Zhou 周志强
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute ClusterRamsay Key
 
2017 - LISA - LinkedIn's Distributed Firewall (DFW)
2017 - LISA - LinkedIn's Distributed Firewall (DFW)2017 - LISA - LinkedIn's Distributed Firewall (DFW)
2017 - LISA - LinkedIn's Distributed Firewall (DFW)Mike Svoboda
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Michael Christofferson
 
Bit_Bucket_x31_Final
Bit_Bucket_x31_FinalBit_Bucket_x31_Final
Bit_Bucket_x31_FinalSam Knutson
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinciAkash Sahoo
 
Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Chris Tankersley
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerHolger Winkelmann
 

Semelhante a Considerations when implementing_ha_in_dmf (20)

[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community) [발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
[발표자료] 오픈소스 Pacemaker 활용한 zabbix 이중화 방안(w/ Zabbix Korea Community)
 
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
RHCE (RED HAT CERTIFIED ENGINEERING)
RHCE (RED HAT CERTIFIED ENGINEERING)RHCE (RED HAT CERTIFIED ENGINEERING)
RHCE (RED HAT CERTIFIED ENGINEERING)
 
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
 
test
testtest
test
 
HeartBeat
HeartBeatHeartBeat
HeartBeat
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
RMLL / LSM 2009
RMLL / LSM 2009RMLL / LSM 2009
RMLL / LSM 2009
 
2017 - LISA - LinkedIn's Distributed Firewall (DFW)
2017 - LISA - LinkedIn's Distributed Firewall (DFW)2017 - LISA - LinkedIn's Distributed Firewall (DFW)
2017 - LISA - LinkedIn's Distributed Firewall (DFW)
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
Bit_Bucket_x31_Final
Bit_Bucket_x31_FinalBit_Bucket_x31_Final
Bit_Bucket_x31_Final
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
 

Último

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Considerations when implementing_ha_in_dmf

  • 1. Considerations when implementing HA in DMF DMF UG meeting Gerald Hofer Presented by Susheel Ghokhale
  • 2. What is High Availability? According to Wikipedia: "High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period." The keywords are "system" and "measurement".
  • 3. What is High Availability? MTBF: Mean Time Between Failures MTBR: Mean Time Between Repairs MTTF Availability = • So: MTTF + MTTR o Increase MTTF (better hardware) o Decrease MTTR (redundant hardware + software)
  • 4. System Design Considerations Client Network Ethernet DMF Server Dual F/C RAID Reasonably Highly Available, Most of the Time
  • 5. Redundant Hardware For components with lowest MTBF • Disks o mirrored or Raid5/6 o external Raid  is usually designed for no single point of failure  HBA  redundant RAID controllers  Cables • Tape drives • Tape libraries • Power supplies
  • 6. Single DMF system What remains that can affect availability? • Hardware o CPU o memory o backplane • Software o kernel o applications o need for updates • External factors o Environment o Power • The human factor / Murphy o administrators o service personal
  • 7. System Design Considerations Node 1 Ethernet Client Network Private Network Ethernet Node 2 RAID “DMF Server” Node 2 takes over when Node 1 fails
  • 8. What is High Availability? The paradox is that adding more components and making a system more complex can decrease the system availability. A simple single physical system with redundant hardware can potentially achieve the highest availability. But this ignores the fact that that a single physical system needs to be brought down for patching, upgrades, testing.
  • 9. Considerations How long does it take to fix a problem? • how long to identify the problem • how long to get a spare part • how long does it take to reboot Frequency of planned outages • updates How much is dependent on the DMF system? • An archive/backup system - low impact • A DMF/NFS server for several thousand cluster cores with time sensitive applications - very high impact
  • 10. History SGI has long tradition supporting High Availability Software • Failsafe o IRIX • SGI InfiniteStorage Cluster Manager o SLES9 and SLES10 • SGI Heartbeat o SLES 10 • SLE/HAE - Novell High Availability Environment for SLES o SLES 11
  • 11. SGI InfiniteStorage Cluster Manager • based on Redhat Cluster Manager • SLES9 and SLES10 • inflexible, needs shutdown for configuration changes • still running at QUT Creative Industries as DMF/Samba server o lots of problems in the beginning o now that system has matured o planned to retire soon (since about a year?) metal:~ # clustat Cluster Status - QUT_CI 01:48:05 Cluster Quorum Incarnation #1 Shared State: Shared Raw Device Driver v1.2 Member Status ------------------ ---------- indie Active metal Active <-- You are here Service Status Owner (Last) Last Transition Chk Tmout Restarts -------------- -------- ---------------- --------------- --- ----- -------- QUT_CI_Fileser started metal 20:19:03 Dec 18 0 400 0 metal:~ # uptime 1:39am up 65 days 5:31, 1 user, load average: 1.10, 1.12, 1.09
  • 12. SGI Heartbeat • SGI build of the Linux-HA Heartbeat package that is a product of the community High Availability Linux Project • SGI specific modules and changes o cxfs, xvm, tmf, openvault, DMF, L1 and L2 controller • SLES 10 • Heartbeat v2 • in active use at several sites in Australia o QUT HPC, UQ HPC, JCU, DERM • flexible but hard to configure and administer (XML, cryptic command line) <resources> <group id="dmfGroup"> <primitive class="ocf" provider="sgi" type="lxvm" id="local_xvm"> <instance_attributes id="1123e13b-69b0-4190-9638-4053acf2c23d"> <attributes> <nvpair name="volnames" value="DMFback,DMFhome,DMFjournals,DMFmove,DMFspool,DMFtmp,DMFcache,archive,home,orac,pkg,scratch,static,work" id="55931f76- c083-4a22-998b-c949746e4dcb"/> <nvpair name="physvols" value="DMFback,DMFback_45,DMFcache_19,DMFcache_20,DMFcache_21,DMFcache_4,DMFcache_5,DMFcache_6,DMFcache_7,DMFcache_8,DMFhome ,DMFjournals,DMFmove,DMFmove_46,DMFspool,DMFtmp,archive_34,home_0,home_1,home_2,home_3,orac_32,pkg_26,pkg_27,pkg_28,pkg_29,scratch_39,scratch_40,scratch_41,scr atch_42,scratch_43,scratch_44,static_33,work_22,work_23,work_24,work_25,work_35A,work_36A,work_37A,work_38A,www_30" id="842b3da6-1ead-40b7-8e8b-e94a283702a3"/> [...]
  • 13. SLE/HAE - Novell High Availability Environment for SLES • Novell product with support • Based on open source components o Pacemaker, OpenAIS, Corosync • SLES11SP1 and up • SGI is adding extensions o cxfs, xvm, tmf, openvault, DMF, L1 and L2 controller • Very flexible • Much easier configuration and administration o powerful CLI o working GUI • First installations in the region (without DMF) o Korea, UQ EBI mirror
  • 15. Architecture • Resource Layer o Resource Agents (RA) • Resource Allocation Layer o Cluster Resource Manager (CRM) o Cluster Information Base (CIB) o Policy Engine (PE) o Local Resource Manager (LRM) • Messaging and Infrastructure Layer o Corosync, OpenAIS
  • 16. Resource Layer • Instead of starting resources during boot, the resources are started by HA • The resources agents (RA) are usually scripts • The most common script standard is OCF o Set of actions with defined exit codes o start, stop, monitor • LSB scripts o normal init.d scripts o not all init.d scripts use correct exit codes
  • 17. Resources example Resource Group: haGroup local_xvm (ocf::sgi:lxvm): Started nimbus _dmf_home (ocf::heartbeat:Filesystem): Started nimbus _dmf_journals (ocf::heartbeat:Filesystem): Started nimbus _dmf_spool (ocf::heartbeat:Filesystem): Started nimbus _HPC_home (ocf::heartbeat:Filesystem): Started nimbus tmf (ocf::sgi:tmf): Started nimbus dmf (ocf::sgi:dmf): Started nimbus nfs (lsb:nfsserver): Started nimbus ip_public (ocf::sgi:sgi-nfsserver): Started nimbus ip_nas0 (ocf::sgi:sgi-nfsserver): Started nimbus ip_nas1 (ocf::sgi:sgi-nfsserver): Started nimbus vsftpd (lsb:vsftpd): Started nimbus Mediaflux (ocf::sgi:Mediaflux_ha): Started nimbus ip_ib1 (ocf::sgi:sgi-ib-nfsserver): Started nimbus ip_ib2 (ocf::sgi:sgi-ib-nfsserver): Started nimbus ip_ib0 (ocf::sgi:sgi-ib-nfsserver): Started nimbus Clone Set: stonith-l2network-set stonith-l2network:0 (stonith:l2network): Started stratus stonith-l2network:1 (stonith:l2network): Started nimbus
  • 18. Resource Allocation Layer Most complex layer • Local Resource Manager (LRM) o start/stop/monitor the different supported scripts • Cluster Information Base (CIB) o in memory XML of configuration and status • Designated Coordinator (DC) o one of the nodes in the cluster is the boss • Policy Engine (PE) o If something changes in the cluster an new state is calculated based on the rules and state in the CIB o Only the DC can make the changes o The PE is running on all nodes to speed up failover • Cluster Resource Manager (CRM) o binds all the components together and provides the communication path o serializes the access to the CIB
  • 19. Messaging and Infrastructure Layer • I am alive signals • communication to send updates to other nodes
  • 20. 2 node cluster • Most DMF HA installations are 2 nodes clusters • DMF can only run once -> active, passive • storage and tapes are accessible from both nodes • Only one node is allowed to write to the file system and to tapes • Who is the boss? • If the HA system is not sure a resource has been stopped on a node, there is no safe way to use the resource on a different node STONITH - shoot the other node in the head • A reliable way to kill the other node
  • 21. STONITH implementation • STONITH devices are implemented as resources • A stonithd daemon is hiding the complexity Different physical implementations • Remote power board • L1/L2 • BMC The implementation needs to make sure that when it reports the successful completion, that the targeted system has no way of running any resources any more.
  • 22.
  • 23. Real world implementation • Set up single DMF server • Test it - and fix all problems • Then convert the system to HA Most problems with HA during the installation phase are related to the fact that the base system was not working correctly in the first place. • System does not shut down cleanly o maybe only under load • System needs user intervention after a boot o Switch is not using 'portfast' and the interface is not configured properly after boot
  • 24. Problems - monitoring timeouts • Resources and the other node are periodically monitored • In case of a fault of a resource or the other node HA is initiating usually a failover, sometimes STONITH the other node • In some cases high utilization can cause monitoring to fail without real problem • This causes unplanned and unnecessary failover events • Usually not a big problem for NFS (service interruption of a few minutes) • Potentially more issue for other services (FTP, SMB, backup) After a new installation every failover incident needs to be analysed and appropriate changes and improvement implemented. It can take a while to bed down a HA systems. Experience helps.
  • 25. Problems - crash dumps If a node is crashing the best way to diagnose the problems is to capture a crash dump. In an HA environment the surviving node will detect the crash and issue a STONITH. Because the node was in kdb or in the middle of a crash dump this reset destroys vital debugging information. Manual intervention might be required to capture these dumps.
  • 26. Problems - syslog Most of the information about the state of HA is logged into syslog. If a node is reset, you are usually loosing the last few syslog enties as that nodes had no time to write out the information to disk. This makes it hard to debug why a node failed. Solution is to write the syslog also to the other node over the network.
  • 27. Problems - configuration changes Changing the configuration on the running system is possible and can be done safely. • unmanage resources • make changes and verify normal operation • manage resources But there can be some traps • Administrators forget they are working on an HA system and restart a service without unmanaging the resources o if the monitoring is active that can mean a failover • Administrators add a new file system and forget to add the mount point on the second system o when the service want to start it fails and it fails over to the other system again. • The faults were not cleaned up correctly after the maintenance o When the system is managed again it fails over
  • 28. Problems - STONITH matches It can happen that the system gets into a state where nodes are constantly shooting each other. So every time both nodes are booted and forming a cluster, one of the nodes get shot. This usually only happens because of configuration problems. There are easy ways to break the loop and fix the configuration. Experience helps.
  • 29. Why do you want to stay away from HA? One of the system administration rules is (or should be): "Keep it simple" Because HA is adding more complexity, there are more ways Murphy can strike. So overall your availability can be negatively affected.
  • 30. Why not use just a spare system (cold spare) and avoid HA A cold spare system can reduce the time to get spare parts. But there is still a considerable downtime to activate and test the cold spare system. The biggest problem is that usually the procedures to use a cold spare are infrequently tested (they need a downtime) and changes and updates to the running system may invalidate these procedures. Because most of the hardware is most likely in a cold spare, it makes sense to invest into HA and make sure that the spare is always usable and configured correctly. In some cases you are better off with a good support contract.
  • 31. Reasons to consider HA Most of the time the DMF system is a central component and any outage is affecting the availability of other systems as well. Because in HA the failing of one system is a normal operation, these fault scenarios are constantly tested. This assures that if a fault occurs that everything is working as expected. On single systems the fault scenario is rarely tested and a lot of times uncovers problems at the most inconvenient time. In HA systems both systems are already booted and the standby system is ready to steal volumes and start services. A failover event is much faster then a reboot of a system. Updates and upgrades can be done online with minimal interruption.

Notas do Editor

  1. When looking at high availability you always look for the whole system (hardware, software, procedures, service). The methodology of the measurement  is important as this can be very different between different sites.
  2. For large systems with lots of LUNs the probing of FC paths through a SAN can take a very long time and can increase the boot time. So bringing back the system to resolve a software fault or to apply a simple patch that requires a reboot can take a long time.
  3. History is relevant as user experiences might be based on old versions. So if you hear stories about HA you need to be aware what version we are talking about.