Presentation explains the difference between multi site high availability (aka metro cluster) and disaster recovery. General concepts are similar for any products but presentation is more tailored for VMware technologies.
4. Business Continuity - Definition
• Business continuity encompasses planning and preparation to ensure that an
organization can continue to operate in case of serious incidents or disasters and
is able to recover to an operational state within a reasonably short period. As
such, business continuity includes three key elements and they are
– Resilience (High Availability) : critical business functions and the supporting
infrastructure must be designed in such a way that they are materially unaffected by
relevant disruptions, for example through the use of redundancy and spare capacity;
– Recovery (Disaster Recovery): arrangements have to be made to recover or restore
critical and less critical business functions that fail for some reason.
– Contingency: the organization establishes a generalized capability and readiness to
cope effectively with whatever major incidents and disasters occur, including those that
were not, and perhaps could not have been, foreseen. Contingency preparations
constitute a last-resort response if resilience and recovery arrangements should prove
inadequate in practice.
Source: https://en.wikipedia.org/wiki/Business_continuity
WHAT IS NOT MENTIONED IN WIKIPEDIA
– Mitigation (Disaster Avoidance): the organization can improve contingency planning
with mitigation planning. Do something proactively to avoid unexpected disasters.
4
5. Business Continuity - Terminology
• General concepts and terminology
– Business Continuity – must be based on BIA (Business Impact Analysis)
• RPO (Recovery Point Objective), RTO (Recovery Time Objective) – Infrastructure level
• WRT (Work Recovery Time) – Application level
• MTD (Maximum Tolerable Downtime) = RTO + WRT – Business level
– High Availability, Disaster Recovery, Disaster Avoidance
– Availability Zones, Regions
5
Less then ~60km
More then ~ 60km
High Availability
7. Business Continuity / High Availability
7
• High Availability technologies
– Self initiated failover without human intervention
– Master node or software arbiter is required
– For multisite HA solution third site for arbiter is required
• VMware HA Cluster Solutions
– Local vSphere High Availability Cluster (vSphere HA)
– Multisite vSphere Metro Storage Cluster (vMSC)
8. Single site vSphere HA Cluster
8
Single Site Shared Storage
FC, iSCSI, NFS, VSAN
We all know that, right?
Local vSphere HA Cluster (single clustered
system in single availability zone)
• Protection against
• Physical server failure (ESXi Hosts monitoring)
• OS failure on top of ESXi (Guest OS monitoring)
• App failure on top of ESXi (App Monitoring)
• System Requirements
• Shared local storage (Fibre Channel, SAS, iSCSI,
NFS, SDS like VSAN)
• Flat L2 Networks for VMs
• Software arbiter - Master node of HA Cluster
9. Multisite vSphere Metro Storage Cluster
9
Multisite Shared Storage
FC, iSCSI, NFS, VSAN
Not so common in the field but very popular topic.
Multisite vSphere Metro Storage Cluster
(single clustered system over two availability zones)
• Protection against
• Various Storage Array Failures
• Whole Single Site Storage Array Failure
• Complete Site Failure
• Anticipated disaster (Disaster Avoidance)
• System Requirements
• Shared stretched storage Volumes / LUNs distributed
across two storage arrays and visible/mounted to
ESXi
• Third zone required because of arbiter in 3rd zone
• Flat L2 Networks for VMs
Distributed LUN across two storage systems
Storage System A Storage System B
Storage Witness
10. Business Continuity / High Availability - HA2
Metro Storage Cluster (vMSC)
– Advantages
• Positive impact on RTO during single storage or site failure
– faster disaster recovery because VMs are automatically restarted without human interaction
• Higher Protection (redundancy) against specific infrastructure failures
– Protection against single storage array failure
– Protection against complete site failure
• Non-disruptive Disaster Avoidance
– VM workloads vMotion between availability zones
– VMs does not need to be restarted = higher VM availability SLA can be achieved
• Operational Simplicity
– Design, Implement, Test and Forget. Then pray that it will work when needed.
– Schedule periodical tests to be sure it really works.
– Disadvantages
• Single stretched fault zone
• Complex clustering techniques highly dependent on particular storage vendor
• No test plan - it can be tested only by real failure simulation
– Business critical application owners will not accept real failures.
• App start order and dependency cannot be achieved = negative impact on WRT and MTD
• Third site is required for software arbiter (arbiter, witness, tie-breaker)
10
12. Business Continuity / Disaster Recovery
• SRM - VMware DR technology = human initiated failovers – human arbiter
– Should be implemented between regions but can be implemented between
availability zones as well
– Only two regions are required because human arbiter can run recovery from
anywhere without split brain
– Can be implemented for more regions – N : M
– Independent Fault Zones - Data Replication and L3 network are the only
common denominators among sites
– Network connectivity should be L3 (routed) to mitigate fault propagation
(broadcast storms, unknown unicasts flooding, etc.)
– All infrastructure services has to be duplicated on each region (NTP, DNS,
Active Directory, vCenter, etc.)
– DR orchestration = Application Dependencies (start order) can and should be
specified
12
13. Business Continuity / Disaster Recovery
• DR (VMware SRM)
– Advantages
• Positive impact on WRT
– VMs restarts with priority orders and application dependency – RunBook (SRM Recovery Plan)
• Independence on other region failures
• Mitigation of false positive failures and unnecessary failovers
– Human initiation of DR failover – business approval required
• DR tests without impact on production
– Detail report of performed DR tests
– Disadvantages
• Higher RTO
– Have to wait for human interaction (Business approval before failover)
– Storage Replication has to be break and volumes / LUNs has to be mounted to ESXi hosts on recovery sites
– all VMs in single recovery plan are started in parallel but only 10 recovery plans can be executed concurrently
• Operational and Business overhead
– BIA must exists
– Protection groups and Recovery Plans has to be defined based on BIA
– Recovery Plans has to be tested
– Operational personnel has to be trained
13
15. Business Continuity / Disaster Avoidance
15
• Disaster Avoidance is preventive failover to another availability zone to
avoid anticipated disaster
• Failover with service disruption
– Option 1: SRM fail-over
• Two independent vCenters in two independent SSO domains
• VMs graceful shutdown
• VM re-start in correct order in another region / availability zone
• Failover without service disruption
– Option 1: vSphere Metro Storage Cluster (vMSC)
• Stretched LUN / datastore across availability zones (storage vendor specific technology)
• VMware VM vMotion (CPU, RAM)
– Option 2: vMotion without shared storage
• VMware vMotion within single vCenter or cross two vCenters in single SSO domain
• VMware VM vMotion (CPU, RAM)
• VMware Storage vMotion share nothing (vDisk)
– Option 3: SRM cross vCenter vMotion without shared storage
• Two independent vCenters in two different SSO domains
• VMware VM vMotion (CPU, RAM)
• VMware Storage vMotion share nothing (vDisk)
16. Multisite High Availability (Metro Cluster) or
Disaster Recovery?
16
Infrastructure Design Qualities
• Availability <= High Availability
• Manageability
• Scalability
• Performance
• Security
• Recoverability <= Disaster Recovery
• Cost
17. Multisite HA (Metro Cluster) or
Disaster Recovery?
• vSphere Storage Metro Cluster (vMSC) is High Availability solution great for
– Protection against complete storage system failure
– Non-disruptive Disaster Avoidance between availability zones
– Protection against complete site failure with low RTO but unpredictable WRT and MTD
• but Metro HA (vMSC) is not real Disaster Recovery because of
– Workload restart order unpredictability
– Single system (fault zone) stretched across sites
– Very hardly testable
– Shorter distance protection (< ~60km)
• Real VMware Disaster Recovery solution is SRM
– Predictable recovery plans
– Testable recovery plans without impact on production
– Longer distance protection (> ~60km)
• So, what technology should I use?
– It always depends on business requirements (BIA) and what you want to achieve
– Stretched Metro HA Cluster (vMSC) for HA2 and Disaster Avoidance
– SRM for Disaster Recovery
– Both solutions can be used together – vSphere Storage Metro Cluster protected by SRM
17
21. Multisite vSphere Metro Storage Cluster
21
Physical Infrastructure Logical Design
Controller A1 Controller A2
Storage Array A
FC SW A1 FC SW A2
ESXi A1 ESXi A2
ETH SW A1 ETH SW A2
Router A
Controller B1 Controller B2
Storage Array B
FC SW B1 FC SW B2
ESXi B1 ESXi B2
ETH SW B1 ETH SW B2
Router B
Ethernet DCI
Fibre Channel DCI
Arbiter / Witness/ Tie-Braker
Router C
DC A DC B
DC C
22. Multi site vSphere Metro Storage Cluster
22
vMSC Logical Design – Uniform Mode – Active/Active storage
DC A DC B
DC C
Controller A1 Controller A2
Storage Array A
1 2 1 2
ESXi A1
Controller B1 Controller B2
Storage Array 02
1 2 1 2
ESXi B1
VMFS Datastore 01
Distributed Storage Volume 01 with Coherent Cache
LUN Active on Storage A and Passive on Storage B
Arbiter / Witness/ Tie-Braker
VM A VM B
vSphere Metro Storage Cluster
(vMSC)
Storage Metro Cluster
(Active/Active)
Paths Active everywhere
(Special Multipathing Driver is required to identify optimal paths to storage targets where LUN is active)
Active Optimize Local Path
Active Optimize Remote Path
23. Multi site vSphere Metro Storage Cluster
23
vMSC Logical Design – Non-Uniform mode – Active/Active storage
DC A DC B
DC C
Controller A1 Controller A2
Storage Array A
1 2 1 2
ESXi A1
Controller B1 Controller B2
Storage Array 02
1 2 1 2
ESXi B1
VMFS Datastore 01
Distributed Storage Volume 01
LUN Active on Storage A and Passive on Storage B
Arbiter / Witness/ Tie-Braker
VM A VM B
vSphere Metro Storage Cluster
(vMSC)
LUN Paths Active Optimized in DC A LUN Paths Active Optimized in DC B
Storage Metro Cluster
(Active/Active)
Active Optimized Local Path
24. Multi site vSphere Metro Storage Cluster
24
vMSC Logical Design – Uniform Mode – ALUA storage
DC A DC B
DC C
Controller A1 Controller A2
Storage Array A
1 2 1 2
ESXi A1
Controller B1 Controller B2
Storage Array 02
1 2 1 2
ESXi B1
VMFS Datastore 01
Distributed Storage Volume 01
LUN Active on Storage A and Passive on Storage B
Arbiter / Witness/ Tie-Braker
VM A VM B
vSphere Metro Storage Cluster
(vMSC)
Storage Metro Cluster
(ALUA)
LUN Paths Active Optimized to DC A and Active Non-Optimized to DC B
Active Optimized Local Path
Active Optimized Remote Path
Active Non-Optimized Local Path
Active Non-Optimized Remote Path
26. VMware SRM Terminology
• SRM - Site Recovery Manager
• Data Replication types
– HBR – Host Based Replication (async replication with delta 15 min => RPO)
– SBR – Storage Based Replication (sync/async replication , sync => I/O write
performance impact)
• SRM Constructs
– Protection Group = group of VMs to protect as a single business service
– Recovery Plan = RunBook how VMs in Protection Group has to be started
• Failover and Failback process
– Failover
– Failover-test
– Re-protect
– Failback
26
27. SRM Logical Design
27
DC1 (ANT) DC2 (BUD)
vCenter Server
SRM
Authentication
VMs Workload
SRA
vSphere
Replication
SRM Plug-in
vSphere Client
esx-01 esx-02 esx-X
vRA
SAN
LUN01 LUN02 LUNX
vRA vRA
Site A Datacenter
VM VM VM VM
LUN01 LUN02 LUNX
Replicated LUNsNon-Replicated LUNs
vCenter Server
SRM
VMs Workload
SRA
vSphere
Replication
esx-01 esx-02 esx-X
vRA
SAN
LUN01 LUN02 LUNX
vRA vRA
Site B Datacenter
VM VM VM VM
LUN01 LUN02 LUNX
Replicated LUNs Non-Replicated LUNs