Mpls conference 2016-data center virtualisation-11-march
Tackling Disaster in a SCM Environment
1. Tackling Disaster in a
Sunrise Clinical
Manager Environment …
Disaster Restart OR
Disaster Recovery?
Ziaul Mannan
-Sr. Technical DBA
Howard Goldberg
-Director, Clinical Systems and Support
2. Yale-New Haven Hospital
New Haven, Connecticut
944 Bed Tertiary Teaching Facility
2600 Medical Staff
7550 Employees
100% CPOE
Average 350,000 orders monthly
Average Daily Census is 724
7 Time “Most Wired” and 3 Time “Most Wireless”
Hospital by Hospitals and Health Networks
3.
4. Future Clinical Cancer Center At
Yale-New Haven Hospital
112 inpatient beds
Outpatient treatment rooms
Expanded operating rooms
Infusion suites
Diagnostic imaging services
Therapeutic radiology
Specialized Women's Cancer Center
Yale-New Haven Breast Center/GYN Oncology Center
Tentative Completion Date: 2009
5. Problem
• After the events of 9/11, hospital realized that
there needed to be redundant data centers with
the ability to provide “zero” downtime.
• Implemented SCM with server clusters and EMC
SAN situated in data centers on opposite ends
of the hospital campus.
6.
7. Goals
• Provide 24X7X365 uptime
• Minimize downtime
• Faster recovery in DR situation
• Database must be consistent
– Or it won’t come up
Challenges
• Build a redundant System over 2 KM apart Data centers
• Overcome the limitations of Clustering solutions
• Design a system for both redundancy and DR solution
8. YNHH Production Environment
• SCM 4.5 SP3 RU3 Migrating to SCM4.5 with
SMM-10/03/06
– Total Users Defined – 10,000
– Users logged on at peak hours ~450
– SCM Reports, HL7 Interfaces,CDS,Multum
– No CDR
– Total disk for data 700GB (all servers)
– Total disk for Archive 500 GB
9. YNHH Production Environment
• MS SQL Server
– SQL Server 2000 EE Build 2195: SP 4
– Master and Enterprise on their own Servers and both
clustered
– MSCS and EMC SRDF/CE used as Clustering
solution
• OS and Hardware
– Window 2000 Advance Server SP4
– Local SCSI and EMC Disks on Symmetrix
10. YNHH Production Environment
– Distributed SCM Environment
• Master Server (MSCS Cluster using EMC
SRDF/CE) ~ 125 GB DB
• Enterprise Server (MSCS+EMC SRDF/CE)
• HL7 Server (MSCS+EMC SRDF/CE)
• Reports Server (MSCS+EMC SRDF/CE)
• CDS Server (MSCS+EMC SRDF/CE)
• Multum Server (MSCS+EMC SRDF/CE)
• Compaq Servers - DL760G2,DL380G3,DL560
• 2-8 CPUs, 3-8 GB RAM
11. YNHH SCM Production
Environment
MSMQ DCs SCM Master DB
Client Workstations
Enterprise Server
XAENTER1PA
Master Active DB XAENTERP
XAMASTER1PA XAENTERCL1
XAMASTERP SCM Client SCM Client
XAMASTERCL1
YNHORG2
YNHORG4
HL7 Interface Server HL7 Interface Server
Executive Server Manager Server
MSMQ XAAPPS2P
XAHL71PA
XAHL7P
XAHL7CL1
SunriseXA Services
Notification, CDS and Order Generation Server Multum Server Report Server
XACOGNS1PA XAMULTUM1PA XAREPORT1P
XACOGNSP XAMULTUMP XAREPORTP
XACOGNSCL1 XAMULTUMCL1 XAREPORTCL1
13. Disaster Recovery VS. Disaster
Restart
• Disaster Recovery
– DR process restores database objects to last good backup
– Recovery process restores and recovers data
– Difficult to coordinate recoveries across database systems
– Long restart time and data loss could be high
• Disaster Restart
– Disaster Restart is inherent in all DBMS
– Remote disaster restart possible using remote mirroring (SRDF)
– Remote restart has no formal recovery
– Remote disaster similar to local system power failure
– Short restart time and low data loss
14. Microsoft Cluster Service (MSCS)
• MSCS is clustering extension to Windows Server
Enterprise and Datacenter
• MSCS is a loosely coupled cluster system
• Provides H/W and OS redundancy, no disk redundancy
• On a failure it fails to the other node along with disks and
resources
• Failover can occur due to manual failover, H/W failure or
application failure.
• Relatively quicker uptime in event of failure
• MSCS provides improved availability,increased
scalability, simplify management of groups of systems
16. Limitations of MSCS
• With SCSI, all servers must be within 40 meters of one
other
• Each must be less than 20 meters from the storage
• With Fiber Channel connections this distance can be
increased
• Does not provide disk redundancy
• It is not a fault tolerant closely coupled system
• Not a solution for disaster recovery
17. SRDF
• Symmetrix Remote Data Facility/ Cluster Enabler is a
disaster restart able business continuance solution
based on Symmetrix from EMC corporation
• SRDF is a configuration of multiple Symmetrix arrays
• SRDF duplicates data from production (source) site to a
secondary recovery (target) site transparently to
users, applications, databases and host processors
• If the primary site fails, data in the secondary site is
current up to the I/O.
• Used for disaster recovery, remote back up, data center
migration, datacenter decision solutions
19. SRDF/CE Overview
• Software extension for MSCS
• Cluster nodes can be geographically separated by
distances of up to 60 KM
• provides fail over for MSCS-handled failures as well as
site disasters, Symmetrix failures or Total
Communication failures (IP + SRDF links lost).
• Up to 64 MSCS clusters per Symmetrix pair
• Protects data from following types of failure:
– Storage failures
– System failures
– Site failures
21. SRDF/CE modes of operation
• Active/Passive
– Cluster of 2 nodes or more
– Processing is done on one node (active node)
– Processing is picked up by a remaining node (or
nodes) only when the active node fails
– Half of the H/W is normally idle.
– On failure the application restarts with full
performance
• Active/Active
– Cluster of 2 nodes or more
– All nodes run application software.
22. • When a node fails, work is transferred to a remaining
node (or nodes)
• The node that picks up, processes load of both systems
• Extra load may cause performance degradation
Other Generic type of Clusters:
• Shared-nothing: No common Cluster resources shared
between clusters.
• Shared-something: Some resource in each cluster node.
23. SRDF/CE in YNHH SCM Production
Environment
Clients:
Enterprise LAN/WAN
Private Interconnect (Heartbeat Connector)
20 KM with Single Mode FDDI
Host A Host B
Node 1 Node 2
UWD SCSI
or FC-AL UWD SCSI
or FC-AL
R1 Bi -directional SRDF Interconnect R2
R2 R1
Symmetrix Symmetrix
24. SRDF/CE Over MSCS
• SRDF/CE protects against more failure scenarios than
MSCS can.
• It overcomes the distance limitations of MSCS
• Cluster nodes can be geographically separated by
distances of up to 60 KM (network round trip latency of
less than 300 ms)
• An ideal solution for dealing with disaster
• Critical information available in minutes
• System restart not recovery when disaster happens
25. SRDF/CE and MSCS
Common Recovery Behavior
1. LAN Link failure 5. Server failure
2. Heartbeat Link failure 6. Application software failure
3. SRDF Link failure 7. Host bus adapter failure
4. Host NIC failure 8. Symmetrix array failure
SRDF/CE Unique Behavior
The geographic separation and disaster tolerance of
SRDF/CE causes unique behavior and provides
recovery alternatives
27. Complete Site Failure and
Recovery
• Site (Server and Symmetrix) failure (5+8)
– Site failure occurs when both the Server and Symmetrix fail from
natural disaster and human error
• Total Communication Failure(1+2+3) - Split-Brain ?
– Occurs when all communication between node1 and node2 is
lost
– In this type of failure, both nodes remain operational and is
referred to as split-brain
– Is a potential cause of logical data corruption as each side
assumes the other side is dead and begin processing new
transactions against their copy of data
– Two separate and irreconcilable copies of data are created
29. Response to complete site failure
• Site Failure GspanPlan
– Site Failure occurs at Node 2 Test-Plan
– QuorumGrp and SQLGrp continue running on Node 1
– Manual intervention required to bring FShareGrp online on
Node1
• Site Failure – Quorum Lost
– Site Failure occurs at Node 1
– Site Failure causes SQLGrp and QuorumGrp to go offline
– With QuorumGrp offline, W2K takes whole cluster offline
– Manual intervention required to bring cluster online.
30. • Total Communications Failure
– Total Communications Failure causes the node without the
QuorumGrp to go offline
– This prevents the Spilt-Brain
– Manual intervention required to bring FShareGrp online.
– EMC doesn’t suggest automatic site fail-over to prevent Spilt-
Brain
31. Benefits
• Disaster recovery solution
• Disaster restart provides short restart time and low data
loss
• Ensures data integrity
• SRDF/CE overcomes limitations in traditional cluster
solutions like MSCS
32. Disadvantages
• Cost
• Complex Setup
• Lots of Disks
• Fail-back needs to be planned, takes longer than failover
• Synchronous SRDF Disaster Restart
– Data must be written to both Symmetrix
– Consistent, reliable data
– More I/O over head
• Asynchronous SRDF Disaster Restart
– Data is written asynchronously to secondary Symmetrix
– May incur data loss
– Faster I/O
• Both sites in the same city, prone to regional disaster
33. Conclusions
• In our DR test following failure scenarios were tested:
– Server failure
– O/S Failure
– HBA/Channel failure
– Application failure
– Public LAN failure
– Private LAN failure
– Complete IP communication failure (public LAN and private LAN)
• All tests were passed
• We have achieved high uptime (non-scheduled outages)
of almost 100% in last 3 years
• 2 unplanned fail overs so far due to windows fluctuation
34. References
• EMC SRDF/Cluster Enabler for MCSC v2.1 Product
Guide P/N 300-001-286 REV A02 by Eclipsys
Corporation, Hopkinton, MA 01748-9103, 2006
• GeopSpan Implementation by John Toner, EMC
Corporation, 2003
Contact Information
Ziaul Mannan : Ziaul.Mannan@ynhh.org
Howard Goldberg: Howard.Goldberg@ynhh.org