SlideShare uma empresa Scribd logo
1 de 35
Tackling Disaster in a
 Sunrise Clinical
 Manager Environment …
 Disaster Restart OR
 Disaster Recovery?




Ziaul Mannan
-Sr. Technical DBA
Howard Goldberg
-Director, Clinical Systems and Support
Yale-New Haven Hospital
           New Haven, Connecticut

 944 Bed Tertiary Teaching Facility
 2600 Medical Staff
 7550 Employees
 100% CPOE
 Average 350,000 orders monthly
 Average Daily Census is 724
 7 Time “Most Wired” and 3 Time “Most Wireless”
  Hospital by Hospitals and Health Networks
Future Clinical Cancer Center At
    Yale-New Haven Hospital
 112 inpatient beds
 Outpatient treatment rooms
 Expanded operating rooms
 Infusion suites
 Diagnostic imaging services
 Therapeutic radiology
 Specialized Women's Cancer Center
 Yale-New Haven Breast Center/GYN Oncology Center
 Tentative Completion Date: 2009
Problem
• After the events of 9/11, hospital realized that
  there needed to be redundant data centers with
  the ability to provide “zero” downtime.
• Implemented SCM with server clusters and EMC
  SAN situated in data centers on opposite ends
  of the hospital campus.
Goals
•   Provide 24X7X365 uptime
•   Minimize downtime
•   Faster recovery in DR situation
•   Database must be consistent
     – Or it won’t come up

Challenges
• Build a redundant System over 2 KM apart Data centers
• Overcome the limitations of Clustering solutions
• Design a system for both redundancy and DR solution
YNHH Production Environment
• SCM 4.5 SP3 RU3 Migrating to SCM4.5 with
 SMM-10/03/06
  –   Total Users Defined – 10,000
  –   Users logged on at peak hours ~450
  –   SCM Reports, HL7 Interfaces,CDS,Multum
  –   No CDR
  –   Total disk for data 700GB (all servers)
  –   Total disk for Archive 500 GB
YNHH Production Environment
• MS SQL Server
  – SQL Server 2000 EE Build 2195: SP 4
  – Master and Enterprise on their own Servers and both
    clustered
  – MSCS and EMC SRDF/CE used as Clustering
    solution
• OS and Hardware
  – Window 2000 Advance Server SP4
  – Local SCSI and EMC Disks on Symmetrix
YNHH Production Environment
– Distributed SCM Environment
  • Master Server (MSCS Cluster using EMC
    SRDF/CE) ~ 125 GB DB
  • Enterprise Server (MSCS+EMC SRDF/CE)
  • HL7 Server (MSCS+EMC SRDF/CE)
  • Reports Server (MSCS+EMC SRDF/CE)
  • CDS Server (MSCS+EMC SRDF/CE)
  • Multum Server (MSCS+EMC SRDF/CE)
  • Compaq Servers - DL760G2,DL380G3,DL560
  • 2-8 CPUs, 3-8 GB RAM
YNHH SCM Production
              Environment
MSMQ DCs                                SCM Master DB
                                                                                                              Client Workstations
                                                                                  Enterprise Server
                                                                                   XAENTER1PA
                                          Master Active DB                          XAENTERP
                                          XAMASTER1PA                              XAENTERCL1
                                           XAMASTERP                                                          SCM Client       SCM Client
                                          XAMASTERCL1
YNHORG2




YNHORG4
                                                                                       HL7 Interface Server      HL7 Interface Server
                                                                                        Executive Server           Manager Server
                                                                                              MSMQ                   XAAPPS2P
                                                                                           XAHL71PA
                                                                                             XAHL7P
                                                                                           XAHL7CL1




                                                             SunriseXA Services




           Notification, CDS and Order Generation Server        Multum Server                Report Server
                           XACOGNS1PA                          XAMULTUM1PA                  XAREPORT1P
                            XACOGNSP                            XAMULTUMP                    XAREPORTP
                           XACOGNSCL1                          XAMULTUMCL1                  XAREPORTCL1
Solutions/Tools
• Disaster Restart
• Microsoft Cluster Service (MSCS)
• EMC SRDF/CE
Disaster Recovery VS. Disaster
             Restart
• Disaster Recovery
  –   DR process restores database objects to last good backup
  –   Recovery process restores and recovers data
  –   Difficult to coordinate recoveries across database systems
  –   Long restart time and data loss could be high

• Disaster Restart
  –   Disaster Restart is inherent in all DBMS
  –   Remote disaster restart possible using remote mirroring (SRDF)
  –   Remote restart has no formal recovery
  –   Remote disaster similar to local system power failure
  –   Short restart time and low data loss
Microsoft Cluster Service (MSCS)
• MSCS is clustering extension to Windows Server
  Enterprise and Datacenter
• MSCS is a loosely coupled cluster system
• Provides H/W and OS redundancy, no disk redundancy
• On a failure it fails to the other node along with disks and
  resources
• Failover can occur due to manual failover, H/W failure or
  application failure.
• Relatively quicker uptime in event of failure
• MSCS provides improved availability,increased
  scalability, simplify management of groups of systems
A typical two-node MSCS
          cluster
Limitations of MSCS
• With SCSI, all servers must be within 40 meters of one
  other
• Each must be less than 20 meters from the storage
• With Fiber Channel connections this distance can be
  increased
• Does not provide disk redundancy
• It is not a fault tolerant closely coupled system
• Not a solution for disaster recovery
SRDF
• Symmetrix Remote Data Facility/ Cluster Enabler is a
  disaster restart able business continuance solution
  based on Symmetrix from EMC corporation
• SRDF is a configuration of multiple Symmetrix arrays
• SRDF duplicates data from production (source) site to a
  secondary recovery (target) site transparently to
  users, applications, databases and host processors
• If the primary site fails, data in the secondary site is
  current up to the I/O.
• Used for disaster recovery, remote back up, data center
  migration, datacenter decision solutions
Basic SRDF Configuration
SRDF/CE Overview
• Software extension for MSCS
• Cluster nodes can be geographically separated by
  distances of up to 60 KM
• provides fail over for MSCS-handled failures as well as
  site disasters, Symmetrix failures or Total
  Communication failures (IP + SRDF links lost).
• Up to 64 MSCS clusters per Symmetrix pair
• Protects data from following types of failure:
   – Storage failures
   – System failures
   – Site failures
A Geographically Distributed 2-
   Node SRDF/CE Cluster
SRDF/CE modes of operation
• Active/Passive
   – Cluster of 2 nodes or more
   – Processing is done on one node (active node)
   – Processing is picked up by a remaining node (or
     nodes) only when the active node fails
   – Half of the H/W is normally idle.
   – On failure the application restarts with full
     performance
• Active/Active
   – Cluster of 2 nodes or more
   – All nodes run application software.
• When a node fails, work is transferred to a remaining
  node (or nodes)
• The node that picks up, processes load of both systems
• Extra load may cause performance degradation

Other Generic type of Clusters:

• Shared-nothing: No common Cluster resources shared
                  between clusters.
• Shared-something: Some resource in each cluster node.
SRDF/CE in YNHH SCM Production
          Environment
      Clients:


                                  Enterprise LAN/WAN


                          Private Interconnect (Heartbeat Connector)

                          20 KM with Single Mode FDDI


                 Host A                                            Host B
                 Node 1                                            Node 2

   UWD SCSI
   or FC-AL                                                                     UWD SCSI
                                                                                or FC-AL

                  R1           Bi -directional SRDF Interconnect       R2

                  R2                                                   R1


                 Symmetrix                                          Symmetrix
SRDF/CE Over MSCS
• SRDF/CE protects against more failure scenarios than
  MSCS can.
• It overcomes the distance limitations of MSCS
• Cluster nodes can be geographically separated by
  distances of up to 60 KM (network round trip latency of
  less than 300 ms)
• An ideal solution for dealing with disaster
• Critical information available in minutes
• System restart not recovery when disaster happens
SRDF/CE and MSCS
    Common Recovery Behavior
1. LAN Link failure         5. Server failure
2. Heartbeat Link failure   6. Application software failure
3. SRDF Link failure        7. Host bus adapter failure
4. Host NIC failure         8. Symmetrix array failure


       SRDF/CE Unique Behavior
    The geographic separation and disaster tolerance of
    SRDF/CE causes unique behavior and provides
    recovery alternatives
SRDF/CE failover operation
Complete Site Failure and
               Recovery
• Site (Server and Symmetrix) failure (5+8)
   – Site failure occurs when both the Server and Symmetrix fail from
     natural disaster and human error
• Total Communication Failure(1+2+3) - Split-Brain ?
   – Occurs when all communication between node1 and node2 is
     lost
   – In this type of failure, both nodes remain operational and is
     referred to as split-brain
   – Is a potential cause of logical data corruption as each side
     assumes the other side is dead and begin processing new
     transactions against their copy of data
   – Two separate and irreconcilable copies of data are created
Complete Site Failure
Response to complete site failure
• Site Failure                                          GspanPlan
   – Site Failure occurs at Node 2                     Test-Plan
   – QuorumGrp and SQLGrp continue running on Node 1
   – Manual intervention required to bring FShareGrp online on
     Node1
• Site Failure – Quorum Lost
   –   Site Failure occurs at Node 1
   –   Site Failure causes SQLGrp and QuorumGrp to go offline
   –   With QuorumGrp offline, W2K takes whole cluster offline
   –   Manual intervention required to bring cluster online.
• Total Communications Failure
   – Total Communications Failure causes the node without the
     QuorumGrp to go offline
   – This prevents the Spilt-Brain
   – Manual intervention required to bring FShareGrp online.
   – EMC doesn’t suggest automatic site fail-over to prevent Spilt-
     Brain
Benefits
• Disaster recovery solution
• Disaster restart provides short restart time and low data
  loss
• Ensures data integrity
• SRDF/CE overcomes limitations in traditional cluster
  solutions like MSCS
Disadvantages
•   Cost
•   Complex Setup
•   Lots of Disks
•   Fail-back needs to be planned, takes longer than failover
•   Synchronous SRDF Disaster Restart
    – Data must be written to both Symmetrix
    – Consistent, reliable data
    – More I/O over head
• Asynchronous SRDF Disaster Restart
    – Data is written asynchronously to secondary Symmetrix
    – May incur data loss
    – Faster I/O
• Both sites in the same city, prone to regional disaster
Conclusions
• In our DR test following failure scenarios were tested:
   –   Server failure
   –   O/S Failure
   –   HBA/Channel failure
   –   Application failure
   –   Public LAN failure
   –   Private LAN failure
   –   Complete IP communication failure (public LAN and private LAN)
• All tests were passed
• We have achieved high uptime (non-scheduled outages)
  of almost 100% in last 3 years
• 2 unplanned fail overs so far due to windows fluctuation
References
• EMC SRDF/Cluster Enabler for MCSC v2.1 Product
  Guide P/N 300-001-286 REV A02 by Eclipsys
  Corporation, Hopkinton, MA 01748-9103, 2006
• GeopSpan Implementation by John Toner, EMC
  Corporation, 2003


           Contact Information
Ziaul Mannan : Ziaul.Mannan@ynhh.org
Howard Goldberg: Howard.Goldberg@ynhh.org
THANK YOU !




 Questions?

Mais conteúdo relacionado

Mais procurados

Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3Jeong, Wookjae
 
Metro Cluster High Availability or SRM Disaster Recovery?
Metro Cluster High Availability or SRM Disaster Recovery?Metro Cluster High Availability or SRM Disaster Recovery?
Metro Cluster High Availability or SRM Disaster Recovery?David Pasek
 
Parallel Sysplex Implement2
Parallel Sysplex Implement2Parallel Sysplex Implement2
Parallel Sysplex Implement2ggddggddggdd
 
Hacmp – high availability pdf
Hacmp – high availability  pdf Hacmp – high availability  pdf
Hacmp – high availability pdf asihan
 
Resume - Donald W. Royer
Resume - Donald W. RoyerResume - Donald W. Royer
Resume - Donald W. RoyerDonald Royer
 
Vn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmVn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmSylvain Lamour
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Microsoft Technet France
 
IDUG NA 2014 / 11 tips for DB2 11 for z/OS
IDUG NA 2014 / 11 tips for DB2 11 for z/OSIDUG NA 2014 / 11 tips for DB2 11 for z/OS
IDUG NA 2014 / 11 tips for DB2 11 for z/OSCuneyt Goksu
 
Software-Defined Networking , Survey of HotSDN 2012
Software-Defined Networking , Survey of HotSDN 2012Software-Defined Networking , Survey of HotSDN 2012
Software-Defined Networking , Survey of HotSDN 2012Jason TC HOU (侯宗成)
 
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource ManagementFahd Mirza Chughtai
 
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeLessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeCuneyt Goksu
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Microsoft Technet France
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Pete Siddall
 
Teradata Training Course Content
Teradata Training Course ContentTeradata Training Course Content
Teradata Training Course ContentBigClasses Com
 
Publish versin host monitoring and outbound load balancing(0915113656)
Publish versin host monitoring and outbound load balancing(0915113656)Publish versin host monitoring and outbound load balancing(0915113656)
Publish versin host monitoring and outbound load balancing(0915113656)gmolina200
 
Die pacman nomaden opnfv summit 2016 berlin
Die pacman nomaden opnfv summit 2016 berlinDie pacman nomaden opnfv summit 2016 berlin
Die pacman nomaden opnfv summit 2016 berlinZhipeng Huang
 
Troubleshooting Common Network Related Issues with NetScaler
Troubleshooting Common Network Related Issues with NetScalerTroubleshooting Common Network Related Issues with NetScaler
Troubleshooting Common Network Related Issues with NetScalerDavid McGeough
 

Mais procurados (20)

Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3
 
Metro Cluster High Availability or SRM Disaster Recovery?
Metro Cluster High Availability or SRM Disaster Recovery?Metro Cluster High Availability or SRM Disaster Recovery?
Metro Cluster High Availability or SRM Disaster Recovery?
 
Parallel Sysplex Implement2
Parallel Sysplex Implement2Parallel Sysplex Implement2
Parallel Sysplex Implement2
 
Hacmp – high availability pdf
Hacmp – high availability  pdf Hacmp – high availability  pdf
Hacmp – high availability pdf
 
Resume - Donald W. Royer
Resume - Donald W. RoyerResume - Donald W. Royer
Resume - Donald W. Royer
 
Vn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmVn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vm
 
22 configuration
22 configuration22 configuration
22 configuration
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
 
IDUG NA 2014 / 11 tips for DB2 11 for z/OS
IDUG NA 2014 / 11 tips for DB2 11 for z/OSIDUG NA 2014 / 11 tips for DB2 11 for z/OS
IDUG NA 2014 / 11 tips for DB2 11 for z/OS
 
Software-Defined Networking , Survey of HotSDN 2012
Software-Defined Networking , Survey of HotSDN 2012Software-Defined Networking , Survey of HotSDN 2012
Software-Defined Networking , Survey of HotSDN 2012
 
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management
12c Multi-Tenancy and Exadata IORM: An Ideal Cloud Based Resource Management
 
Minimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARMMinimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARM
 
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS InitiativeLessons learned from Isbank - A Story of a DB2 for z/OS Initiative
Lessons learned from Isbank - A Story of a DB2 for z/OS Initiative
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...
 
Teradata Training Course Content
Teradata Training Course ContentTeradata Training Course Content
Teradata Training Course Content
 
Publish versin host monitoring and outbound load balancing(0915113656)
Publish versin host monitoring and outbound load balancing(0915113656)Publish versin host monitoring and outbound load balancing(0915113656)
Publish versin host monitoring and outbound load balancing(0915113656)
 
Die pacman nomaden opnfv summit 2016 berlin
Die pacman nomaden opnfv summit 2016 berlinDie pacman nomaden opnfv summit 2016 berlin
Die pacman nomaden opnfv summit 2016 berlin
 
XS Oracle 2009 Just Run It
XS Oracle 2009 Just Run ItXS Oracle 2009 Just Run It
XS Oracle 2009 Just Run It
 
Troubleshooting Common Network Related Issues with NetScaler
Troubleshooting Common Network Related Issues with NetScalerTroubleshooting Common Network Related Issues with NetScaler
Troubleshooting Common Network Related Issues with NetScaler
 

Semelhante a Tackling Disaster in a SCM Environment

2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance ConsiderationsShawn Wells
 
Introducing MFX for z/OS 2.1 & ZPSaver Suite
Introducing MFX for z/OS 2.1 & ZPSaver SuiteIntroducing MFX for z/OS 2.1 & ZPSaver Suite
Introducing MFX for z/OS 2.1 & ZPSaver SuitePrecisely
 
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...Zahid Anwar (OCM)
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
 
Champion Solutions Group Atlanta Solutions Lab Partners
Champion Solutions Group Atlanta Solutions Lab   PartnersChampion Solutions Group Atlanta Solutions Lab   Partners
Champion Solutions Group Atlanta Solutions Lab PartnersMichael Hudak
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbbohanchen
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfhik_lhz
 
VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers VMworld
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Sal Marcus
 
1.4 System Arch.pdf
1.4 System Arch.pdf1.4 System Arch.pdf
1.4 System Arch.pdfssuser8b6c85
 
并行计算与分布式计算的区别
并行计算与分布式计算的区别并行计算与分布式计算的区别
并行计算与分布式计算的区别xiazdong
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Oleksandra Nazola
 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchAricent
 

Semelhante a Tackling Disaster in a SCM Environment (20)

2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Introducing MFX for z/OS 2.1 & ZPSaver Suite
Introducing MFX for z/OS 2.1 & ZPSaver SuiteIntroducing MFX for z/OS 2.1 & ZPSaver Suite
Introducing MFX for z/OS 2.1 & ZPSaver Suite
 
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
Champion Solutions Group Atlanta Solutions Lab Partners
Champion Solutions Group Atlanta Solutions Lab   PartnersChampion Solutions Group Atlanta Solutions Lab   Partners
Champion Solutions Group Atlanta Solutions Lab Partners
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-db
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
Mcserviceguard2
Mcserviceguard2Mcserviceguard2
Mcserviceguard2
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
1.4 System Arch.pdf
1.4 System Arch.pdf1.4 System Arch.pdf
1.4 System Arch.pdf
 
并行计算与分布式计算的区别
并行计算与分布式计算的区别并行计算与分布式计算的区别
并行计算与分布式计算的区别
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016
 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-march
 

Tackling Disaster in a SCM Environment

  • 1. Tackling Disaster in a Sunrise Clinical Manager Environment … Disaster Restart OR Disaster Recovery? Ziaul Mannan -Sr. Technical DBA Howard Goldberg -Director, Clinical Systems and Support
  • 2. Yale-New Haven Hospital New Haven, Connecticut  944 Bed Tertiary Teaching Facility  2600 Medical Staff  7550 Employees  100% CPOE  Average 350,000 orders monthly  Average Daily Census is 724  7 Time “Most Wired” and 3 Time “Most Wireless” Hospital by Hospitals and Health Networks
  • 3.
  • 4. Future Clinical Cancer Center At Yale-New Haven Hospital  112 inpatient beds  Outpatient treatment rooms  Expanded operating rooms  Infusion suites  Diagnostic imaging services  Therapeutic radiology  Specialized Women's Cancer Center  Yale-New Haven Breast Center/GYN Oncology Center  Tentative Completion Date: 2009
  • 5. Problem • After the events of 9/11, hospital realized that there needed to be redundant data centers with the ability to provide “zero” downtime. • Implemented SCM with server clusters and EMC SAN situated in data centers on opposite ends of the hospital campus.
  • 6.
  • 7. Goals • Provide 24X7X365 uptime • Minimize downtime • Faster recovery in DR situation • Database must be consistent – Or it won’t come up Challenges • Build a redundant System over 2 KM apart Data centers • Overcome the limitations of Clustering solutions • Design a system for both redundancy and DR solution
  • 8. YNHH Production Environment • SCM 4.5 SP3 RU3 Migrating to SCM4.5 with SMM-10/03/06 – Total Users Defined – 10,000 – Users logged on at peak hours ~450 – SCM Reports, HL7 Interfaces,CDS,Multum – No CDR – Total disk for data 700GB (all servers) – Total disk for Archive 500 GB
  • 9. YNHH Production Environment • MS SQL Server – SQL Server 2000 EE Build 2195: SP 4 – Master and Enterprise on their own Servers and both clustered – MSCS and EMC SRDF/CE used as Clustering solution • OS and Hardware – Window 2000 Advance Server SP4 – Local SCSI and EMC Disks on Symmetrix
  • 10. YNHH Production Environment – Distributed SCM Environment • Master Server (MSCS Cluster using EMC SRDF/CE) ~ 125 GB DB • Enterprise Server (MSCS+EMC SRDF/CE) • HL7 Server (MSCS+EMC SRDF/CE) • Reports Server (MSCS+EMC SRDF/CE) • CDS Server (MSCS+EMC SRDF/CE) • Multum Server (MSCS+EMC SRDF/CE) • Compaq Servers - DL760G2,DL380G3,DL560 • 2-8 CPUs, 3-8 GB RAM
  • 11. YNHH SCM Production Environment MSMQ DCs SCM Master DB Client Workstations Enterprise Server XAENTER1PA Master Active DB XAENTERP XAMASTER1PA XAENTERCL1 XAMASTERP SCM Client SCM Client XAMASTERCL1 YNHORG2 YNHORG4 HL7 Interface Server HL7 Interface Server Executive Server Manager Server MSMQ XAAPPS2P XAHL71PA XAHL7P XAHL7CL1 SunriseXA Services Notification, CDS and Order Generation Server Multum Server Report Server XACOGNS1PA XAMULTUM1PA XAREPORT1P XACOGNSP XAMULTUMP XAREPORTP XACOGNSCL1 XAMULTUMCL1 XAREPORTCL1
  • 12. Solutions/Tools • Disaster Restart • Microsoft Cluster Service (MSCS) • EMC SRDF/CE
  • 13. Disaster Recovery VS. Disaster Restart • Disaster Recovery – DR process restores database objects to last good backup – Recovery process restores and recovers data – Difficult to coordinate recoveries across database systems – Long restart time and data loss could be high • Disaster Restart – Disaster Restart is inherent in all DBMS – Remote disaster restart possible using remote mirroring (SRDF) – Remote restart has no formal recovery – Remote disaster similar to local system power failure – Short restart time and low data loss
  • 14. Microsoft Cluster Service (MSCS) • MSCS is clustering extension to Windows Server Enterprise and Datacenter • MSCS is a loosely coupled cluster system • Provides H/W and OS redundancy, no disk redundancy • On a failure it fails to the other node along with disks and resources • Failover can occur due to manual failover, H/W failure or application failure. • Relatively quicker uptime in event of failure • MSCS provides improved availability,increased scalability, simplify management of groups of systems
  • 15. A typical two-node MSCS cluster
  • 16. Limitations of MSCS • With SCSI, all servers must be within 40 meters of one other • Each must be less than 20 meters from the storage • With Fiber Channel connections this distance can be increased • Does not provide disk redundancy • It is not a fault tolerant closely coupled system • Not a solution for disaster recovery
  • 17. SRDF • Symmetrix Remote Data Facility/ Cluster Enabler is a disaster restart able business continuance solution based on Symmetrix from EMC corporation • SRDF is a configuration of multiple Symmetrix arrays • SRDF duplicates data from production (source) site to a secondary recovery (target) site transparently to users, applications, databases and host processors • If the primary site fails, data in the secondary site is current up to the I/O. • Used for disaster recovery, remote back up, data center migration, datacenter decision solutions
  • 19. SRDF/CE Overview • Software extension for MSCS • Cluster nodes can be geographically separated by distances of up to 60 KM • provides fail over for MSCS-handled failures as well as site disasters, Symmetrix failures or Total Communication failures (IP + SRDF links lost). • Up to 64 MSCS clusters per Symmetrix pair • Protects data from following types of failure: – Storage failures – System failures – Site failures
  • 20. A Geographically Distributed 2- Node SRDF/CE Cluster
  • 21. SRDF/CE modes of operation • Active/Passive – Cluster of 2 nodes or more – Processing is done on one node (active node) – Processing is picked up by a remaining node (or nodes) only when the active node fails – Half of the H/W is normally idle. – On failure the application restarts with full performance • Active/Active – Cluster of 2 nodes or more – All nodes run application software.
  • 22. • When a node fails, work is transferred to a remaining node (or nodes) • The node that picks up, processes load of both systems • Extra load may cause performance degradation Other Generic type of Clusters: • Shared-nothing: No common Cluster resources shared between clusters. • Shared-something: Some resource in each cluster node.
  • 23. SRDF/CE in YNHH SCM Production Environment Clients: Enterprise LAN/WAN Private Interconnect (Heartbeat Connector) 20 KM with Single Mode FDDI Host A Host B Node 1 Node 2 UWD SCSI or FC-AL UWD SCSI or FC-AL R1 Bi -directional SRDF Interconnect R2 R2 R1 Symmetrix Symmetrix
  • 24. SRDF/CE Over MSCS • SRDF/CE protects against more failure scenarios than MSCS can. • It overcomes the distance limitations of MSCS • Cluster nodes can be geographically separated by distances of up to 60 KM (network round trip latency of less than 300 ms) • An ideal solution for dealing with disaster • Critical information available in minutes • System restart not recovery when disaster happens
  • 25. SRDF/CE and MSCS Common Recovery Behavior 1. LAN Link failure 5. Server failure 2. Heartbeat Link failure 6. Application software failure 3. SRDF Link failure 7. Host bus adapter failure 4. Host NIC failure 8. Symmetrix array failure SRDF/CE Unique Behavior The geographic separation and disaster tolerance of SRDF/CE causes unique behavior and provides recovery alternatives
  • 27. Complete Site Failure and Recovery • Site (Server and Symmetrix) failure (5+8) – Site failure occurs when both the Server and Symmetrix fail from natural disaster and human error • Total Communication Failure(1+2+3) - Split-Brain ? – Occurs when all communication between node1 and node2 is lost – In this type of failure, both nodes remain operational and is referred to as split-brain – Is a potential cause of logical data corruption as each side assumes the other side is dead and begin processing new transactions against their copy of data – Two separate and irreconcilable copies of data are created
  • 29. Response to complete site failure • Site Failure GspanPlan – Site Failure occurs at Node 2 Test-Plan – QuorumGrp and SQLGrp continue running on Node 1 – Manual intervention required to bring FShareGrp online on Node1 • Site Failure – Quorum Lost – Site Failure occurs at Node 1 – Site Failure causes SQLGrp and QuorumGrp to go offline – With QuorumGrp offline, W2K takes whole cluster offline – Manual intervention required to bring cluster online.
  • 30. • Total Communications Failure – Total Communications Failure causes the node without the QuorumGrp to go offline – This prevents the Spilt-Brain – Manual intervention required to bring FShareGrp online. – EMC doesn’t suggest automatic site fail-over to prevent Spilt- Brain
  • 31. Benefits • Disaster recovery solution • Disaster restart provides short restart time and low data loss • Ensures data integrity • SRDF/CE overcomes limitations in traditional cluster solutions like MSCS
  • 32. Disadvantages • Cost • Complex Setup • Lots of Disks • Fail-back needs to be planned, takes longer than failover • Synchronous SRDF Disaster Restart – Data must be written to both Symmetrix – Consistent, reliable data – More I/O over head • Asynchronous SRDF Disaster Restart – Data is written asynchronously to secondary Symmetrix – May incur data loss – Faster I/O • Both sites in the same city, prone to regional disaster
  • 33. Conclusions • In our DR test following failure scenarios were tested: – Server failure – O/S Failure – HBA/Channel failure – Application failure – Public LAN failure – Private LAN failure – Complete IP communication failure (public LAN and private LAN) • All tests were passed • We have achieved high uptime (non-scheduled outages) of almost 100% in last 3 years • 2 unplanned fail overs so far due to windows fluctuation
  • 34. References • EMC SRDF/Cluster Enabler for MCSC v2.1 Product Guide P/N 300-001-286 REV A02 by Eclipsys Corporation, Hopkinton, MA 01748-9103, 2006 • GeopSpan Implementation by John Toner, EMC Corporation, 2003 Contact Information Ziaul Mannan : Ziaul.Mannan@ynhh.org Howard Goldberg: Howard.Goldberg@ynhh.org
  • 35. THANK YOU ! Questions?