Mmckeown hadr that_conf

PLATINUM SPONSORS
Gold Sponsors

High Availability and
Disaster Recovery
in Windows Azure
MIKE MCKEOWN
BLOG: HTTP://WWW.MICHAELMCKEOWN.COM
TWITTER: NWOEKCM
LINKEDIN: WWW.LINKEDIN.COM/PUB/MIKE-MCKEOWN/20/B73/389/
CLOUD SOLUTIONS ARCHITECT - ADITI TECHNOLOGIES

CORPORATE OVERVIEW OF ADITI
- TRUSTED, RESPECTED, TECHNOLOGY SERVICES LEADER
2012 Partner of the year
Windows Azure , Finalist
Windows Azure SI, Finalist
Windows Azure , Winner
 Best companies to work for
 Top 10 IT Workplace
 Global Cloud MVPs
 Top 50 Cloud influencers
 1:114 hiring ratio
The Best ‘OF’
Vendor Award
 52% of our customers rate us 5/5.
 45 + active customers.
 1200+ engagements.
 1600 people, globally
 18 years, 12 locations

You might be from Wisconsin if…
 You have been both frostbitten and sunburned all in the same week.
 You owe more money on your snowmobile than you do on your family car.
 You consider a six pack of beer and a bug-zapper quality entertainment
 You go to your family reunion looking to meet new women
 You learned to drive a tractor before the training wheels were off your bike.
 You think that John Deere Green, Ford Blue, and Primer Gray are the three
primary colors.
 Your school loses half its student body during deer season.
 The blue book value of your truck goes up and down depending on how
much gas it has in it.

Agenda
 High Availability (HA) and Disaster Recovery (DR)
 Definitions
 Service Level Agreements (SLAs)
 Designing for Failure
 HA/DR Architectures
 Failover Demo – Azure Traffic Manager
 Tips and Best Practices

Introduction to HA/DR
 High Availability (HA) includes a Disaster Recovery (DR) plan
 Cloud failure is inevitable
 Proper management means fast recognition to minimize effects
 Define tolerance thresholds and an associated strategy
 Consider budget and strategic location of resources
 Cloud provides affordable and easily configurable geo-redundancy
 Azure builds resiliency into some of its services
 Others you must build it in yourself

What is your Cloud HA/DR strategy?

1. HA = Flat tire and spare donut tire
With spare tire car continues to run
 Can’t reach top speeds
 Can’t maneuver as well
Example of Azure HA:
 An instance of a Web role
crashes due to a fault on its rack
 SLA allows app to keep running

High Availability Definitions
1. Fault Tolerance
 Detects and maneuvers around failed elements to continue and return the correct results
within specific timeframe
 Use one or more design strategies - app redundancy, data replication, or degraded
functionality (i.e. order processing system)
2. Availability
 HA systems are measured by the % of their availability in terms of planned/unplanned
service outages for users
 Azure Availability SLA
 Techniques can improve availability so its always available during problems
 Redundant and reliable design

Redundancy in Windows Azure
• Windows Azure Storage with 2x replicas
• Azure SQL Database built-in 2x backup servers
• Windows Azure Caching with high availability
enabled
• Multi-instance Windows Azure Web Sites and
Cloud Services
• Failover with Windows Azure Traffic Manager

Reliability in Windows Azure
• Auto recovery of crashed/nonresponsive instances
• Fault domain to scatter instances across racks
• Virtual machine availability set to allocate VMs across
Fault domains
• Upgrade domain to avoid shutting down all instances at
the same time
• Handle transient errors using the Transient Fault Handling
Application block

High Availability Definitions
1. Fault Tolerance
 Detects and maneuvers around failed elements to continue and return the correct results
within specific timeframe
 Use one or more design strategies - app redundancy, data replication, or degraded
functionality (i.e. order processing system)
2. Availability
 HA systems are measured by the % of their availability in terms of planned/unplanned
service outages for users
 Azure Availability SLA
 Techniques can improve availability so its always available during problems
 Redundant and reliable design
3. Scalability
 Meet increased demand with consistent results in acceptable time windows
 Horizontal scale out (dynamic) vs vertical scale up (restart)

What does HA require?
 Strategies to absorb outage of key components
 No single points of failure
 Multiple web servers and data replication
 Graceful failover when individual components fail (and they will)
 Backup components and systems
XXX

2. DR = Bad Car Crash
 Entire Data center down and no connection to the database
 Network goes down and can’t contact to on prem machines

Disaster Recovery
 Process, policies, and procedures to restore critical systems
after a catastrophic event
 Application failure, data corruption (human error also), network down,
failure of connected service, DC down
 A DR Plan is a part of a good HA strategy
 Invest time and resources to continually plan, prepare, rehearse,
document, train, and update processes
 One point of responsibility
 Real World DR Plan – Dilbert Technical Services
 Establish RPO and RTO and know your SLAs

Recovery Point Objective
(RPO)
Disaster
How much data can you lose
and still be okay after rollback?
How consistent does data need
to be after a rollback?
> RPO means less critical/$
< RPO means more critical/$

Recovery Time Objective
(RTO)
Disaster
RTO
How much time does it take to
recover?
> RTO means less critical/$
< RTO means more critical/$

What’s in a Hot Dog?
 Animal organs
 Kindey, liver, hearts, etc.
 Reproductive organs?
 Plastic, glass, bugs, and animal
bones
 Mechanically Separated Meats
 “A paste-like meat product
produced by forcing bones,
with attached edible meat,
under high pressure through a
sieve to separate the bone
from the edible meat tissue,"
 SLAs are like hot dogs

 The closer to a 10 (more 9’s) the more up time but
costs more and higher maintenance
 Azure has non-cumulative monthly SLAs
Service Level Agreements

Compounding of SLAs
Effective availability - Considers the SLAs of each dependent service
and their cumulative effect on the total system availability
 Windows Azure Compute (2 instances) = 99.95%
 SQL Azure Database = 99.9%
 Windows Azure Storage = 99.9%
 Total Monthly SLA
 4.38 hours + 8.76 hours + 8.76 hours = 21.9 hours
 Effective Availability: 99.75%
 Is the good enough for your app?
 Can Effective availability of SLAs meet RPO and RTO of your app?

Azure HA/DR Architecture Concepts
 Failure Design
 Multi-Site Data
Backup/Recovery
Strategies
 Immediately or
eventually consistent
systems
 FC and Fault Domains
 PaaS and IaaS
 Windows Azure Traffic
Manager

Design For Failure
 Large scale failures in any Cloud are rare but will happen
 Cloud Data Centers don’t magically remove failures
 Fabric Controller helps to quickly recover from problems in one DC
 Understand RPO/RTO requirements to design for failures
 Balance cost and complexity of HADR efforts against risk(s) you’re willing to bear
 Cloud has made DR and HA remarkably easy and affordable
 Multiple configurations possible with a few clicks
 Application owners are ultimately responsible for failure management
 Owners of DR Plans and HA strategy

Multi-Site Data Recovery Approaches
1. Azure Data Synch Services (PaaS)
 Recommended between Azure SQL Database instances only
 5 minutes minimum replication
 If need lower RPO need to do it yourself
 Creates clutter in synced databases
2. SQL Server Merge Replication (IaaS)
 Two SQL Server databases (IaaS VMs) in two different regions
 Update is DB A goes to DB B also and vice versa
 Synchronous transactional operations locks tables and affects performance
3. SQL Server 2012 Always-On Availability Groups (IaaS)
 Two SQL Server databases (IaaS VMs) in different regions
 Immediate replication in master and its replicas
 Non-transactional so no locking or performance degradation

1. Azure Data Sync Services
SQL Azure Database only (pure PaaS)
 5 minute minimum replication
 Transactional and blocking
 One way or two way
 Not recommended with SQL Server
Azure SQL Database Azure SQL DatabaseData Sync Services

2. SQL Server Merge Replication/Azure IaaS VMs
 Two databases in two different Regions in IaaS VMs
 Update is DB A goes to DB B …..and vice versa
 Synchronous transactional operations locks tables and affects performance
Azure IaaS VM and
SQL Server 1
Azure IaaS VM and
SQL Server 2
SQL Server
Database A
SQL Server
Database B
Trans Sync from B to A
Trans Sync from A to B

3. SQL Server 2012 Always-On Availability Groups
 Two databases in two different Regions in IaaS VMs
 Immediate replication in master and its replicas
 Non-transactional so no locking or performance degradation
Azure IaaS VM and
SQL Server 2012
Azure IaaS VM and
SQL Server 2012
SQL Server 2012 SQL Server 2012
Master DB Replica DB
Always On (Non-
Blocking)
Synchronization

Consistency Models
 Immediately consistent systems
 Traditional Synchronous pattern of all at once
 Can hurt performance with locking/blocking
 Possibly lose something at failure and recovery
 The “C” in ACID
 Transactional consistency to all affected data based upon rules, triggers, constraints
 Eventually consistent systems
 Asynchronous patterns using durable queues
 Nothing lost in recovery
 The ability to recreate system after failure
 Improves fault tolerance in systems
 Customer may not need to see immediate updates
 Posts to Twitter/Facebook
 DB may have some inconsistencies at any point in time
 All nodes eventually consistent when all updates are done
 Both have a role in HADR based upon RTO and RPO

“A fault domain is a set
of hardware components
– computers, switches,
and more – that share a
single point of failure.”
 Cant control FDs – given by
Azure
 Fault Domains do not span
data centers
 FC provisions multiple role
instances across Fault
Domains
 FC monitors Fault Domains to
reduce localized failures
 Upon failure FC enforces SLA
and re-provisions instances
Fault Domains - PaaS

“A fault domain is a set
of hardware components
– computers, switches,
and more – that share a
single point of failure.”
 VM Availability Sets
 Different Fault Domains/Racks
 Azure locates VMs in different
fault domains to prevent
localized failure
 Required for 99.95% VM SLA
 Ex. Web & SQL Server
Fault Domains – IaaS Virtual Machines

Windows Azure Traffic Manager (WATM)
 Automated priority of routing
1. Failover
2. Performance
3. Round-robin
 Gives a new DNS prefix for users
 Key point – You decide if your
failover domain is dormant or
active while NOT in failover mode
 WATM rolls over regardless if site is
up or down
 You need to manage if failover
domain is active or dormant

HA/DR Types and Terms
 Mostly PaaS concepts with a bit of IaaS
 Example : home phone
1. Cold
 Backup has nothing active, pre-loaded, or updated
 Least expensive and slowest recovery
 Ex. Have to go out and buy new home phone
2. Warm or Passive
 Backup has some parts loaded/current and others made active upon failure
 Ex. Home phone at house but still packed and notcharged
3. Hot or Active
 Backup is loaded and ready to receive load upon failure but not active
 Ex. Home phone with charged battery but not plugged into home circuit
4. Highly Available
 Backup is loaded and active and receiving load as part of normal processing
 Most expensive and quickest recovery
 Ex. Home phone with charged battery and plugged into home phone circuit

Single Region Deployment
•
•
•
•
•

Cold DR
•
•
•
•
•
•
•
•
•

Fault
Domain #1
Fault
Domain #2
Fault
Domain #1
Fault
Domain #2
Warm DR
Fault
Domain #1
Fault
Domain #2

Fault
Domain #1
Fault
Domain #2
Fault
Domain #1
Fault
Domain #2
Hot DR – Option 2
Fault
Domain #1
Fault
Domain #2

Fault
Domain #1
Fault
Domain #2
High Availability
Fault
Domain #1
Fault
Domain #2

Demo: HA using Azure Traffic
Manager

HA/DR Checklist for Risk Mitigation
1. Conduct a risk assessment for each application
 Each can have different requirements.
 Some applications are more critical than others
 Justify extra cost to architect them for disaster recovery
 Use this information to define the RTO and RPO for each application.
2. Design for failure starting with the application architecture
3. Implement best practices for high availability
 Balancing cost, complexity, and risk
4. Implement disaster recovery plans and processes.
5. Establish backup strategies for all reference and transactional data.
6. Consider failures that span the module level all the way to a complete Cloud outage.
7. Choose a multi-site disaster recovery architecture.

General HA Best Practices
 Avoid single points of failure
 Always place (at least) one of each component (load balancers, app servers,
databases, …) in at least two regions or fault domains
 Maintain sufficient capacity to absorb region/ fault domain failures
 Reserved Instances (hot) – guarantee capacity is available in a separate region/cloud
 Replicate data across clouds/regions for failover
 Setup monitoring, alerts, and operations to identity and automate problem resolution
or failover process
 Design stateless applications for resilience to reboot / relaunch

Summary
Plan and design
for failure
Work with
business and IT
- RPO and RTO
Understand
cumulative SLAs
Implement
correct HA/DR
Architectures
Best Practices
and Checklist
Start with some
DR strategy and
improve
continually

Resources
 Disaster Recovery and High Availability for Windows Azure Applications
 Mike McKeown and Hanu Kommalapati
http://msdn.microsoft.com/en-us/library/dn251004.aspx
 Contingency Planning Guide for Information Technology Systems
 National Institute of Standards and Technology
https://www.fismacenter.com/sp800-34.pdf
 Failsafe: Guidance for Resilient Cloud Architectures
 Marc Mercuri, Ulrich Homann, and Andrew Townhill
http://msdn.microsoft.com/en-us/library/windowsazure/jj853352.aspx
 Business Continuity for Windows Azure
 Patrick Wickline, Adam Skewgar, Walter Myers III
http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx

AUGUST 11TH – 13TH 2014
SAME PLACE, SAME TIME

Mmckeown hadr that_conf

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Mmckeown hadr that_conf

Semelhante a Mmckeown hadr that_conf (20)

Último

Último (20)

Mmckeown hadr that_conf