3. High Availability and
Disaster Recovery
in Windows Azure
MIKE MCKEOWN
BLOG: HTTP://WWW.MICHAELMCKEOWN.COM
TWITTER: NWOEKCM
LINKEDIN: WWW.LINKEDIN.COM/PUB/MIKE-MCKEOWN/20/B73/389/
CLOUD SOLUTIONS ARCHITECT - ADITI TECHNOLOGIES
4. CORPORATE OVERVIEW OF ADITI
- TRUSTED, RESPECTED, TECHNOLOGY SERVICES LEADER
2012 Partner of the year
Windows Azure , Finalist
2011 Partner of the year
Windows Azure SI, Finalist
2010 Partner of the year
Windows Azure , Winner
Best companies to work for
Top 10 IT Workplace
Global Cloud MVPs
Top 50 Cloud influencers
1:114 hiring ratio
The Best ‘OF’
Vendor Award
52% of our customers rate us 5/5.
45 + active customers.
1200+ engagements.
1600 people, globally
18 years, 12 locations
5. You might be from Wisconsin if…
You have been both frostbitten and sunburned all in the same week.
You owe more money on your snowmobile than you do on your family car.
You consider a six pack of beer and a bug-zapper quality entertainment
You go to your family reunion looking to meet new women
You learned to drive a tractor before the training wheels were off your bike.
You think that John Deere Green, Ford Blue, and Primer Gray are the three
primary colors.
Your school loses half its student body during deer season.
The blue book value of your truck goes up and down depending on how
much gas it has in it.
6. Agenda
High Availability (HA) and Disaster Recovery (DR)
Definitions
Service Level Agreements (SLAs)
Designing for Failure
HA/DR Architectures
Failover Demo – Azure Traffic Manager
Tips and Best Practices
7. Introduction to HA/DR
High Availability (HA) includes a Disaster Recovery (DR) plan
Cloud failure is inevitable
Proper management means fast recognition to minimize effects
Define tolerance thresholds and an associated strategy
Consider budget and strategic location of resources
Cloud provides affordable and easily configurable geo-redundancy
Azure builds resiliency into some of its services
Others you must build it in yourself
9. 1. HA = Flat tire and spare donut tire
With spare tire car continues to run
Can’t reach top speeds
Can’t maneuver as well
Example of Azure HA:
An instance of a Web role
crashes due to a fault on its rack
SLA allows app to keep running
10. High Availability Definitions
1. Fault Tolerance
Detects and maneuvers around failed elements to continue and return the correct results
within specific timeframe
Use one or more design strategies - app redundancy, data replication, or degraded
functionality (i.e. order processing system)
2. Availability
HA systems are measured by the % of their availability in terms of planned/unplanned
service outages for users
Azure Availability SLA
Techniques can improve availability so its always available during problems
Redundant and reliable design
11. Redundancy in Windows Azure
• Windows Azure Storage with 2x replicas
• Azure SQL Database built-in 2x backup servers
• Windows Azure Caching with high availability
enabled
• Multi-instance Windows Azure Web Sites and
Cloud Services
• Failover with Windows Azure Traffic Manager
12. Reliability in Windows Azure
• Auto recovery of crashed/nonresponsive instances
• Fault domain to scatter instances across racks
• Virtual machine availability set to allocate VMs across
Fault domains
• Upgrade domain to avoid shutting down all instances at
the same time
• Handle transient errors using the Transient Fault Handling
Application block
13. High Availability Definitions
1. Fault Tolerance
Detects and maneuvers around failed elements to continue and return the correct results
within specific timeframe
Use one or more design strategies - app redundancy, data replication, or degraded
functionality (i.e. order processing system)
2. Availability
HA systems are measured by the % of their availability in terms of planned/unplanned
service outages for users
Azure Availability SLA
Techniques can improve availability so its always available during problems
Redundant and reliable design
3. Scalability
Meet increased demand with consistent results in acceptable time windows
Horizontal scale out (dynamic) vs vertical scale up (restart)
15. What does HA require?
Strategies to absorb outage of key components
No single points of failure
Multiple web servers and data replication
Graceful failover when individual components fail (and they will)
Backup components and systems
XXX
16. 2. DR = Bad Car Crash
Entire Data center down and no connection to the database
Network goes down and can’t contact to on prem machines
17. Disaster Recovery
Process, policies, and procedures to restore critical systems
after a catastrophic event
Application failure, data corruption (human error also), network down,
failure of connected service, DC down
A DR Plan is a part of a good HA strategy
Invest time and resources to continually plan, prepare, rehearse,
document, train, and update processes
One point of responsibility
Real World DR Plan – Dilbert Technical Services
Establish RPO and RTO and know your SLAs
18. Recovery Point Objective
(RPO)
Disaster
How much data can you lose
and still be okay after rollback?
How consistent does data need
to be after a rollback?
> RPO means less critical/$
< RPO means more critical/$
20. What’s in a Hot Dog?
Animal organs
Kindey, liver, hearts, etc.
Reproductive organs?
Plastic, glass, bugs, and animal
bones
Mechanically Separated Meats
“A paste-like meat product
produced by forcing bones,
with attached edible meat,
under high pressure through a
sieve to separate the bone
from the edible meat tissue,"
SLAs are like hot dogs
21. The closer to a 10 (more 9’s) the more up time but
costs more and higher maintenance
Azure has non-cumulative monthly SLAs
Service Level Agreements
22. Compounding of SLAs
Effective availability - Considers the SLAs of each dependent service
and their cumulative effect on the total system availability
Windows Azure Compute (2 instances) = 99.95%
SQL Azure Database = 99.9%
Windows Azure Storage = 99.9%
Total Monthly SLA
4.38 hours + 8.76 hours + 8.76 hours = 21.9 hours
Effective Availability: 99.75%
Is the good enough for your app?
Can Effective availability of SLAs meet RPO and RTO of your app?
24. Azure HA/DR Architecture Concepts
Failure Design
Multi-Site Data
Backup/Recovery
Strategies
Immediately or
eventually consistent
systems
FC and Fault Domains
PaaS and IaaS
Windows Azure Traffic
Manager
25. Design For Failure
Large scale failures in any Cloud are rare but will happen
Cloud Data Centers don’t magically remove failures
Fabric Controller helps to quickly recover from problems in one DC
Understand RPO/RTO requirements to design for failures
Balance cost and complexity of HADR efforts against risk(s) you’re willing to bear
Cloud has made DR and HA remarkably easy and affordable
Multiple configurations possible with a few clicks
Application owners are ultimately responsible for failure management
Owners of DR Plans and HA strategy
26. Multi-Site Data Recovery Approaches
1. Azure Data Synch Services (PaaS)
Recommended between Azure SQL Database instances only
5 minutes minimum replication
If need lower RPO need to do it yourself
Creates clutter in synced databases
2. SQL Server Merge Replication (IaaS)
Two SQL Server databases (IaaS VMs) in two different regions
Update is DB A goes to DB B also and vice versa
Synchronous transactional operations locks tables and affects performance
3. SQL Server 2012 Always-On Availability Groups (IaaS)
Two SQL Server databases (IaaS VMs) in different regions
Immediate replication in master and its replicas
Non-transactional so no locking or performance degradation
27. 1. Azure Data Sync Services
SQL Azure Database only (pure PaaS)
5 minute minimum replication
Transactional and blocking
One way or two way
Not recommended with SQL Server
Azure SQL Database Azure SQL DatabaseData Sync Services
28. 2. SQL Server Merge Replication/Azure IaaS VMs
Two databases in two different Regions in IaaS VMs
Update is DB A goes to DB B …..and vice versa
Synchronous transactional operations locks tables and affects performance
Azure IaaS VM and
SQL Server 1
Azure IaaS VM and
SQL Server 2
SQL Server
Database A
SQL Server
Database B
Trans Sync from B to A
Trans Sync from A to B
29. 3. SQL Server 2012 Always-On Availability Groups
Two databases in two different Regions in IaaS VMs
Immediate replication in master and its replicas
Non-transactional so no locking or performance degradation
Azure IaaS VM and
SQL Server 2012
Azure IaaS VM and
SQL Server 2012
SQL Server 2012 SQL Server 2012
Master DB Replica DB
Always On (Non-
Blocking)
Synchronization
30. Consistency Models
Immediately consistent systems
Traditional Synchronous pattern of all at once
Can hurt performance with locking/blocking
Possibly lose something at failure and recovery
The “C” in ACID
Transactional consistency to all affected data based upon rules, triggers, constraints
Eventually consistent systems
Asynchronous patterns using durable queues
Nothing lost in recovery
The ability to recreate system after failure
Improves fault tolerance in systems
Customer may not need to see immediate updates
Posts to Twitter/Facebook
DB may have some inconsistencies at any point in time
All nodes eventually consistent when all updates are done
Both have a role in HADR based upon RTO and RPO
31. “A fault domain is a set
of hardware components
– computers, switches,
and more – that share a
single point of failure.”
Cant control FDs – given by
Azure
Fault Domains do not span
data centers
FC provisions multiple role
instances across Fault
Domains
FC monitors Fault Domains to
reduce localized failures
Upon failure FC enforces SLA
and re-provisions instances
Fault Domains - PaaS
32. “A fault domain is a set
of hardware components
– computers, switches,
and more – that share a
single point of failure.”
VM Availability Sets
Different Fault Domains/Racks
Azure locates VMs in different
fault domains to prevent
localized failure
Required for 99.95% VM SLA
Ex. Web & SQL Server
Fault Domains – IaaS Virtual Machines
33. Windows Azure Traffic Manager (WATM)
Automated priority of routing
1. Failover
2. Performance
3. Round-robin
Gives a new DNS prefix for users
Key point – You decide if your
failover domain is dormant or
active while NOT in failover mode
WATM rolls over regardless if site is
up or down
You need to manage if failover
domain is active or dormant
35. HA/DR Types and Terms
Mostly PaaS concepts with a bit of IaaS
Example : home phone
1. Cold
Backup has nothing active, pre-loaded, or updated
Least expensive and slowest recovery
Ex. Have to go out and buy new home phone
2. Warm or Passive
Backup has some parts loaded/current and others made active upon failure
Ex. Home phone at house but still packed and notcharged
3. Hot or Active
Backup is loaded and ready to receive load upon failure but not active
Ex. Home phone with charged battery but not plugged into home circuit
4. Highly Available
Backup is loaded and active and receiving load as part of normal processing
Most expensive and quickest recovery
Ex. Home phone with charged battery and plugged into home phone circuit
44. HA/DR Checklist for Risk Mitigation
1. Conduct a risk assessment for each application
Each can have different requirements.
Some applications are more critical than others
Justify extra cost to architect them for disaster recovery
Use this information to define the RTO and RPO for each application.
2. Design for failure starting with the application architecture
3. Implement best practices for high availability
Balancing cost, complexity, and risk
4. Implement disaster recovery plans and processes.
5. Establish backup strategies for all reference and transactional data.
6. Consider failures that span the module level all the way to a complete Cloud outage.
7. Choose a multi-site disaster recovery architecture.
45. General HA Best Practices
Avoid single points of failure
Always place (at least) one of each component (load balancers, app servers,
databases, …) in at least two regions or fault domains
Maintain sufficient capacity to absorb region/ fault domain failures
Reserved Instances (hot) – guarantee capacity is available in a separate region/cloud
Replicate data across clouds/regions for failover
Setup monitoring, alerts, and operations to identity and automate problem resolution
or failover process
Design stateless applications for resilience to reboot / relaunch
46. Summary
Plan and design
for failure
Work with
business and IT
- RPO and RTO
Understand
cumulative SLAs
Implement
correct HA/DR
Architectures
Best Practices
and Checklist
Start with some
DR strategy and
improve
continually
47. Resources
Disaster Recovery and High Availability for Windows Azure Applications
Mike McKeown and Hanu Kommalapati
http://msdn.microsoft.com/en-us/library/dn251004.aspx
Contingency Planning Guide for Information Technology Systems
National Institute of Standards and Technology
https://www.fismacenter.com/sp800-34.pdf
Failsafe: Guidance for Resilient Cloud Architectures
Marc Mercuri, Ulrich Homann, and Andrew Townhill
http://msdn.microsoft.com/en-us/library/windowsazure/jj853352.aspx
Business Continuity for Windows Azure
Patrick Wickline, Adam Skewgar, Walter Myers III
http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx