Disaster Recovery and Mailbox High Availability Solutions

Agenda
Solutions for Disaster Recovery
Mailbox Server High Availability
CCR and SCR: Better Together
Why CCR? Why not SCC?
Continuous Replication Demystified

2

Deleted Item Retention – default 14 days
Deleted Mailbox Retention – default 30 days
Mailbox Service and Data Recovery
Server Recovery
Setup /m:RecoverServer
Setup /recoverCMS
Database portability
Dial tone portability
Continuous replication
Backup and Restore
Legacy streaming ESE backups
Volume Shadow Copy Service (VSS) backups
Recovery Storage Groups, alternate restores
Edge Transport Server Cloned Configuration

Augment built-in solutions with other processes
Configuration Management
Server build standardization
Server build documentation
Change management
Release management
Proactive monitoring
Detailed recovery plans
Regular integrity checks
Regular practice drills

Server Recovery
Setup /m:recoverServer
All roles except Edge
Fresh install and ImportEdgeConfig for Edge
All custom settings on Client Access server must be recreated
Restrictions: Can’t use this for…
repairing a failed setup
migrating between different operating systems
recovering or un-clustering a clustered mailbox server
Setup /recoverCMS
For CCR and SCC only
Restrictions: Can’t use this for…
changing from CCR to SCC or vice versa
migrating between different operating systems
clustering a standalone Mailbox server
splitting or merging clustered Exchange environments
Does not trigger Transport Dumpster
Windows 2003 clustering has dependency on PDC Emulator
6

Data Recovery
Switch to a replicated copy (Activation)
Passive copy (LCR/CCR)
Target copy (SCR)
Restore from backup
Same server
Database portability on alternate server
Database portability from Windows 2003 to Windows
2008 has initial performance impact
Dial tone and data merge using RSG

7

Built-in features for various levels of availability
Local Continuous Replication (LCR) – data
availability
Single Copy Cluster (SCC) – service availability
Cluster Continuous Replication (CCR) – data and
service availability
Standby Continuous Replication (SCR) – disaster
recovery and site resilience

Local Continuous Replication (LCR)

10

Single Copy Cluster (SCC)

11

Cluster Continuous Replication (CCR)

12

Standby Continuous Replication
SCR Sources SCR Targets
Standalone Mailbox
CCR Server (w/o LCR)

Standby Cluster with
Passive Mailbox Role

Standalone

SCC
13

CCR and SCR: Better Together
CCR provides high-availability for Mailbox data
and services within the datacenter
SCR replicates data remotely to provide site
resilience for the Mailbox data
Datacenter A Datacenter B

CCR across 2 Sites

16

CCR local / SCR to remote Site


17

CCR/SCR vs SCC/Sync – 2 sites
CCR Log
corruption Setup /recovercms,
detected play logs forward
immediately
on replication
Physical at both
Corruption targets
Logs

Logs
DB

DB

DB
Logs
SCC
Exchange Disaster On Site Failure in
On full Storage
Recovery or 3rd Primary Failure
or Site Site,
Party Failover ifin Primary Site,
corruption not
detected and
corruption is
Physical Undetected corrected from a
detected, must
Corruption Physical test failover, must
Recover from
Corruption Recover from
Backup
Clone
Clone

Logs

DB

Backup

DB
VSS
VSS

Q

Logs
1 month later, Undetected
Physical Corruption
18

CCR SCC
Single Point  None when stretched across Data, Storage and Site single points of failure
sites or combined with SCR for Potential for massive data loss on single failure:
of Failure • Storage device failures can lose collocated backups
site resiliency
• Hardware replication can propagate physical errors
• Storage failure requires activation of remote copy if
one exists
• Requires two VSS clones plus a remote copy of data
to achieve RPO equal to CCR

Simplicity  Simple setup  Shared storage
• No special storage  Storage configuration before and after forming
configuration cluster
 Built-in Site Resilience  Complex storage stack
 Same technology and  Complex deployment to get RTO/RPO of 1 CCR
redundancy model for intra- cluster
and inter-site protection

20

CCR SCC
Backups Backups off passive copy Backups must be off active
eliminates/reduces backup
window
 Reduced TCO  Higher TCO
TCO • Cheaper hardware • Additional products needed to achieve
• No special storage equivalent combined RTO/RPO
expertise required • Separate management tools for HA
• In-the-box solution operations may be required
• Integrated management • Higher-end servers and storage required
• Single operations team • Storage expertise needed
• Reduced backup cost
Large • Great RTO/RPO, Simplicity,  Higher TCO, long recovery times constrain
No Maintenance Window, mailbox size
Mailboxes
Reduced TCO → improved
support for larger
mailboxes

21

CCR SCC
Failure SCC + SCR/3rd party replication + 2 VSS clones
Stretched CCR or CCR + SCR
to approach combined RTO/RPO of 1 CCR cluster
Server ~ 2 minutes ~ 2 minutes
Data or LUN ~ 2 minutes 15 min – 1 hour
RTO Full Storage ~ 2 minutes  ~ 15 min with synchronous replication
 Days with VSS clones only
Site  ~ 2 minutes for Stretched CCR  ~ 15 min with synchronous replication
 30-60 minutes for CCR + SCR  Days with VSS clones only
Server 0 for mail* 0 – uses same copy of data
appointment, contact, task, draft
Physical DB 0 Hours to days if sync repl; point in time if VSS
Corrupt Logs 0 (must reseed passive) N/A if log not needed; same as DB if needed
DB LUN dies 0  0 with synchronous replication
 Point-in-time with VSS clones
RPO
LOG LUN dies 0 for mail*  0 with synchronous replication
appointment, contact, task, draft  Point-in-time with VSS clones
Full Storage 0 for mail*  0 with synchronous replication
appointment, contact, task, draft  Hours to days with VSS clones only
Site  Same as Server for Stretched CCR  0 with synchronous replication
 1 Log**  Hours to days with VSS clone

* Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up
22

Corruptions caused by the application
Logical Logical corruption replicated by all replication solutions
Corruption SCR with lag replay can mitigate if detected early

SCC: no mechanism to detect database corruption on the copy
replicated by 3rd Party solutions
SCC: no mechanism to detect log corruption on the copy
replicated by 3rd Party solutions
Physical With hardware-based replication, deeper stack can lead to
Corruption corruption caused by:
HBA driver/firmware
Multi-path driver
Server hardware
FC Switch firmware
Storage controller firmware/OS
Target storage controller firmware/OS

23

Basic Replication Pipeline
Source
DB

Store
Log Log
Copier Inspector Inspector Replica
Source Directory Log
Log Directory
Directory
Log
Replayer

Target
DB

25

Continuous Replication Basics
When current log file is closed, it is copied to
the replication target by the Replication service
Replication service
at source: creates read-only shares for log directory
at target: reads from the shares and pulls a copy of
the log file
contains a ReplicaInstance for each storage group
Configuration discovered from Active Directory (every 30
sec for LCR/CCR, every 3 min for SCR)

26

Continuous Replication Basics
Communication is done via logs, registry, cluster
database and RPC
Logs: replicate database changes and backup status
Registry: used in LCR and SCR. Also in CCR for
checkpointing the current log generation value for
loss calculation
Cluster database: cluster res quot;Exchange Information
Store Instance (CMSName)quot; /priv | findstr /i replay
RPCs: Target Replication service RPCs into Store for
log truncation coordination

27

Lost Log Resilience (LLR)
Designed to minimize need to reseed after lossy
failover
Database changes written to log file prior to database,
and the database can be updated as soon as change is
logged
LLR modifies this behavior by delaying updates to the
database until 1 or more log generations are created
Utilizes a new log stream marker called the waypoint
Minimum Log Required to prevent database divergence
No modifications after the waypoint
have been written to the database

28

Log Stream Markers
Committed: Log generation 20
Checkpoint: Log generation 2
Waypoint: Log generation 10
What this means:
Only logs 2-10 are needed
Logs 11-20 can be discarded
Initiating FILE DUMP mode...
Database: priv1.edb
...
State: Dirty Shutdown
Log Required: 2-10 (0x2-0xA)
Log Committed: 0-20 (0x0-0x14)
...

NodeA NodeB
21 21 Healthy CCR
20 20
19 19
18 18 NodeA fails and a failover to
17 17
NodeB occurs
16 16
Validate database can mount
15 15
logs lost <
14 14 AutoDatabaseMountDial
13 13
12 waypoint 12 Logs are generated on
11 11 NodeB (beyond gen21)
10 10
9 9 NodeA recovers and
performs a
8 8
divergence check
7 7
6 6
NodeA performs incremental
5 5 reseed and copies logs
4 checkpoint 4
3 3
2 2 Healthy CCR
1 1

When Do I Need A Full Reseed?
Rarely
Lost log past current Waypoint
Admin accepted large amount of loss by running Restore-
StorageGroupCopy
Automatic mount while LLR was “not honored”
Automatic lossy mount with “stale” loss window
calculation
Log corruption prior to log replay
ESE cannot skip over logs
Database files modified outside of Store or
Replication service
E.g., Offline defrag, eseutil /r
31

Hub Transport servers retain messages that have been delivered
to destination mailbox until size or time limit is reached
Transport Dumpster is per storage group per Hub Transport
server for servers in same Active Directory site as the storage
group
Transport Dumpster statistics:
Get-StorageGroupCopyStatus -DumpsterStatistics
Output:
DumpsterServersNotAvailable:{HUB1}
DumpsterStatistics:
{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}

32

CCR CMS
MBX1

HUB1
SG Dumpster Contents
SG1 SG2
SG1 Msg1
Active
SG2 Msg1,Msg3
Msg1

MBX2
Redeliver SG1,SG2(returns timeout)
retry)
success)

HUB2 SG1 SG2

SG Dumpster Contents
Passive
SG1 Msg2,Msg4
Msg2
SG Resubmit Required

SG2 Msg4 SG1 HUB1
HUB1,HUB2

SG2 HUB1
HUB1,HUB2
Redeliver SG1,SG2(returns Retry)
Success)
33

How much data loss can transport dumpster mitigate?
18 MB dumpster per storage group on 8 Hub Transport
servers = 144 MB / storage group
[20 MB / 10 hour] x [100 users / SG] = 200 MB message
traffic in one hour
Putting the above two together gives
60 min X 144 / 200  43.2 minutes worth of data
in 43.2 minutes  144+ logs created per SG
Customize transport dumpster size/time limit
Set-TransportConfig –MaxDumpsterSizePerStorageGroup
30MB –MaxDumpsterTime 07.00:00:00
No time window guarantees
If there are no message size limits, a single large message
(e.g., 15 MB) will purge all other messages for destination
storage group(s) on a given Hub Transport server
34

When CCR detects a lossy failover:
Expands loss window by 12 hours back and 4 hours forward
Finds all Hub Transport servers in the local Active Directory site
Requests transport dumpster redelivery from all detected servers
New servers not added to redelivery list
Inaccessible servers: CCR retries same request every 30 seconds until
configured MaxDumpsterTime
If multiple lossy failovers take place, new loss is window added to
previous one
Restore-StorageGroupCopy on LCR is one time request, no
retries
Redelivery not triggered as part of Setup /recoverCMS
No other ways to redeliver messages from transport dumpster

35

Redundant Networks
Use for log shipping and seeding in CCR
Enable-ContinuousReplicationHostName

Seeding
Update-StorageGroupCopy
-DataHostNames:Host1,Host2
Get-ClusteredMailboxServerStatus
OperationalReplicationHostNames:
FailedReplicationHostNames:
InUseReplicationHostNames:

Watch out for misconfigured host file

Circular Logging
One configuration setting with two consumers
Store service: requires database to be dismounted and re-
mounted to take effect
Replication service: picks up new setting dynamically
In CCR, it’s no big deal to switch between on/off/on
In some settings, logs are deleted prematurely
Example: turn off circular logging, then enable LCR without
dismount/mount of database
ESE is still doing log truncation with circular logging logic
Logs will get truncated before making it to the LCR copy
To be safe follow this recipe:
Suspend, dismount, change setting, mount, resume

37

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Disaster Recovery and Mailbox High Availability Solutions

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Disaster Recovery and Mailbox High Availability Solutions

Semelhante a Disaster Recovery and Mailbox High Availability Solutions (8)

Mais de rsnarayanan

Mais de rsnarayanan (20)

Último

Último (20)

Disaster Recovery and Mailbox High Availability Solutions

Notas do Editor