This document discusses various Exchange Server disaster recovery and high availability solutions such as continuous replication (CCR), standby continuous replication (SCR), local continuous replication (LCR), and single copy cluster (SCC). It provides details on how each solution works, when to use each one, advantages and disadvantages of CCR versus SCC, and basics of how continuous replication functions in Exchange. It also covers topics like transport dumpster redelivery, lost log resilience, and circular logging.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Disaster Recovery and Mailbox High Availability Solutions
1.
2. Agenda
Solutions for Disaster Recovery
Mailbox Server High Availability
CCR and SCR: Better Together
Why CCR? Why not SCC?
Continuous Replication Demystified
2
4. Solutions for Disaster Recovery
Deleted Item Retention – default 14 days
Deleted Mailbox Retention – default 30 days
Mailbox Service and Data Recovery
Server Recovery
Setup /m:RecoverServer
Setup /recoverCMS
Database portability
Dial tone portability
Continuous replication
Backup and Restore
Legacy streaming ESE backups
Volume Shadow Copy Service (VSS) backups
Recovery Storage Groups, alternate restores
Edge Transport Server Cloned Configuration
5. Solutions for Disaster Recovery
Augment built-in solutions with other processes
Configuration Management
Server build standardization
Server build documentation
Change management
Release management
Proactive monitoring
Detailed recovery plans
Regular integrity checks
Regular practice drills
6. Server Recovery
Setup /m:recoverServer
All roles except Edge
Fresh install and ImportEdgeConfig for Edge
All custom settings on Client Access server must be recreated
Restrictions: Can’t use this for…
repairing a failed setup
migrating between different operating systems
recovering or un-clustering a clustered mailbox server
Setup /recoverCMS
For CCR and SCC only
Restrictions: Can’t use this for…
changing from CCR to SCC or vice versa
migrating between different operating systems
clustering a standalone Mailbox server
splitting or merging clustered Exchange environments
Does not trigger Transport Dumpster
Windows 2003 clustering has dependency on PDC Emulator
6
7. Data Recovery
Switch to a replicated copy (Activation)
Passive copy (LCR/CCR)
Target copy (SCR)
Restore from backup
Same server
Database portability on alternate server
Database portability from Windows 2003 to Windows
2008 has initial performance impact
Dial tone and data merge using RSG
7
9. Mailbox Server High Availability
Built-in features for various levels of availability
Local Continuous Replication (LCR) – data
availability
Single Copy Cluster (SCC) – service availability
Cluster Continuous Replication (CCR) – data and
service availability
Standby Continuous Replication (SCR) – disaster
recovery and site resilience
15. CCR and SCR: Better Together
CCR provides high-availability for Mailbox data
and services within the datacenter
SCR replicates data remotely to provide site
resilience for the Mailbox data
Datacenter A Datacenter B
17. CCR local / SCR to remote Site
Datacenter A Datacenter B
17
18. CCR/SCR vs SCC/Sync – 2 sites
Datacenter A Datacenter B
CCR Log
corruption Setup /recovercms,
detected play logs forward
immediately
on replication
Physical at both
Corruption targets
Logs
Logs
DB
DB
DB
Logs
SCC
Exchange Disaster On Site Failure in
On full Storage
Recovery or 3rd Primary Failure
or Site Site,
Party Failover ifin Primary Site,
corruption not
detected and
corruption is
Physical Undetected corrected from a
detected, must
Corruption Physical test failover, must
Recover from
Corruption Recover from
Backup
Clone
Clone
Logs
DB
Backup
DB
VSS
VSS
Q
Logs
1 month later, Undetected
Physical Corruption
18
20. Why CCR? Why not SCC?
CCR SCC
Single Point None when stretched across Data, Storage and Site single points of failure
sites or combined with SCR for Potential for massive data loss on single failure:
of Failure • Storage device failures can lose collocated backups
site resiliency
• Hardware replication can propagate physical errors
• Storage failure requires activation of remote copy if
one exists
• Requires two VSS clones plus a remote copy of data
to achieve RPO equal to CCR
Simplicity Simple setup Shared storage
• No special storage Storage configuration before and after forming
configuration cluster
Built-in Site Resilience Complex storage stack
Same technology and Complex deployment to get RTO/RPO of 1 CCR
redundancy model for intra- cluster
and inter-site protection
20
21. Why CCR? Why not SCC?
CCR SCC
Backups Backups off passive copy Backups must be off active
eliminates/reduces backup
window
Reduced TCO Higher TCO
TCO • Cheaper hardware • Additional products needed to achieve
• No special storage equivalent combined RTO/RPO
expertise required • Separate management tools for HA
• In-the-box solution operations may be required
• Integrated management • Higher-end servers and storage required
• Single operations team • Storage expertise needed
• Reduced backup cost
Large • Great RTO/RPO, Simplicity, Higher TCO, long recovery times constrain
No Maintenance Window, mailbox size
Mailboxes
Reduced TCO → improved
support for larger
mailboxes
21
22. Why CCR? Why not SCC?
CCR SCC
Failure SCC + SCR/3rd party replication + 2 VSS clones
Stretched CCR or CCR + SCR
to approach combined RTO/RPO of 1 CCR cluster
Server ~ 2 minutes ~ 2 minutes
Data or LUN ~ 2 minutes 15 min – 1 hour
RTO Full Storage ~ 2 minutes ~ 15 min with synchronous replication
Days with VSS clones only
Site ~ 2 minutes for Stretched CCR ~ 15 min with synchronous replication
30-60 minutes for CCR + SCR Days with VSS clones only
Server 0 for mail* 0 – uses same copy of data
appointment, contact, task, draft
Physical DB 0 Hours to days if sync repl; point in time if VSS
Corrupt Logs 0 (must reseed passive) N/A if log not needed; same as DB if needed
DB LUN dies 0 0 with synchronous replication
Point-in-time with VSS clones
RPO
LOG LUN dies 0 for mail* 0 with synchronous replication
appointment, contact, task, draft Point-in-time with VSS clones
Full Storage 0 for mail* 0 with synchronous replication
appointment, contact, task, draft Hours to days with VSS clones only
Site Same as Server for Stretched CCR 0 with synchronous replication
1 Log** Hours to days with VSS clone
* Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up
22
23. Why CCR? Why not SCC?
Corruptions caused by the application
Logical Logical corruption replicated by all replication solutions
Corruption SCR with lag replay can mitigate if detected early
SCC: no mechanism to detect database corruption on the copy
replicated by 3rd Party solutions
SCC: no mechanism to detect log corruption on the copy
replicated by 3rd Party solutions
Physical With hardware-based replication, deeper stack can lead to
Corruption corruption caused by:
HBA driver/firmware
Multi-path driver
Server hardware
FC Switch firmware
Storage controller firmware/OS
Target storage controller firmware/OS
23
25. Basic Replication Pipeline
Source
DB
Store
Log Log
Copier Inspector Inspector Replica
Source Directory Log
Log Directory
Directory
Log
Replayer
Target
DB
25
26. Continuous Replication Basics
When current log file is closed, it is copied to
the replication target by the Replication service
Replication service
at source: creates read-only shares for log directory
at target: reads from the shares and pulls a copy of
the log file
contains a ReplicaInstance for each storage group
Configuration discovered from Active Directory (every 30
sec for LCR/CCR, every 3 min for SCR)
26
27. Continuous Replication Basics
Communication is done via logs, registry, cluster
database and RPC
Logs: replicate database changes and backup status
Registry: used in LCR and SCR. Also in CCR for
checkpointing the current log generation value for
loss calculation
Cluster database: cluster res quot;Exchange Information
Store Instance (CMSName)quot; /priv | findstr /i replay
RPCs: Target Replication service RPCs into Store for
log truncation coordination
27
28. Lost Log Resilience (LLR)
Designed to minimize need to reseed after lossy
failover
Database changes written to log file prior to database,
and the database can be updated as soon as change is
logged
LLR modifies this behavior by delaying updates to the
database until 1 or more log generations are created
Utilizes a new log stream marker called the waypoint
Minimum Log Required to prevent database divergence
No modifications after the waypoint
have been written to the database
28
29. Log Stream Markers
Committed: Log generation 20
Checkpoint: Log generation 2
Waypoint: Log generation 10
What this means:
Only logs 2-10 are needed
Logs 11-20 can be discarded
Initiating FILE DUMP mode...
Database: priv1.edb
...
State: Dirty Shutdown
Log Required: 2-10 (0x2-0xA)
Log Committed: 0-20 (0x0-0x14)
...
30. NodeA NodeB
21 21 Healthy CCR
20 20
19 19
18 18 NodeA fails and a failover to
17 17
NodeB occurs
16 16
Validate database can mount
15 15
logs lost <
14 14 AutoDatabaseMountDial
13 13
12 waypoint 12 Logs are generated on
11 11 NodeB (beyond gen21)
10 10
9 9 NodeA recovers and
performs a
8 8
divergence check
7 7
6 6
NodeA performs incremental
5 5 reseed and copies logs
4 checkpoint 4
3 3
2 2 Healthy CCR
1 1
31. When Do I Need A Full Reseed?
Rarely
Lost log past current Waypoint
Admin accepted large amount of loss by running Restore-
StorageGroupCopy
Automatic mount while LLR was “not honored”
Automatic lossy mount with “stale” loss window
calculation
Log corruption prior to log replay
ESE cannot skip over logs
Database files modified outside of Store or
Replication service
E.g., Offline defrag, eseutil /r
31
32. Hub Transport servers retain messages that have been delivered
to destination mailbox until size or time limit is reached
Transport Dumpster is per storage group per Hub Transport
server for servers in same Active Directory site as the storage
group
Transport Dumpster statistics:
Get-StorageGroupCopyStatus -DumpsterStatistics
Output:
DumpsterServersNotAvailable:{HUB1}
DumpsterStatistics:
{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}
32
34. How much data loss can transport dumpster mitigate?
18 MB dumpster per storage group on 8 Hub Transport
servers = 144 MB / storage group
[20 MB / 10 hour] x [100 users / SG] = 200 MB message
traffic in one hour
Putting the above two together gives
60 min X 144 / 200 43.2 minutes worth of data
in 43.2 minutes 144+ logs created per SG
Customize transport dumpster size/time limit
Set-TransportConfig –MaxDumpsterSizePerStorageGroup
30MB –MaxDumpsterTime 07.00:00:00
No time window guarantees
If there are no message size limits, a single large message
(e.g., 15 MB) will purge all other messages for destination
storage group(s) on a given Hub Transport server
34
35. When CCR detects a lossy failover:
Expands loss window by 12 hours back and 4 hours forward
Finds all Hub Transport servers in the local Active Directory site
Requests transport dumpster redelivery from all detected servers
New servers not added to redelivery list
Inaccessible servers: CCR retries same request every 30 seconds until
configured MaxDumpsterTime
If multiple lossy failovers take place, new loss is window added to
previous one
Restore-StorageGroupCopy on LCR is one time request, no
retries
Redelivery not triggered as part of Setup /recoverCMS
No other ways to redeliver messages from transport dumpster
35
36. Redundant Networks
Use for log shipping and seeding in CCR
Enable-ContinuousReplicationHostName
Seeding
Update-StorageGroupCopy
-DataHostNames:Host1,Host2
Get-ClusteredMailboxServerStatus
OperationalReplicationHostNames:
FailedReplicationHostNames:
InUseReplicationHostNames:
Watch out for misconfigured host file
37. Circular Logging
One configuration setting with two consumers
Store service: requires database to be dismounted and re-
mounted to take effect
Replication service: picks up new setting dynamically
In CCR, it’s no big deal to switch between on/off/on
In some settings, logs are deleted prematurely
Example: turn off circular logging, then enable LCR without
dismount/mount of database
ESE is still doing log truncation with circular logging logic
Logs will get truncated before making it to the LCR copy
To be safe follow this recipe:
Suspend, dismount, change setting, mount, resume
37
DB portability between different OS versions – watch out for performance impact!- an upgrade of the operating system for an Exchange database results in the updating of the value for OS Version in the database header. - This update triggers the rebuilding of internal database indexes. When using database portability to move a database from a Mailbox server running WS03 to a Mailbox server running WS08 , the Extensible Storage Engine (ESE) will detect the operating system upgrade and take the following actions:-- During the first database mount operation, all secondary indexes are discarded. A secondary index is used to provide a specific view of the mailbox data (for example, when messages in a mail folder are sorted using Outlook in Online Mode). The database will not be mounted and available to clients until this initial operation is complete. The amount of time it takes to complete the operation is largely dependent on the size of the database. The larger the database is, the longer the mount operation will take.-- Secondary indexes will be rebuilt on-demand, as Outlook users sort their views in Online Mode. In environments with large or extremely large databases, the on-demand rebuilding of indexes will initially result in high processor and disk utilization.
This illustrates why our belief that CCR is a better solution than SCC.When you lose a database in SCC, you can recover by restoring a VSS clone, but that clone is a point-in-time restore (it could be 10 min old, 30 min old, etc., depending on how frequently backups occur).
Storage failures for SCC can involve storage for data and the storage hosting the VSS clones. Typically, this is the same storage, so when it fails, you need to use remote data to recover.
This is a summary showing the RTO and RPO for these two solutions. You can see that to achieve the same RTO/RPO of CCR, an SCC solution also needs to be extended with replication technology, as well as hardware-based VSS (at least two).RTO for Data/LUN failure is15min -1 hour: While 3rd part solutions can activate a VSS clone quickly, Exchange server still has to be brought up and recovery (play the logs forward) still has to be run once the clone has been activate. This can take several minutes to over an hour depending upon log backup regimen.RPO: For CCR, the normal RPO can’t really be measured by time. The type of items that can be lost are the items that don’t go through Transport. If a deployment has synchronous replication and no geo-clustering, then it is a manual DR process to activate the copy (expose the LUNS, go through Exchange DR/Database Portability steps). Exchange server may or may not be pre-built out (depends upon the SLA and how much idle hardware a customer can afford).Geo-clustered synchronous replication solutions are almost always failed over manually (automatic failover between sites is a big deal for customers and they prefer to hit the “big red button”). RTO is typically~15min if all works correctly.RPO for LOG LUN:If the log LUN dies, the DB becomes unclean. Jet can't shutdown and all un-flushed writes to the db are lost,leaving the DB in a bad state. As a result, recovery must be run but can’t since the LOG LUN is dead; thus, the DB is also lost. If the logs have been synchronously replicated and the replicated copy of the logs are good, they can be used to recover the DB. However, if the reason the LOG LUN was lost was because of physical corruption on the logs, which gets replicated to the LOG LUN’s replicated copy, then the only option is to recover from a backup.
- Polls and uses file system notifications to see a new log in a directory- LogInspector verifies that the log is safe to replay (3rd party sync replication cannot provide this type of replicated data verification for logs)ChecksumIs this log for this log stream?Recopy on failure
If shares do exist, they will not be re-created. If permissions on the shares as messed up, remove the shares manually and cycle replication service.Different ReplicaInstance types cannot co-exist
Logs required indicates that some transactions haven’t been committed (some pages may have been written to disk, others may have not been). Checkpoint is the minimum log that we need in order to perform recovery. Waypoint is the maximum log needed for recovery, i.e. the last log file that has potential log records that have been recorded in the physical database.Committed Generation is the last log file generated by ESE for the particular storage group.
Logically speaking – dumpster is a property of the storagegroup not storagegroupcopy. Loss calculation: now – last log inspectedRequest dumpster resubmit: 12 hours before the loss and 1 hour after the loss window.no it cannot grab extra space. Every SG has a max dumpster size dedicated to that specific SG. Messages are stored only once but counted against multiple SGs if they happen to be in an SG’s dumpster. Maybe you are remembering this other discussion: Msg1 is delivered to both SG1 and SG2. This message counts against the dumpster quota for both SGs. Let’s say SG1 got lots of messages and had to drop Msg1 from its dumpster (Msg1 is still at the HUB server because it is included as part of SG2’s dumpster. When a dumpster resubmit request comes for SG1, msg1 will get resubmitted because it happen to be on the server. This is not guaranteed though
- This traffic amounts to around 3 (= 20/6) logs/min/SG