OpenStack Tokyo Talk Application Data Protection Service
1. OpenStack Summit Tokyo 2015
Wang Hao, Software Engineer, Huawei IT Product Line
Eran Gampel, Cloud Chief Architect , Huawei European Research Center
Oshrit Feder, IBM Research - Haifa
Cloud DR Orchestration:
Beyond volume replication
2. Agenda
Why we need disaster recovery?
Replication in Cinder
Hypervisor-based DR
ADPaaS: Project Smaug
Demo
3. Why do we need disaster recovery?
Customers want 24x7 service availability
Hardware Failures
Human Error
Accidents and Natural Disasters
5. Got version 2 of replication in Liberty release
Improve and make it more widely
usable by other backend devices.
None driver supported yet
Implemented for Juno release
Upstream OS code merged Support to IBM Storwize/SVC driver
Begin from Icehouse summit
Design summit on volume replication
Status of Replication in Cinder
6. The main use of volume replication is resiliency in presence of failures.
OpenStack
Storage Backend Storage Backend
Cinder
DC#1 DC#2
Data Replication
Use Case of Replication
OpenStack
10. Hypervisor LevelHardware Level
Replication Solution Types
Case in point: Hardware vs. Hypervisor
Volume
Storage HW
Hypervisor
VM
IO Mirroring
Replication
Agent
Volume
Storage HW
Volume
Storage HW
Hypervisor
VM
Volume
Storage HW
Source
Target
Source
Target
11. Production Site DR Site
DR Manager DR Manager
Host
IO Mirror
VM VM VM
Storage
hypervisor
VRGOpenStack
Host
Write Agent
Storage
hypervisor
VRG OpenStack
WAN
OpenStack® Component
New Component
Vendor Component
Protected VM
Control Path
Data Path
Another choice: Hypervisor DR
12. IO Commands IO Completion
IO Capture
Write as normal
Write ACK
IO replication
Queue
IO Forwarding ,Compression and
Encryption
IO cache, Decompression and
Decryption
Write ACK
IO Completion
Write
Write ACK
IO Parsing
Production Site DR Site
Guest OS
IO Mirror
VRG VRG
Write Agent
Hypervisor DR: IO Mirroring
13. Setup
Connection
with vRG
Start CBT Data
Replication
Consistency
Check
Queue Data
Replication
Queue overflow
CBT done
Finished1.Host abnormal restart
2. Swap(re-protect)
Stop
Hypervisor DR: IO Mirroring State Machine
15. Replication Type HW Array
Replication
Hypervisor
Replication
Multi-Vendor Hardware Agnostic
No Impact on Compute Performance
No Special Network/Storage Privileges
No Special Admin Skillset Required
Transparent Deduplication
Virtualization-Ready
Cross VM Consistency Grouping Support
Cross Array Consistency Group Support
Hypervisor DR: HW(Array) vs. Hypervisor
16. Multiple Use Cases, Multiple
Protection Plans
Users need to be able to Choose the right protection plan
Vendors need a way to plug different implementations
22. Case in point: Typical 3-tier Cloud App
Volume
Web Net
Router
SG
Web Srv 1
Project
Web Srv 2
Image
SG
App Net
App Server
DB Net
DB Server
Image Image
Volume
25. Smaug: Mission Statement
Formalize Application Data Protection in OpenStack
APIs, Services, Plugins, …
Be able to protect Any Resource in OpenStack (as well as
their dependencies)
Allow Diversity of vendor solutions, capabilities and
implementations without compromising usability
26. Smaug: Highlights
Open Architecture
Vendors create plugins that implement Protection mechanisms for different
OpenStack resources
User perspective: Protect App Deployment
Configure and manage custom protection plans on the deployed resources
(topology, VMs, volumes, images, …)
Admin perspective: Define Protectable Resources
Decide what plugins protect which resources, what is available for the user
Decide where users can protect their resources
27. How to protect?
(Protection Plans)
Smaug: Application Data Protection as a Service
What is protected?
(Protected Resources)
Where to protect?
(Protection Banks)
What was protected?
(Protection Transactions)
Who protects?
(Protection Providers)
Plan
API
Protection
Resource
API
Protection
Transaction
API
Bank
API
Pluggable
Plan Enforcer
Service
Resource Protection Service
Bank
Vault
Resource
Protection
Plugin
Orchestrate
28. Overview
Swift S3 …
What is protected?
(Protected Resources)
VM
Image
Topology Volume
How to protect?
(Protection Plans)
Protection
Plan
Name
ID
Protected
Resource
Trigger
Retries
Bank
Options
Volume Protection Plugin
Backup Replication SnapshotWho protects?
(Protection Providers)
VM Protection Plugin
Image Protection Plugin
Topology Protection Plugin
Protect
Restore
Verify
OptionSchema
ResultsSchema
Protection API
Read
Write
Bank API
Where to protect?
(Protection Banks)
Bank
Vault
Cinder Nova …
What was protected?
(Protection Transactions)
Ledger
ProtectionTransaction
implements
Manual
Time
Event
29. Help us Build Smaug – Join the project
https://launchpad.net/smaug
IRC (gampel)
eran.gampel@huawei.com
oshritf@il.ibm.com
Download Link
30. Demo Time
Video -- Application DR With IBM Cloud Manger
References
Paris summit talk & demo
European FP7 ORBIT Research project
IBM Cloud Manager with Openstack
Service continuity
Hardware can fail, sometimes
People make mistakes, sometimes
Natural Calamities, or cataclysmic events (like fire, tornado, etc.)
Replication is for critical data and has relatively shorter lifespan
Backup has longer lifespan, but is snapshot-based, so your RPO is not as good.
Cloud admin create a volume type with capabilities:replication="<is> True“
End users use this volume type to create volume
Cinder scheduler will choose a backend supporting replication
The backend will create a volume replica & setup replication between two volumes
Cinder have periodic task to update volumes’ replication status
When disaster happen, the cloud admin promotes the replica
Users can use those volumes in the secondary data center with its storage
As part of the fail-back process, re-enable the replication between the primary and secondary volumes
Users can test the replication by creating volume with –source-replica
4. According the configuration in cinder.conf, driver will choose replication target device to create replica & setup replication between two volumes
5. If replication is enable in driver, update the replication status in driver report periodic task
6. When disaster happen, the cloud admin failover a replicating volume to it's secondary via “failover_replication” API
8. Cloud admin also can enable/disable replication on a replication capable volume for some use case, like maintenance
9. Cloud admin also can query a volume for a list of configured replication targets
IO Mirror state machine:
CBT(changed Block Tracking) replication: based on “Bitmap”
Queue replication: In this state, user can create a snapshot for replication data.
Consistency check
Start
Setup Connection with Virtual Replication Gateway
Initial Replication
Host normal restart, data in queue during shutdown is written to disk by using CBT bitmap
CBT Data Replication
CBT bitmap is clear, proceed to Queue-based
If Queue in overflow, switch to CBT
On Host Abnormal Restart or Swap (re-protect)
Do Consistency Check and then CBT data replication
Install and Configure Hypervisor with replication capabilities.
DR admin creates a Protected Group for VMS in dashboard
DR admin can define the Protection Policy (encryption, compression, RPO, etc)
When admin create the protect group, replication start, IO Mirror will send IO data to VRG.
DR admin creates a Recovery Plan for fail-over, replication test and fail-back
When disaster happens, DR admin chooses the fail-over recovery plan by using snapshot or newest data in DR site
DR admin can use re-protect to swap production site and DR site. System will replicate data from new production sit to new DR site.
If needing fail-back, DR admin choose the recovery plan to make data consistency between production site and DR site.
So… what do we need??
Is data only storage?
If it where so, we would need just Data Protection.
For example… (move slide)
We start by define the API and the services frameworks