In today's sophisticated IT Cloud world, how do I fuse multiple technologies, products, and clouds together to create a 2012 integrated High Availability, Disaster Recovery, Business Continuity IT solution? This session
complements product-specific and Overview HA/DR/BC sessions by providing proven, product-agnostic methodology to architect such a solution, including petabyte-level considerations. We provide pragmatic industry-proven, step-by-step methodology / toolset for you to use to work directly with clients to a) crisply elicit, distill HA/DR/BC requirements b) efficiently organize, map requirements to c) design a integrated multi-product, phased-approach, IT HA/DR/BC solution which properly combines backup/restore software, tape, tape libraries, dedup, point-in-time and continuous disk replication, and storage virtualization products c) provide template to clearly communicate solution, gain consensus
across multiple levels of operations and management. John Sing is author of 3 IBM Redbooks, including
SG24-6547-03 IBM System Storage Planning for Business Continuity. My only request when referencing this material in your work, is that you give full credit to me, John Sing, and IBM, as the authors of this material, research, and methodology. That having been said, please spread the good word.
1. Architect’s Guide to Designing Integrated
Multi-Product HA-DR-BC Solutions
John Sing, Executive Strategy, IBM Session E10
1
2. John Sing • 31 years of experience with IBM in high end servers, storage, and
software
– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and Planning, Enterprise
Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics,
HA/DR/BC
– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity,
HA/DR/BC, IBM Storage
– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing
Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global
Mirror)
– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage
– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors
– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and
VSE/ESA)
• singj@us.ibm.com
• IBM colleagues may access my webpage:
– http://snjgsa.ibm.com/~singj/
• You may follow my daily IT research blog
– http://www.delicious.com/atsf_arizona
2
3. Agenda
• Understand today’s challenges and best
practices
– for IT High Availability and IT Business Continuity
• What has changed? What is the same?
• Strategies for:
– Requirements, design, implementation
• Step by step approach
– Essential role of automation
– Accommodating petabyte scale
– Exploiting Cloud
2012 Cloud
deployment
options
3
3
4. Agenda
1. Solving Today’s HA-DR-BC Challenges
2. Guiding HA-DR-BC Principles to mitigate chaos
3. Traditional Workloads vs. Internet Scale Workloads
4. Master Vision and Best Practices Methodology
4
5. Recovering today’s real-time massive streaming workflows is challenging
n d
Chart in public domain: IEEE Massive File Storage presentation, author: Bill Kramer, NCSA: http://storageconference.org/2010/Presentations/MSST/1.Kramer.pdf:
5
7. Inter-
Many options, including many non-traditional alternatives for Disciplinary
user deployments, workload hosting, and recovery models
Traditional alternatives: • Non-traditional alternatives:
– The Cloud, the Developing World
• Other platforms
• Other vendors
Illustrative Cloud examples only
No endorsement is implied
or expressed
7
8. Finally, we have this ‘little’ problem regarding Mobile proliferation
Clayton Christensen
Harvard Business School
• From IT standpoint, we are
clearly seeing
“consumerization of IT”
• Key is to recognize and
exploit hyper-pace reality
of BYOD’s associated data
• Not just the technology
• Also the recovery model
(“cloud), the business
model, and the required
ecosystem
http://en.wikipedia.org/wiki/Disruptive_innovation
8
9. So how do we affordably architect HA / BC / DR in 2012?
9
10. What has remained the same?
(Continued good Guiding Principles that mitigate
HA/DR/BC chaos)
Storage Efficiency
Service Management Data Protection
10
11. The Business Process is still the Recoverable Unit
Business
Business Business Business Business Business Business Business
process A process B process C process D process E process F process G
3. The loss of both
db2
applications affects two
Application
Application 2
http://xyz.xml distinctly different
Web
Sphere
business processes
MQseries
2. The error impacts management Application 3
Analytics
Application 1 the ability of two or
report reports decision
more applications to SQL point
share critical data
Infrastructure
IT Business Continuity
1. An error occurs on
a storage device that must recover at the
correspondingly
corrupts a database
business process
level
11
12. Cloud does not change business process; still the recovery unit
Business
Business Business Business Business Business Business Business
process A process B process C process D process E process F process G
3. The loss of Cloud
db2
output affects two
Application
Application 2
http://xyz.xml distinctly different
Web
Sphere processes
business
STOP
management Application 3
Analytics
Application 1 2. Cloud provider reports decision
report
outage SQL point
Infrastructure
Cloud is simply another
deployment option
1. Data input to the
cloud But doesn’t change HA/BC
fundamental approach
12
13. When can Cloud recovery can provide extremely
fast time to project completion?
• Where entire business process recoverable units can be out-sourced to Cloud provider
– Production example: Out-sourcing production, or backup/restore, or integrated, standalon, application to
a provider
– Cloud application-as-a-service (AaaS) example: Salesforce.com, etc.
Business
Business Business Business Business Business Business Business
process A process B process C process D process E process F process G
db2
Application
http://xyz.xml Application 2 Web
Sphere
MQseries
Analytics management Application 3
Application 1 decision
report SQL reports
point
Technical
13
14. The trick to leveraging Cloud is:
Understanding that Cloud is simply another
(albeit powerful) deployment choice
Good news:
Fundamental principles for HA/DR/BC haven’t changed
It’s only the deployment options that have changed
14
15. Still true: synergistic overlap of valid data protection techniques
IT Data
Protection
1. High Availability 2. Continuous Operations 3. Disaster Recovery
Fault-tolerant, failure- Non-disruptive backups and Protection against unplanned
resistant streamlined system maintenance coupled with outages such as disasters
infrastructure with continuous availability of through reliable, predictable
affordable cost applications recovery
foundation
Protection of critical Business data Operations continue after a disaster
Recovery is predictable and reliable Costs are predictable and manageable
15
16. Four Stages of Data Center Efficiency: (pre-req’s for HA/BC/DR)
April 2012
http://www-935.ibm.com/services/us/igs/smarterdatacenter.html
http://public.dhe.ibm.com/common/ssi/ecm/en/rlw03007usen/RLW03007USEN.PDF
16
17. Telecom bandwidth
still the major delimiter
Still true: Timeline of an IT Recovery ==> for any fast recovery
Execute hardware, operating system,
RPO ?
Assess
and data integrity recovery
Telecom Network
Management Control Data
Physical Facilities
Operating System
Outage!
Production ☺ Operations Staff
Network Staff
Applications Staff
Recovery Point
Objective
Recovery Time Objective (RTO)
of hardware data integrity
RPO Done? transaction
Application
integrity recovery
(RPO)
How much data Applications
must be
recreated? Recovery Time Objective (RTO)
of transaction integrity
Now we're done!
17
18. Still true: value of Automation for real-time failover ===>
RPO ?
Assess HW
Telecom Network
Management Control
Data
Physical Facilities
Operating System Value of
automation
Outage!
Production ☺ Operations Staff Network Staff
Applications Staff
RTO Trans.
•Reliability
RPO
Recov.
H/W
•Repeatability
Recovery Point
Applications •Scalability
Objective
(RPO)
RTO
trans. integrity
•Frequent Testing
How much data
must be Now we're done!
recreated?
18
19. Still true: Organize High Availability, Business Continuity Technologies
Balancing recovery time objective with cost / value
Recovery from a disk image Recovery from tape copy
BC Tier 7 – Add Server or Storage replication with end-to-end automated server
recovery
BC Tier 6 – Add real-time continuous data replication, server or
storage
BC Tier 5 – Add Application/database integration to
Backup/Restore
e u a V/ t s o C
BC Tier 4 – Add Point in Time replication to
Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries +
Automation BC Tier 1 – Restore
l
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from Tape
Recovery Time Objective (guidelines only)
19
20. Still true: Replication Technology Drives RPO
For example:
Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks
Recovery Point Recovery Time
Tape
Backup Periodic
Replication
Asynchronous
replication
Synchronous
replication / HA
20
21. Still true: Recovery Automation Drives Recovery Time
For example:
Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks
Recovery Point Recovery Time
End to end
automated Storage
Recovery Time includes: clustering automation Manual Tape
Restore
– Fault detection
– Recovering data
– Bringing applications back online
– Network access
21
22. Still true: “ideal world” construct for IT High Availability and Business Continuity
Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot
be resilient without having strategies for alternate workspace, staff members, call centers and communications
channels.
Business Prioritization Integration into IT Manage
Awareness, Regular Validation, Change Management, Quarterly Management Briefings Resilience
Program
Management
e
ery ted
Tim
Cap rrent
lity
ss RTO/RPO
ine ct m
Re stima
s ra
abi
bu pa is Cu ra
m
og ation
im lys og ign pr id
Pr es
cov
risk a program Strategy l
an
E
assessment assessment D Design
Implement va
of
s
rea itie
ts • Maturity
ac e crisis team
High Availability 1. People
mp utag
Th abil
Model
ts
design
I 2. Processes
O
and ulner
• Measure 3. Plans
business
ROI High Availability
4. Strategies
resumption
s, V
Servers
• Roadmap 5. Networks
disaster
k
for 6. Platforms
Ris
Storage, Data
Program recovery Replication 7. Facilities
high
Database and
availability Software design
Source: IBM STG, IBM Global Services
22
23. The 2012 Bottom line: (IT Business Continuity Planning Steps)
For today’s real world environment……….
Need faster way than even this simplified 2007 version:
2012 key #1:
1. Collect information for this “ideal” process?
i.e. how to streamline prioritization
need a basic
2. Vulnerability, risk assessment, scope
Business Prioritization
Awareness, Regular Validation, Change Management, Quarterly Management Briefings
Integration into IT Manage Data Strategy
Resilience
Program
Management
3. Define BC targets based on scope
e
ery ted
Cap rrent
Tim
lity
s RTO/RPO
es
sin t
Re stima
abi
m
bu pac is m ra
og ion
Cu
im lys ra
og n pr idat
cov
Pr esig
4. Solution option design and evaluation
a
E
risk an program Strategy l
D Implement va
assessment assessment Design
2012 key #2:
f
so • Maturity 1. People
ct high availability
s
5. Recommend solutions and products
pa age crisis team
rea itie
Model 2. Processes
Im ut
Workload type
design
O • Measure
Th abil
3. Plans
ts
ROI
and ulner
business 4. Strategies
• Roadmap High Availability
resumption 5. Networks
Servers
V
for Program 6. Platforms
6. Recommend strategy and roadmap
ks,
7. Facilities
Ris
disaster Data
recovery Replication
high
availability Database and
Software design
23
24. Streamlined BC Actions 2005 version
Input Output
Scope, Resource Business
Business processes, Key 1. Collect info for Impact
Perf. Indicators, IT prioritization Component effect on business
processes
inventory
Defined vulnerabilities
List of vulnerabilities 2. Vulnerability / Risk
Assessment
Defined BC baseline
Existing BC capability, KPIs, targets,
3. Define desired HA/BC architecture,
targets, and success rate
targets based on scope decision and
success criteria
Technologies and solution
4. Solution design and
options
evaluation Business process segments
and solutions
Generic solutions that meet 5. Recommend Recommended IBM
criteria solutions and products Solutions and benefits
Budget, major project
milestones, resource 6. Recommend strategy and Baseline Bus. Cont. strategy,
availability, business roadmap, benefits, challenges,
roadmap
financial implications and
process priority
justification
24
25. Streamlined BC Actions 2012 version
Input Output
Scope, Resource Business
Business processes, Key 1. Collect info for Impact
Perf. Indicators, IT prioritization Component effect on business
processes
inventory Do basic HA/DR
List of vulnerabilities Data Strategy
2. Vulnerability / Risk
Defined vulnerabilities
Assessment
Defined BC baseline
Existing BC capability, KPIs, targets,
3. Define desired HA/BC architecture,
targets, and success rate
targets based on scope decision and
success criteria
Technologies and solution
4. Solution design and
options
evaluation Business process segments
and solutions
Exploit
Generic solutions that meet Workload Type
5. Recommend Recommended IBM
criteria solutions and products Solutions and benefits
Budget, major project
milestones, resource 6. Recommend strategy and Baseline Bus. Cont. strategy,
availability, business roadmap, benefits, challenges,
roadmap
financial implications and
process priority
justification
25
26. How do we get there in 2012?
Bottom line #1: have a basic Data Strategy
Bottom line #2: Exploit Workload type
Storage Efficiency
Service Management Data Protection
26
27. i.e. #1: It’s all about the
Data
Now, what do I mean by that?
27
28. What is a basic Data Strategy? Specify data usage over it’s lifespan
Applications Information Information
create data and data Archive / Retain / Delete
Management
Frequency of Access and Use
Time
28
28
29. Data strategy = collecting information, prioritizing, vulnerability/risk,
scope
Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot
be resilient without having strategies for alternate workspace, staff members, call centers and communications
channels.
Business Prioritization Integration into IT Manage
Awareness, Regular Validation, Change Management, Quarterly Management Briefings Resilience
Program
Management
e
ery ted
Tim
Cap rrent
lity
ss RTO/RPO
ine ct m
Re stima
s Data ra
abi
bu pa is Cu ra
m
og ation
im lys og ign pr id
Pr es
cov
risk a program Strategy l
an
E
assessment Strategy assessment D Design
Implement va
of
s
rea itie
ts • Maturity
ac e crisis team
High Availability 1. People
mp utag
Th abil
Model
ts
design
I 2. Processes
O
and ulner
• Measure 3. Plans
business
ROI High Availability
4. Strategies
resumption
V
Servers
• Roadmap 5. Networks
k s,
for disaster 6. Platforms
Ri s
Storage, Data
Program recovery Replication 7. Facilities
high
Database and
availability Software design
Source: IBM STG, IBM Global Services
29
30. Data Strategy Defined
Data Strategy: relationship to Business, IT Strategies
Business Strategy IT Strategy
Business Strategies
Business Technology
Scope Scope
IT Strategy
Distinct Business System IT
Competencies Governance Competencies Governance Data Strategy
Enterprise IT Architecture
Organization, Infrastructure, IT Infrastructure
Process And processes
IT IT Infrastructure
Process Infrastructure
People Data
Skills Tools Processes Skills
Process
Technology
Structure
30
31. Data Strategy Defined
The role of the basic “Data Strategy” for HA / BC purposes
• Define major data types “good enough”
– i.e. by major application, by business line….
Business Strategies
– An ongoing journey
You have to
• For each data type: know your data IT Strategy
– Usage
– Performance and measurement Data Strategy
– Security
– Availability
Enterprise IT Architecture
– Criticality
– Organizational role And have a
– Who manages basic strategy
– What standards for this data for it
• What type storage deployed on
• What database
• What virtualization
IT Infrastructure
• Be pragmatic People Data
– Create a basic, “good enough” data strategy for HA/BC purposes Process
Technology
Structure
• Acquire tools that help you know your data
31
32. Here’s the major difference for 2012:
There are two major types of workloads:
Traditional IT Internet Scale
Workloads
HA, Business Continuity, HA/DR/BC can be done “Agnostic / HA/DR/BC must be “designed
Disaster Recovery after the fact” using replication into software stack from the
Characteristics beginning”
Data Strategy Use traditional tools/concepts to Proven Open Source toolset
understand / know data to implement failure
Storage/server virtualization and tolerance and redundancy in
pooling the application stack
Automation End to end automation of server / End to end automation of the
storage virtualization application software stack
providing failure tolerance
Commonality Apply master vision and lessons Apply master vision and
learned from internet scale data lessons learned from internet
centers scale data centers
32
33. Choices for high availability and replication architectures
Production Site
Geographic Load Balancer
Site Load Web Application / DB Server
Balancer Server Clusters Server Clusters Clusters Disk
Workload Application Server Storage
balancer or database Replication Replic.
Replication
Local
backup Geographic
Load Balancer Site
Web Application / DB Server
Load
Balancer Server Clusters Server Clusters Clusters
PIT Image,
Other Site(s) Tape B/U
33
34. Comparing IT BC architectural methods
Production Site
Geographic Load Balancer
Site Load Web Application / DB Server
Balancer Server Clusters Server Clusters Clusters Storage
Workload Application / Server Stor
Balancer DB Replication Replication Replic.
Local Geographic
Backup Load Balancer Site Web Application / DB Server
Load Server Clusters Server Clusters Clusters
Balancer Replication,
Multiple Site(s) PiT Image,
Tape
• Application / database / file system replication / workload balancer
–
File system,
Typically requires the least bandwidth
– May be required if the scale of storage is very large (i.e. internet scale) DB, Applic.
– Span of consistency is that application, database or file system only
–
Aware
Well understood by database, application, file system administrators
– Can be more complex implementation, must implement for each application
• Replication – Server (traditional IT)
– Well understood by operating systems administrators
– Storage and application independent, uses server cycles
– Span of recovery limited to that server platform
• Replication – Storage (traditional IT)
– Can provide common recovery across multiple application stacks and multiple File system,
server platforms
– Usually requires more bandwidth DB, Applic.
– Requires storage replication skill set
Agnostic
34
36. Internet Scale Workload Characteristics - 1
• Embarrassingly parallel Internet workload
– Immense data sets, but relatively independent records being processed
• Example: billions of web pages, billions of log / cookie / click entries
– Web requests from different users essentially independent of each over
• Creating natural units of data partitioning and concurrency
• Lends itself well to cluster-level scheduling / load-balancing
– Independence = peak server performance not important i.e. Very low
inter-process
– What’s important is aggregate throughput of 100,000s of servers communication
• Workload Churn
– Well-defined, stable high level API’s (i.e. simple URLs)
– Software release cycles on the order of every couple of weeks
• Means Google’s entire core of search services rewritten in 2 years
– Great for rapid innovation
• Expect significant software re-writes to fix problems ongoing basis
– New products hyper-frequently emerge
• Often with workload-altering characteristics, example = YouTube
36
37. Internet Scale Workload Characteristics - 2
• Platform Homogeneity
– Single company owns, has technical capability, runs entire platform end-to-
end including an ecosystem
– Most Web applications more homogeneous than traditional IT
– With immense number of independent worldwide users
1% - 2% of all
Internet requests
fail*
• Fault-free operation via application middleware
– Some type of failure every few hours, including software bugs
– All hidden from users by fault-tolerant middleware Users can’t tell difference
between Internet down and
– Means hardware, software doesn’t have to be perfect your system down
Hence 99% good enough
• Immense scale:
– Workload can’t be held within 1 server, or within max size tightly-clustered
memory-shared SMP
– Requires clusters of 1000s, 10000s of servers with corresponding PBs storage,
network, power, cooling, software
– Scale of compute power also makes possible apps such as Google Maps, Google
Translate, Amazon Web Services EC2, Facebook, etc.
*The Data Center as a Computer: Introduction to Warehouse Scale Computing, p.81 Barroso, Holzle
http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006
37
38. IT architecture at internet scale
• Internet scale architectures fundamental assumptions:
Criteria:
– Distributed aggregation of data
Cost – High Availability, failure tolerance functionality is in
software on the server
– Time to Market is everything
• Breakage = “OK” if I can insulate that from user
– Affordability is everything
– Use open source software where-ever possible
Extreme:
– Expect that something somewhere in infrastructure will
- Scale always be broken
- Parallelism
- Performance – Infrastructure is designed top-to-bottom to address this
- Real time
-Time to Market
• All other criteria are driven off of these
38
39. For Internet Scale workloads, Open Source based
internet-scale software stack
Example shown is the 2003-2008 Google version:
1. Google File System Architecture – GFS II
2. Google Database - Bigtable
3. Google Computation - MapReduce
4. Google Scheduling - GWQ
Reliability, redundancy all in
The OS or HW doesn’t do the “application stack”
any of the redundancy
39
40. Internet-scale
Each red block is an
HA/DR/BC inexpensive server =
IT infrastructure plenty of power for its
portion of workflow
For
Internet
Your customers
Scale
Input from the Internet
Workloads
40
41. Warehouse Scale Computer programmer productivity framework example
• Hadoop • Flume
– Overall name of software stack – Populate Hadoop with data
• HDFS • Oozie
– Hadoop Distributed File System – Workflow processing system
• MapReduce • Whirr
– Software compute framework – Libraries to spin up Hadoop on
• Map = queries Amazon EC2, Rackspace, etc.
• Reduce=aggregates answers • Avro
• Hive – Data serialization
– Hadoop-based data warehouse • Mahout
• Pig – Data mining
– Hadoop-based language • Sqoop
• Hbase – Connectivity to non-Hadoop
– data stores
Non-relationship database fast
lookups • BigTop
– Packaging / interop of all
Hadoop components
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond
41
42. Summary - two major types of approaches, depending
on workload type:
Traditional IT Internet Scale
Workloads
HA, Business Continuity, HA/DR/BC can be done “Agnostic / HA/DR/BC must be “designed
Disaster Recovery after the fact” using replication into software stack from the
Characteristics beginning”
Data Strategy Use traditional tools/conceptsw to Proven Open Source toolset
understand / know data to implement failure
Storage/server virtualization and tolerance and redundancy in
pooling the application stack
Automation End to end automation of server / End to end automation of the
storage virtualization application software stack
providing failure tolerance
Commonality Apply master vision and lessons Apply master vision and
learned from internet scale data lessons learned from internet
centers scale data centers
42
44. Key strategy: segment data into logical storage pools by appropriate Data Protection
characteristics (animated chart)
Mission
Critical
• Continuous Availability (CA) – E2E automation enhances RDR
– RTO = near continuous, RPO = small as possible (Tier 7)
– Priority = uptime, with high value justification
• Rapid Data Recovery (RDR) – enhance backup/restore
– For data that requires it
– RTO = minutes, to (approx. range): 2 to 6 hours
– BC Tiers 6, 4
– Balanced priorities = Uptime and cost/value
• Backup/Restore (B/R) – assure efficient foundation
– Standardize base backup/restore foundation
– Provide universal 24 hour - 12 hour (approx) recovery capability
– Address requirements for archival, compliance, green energy
– Priority = cost
Lower Enabled by Know and categorize your data -
cost virtualization
Provides foundation for affordable data protection
44
46. Consolidated virtualized systems become the Recoverable
Virtuali
Units for IT Business Continuity zation
Virtualized IT infrastructure Business Processes
Virtualized systems become the resource pools that enable the recoverability
46
47. High Availability, Business Continuity Step by Step virtualization journey
Balancing recovery time objective with cost / value
Recovery from a disk image Recovery from tape copy
BC Tier 7 – Add Server or Storage replication with end-to-end automated server
recovery
BC Tier 6 – Add real-time continuous data replication, server or
storage
BC Tier 5 – Add Application/database integration to
Backup/Restore
e u a V/ t s o C
BC Tier 4 – Add Point in Time replication to
Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries +
Automation BC Tier 1 – Restore
l
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from Tape
Storage pools Recovery Time Objective
Foundation
47
48. Storage
Pools Add automated failover to
Apply appropriate server, replicated storage
storage technology
Real Time replication Real-time
(storage or server or replication
software)
Periodic PiT replication: Point in time
-File System
- Point in Time Disk
- VTL to VTL with Dedup
Removable
media
- Foundation backup/restore
- Physical or electronic transport
PetaByte Petabyte unstructured, due to usage and Petabyte
Unstructured large scale, typically uses Unstructured
application level intelligent redundancy File, application, or
failure toleration design disk-to-disk
periodic replication
48
49. Methodology: Traditional IT HA / BC / DR in stages, from bottom up
•IBM ProtecTier
•IBM Virtual Tape Library
•IBM Tivoli Storage •VTL, de-dup,
Manager Backup/restore remote replication
at tape level
SAN SAN
Disk VTL/De-Dup VTL/De-Dup
•IBM FlashCopy, SnapShot
•IBM XIV, SVC, DS, SONAS
•IBM Tivoli Storage
Productivity Center 5.1
Cost
Add: Point-in-time Copy, disk to disk, Tiered Storage (Tier 4)
Foundation: electronic vaulting, automation, tape lib (Tier 3)
Foundation: standardized, automated tape backup (Tier 2, 1)
Recovery Time Objective
49
50. Methodology: traditional IT HA / BC / DR in stages, from bottom up
•Server virtualization
•Tivoli FlashCopy
Manager
End to end
Automated
Application Failover: Application Dynamic
Server integration
integration
Storage
Applications •VMWare
•PowerHA on p
SAN SAN
If storage:
•Metro Mirror, Global
VTL/De-Dup VTL/De-Dup
Disk Data Data VTL/De-Dup Mirror, Hitachi UR
replication replication •XIV, SVC, DS, other
storage
•TPC 5.1
End to end automated site failover servers, storage, applications (Tier 7)
Consolidate and implement real time data availability (Tier 6)
Cost
Automate applications, database for replication and automation (Tier 5)
Add: Point-in-time Copy, disk to disk for backup/restore (Tier 4)
Foundation: electronic vaulting, automation, tape lib (Tier 3)
Foundation: standardized, automated tape backup (Tier 2, 1)
Recovery Time Objective
50
51. Pay-per-Usage
• Supporting compute-
Persistent Storage
multi-tenancy model
Compute Cloud
• Finer granularity in
centric workloads
User
• Provider-owned
C
User
Public Cloud
E
4
Services
Enterprise A
User
B
Enterprise B
User
Enterprise C
D
assets
User
A
5
Shared Cloud
Services
51
• Standardized, multi-
tenant service
• Pay-per-usage
Operated or
model withCo-located
provider-owned
assets
2 3 Enterprise
Enterprise
Technology Deployments in Cloud
Data Center
Managed
Private Cloud Hosted Private
Cloud
Co-lo operated Co-lo owned and
Co-lo owned and
operated
operated
• Consumption models including client-
owned and provider-owned assets
• Delivery options including client premise
& hosted
• Strategic Outsourcing clients with
standardized services
Private Cloud
• Client-managed
implementation
Data Center
Enterprise
Private
Cloud
• Internal or
services
partner
cloud
1
52. Cloud as remote site deployment options
Real Time replication
(storage or server or
software)
Recovery
Production
in
Periodic PiT replication: Cloud
-File System
- Point in Time Disk
- VTL to VTL with Dedup
- Point in Time Copies
- Physical or electronic transport
PetaByte Petabyte level storage typically Petabyte
Unstructured uses intelligent file or application replication Unstructured
due to large scale, usage patterns
52
53. Virtualized Automated
Storage failover
Data strategy
remote cloud
Real Time replication Real-time
(storage or server or replication
software)
Periodic PiT replication: Point in time
-File System
- Point in Time Disk
- VTL to VTL with Dedup
Removable
media
- Point in Time Copies
- Physical or electronic transport
Disk-to-disk
replication
PetaByte Petabyte level storage typically Petabyte
Unstructured uses intelligent file or application replication Unstructured
due to large scale, usage patterns
53
55. Cloud provider
responsibility
for HA
and BC
Real Time replication
(storage or server or
software)
Recovery
Your
By
Production Periodic PiT replication: Cloud
In -File System
- Point in Time Disk Provider
Cloud
- VTL to VTL with Dedup
- Point in Time Copies
- Physical or electronic transport
PetaByte Petabyte level storage typically Petabyte
Unstructured uses intelligent file or application replication Unstructured
due to large scale, usage patterns
55
56. Today’s world: High Availability, Business Continuity
Cloud
is a Step by Step data strategy / workload journey deployment
Balancing recovery time objective with cost / value if needed
Recovery from a disk image Recovery from tape copy
BC Tier 7 – Add Server or Storage replication with end-to-end automated server
recovery
BC Tier 6 – Add real-time continuous data replication, server or
storage
BC Tier 5 – Add Application/database integration to
Backup/Restore
e u a V/ t s o C
BC Tier 4 – Add Point in Time replication to
Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries +
Automation BC Tier 1 – Restore
l
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from Tape
Data Strategy Recovery Time Objective
Workload Types
56
57. Step by Step Virtualization, High Availability,
Cloud
Business Continuity data strategy deployment
Balancing recovery time objective with cost / value if needed
Recovery from a disk image Recovery from tape copy
BC Tier 7 – Add Server or Storage Availability end-to-end automated server
Continuous replication with
recovery
BC Tier 6 – Add real-time continuous data replication, server or
storage
Rapid Data Recovery
BC Tier 5 – Add Application/database integration to
Backup/Restore
e u a V/ t s o C
BC Tier 4 – Add Point in Time replication to
Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
Backup/Restore
BC Tier 2 – Tape libraries +
Automation BC Tier 1 – Restore
l
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days from Tape
Data Strategy Workload types Recovery Time Objective
57
58. Summary – IT High Availability / Business Continuity Best Practices 2012
Continuous Implement BC Tier 7 – Standardize use of Continuous
Availability Availability automated Failover
Implement Tier 6 – Standardize high volume
data replication method
Rapid
Data I
Recovery
Implement Tier 4 – Standardize use of disk to disk and
Point in Time disk copy
Backup / Implement Tier 3 – Consolidate and standardize
Restore Backup/Restore methods. Implement tape VTL, data de-dup,
Server / Storage Virtualization / Mgmt tools, basic automation
Production
Backup/Restore Tier 1, 2 Backup/Restore Tier 1, 2
Foundation: replicated foundation:
Storage, server virtualization SAN and server
and consolidation virtualization and
Understand my data consolidation
Define scope of recovery Implement remote
Data strategy S sites (Tier 1, 2) Recovery
Workload types
58
59. Summary
• Understand today’s best practices
– Data Workload
for IT High Availability and IT Business Continuity
Strategy types
• What has changed? What is the same?
– Principles for requirements = no change
• Data Strategy
– Deployment for true internet scale wkloads:
• Application level redundancy
• Strategies for:
– Requirements, design, implementation
– In-house vs. out-sourcing
Cloud
deployment
• Step by step approach options
– Automation, virtualization essential
– Segment workloads traditional vs. petabyte scale
– Exploiting Cloud
59
59
There are three primary aspects of providing business continuity for key applications and business processes: High Availability, Continuous Operations, and Disaster Recovery. Generally the higher in the organization, the simpler the term to use. Senior execs are responsible for setting vision and strategy. Mid level more for implementation. So you can get in the door with just BC at the senior level; but you need BC + HA & CO & DR to get in at the Manager, Director, level. “ Business Continuity” was preferred by senior IT executives and line of business titles . Lower IT titles preferred more detailed naming that spelled out the solution components-- they wanted to make it relevant to their more limited responsibilities. High Availability: is the ability to provide access to applications. High availability is often provided by clustering solutions that work with operating systems coupled with hardware infrastructure that has no single points of failure. If a server that is running an application suffers a failure, the application is picked up by another server in the cluster, and users see minimal or no interruption. Today’s servers and storage systems are also built with fault-tolerant architectures to minimize application outages due to hardware failures. In addition, there are many aspects of security imbedded in the hardware from servers to storage to network components to help protect unauthorized access. You can think of high availability as resilient IT infrastructure that masks failures, and thus continues to provide access to applications. Continuous Operations: Sometimes you must take important applications down for purposes of updating files, or taking backups. Fortunately, great progress has been made in recent years in technology for online backups, but even with these advances, sometimes applications must be taken down as planned outages for maintenance or upgrading of servers or storage. You can think of continuous availability is the ability to keep things running when everything is working right... where you do not have to take applications down merely to do scheduled backups or planned maintenance. Disaster Recovery: the ability to recover a datacenter at a different site if a disaster destroys the primary site or otherwise renders it inoperable. The characteristics of a disaster recovery solutions are that processing resumes at a different site, and on different hardware. (A non-disaster problem, such as a corruption of a key customer database, may indeed be a catastrophe for a business, but it is not a disaster , in this sense of the term, unless processing must be resumed at a different location and on different hardware. You can think of disaster recovery as the ability to recover from unplanned outages at a different site, something you do after something has gone wrong. Fortunately, some of the solutions that you can implement as preparedness for disaster recovery, can also help with High Availability and with Continuous Operations. In this way, your investment in disaster recovery can help your operations even if you never suffer a disaster. The goal of business continuity is to protect critical business data, to make key applications available, and to enable operations to continue after a disaster. This must be done in such a way that recovery time is both predictable and reliable, and such that costs are predictable and manageable.
This animated chart is used to organize “who does what“ in a recovery, and to define Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Hardware (servers, storage) can only handle the blue portion of the recovery. All the other necessary processes are important, they are just outside ability of the hardware/servers/storage to control. Hence they should be acknowledged as important, but also, should be supplemental discussions that should be discussed with Services team, and thus outside the scope of a storage-only or Tivoli-only discussion. It‘s good to use this chart to help audience visually organize who does what, in what order, in a recovery.
This animation shows that the previous timeline still applies today. Automation simply makes consistent, the multiple steps of the Timeline of an IT Recovery. Also, Automation provides affordable way to handle testing and compliance of Data Protection solution:
In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
This slide shows that that Technology only addresses RPO (i.e. how current is the data?). As we improve the technology, we improve RPO. Notice that RTO (Recovery Time Objective) is not driven by technology. (Next chart)
Here we see that automation drives the RTO ( recovery time objective ). Automation is what affects the RTO – because it addresses all the non-technology factors that take time
First, let’s review important IBM 2009 messaging.
Rework title – All Information has a lifespan based on business value
Client Issue: How will technologies evolve to meet the needs of business continuity planning? Strategic Planning Assumption: Data replication for disaster recovery will increase in large enterprises from 25 percent in 2004 to 75 percent by 2006 (0.7 probability).
Example of Application / Database replication: DB2 Queue Replication URL: http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0503aschoff/
*The Data Center as a Computer: Introduction to Warehouse Scale Computing, p.81 Barroso, Holzle http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006
Speed of Decision Making Data volumes have a major effect on "time to analysis" (i.e., the elapsed time between data reception, analysis, presentation, and decision-maker activities). There are four architectural options (i.e., CEP, OLTP/ODS, EDW, and big data), and big data is most appropriate when addressing slow decision cycles that are based on large data volumes. The CEPs requirement for processing hundreds or thousands of transactions per second requires that the decision making be automated using models or business rules. OLTP and ODS support the operational reporting function in which decisions are made at human speed and based on recent data. The EDW — with the time to integrate data from disparate operational systems, process transformations, and compute aggregations — supports historic trend analysis and forecasting. Big data analysis enables the analysis of large volumes of data — larger than can be processed within the EDW — and so supports long-term/strategic and one-off transactional and behavioral analysis. Processing Complexity Processing complexity is the inverse of the speed of decision making. In general, CEP has a relatively simple processing model — although CEP often includes the application of behavioral models and business rules that require complex processing on historic data occurring in the EDW or big data analytics phases of the data-processing pipeline. The requirement to process unstructured data at real-time speeds — for example, in surveillance and intelligence applications — is changing this model. Processing complexity increases through OLTP, ODS, and EDW. Two trends are emerging: OLTP is beginning to include an analytics component within the business process and to utilize in-database analytics. The EDW is exploiting the increasing computational power of the database engine. Processing complexities, and the associated data volumes, are so high within the big data analytics phase that parallel processing is the preferred architectural and algorithmic pattern. Transactional Data Volumes Transactional data volume is the amount of data (either the number of records/events or event size) processed within a single transaction or analysis operation. Modern IT internet architectures process a huge number of discrete base events to compute a sophisticated, pockets of value output. OLTP is similarly concerned with transactional or atomic events. Analysis, with its requirement to process many record simultaneously, starts with ODS, and its complexity grows within the EDW. Big data analytics — with the requirement to model long-term trends and customer behavior on Web clickstream data — processes even larger transactional data volumes. Data Structure The prevalence of non-structured data (semi-, quasi-, and unstructured) increases as the data-processing pipeline is traversed from CEP to big data. The EDW layer is increasingly becoming more heterogeneous as other, often non-structured, data sources are required by the analysis being undertaken. This is having a corresponding effect on processing complexity. The mining of structured data is advanced, and systems and products are optimized for this form of analysis. The mining of non-structured data (e.g., text analytics and image processing) is less well understood, computationally expensive, and often not integrated into the many commercially available analysis tools and packages. One of the primary uses of big data analysis is processing Web clickstream data, which is quasi-structured. In addition, the data is not stored within databases; rather, it is collected and stored within files. Some examples of non-structured data that fit with the big data definition include: log files, clickstream data, shopping card data, social media data, call or support center logs, and telephone call data records (CDRs). There is an increasing requirement to process unstructured data at real-time speeds — for example in surveillance and intelligence applications — so this class of data is becoming more important in CEP processing. Flexibility of Processing/Analysis Data management stakeholders understand the processing and scheduling requirements of transactional processing and operational reporting. The stakeholder's ability to build analysis models is well proven. Peaks and troughs commonly occur across various time intervals (e.g., overnight batch processing window or peak holiday period), but these variations have been studied though trending and forecasting. Big data analysis and a growing percentage of EDW processing are ad hoc or one-off in nature. Data relationships may be poorly understood and require experimentation to refine the analysis. Big data analysis models "analytic heroes" that are continually being challenged "challengers" by new or refined models to see which has better performance or yields better accuracy. The flexibility of such processing is high, and conversely, the governance that can be applied to such processing is low. Throughput Throughput, a measure of the degree of simultaneous execution of transactions, is high in transactional and reporting processing. The high data volumes and complex processing that characterize big data analysis are often hardware constrained and have a low concurrency. The scheduling of big data analysis processing is not time-critical. Big data analysis is therefore not suitable for real-time or near-real-time requirements. = = = = = = = Source for graphic: “InfoSphere Streams Architecture”, Mike Spicer, Chief Architect, InfoSphere Streams, June 2, 2011 Source for quote: Dr. Steve Pratt, CenterPoint Energy, May 25, 2011, IBM Smarter Computing Summit, “Managing the Information Explosion” with Brian Truskowski, between 8:20 and 20:40, http://centerlinebeta.net/smarter-computing-palm-springs/index.html
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond A Hadoop “stack” is made up of a number of components. They include: Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster; Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail; Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail; Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data. Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker. In addition to the above, the Hadoop ecosystem is made up of a number of complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Following is a guide to Hadoop's components: Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive: Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Whirr: Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market. Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.
Understanding your data, categorizing it by recovery time, is essential in order to build a cost-justifiable, affordable solution. Finally, not every client can justify near continuous availability or rapid data recovery solutions. A balance between the priorities of uptime and cost, in concert with the needs of the business, is always necessary. For example, many clients may find that the appropriate cost/recovery time equation is that it is not necessary for the data at the remote site to be within seconds; the requirement is only for the data at the remote site to be no more than 12 hours old. These types of recoveries do not required on-going real-time consistent update of data at a remote site. Rather, only a periodic point-in-time copy needs to be made (on disk or on tape for the lower tiers), and the simply replicate the copies to a remote site Server and workload restart is semi-automated or manual.
Data center complexity has reached crisis levels and is continuing to increase thereby limiting improvement and growth Businesses spend a large fraction of their IT budgets on data center resource management rather than on valuable applications and business processes IT management costs are the dominant IT cost component today and have increased over the past ten years in rough proportion to increasing scale-out sprawl Basic forces will drive continuing increases in IT complexity The numbers of systems deployed will continue to grow rapidly, driven largely by: New applications (for Web 2.0, surveillance, operational asset mgmt., ...) Improving hardware price/performance and utilization (more systems per server) The diversity of IT products will increase as competing suppliers continue to introduce new applications, systems, and management software products The coupling of IT components is extensive and increasing, driven by application tiering, growing SOA usage, advances in high-performance standard networks, … The resulting increase in IT complexity will further exacerbate the current IT management cost crisis. Managing the increasing IT complexity and scale-out sprawl with traditional IT management software will be increasingly difficult and costly New approaches to Data Center Architectures are needed to simplify IT management and enable growth
In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
Here is another way of showing the same step by step, incremental-improve concept. It’s a ‘big picture’ positioning the various kinds of technologies that can be deployed, step by step, to provide IT BC solutions – starting from low and moving to the high end of the cost curve. Click to show each one of the steps to come up. Note how the icons show where the data flows, through different types of technologies that we will discuss further today.
Building upon the previous chart, we continue clicking to show enhancements to: Rapid Data Recovery capabilities, followed by Continuous Availability capabilities. (This chart starts from where the previous chart left off
This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
This is an optional chart, showing the typical Information Availability System Storage technologies that we would apply to the various pools of storage (click) including the fact that large unstructured data probably needs to be recovered using file system or application involvement.
In summary, the animation shows the storage pool concept – mapped to the different technologies: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
In summary, the animation shows the storage pool concept – mapped to the general categories: (click) Backup/Restore (click) Rapid Data Recovery (click) Continuous Availability
Here’s yet another way to look at this process. Each step of the process that we’ve reviewed, are shown here in a build-up, step by step project visualiization. In this case, we show how the Timeline of an IT Recovery is improved at each step.