SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
IBM Services – Continuous Availability
Availability in a Cloud-Native
World. Guidelines for mere
mortals.v1.6 Tuesday, February 26, 2019
Haytham Elkhoja
Global Tech Leader and Chief Architect
IBM Services Continuous Availability – IBM Services
haytham.elkhoja@ibm.com
@haythamelkhoja
Herbie Pearthree
Chief Technical Officer, Senior Technical Staff Member
IBM Services Continuous Availability – IBM Services
hpear3@us.ibm.com
@herbiepear3
What you should aim for.
Users
DataReplication
Data
Replication
Data Replication
Session
Replication
Session
Replication
Session Replication
Traffic Traffic
Traffic
*Cloud vendors can be substituted with Cloud regions. Same principles apply.
Definition
Cloud-Native Apps are
born on the cloud,
scale on the cloud,
consume the cloud,
resilient on the cloud,
and perform on the cloud.
Definition
Microservices
Definition
Availability. Everything breaks,
you should plan on it. Business
must be active in multi-availability
zones to mitigate failures3 (fires,
floods and fools).
It also allows zero downtime for planned changes and
minimizes maintenance windows.
Definition
Availability in a Cloud Native
World.
Cloud Native and Microservices
- Parallel, agile, polyglot development.
- Choose the right tool for the job.
- Microservices and Loosely-Coupled Components.
- Pet vs Cattle.
Continuous Availability / Always On / Zero Downtime
- First impression, last impression.
- Cost of downtime, there are 8,760 hours in a year, make them count.
- Availability, resilience, performance and scalability go hand in hand.
- Blue Green and canary deployments per region/cloud for non-disruptive change management.
- Redirect users to their closest region/cloud, right cloud/region for the right job.
- No HA and stretched clustering = no failure domains.
- 3 regions/clouds cheaper than 2.
v/s
Definition
Achieving availability in a cloud
native world requires
1. Good Sense
2. Portability
3. Scalability
4. Resiliency
Here are some guidelines we
picked up in the field.
in no specific order.
1. Good Sense
Guidelines
Guideline
Embrace tradeoffs. There is no
silver bullet. Availability comes
from good architectures.
Guideline
Formulate SLAs, SLOs, SLIs and
error budgets.
Example:
- SLI = HTTP Error Codes
- SLO = 1% HTTP 500s every month allowed
- SLA = Penalty for every additional HTTP 500s ($ or Refunds)
- Error Budgets are SLOs for meeting other SLOs
Guideline
Remember your high school
calculus.
MeanTimeToRepair = MeanTimeToDetect + MeanTimeToTriage + MeanTimeToRestore
Guideline
Distributed computing is full of
fallacies such as networks are
reliable. They’re not, and neither
are disks.
Guideline
Speaking of fallacies here’s a
bunch:
- Network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
Guideline
Bleeding edge is an attitude.
Technology is changing every day.
What you knew yesterday is
already legacy (or deprecated).
Guideline
Understand consistency.
Consistency
Weak
• After a write, reads may or may not see it.
• Best effort only.
• Memcache, VoIP, live video streaming.
Eventual
• After a write, reads will eventually see it.
• Write will happen... Eventually.
• Object Storage, SMTP, DNS.
• Asynchronous data replication.
Strong
• After a write, reads will see it.
• Don’t continue unless commit.
• Filesystems, RDBMS.
• Synchronous data replication.
Guideline
CAP Theorem decisions
early on.
Knowing that Partition Tolerance cannot be sacrificed.
Pick Consistency or Availability.
Consistency
All distributed nodes have a single up-to-date copy of all data at all times.
Availability
Every request receives a success or failure response.
Partition-tolerance
System continues to run despite arbitrary message loss or failure of part of the system.
C
A P
Pick two
Cassandra, CouchDB, HBase etc…
MongoDB,
Redis
etc…
Oracle,
DB2,
MySQL
etc…
Distributed systems data persistence decisions
C+A
To have consistent and available data,
partitioning tolerance must be sacrificed.
This means that data can only be consistent
in a single place at any moment in time.
C+P
To ensure data consistency and partitioning
tolerance, availability must be sacrificed.
This means that data is accessible only if
all data nodes are available.
A+P
To ensure availability and partition
tolerance, consistency must be sacrificed.
This means some data nodes aren’t necessarily
in sync in case of a networking disruption.
Guideline
Love DevOps? Wait till you meet
SRE.
https://landing.google.com/sre/
“SRE is what happens when you ask a
software engineer to design an
operations team. ”
Guideline
Database versioning and
backward-compatible schemas
are not optional, but compulsory.
Guideline
Design for feedback. Measure
every single detail via KPIs and
SLIs. Capture metrics and logs.
There’s no such thing as too much
logs.
Guideline
Timestamp every breath you
make. Thank yourself later.
Guideline
Synthetic automated monitoring
help you understand what your
digital users experience far from
typical platform monitoring. Do it
from multiple locations.
Guideline
Continuous tinkering is healthy
even when random. Use
randomness to spoon-feed
yourself with discoveries.
Guideline
Reduce uncertainty with
GameDays, then aim to regularly
inducing failure in your production
environment.
Guideline
Bypass failures all together.
Recovery leads to a mediocre,
sometimes catastrophic
experience.
2. Portability
Guidelines
Guideline
Architect your application to be
cloud, infrastructure and OS
agnostic.
Guideline
Keep up with the times.
Containerize or Serverlessize
your app.
Guideline
12 factors app development and
design methods help you achieve
application and cloud mobility.
Guideline
Rely on dependency managers to
keep your app clean and lean.
Guideline
Environment variables should be
bootstrapped. No strings
attached.
Guideline
Got Syslog? Feed the logs using
stdout and stderr.
Guideline
Delegate responsibilities.
Whatever as a Service.
Somebody, somewhere has done
it better.
3. Scalability
Guidelines
Guideline
Love thy neighbor. Configure
resource requests and limits.
Throttle API requests.
Guideline
Religiously steer clear from IP
addresses. DNS and service
discovery are your best friends.
Guideline
GitOps. Everything should be
versioned, ephemeral and
reproducible. This includes
configuration files and
Infrastructure as Code.
Guideline
Actions performed by humans
hundreds of times won’t be
performed the same way each
time, even with the best
intentions. Automate.
Guideline
Most times, it might make sense
to cache data and return it, but
manage your TTLs.
4. Resiliency
Guidelines
Guideline
Share-nothing. Cluster-nothing.
Stretch-nothing.
DB DB
Disk
DB DB DB
Disk
DB
Disk
DB DB DB
DiskDisk Disk
Share Everything Share Disks and
Networking
Share Nothing
Networking Networking
Networking
Networking
Networking
Networking
Guideline
Deploy to multi active clouds (or
regions). Resilient clouds don’t
mean resilient apps.
Guideline
Adopt region affinity using Global
Load Balancers to resolve traffic
to the nearest region.
Use anycast for legacy IP
communication.
Guideline
Embrace asynchronous events
and eventual data consistency.
Guideline
Write anywhere and everywhere.
Peer to Peer database-level
replication. Shard or Read/Query
if you can’t.
Guideline
Aim for stateless, but maintain
sessions, if you must.
Guideline
Design for failure.
Handle SIGTERM and SIGKILL
like a champ.
Guideline
Fail gracefully and inform your
customers what’s up (or down)Pun intended.
Guideline
Rolling updates strategies for
zero downtime deployments.
Accounting for the time the application needs to start up.
Deploy by adding an instance, then remove
an old one
Deploy by removing an instance, then add a
new one
Deploy by updating instances as fast as
possible
Guideline
Are we there yet? Implement
readiness, liveness probes and
circuit-breakers.
Guideline
You don’t choose Chaos Monkey.
Chaos Monkey chooses you.
“Chaos Engineering the discipline of
experimenting on a distributed system in
order to build confidence in the system's
capability to withstand turbulent
conditions in production.”
https://principlesofchaos.org
Guideline
When pursuing Chaos
Engineering, start small and
observe and learn.
# of instances
E.g. Latency attack
200
400
600
800
100
0
0
Latency
(ms)
0 20 40 60 80 100
start here
I. Plan an experiment II. Contain the Blast
Radius
III. Scale or Squash
How to conduct Chaos Engineering attacks:
• Test (latency, DNS, leap seconds, disk fill, kill
processes, etc…).
• Expected results?
• Observed results.
• Document.
Remember to start small and gradually increase blast radius.
then increase radius
Guideline
Data patterns differ. Not all data
are created equal.
Messaging
BPM
CEP
APP
Active standby
or active/query
Hot standby
or configured
active/active for
fast switchover
Multi-master
or peer-to-peer
write anywhere
Data distribution
filter and push
Data warehouse
integration and
federation
Data through
messaging filter
and push
distribution
Result should look
something like this.
PUBLIC NETWORK CLOUD NETWORK ENTERPRISE NETWORK
TRANSFORMATION &
CONNECTIVITY
GLOBAL LOAD
BALANCER
USER
ENTERPRISE
DATABASE
ENTERPRISE
DATABASE
FIREWALL
TRANSFORMATION &
CONNECTIVITY
TRANSFORMATION &
CONNECTIVITY
DATACENTER 1
DATACENTER 2
LEGEND
Application
Infrastructure
Data Store
Security
Devops
User
Scalable
FIREWALL
APPLICATION
CLOUD SITE 1
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
6APPLICATION
CLOUD SITE 2
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
APPLICATION
CLOUD SITE 3
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
GLOBAL LOAD
BALANCER
GLOBAL LOAD
BALANCER
1
3-Active Microservices Systems of Engagement w/Active-Active Enterprise SoR
1. Global LoadBalancer responds to DNS request and points user to best responding site
2. User Request is sent to best site to consume the business service application
3. Cloud Native Microservice #1 (using circuit breaker) connects to best Enterprise SoR
4. Cloud Native Microservice #2 performs CRUD on NoSQL Database in site
5. NoSQL database replication set performs operation on each of it’s peers
6. Enterprise SoR replication set performs CRUD on it’s peer
2
3
34
4
5
99.99%
99.999%
IBM Services - Guidelines for Achieving Continuous Availability in a Cloud-Native World

Mais conteúdo relacionado

Mais procurados

App Dev in the Cloud: Not my circus, not my monkeys...
App Dev in the Cloud: Not my circus, not my monkeys...App Dev in the Cloud: Not my circus, not my monkeys...
App Dev in the Cloud: Not my circus, not my monkeys...Eric D. Schabell
 
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...Josef Adersberger
 
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerThe new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerDonnie Berkholz
 
The Paved Road at Netflix
The Paved Road at NetflixThe Paved Road at Netflix
The Paved Road at NetflixDianne Marsh
 
DevSecOps: The DoD Software Factory
DevSecOps: The DoD Software FactoryDevSecOps: The DoD Software Factory
DevSecOps: The DoD Software Factoryscoopnewsgroup
 
Cloud-Native Microservices
Cloud-Native MicroservicesCloud-Native Microservices
Cloud-Native MicroservicesDiego Pacheco
 
Building Microservices in the cloud - GOTO Nights Berlin 2016
Building Microservices in the cloud - GOTO Nights Berlin 2016Building Microservices in the cloud - GOTO Nights Berlin 2016
Building Microservices in the cloud - GOTO Nights Berlin 2016Christian Deger
 
TWISummit 2019 - Embracing a Service Mesh
TWISummit 2019 - Embracing a Service MeshTWISummit 2019 - Embracing a Service Mesh
TWISummit 2019 - Embracing a Service MeshThoughtworks
 
Digital foundations - Paving the road to cloud solutions
Digital foundations - Paving the road to cloud solutionsDigital foundations - Paving the road to cloud solutions
Digital foundations - Paving the road to cloud solutionsEric D. Schabell
 
Building Microservices in the cloud - Software Architecture Summit 2016
Building Microservices in the cloud - Software Architecture Summit 2016Building Microservices in the cloud - Software Architecture Summit 2016
Building Microservices in the cloud - Software Architecture Summit 2016Christian Deger
 
Hands-On Lab: Monitor Modern Applications in the Cloud
Hands-On Lab: Monitor Modern Applications in the CloudHands-On Lab: Monitor Modern Applications in the Cloud
Hands-On Lab: Monitor Modern Applications in the CloudCA Technologies
 
Nab 2017 a journey to the future of cloud-native media micro-services - was...
Nab 2017   a journey to the future of cloud-native media micro-services - was...Nab 2017   a journey to the future of cloud-native media micro-services - was...
Nab 2017 a journey to the future of cloud-native media micro-services - was...Washington Cabral
 

Mais procurados (12)

App Dev in the Cloud: Not my circus, not my monkeys...
App Dev in the Cloud: Not my circus, not my monkeys...App Dev in the Cloud: Not my circus, not my monkeys...
App Dev in the Cloud: Not my circus, not my monkeys...
 
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
 
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerThe new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
 
The Paved Road at Netflix
The Paved Road at NetflixThe Paved Road at Netflix
The Paved Road at Netflix
 
DevSecOps: The DoD Software Factory
DevSecOps: The DoD Software FactoryDevSecOps: The DoD Software Factory
DevSecOps: The DoD Software Factory
 
Cloud-Native Microservices
Cloud-Native MicroservicesCloud-Native Microservices
Cloud-Native Microservices
 
Building Microservices in the cloud - GOTO Nights Berlin 2016
Building Microservices in the cloud - GOTO Nights Berlin 2016Building Microservices in the cloud - GOTO Nights Berlin 2016
Building Microservices in the cloud - GOTO Nights Berlin 2016
 
TWISummit 2019 - Embracing a Service Mesh
TWISummit 2019 - Embracing a Service MeshTWISummit 2019 - Embracing a Service Mesh
TWISummit 2019 - Embracing a Service Mesh
 
Digital foundations - Paving the road to cloud solutions
Digital foundations - Paving the road to cloud solutionsDigital foundations - Paving the road to cloud solutions
Digital foundations - Paving the road to cloud solutions
 
Building Microservices in the cloud - Software Architecture Summit 2016
Building Microservices in the cloud - Software Architecture Summit 2016Building Microservices in the cloud - Software Architecture Summit 2016
Building Microservices in the cloud - Software Architecture Summit 2016
 
Hands-On Lab: Monitor Modern Applications in the Cloud
Hands-On Lab: Monitor Modern Applications in the CloudHands-On Lab: Monitor Modern Applications in the Cloud
Hands-On Lab: Monitor Modern Applications in the Cloud
 
Nab 2017 a journey to the future of cloud-native media micro-services - was...
Nab 2017   a journey to the future of cloud-native media micro-services - was...Nab 2017   a journey to the future of cloud-native media micro-services - was...
Nab 2017 a journey to the future of cloud-native media micro-services - was...
 

Semelhante a IBM Services - Guidelines for Achieving Continuous Availability in a Cloud-Native World

Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deploymentFilippo Zanella
 
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedJoão Pedro Martins
 
Andrea Di Persio
Andrea Di PersioAndrea Di Persio
Andrea Di PersioCodeFest
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)jaxLondonConference
 
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016Jon Milsom
 
Intro to Cloud Native _ v1.0en (2021/01)
Intro to Cloud Native _ v1.0en (2021/01)Intro to Cloud Native _ v1.0en (2021/01)
Intro to Cloud Native _ v1.0en (2021/01)Young Suk Ahn Park
 
Twelve Factor - Designing for Change
Twelve Factor - Designing for ChangeTwelve Factor - Designing for Change
Twelve Factor - Designing for ChangeEric Wyles
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
stackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeestackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeeGaurav "GP" Pal
 
Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)VMware Tanzu
 
Accelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and MicroservicesAccelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and MicroservicesAmazon Web Services
 
Evolving to Cloud-Native - Nate Schutta 1/2
Evolving to Cloud-Native - Nate Schutta 1/2Evolving to Cloud-Native - Nate Schutta 1/2
Evolving to Cloud-Native - Nate Schutta 1/2VMware Tanzu
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsLinards Liep
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)dhubbard858
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeLacework
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 

Semelhante a IBM Services - Guidelines for Achieving Continuous Availability in a Cloud-Native World (20)

Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deployment
 
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
 
The Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian CockcroftThe Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian Cockcroft
 
Andrea Di Persio
Andrea Di PersioAndrea Di Persio
Andrea Di Persio
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)
 
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016
Pitchero - Increasing agility through DevOps - Leeds DevOps November 2016
 
Intro to Cloud Native _ v1.0en (2021/01)
Intro to Cloud Native _ v1.0en (2021/01)Intro to Cloud Native _ v1.0en (2021/01)
Intro to Cloud Native _ v1.0en (2021/01)
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
Twelve Factor - Designing for Change
Twelve Factor - Designing for ChangeTwelve Factor - Designing for Change
Twelve Factor - Designing for Change
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
stackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeestackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfee
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 
Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)Evolving to Cloud-Native - Nate Schutta (1/2)
Evolving to Cloud-Native - Nate Schutta (1/2)
 
Accelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and MicroservicesAccelerate your Application Delivery with DevOps and Microservices
Accelerate your Application Delivery with DevOps and Microservices
 
Evolving to Cloud-Native - Nate Schutta 1/2
Evolving to Cloud-Native - Nate Schutta 1/2Evolving to Cloud-Native - Nate Schutta 1/2
Evolving to Cloud-Native - Nate Schutta 1/2
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepins
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least Privilege
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

IBM Services - Guidelines for Achieving Continuous Availability in a Cloud-Native World

  • 1. IBM Services – Continuous Availability Availability in a Cloud-Native World. Guidelines for mere mortals.v1.6 Tuesday, February 26, 2019 Haytham Elkhoja Global Tech Leader and Chief Architect IBM Services Continuous Availability – IBM Services haytham.elkhoja@ibm.com @haythamelkhoja Herbie Pearthree Chief Technical Officer, Senior Technical Staff Member IBM Services Continuous Availability – IBM Services hpear3@us.ibm.com @herbiepear3
  • 2. What you should aim for.
  • 3. Users DataReplication Data Replication Data Replication Session Replication Session Replication Session Replication Traffic Traffic Traffic *Cloud vendors can be substituted with Cloud regions. Same principles apply.
  • 4. Definition Cloud-Native Apps are born on the cloud, scale on the cloud, consume the cloud, resilient on the cloud, and perform on the cloud.
  • 6. Definition Availability. Everything breaks, you should plan on it. Business must be active in multi-availability zones to mitigate failures3 (fires, floods and fools). It also allows zero downtime for planned changes and minimizes maintenance windows.
  • 7. Definition Availability in a Cloud Native World. Cloud Native and Microservices - Parallel, agile, polyglot development. - Choose the right tool for the job. - Microservices and Loosely-Coupled Components. - Pet vs Cattle. Continuous Availability / Always On / Zero Downtime - First impression, last impression. - Cost of downtime, there are 8,760 hours in a year, make them count. - Availability, resilience, performance and scalability go hand in hand. - Blue Green and canary deployments per region/cloud for non-disruptive change management. - Redirect users to their closest region/cloud, right cloud/region for the right job. - No HA and stretched clustering = no failure domains. - 3 regions/clouds cheaper than 2. v/s
  • 8. Definition Achieving availability in a cloud native world requires 1. Good Sense 2. Portability 3. Scalability 4. Resiliency
  • 9. Here are some guidelines we picked up in the field. in no specific order.
  • 11. Guideline Embrace tradeoffs. There is no silver bullet. Availability comes from good architectures.
  • 12. Guideline Formulate SLAs, SLOs, SLIs and error budgets. Example: - SLI = HTTP Error Codes - SLO = 1% HTTP 500s every month allowed - SLA = Penalty for every additional HTTP 500s ($ or Refunds) - Error Budgets are SLOs for meeting other SLOs
  • 13. Guideline Remember your high school calculus. MeanTimeToRepair = MeanTimeToDetect + MeanTimeToTriage + MeanTimeToRestore
  • 14. Guideline Distributed computing is full of fallacies such as networks are reliable. They’re not, and neither are disks.
  • 15. Guideline Speaking of fallacies here’s a bunch: - Network is reliable. - Latency is zero. - Bandwidth is infinite. - The network is secure. - Topology doesn't change. - There is one administrator. - Transport cost is zero. - The network is homogeneous.
  • 16. Guideline Bleeding edge is an attitude. Technology is changing every day. What you knew yesterday is already legacy (or deprecated).
  • 17. Guideline Understand consistency. Consistency Weak • After a write, reads may or may not see it. • Best effort only. • Memcache, VoIP, live video streaming. Eventual • After a write, reads will eventually see it. • Write will happen... Eventually. • Object Storage, SMTP, DNS. • Asynchronous data replication. Strong • After a write, reads will see it. • Don’t continue unless commit. • Filesystems, RDBMS. • Synchronous data replication.
  • 18. Guideline CAP Theorem decisions early on. Knowing that Partition Tolerance cannot be sacrificed. Pick Consistency or Availability. Consistency All distributed nodes have a single up-to-date copy of all data at all times. Availability Every request receives a success or failure response. Partition-tolerance System continues to run despite arbitrary message loss or failure of part of the system. C A P Pick two Cassandra, CouchDB, HBase etc… MongoDB, Redis etc… Oracle, DB2, MySQL etc… Distributed systems data persistence decisions C+A To have consistent and available data, partitioning tolerance must be sacrificed. This means that data can only be consistent in a single place at any moment in time. C+P To ensure data consistency and partitioning tolerance, availability must be sacrificed. This means that data is accessible only if all data nodes are available. A+P To ensure availability and partition tolerance, consistency must be sacrificed. This means some data nodes aren’t necessarily in sync in case of a networking disruption.
  • 19. Guideline Love DevOps? Wait till you meet SRE. https://landing.google.com/sre/ “SRE is what happens when you ask a software engineer to design an operations team. ”
  • 20. Guideline Database versioning and backward-compatible schemas are not optional, but compulsory.
  • 21. Guideline Design for feedback. Measure every single detail via KPIs and SLIs. Capture metrics and logs. There’s no such thing as too much logs.
  • 22. Guideline Timestamp every breath you make. Thank yourself later.
  • 23. Guideline Synthetic automated monitoring help you understand what your digital users experience far from typical platform monitoring. Do it from multiple locations.
  • 24. Guideline Continuous tinkering is healthy even when random. Use randomness to spoon-feed yourself with discoveries.
  • 25. Guideline Reduce uncertainty with GameDays, then aim to regularly inducing failure in your production environment.
  • 26. Guideline Bypass failures all together. Recovery leads to a mediocre, sometimes catastrophic experience.
  • 28. Guideline Architect your application to be cloud, infrastructure and OS agnostic.
  • 29. Guideline Keep up with the times. Containerize or Serverlessize your app.
  • 30. Guideline 12 factors app development and design methods help you achieve application and cloud mobility.
  • 31. Guideline Rely on dependency managers to keep your app clean and lean.
  • 32. Guideline Environment variables should be bootstrapped. No strings attached.
  • 33. Guideline Got Syslog? Feed the logs using stdout and stderr.
  • 34. Guideline Delegate responsibilities. Whatever as a Service. Somebody, somewhere has done it better.
  • 36. Guideline Love thy neighbor. Configure resource requests and limits. Throttle API requests.
  • 37. Guideline Religiously steer clear from IP addresses. DNS and service discovery are your best friends.
  • 38. Guideline GitOps. Everything should be versioned, ephemeral and reproducible. This includes configuration files and Infrastructure as Code.
  • 39. Guideline Actions performed by humans hundreds of times won’t be performed the same way each time, even with the best intentions. Automate.
  • 40. Guideline Most times, it might make sense to cache data and return it, but manage your TTLs.
  • 42. Guideline Share-nothing. Cluster-nothing. Stretch-nothing. DB DB Disk DB DB DB Disk DB Disk DB DB DB DiskDisk Disk Share Everything Share Disks and Networking Share Nothing Networking Networking Networking Networking Networking Networking
  • 43. Guideline Deploy to multi active clouds (or regions). Resilient clouds don’t mean resilient apps.
  • 44. Guideline Adopt region affinity using Global Load Balancers to resolve traffic to the nearest region. Use anycast for legacy IP communication.
  • 45. Guideline Embrace asynchronous events and eventual data consistency.
  • 46. Guideline Write anywhere and everywhere. Peer to Peer database-level replication. Shard or Read/Query if you can’t.
  • 47. Guideline Aim for stateless, but maintain sessions, if you must.
  • 48. Guideline Design for failure. Handle SIGTERM and SIGKILL like a champ.
  • 49. Guideline Fail gracefully and inform your customers what’s up (or down)Pun intended.
  • 50. Guideline Rolling updates strategies for zero downtime deployments. Accounting for the time the application needs to start up. Deploy by adding an instance, then remove an old one Deploy by removing an instance, then add a new one Deploy by updating instances as fast as possible
  • 51. Guideline Are we there yet? Implement readiness, liveness probes and circuit-breakers.
  • 52. Guideline You don’t choose Chaos Monkey. Chaos Monkey chooses you. “Chaos Engineering the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production.” https://principlesofchaos.org
  • 53. Guideline When pursuing Chaos Engineering, start small and observe and learn. # of instances E.g. Latency attack 200 400 600 800 100 0 0 Latency (ms) 0 20 40 60 80 100 start here I. Plan an experiment II. Contain the Blast Radius III. Scale or Squash How to conduct Chaos Engineering attacks: • Test (latency, DNS, leap seconds, disk fill, kill processes, etc…). • Expected results? • Observed results. • Document. Remember to start small and gradually increase blast radius. then increase radius
  • 54. Guideline Data patterns differ. Not all data are created equal. Messaging BPM CEP APP Active standby or active/query Hot standby or configured active/active for fast switchover Multi-master or peer-to-peer write anywhere Data distribution filter and push Data warehouse integration and federation Data through messaging filter and push distribution
  • 56. PUBLIC NETWORK CLOUD NETWORK ENTERPRISE NETWORK TRANSFORMATION & CONNECTIVITY GLOBAL LOAD BALANCER USER ENTERPRISE DATABASE ENTERPRISE DATABASE FIREWALL TRANSFORMATION & CONNECTIVITY TRANSFORMATION & CONNECTIVITY DATACENTER 1 DATACENTER 2 LEGEND Application Infrastructure Data Store Security Devops User Scalable FIREWALL APPLICATION CLOUD SITE 1 MICROSERVICE APPLICATION 1 NOSQL DATABASE MICROSERVICE APPLICATION 2 6APPLICATION CLOUD SITE 2 MICROSERVICE APPLICATION 1 NOSQL DATABASE MICROSERVICE APPLICATION 2 APPLICATION CLOUD SITE 3 MICROSERVICE APPLICATION 1 NOSQL DATABASE MICROSERVICE APPLICATION 2 GLOBAL LOAD BALANCER GLOBAL LOAD BALANCER 1 3-Active Microservices Systems of Engagement w/Active-Active Enterprise SoR 1. Global LoadBalancer responds to DNS request and points user to best responding site 2. User Request is sent to best site to consume the business service application 3. Cloud Native Microservice #1 (using circuit breaker) connects to best Enterprise SoR 4. Cloud Native Microservice #2 performs CRUD on NoSQL Database in site 5. NoSQL database replication set performs operation on each of it’s peers 6. Enterprise SoR replication set performs CRUD on it’s peer 2 3 34 4 5 99.99% 99.999%