SlideShare uma empresa Scribd logo
1 de 87
Baixar para ler offline
The 7 quests of resilient software design
A guide for the adventurous software engineer

Uwe Friedrichsen – codecentric AG – 2012-2017
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried
You want to do resilient software design ...
... and you expect everything to be like this
But somehow it feels more like that ...
... or even that
What the **** went wrong?
The road to resilience is a twisted one
“7 quests you must complete!”
Quest #1
Understand the business case
“How much money will we earn with it?”
“Does it improve our velocity?”
Resilience is not about making money
Resilience is about not losing money
Lack of resilient software design
Reduced system availability
Users cannot do what they intend to do
Less transactions per time period
Immediate lost revenue
Users get annoyed
Churn rate increases
Delayed lost revenue
Due to non-determinism
of distributed systems
This is at most your
resilience budget
Quest #2
Embrace distributed systems
Everything fails, all the time.
-- Werner Vogels
If X then Y
What we learned in our IT education
If X then maybe Y



This changes
everything!
What we need for distributed systems












We are good at this (due to how our brains work)
Inside process thinking
Reasoning about
deterministic behavior
Designing a complicated system












We are poor at that (due to how our brains work)
Reasoning about
non-deterministic behavior
Across process thinking
Designing a complex system
Yet, we usually use deterministic thinking
to reason about distributed systems
Failures in distributed systems ...

•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzantine failure
... turn seemingly simple issues into very hard ones
Time & Ordering

Leslie Lamport

"Time, clocks, and the
ordering of events in
distributed systems"
Consensus

Leslie Lamport

”The part-time
parliament”
(Paxos)
CAP

Eric A. Brewer

"Towards robust
distributed systems"
Faulty processes

Leslie Lamport,
Robert Shostak,
Marshall Pease

"The Byzantine
generals problem"
Consensus

Michael J. Fischer,
Nancy A. Lynch,
Michael S. Paterson

"Impossibility of
distributed consensus
with one faulty
process” (FLP)
Impossibility

Nancy A. Lynch

”A hundred
impossibility proofs
for distributed
computing"
Embrace distributed systems

•  Distributed systems introduce non-determinism regarding
•  Execution completeness
•  Message ordering
•  Communication timing
•  You will be affected by this at the application level
•  Don’t expect your infrastructure to hide all effects from you
•  Better have a plan to detect and recover from inconsistencies
But do I really need to care?
(The system, I am working on, is not a distributed system)
(Almost) every system is a distributed system
-- Chas Emerick


http://www.infoq.com/presentations/problems-distributed-systems
… and it’s getting “worse”


•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web
Quest #3
Avoid the “100% available” trap
The “100% available” trap, version #1

You: “How should the application respond if a technical failure occurs?”

Business owner: “This must not happen! It is your responsibility to make

 
sure that this will not happen.”
The “100% available” trap, version #2

You: “How do you handle the situation if the service you call does not

 
respond (or does not respond timely)?”

Developer 1: “We did not implement any extra measures. The other service

 
is so important and thus needs to be so highly available that it is

 
not worth any extra effort.”

Developer 2: “Actually, if that service should be down, we would not be able

 
to do anything useful anyway. Thus, it just needs to be up.”
The question is not, if a failure will happen
The question is, when a failure will happen
A short note about availability

Assume a service availability of 99,5% (incl. planned downtimes)

•  10 services involved in a request à 95,1% probability of success
•  50 services involved in a request à 77,8% probability of success
Quest #4
Establish the ops-dev feedback loop
The big wall between Dev and Ops
In a distributed environment, you cannot solve
availability issues on an infrastructure level only
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
All developer activities towards
improving robustness are basically
“shooting at the dark” which is neither
effective not sustainable
Having a wall between Dev and Ops
breaks the cycle required to implement
effective robustness measures
For effective resilient software design
you need a working ops-dev feedback loop
Establishing the feedback loop


•  Adopt DevOps
•  Adopt Site Reliability Engineering (SRE)
•  Or do it your own way if you know a better way ...
•  ... but make sure you establish the required feedback loops!
Quest #5
Master functional design
Without proper functional design
nothing else matters
Isolation

•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid cascading failures
•  Foundation of resilient software design
Bulkhead

•  Bulkheads implement the “parts” that need to be isolated
•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
•  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ...
•  Shaping good bulkheads is a pure functional design issue (and extremely hard)
Hmm, sound easy. Why should that be hard?
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
How do we avoid this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... without ending up with this?
Let us apply our well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yourself)
•  Design for reusability
•  Layered architecture
•  …
Unfortunately ...
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
... this usually leads to this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... and in the end also often to this.
Welcome to distributed hell!
Caches to the rescue!
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
CacheofB
Break tight service coupling
by caching data/responses
of downstream service
Caches to the rescue?
Do you really think
that copying stale data all over your system
is a suitable measure
to fix an inherently broken design? *

* Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
We have to re-learn design
for distributed system
No silver bullet
Yet, a few guiding thoughts ...
Foundations of design

•  “High cohesion, low coupling” & “separation of concerns”
•  “Crucial across process boundaries
•  Still poorly understood issue
•  Start with
•  Understanding organizational boundaries
•  Understanding use cases and flows
•  Identifying functional domains (à DDD)
•  Finding areas that change independently
•  Do not start with a data model!
Short activation paths

•  Long activation paths affect availability
•  Increase likelihood of failures
•  Minimize remote calls per request
•  Need to balance opposing forces
•  Avoid monolith à clear separation of concerns
•  Minimize requests à cluster functionality & data
•  Caches can sometimes help, but stale data as trade-off
Be (extremely) wary of reusability

•  Reusability increases coupling
•  Reusability usually leads to bad service design
•  Reusability compromises availability
•  Reusability rarely pays
•  Do not strive for reusable services
•  Strive for replaceable services instead
•  Try to tackle reusability issues with libraries
Quest #6
Know your toolbox
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
 Queue for
resources
Bounded queue
Finish work in
progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine
checks
Health
check
Fail fast
Let sleeping dogs lie
Small releases
Hot deployments
Routine maintenance
Backup request
Anti-fragility
Diversity
 Jitter
Error
injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
 Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Confinement
Acknowledgement
Using resilience patterns

•  Patterns are options, not obligations
•  Don’t pick too many patterns
•  Each pattern increases complexity
•  Complexity is the enemy of robustness
•  Each pattern costs money in dev & ops
•  You only have a limited resilience budget
•  Look for complementary patterns
How other people did it
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Escalation
Communication
paradigm
Isolation
System level
Monitor
Heartbeat
Either level
Hot deployments
Restart
 Let it crash!
Node level
Actor
Messaging
Erlang (Akka)

Core patterns
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Fallback
Share load
Bounded
queue
Communication
paradigm
Isolation
System level
Monitor
Either level
Error
injection
Retry
Limit retries
Node level
Circuit breaker
Timeout
Zero downtime
deployment
Canary releases
Redundancy
Several variants
(Micro)service
Request/
response
Netflix

Core patterns
Quest #7
Preserve the collective memory
We face a new generation of developers
every 5 years
We loose our collective memory
every 5 years *


* Mean time until a topic discussion in the community starts over form scratch
Time working in IT
Growth of
knowledge
Depth of
insights
What do we do to compensate this effect?
We look for the new & shiny stuff ...
... as anything not new must be useless crap!
We need to rediscover our insights
every 5 years
In IT, we suffer from
continuous collective amnesia
and we are even proud of it
How can we become better?
Wrap-up
The 7 quests at a glance
Wrap-up

•  The road to resilient software design is a twisted one!
•  Most challenges are only indirectly related to RSD
•  Most challenges are not coding related
•  Mastering functional design is extremely hard ...
•  ... while learning the patterns is relatively easy
•  How do we preserve our collective memory?
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried

Mais conteúdo relacionado

Mais procurados

Application Resilience Patterns
Application Resilience PatternsApplication Resilience Patterns
Application Resilience Patterns
Kiran Sama
 

Mais procurados (20)

Application Resilience Patterns
Application Resilience PatternsApplication Resilience Patterns
Application Resilience Patterns
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Microservices architecture
Microservices architectureMicroservices architecture
Microservices architecture
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Design patterns for microservice architecture
Design patterns for microservice architectureDesign patterns for microservice architecture
Design patterns for microservice architecture
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
Micro-services architecture
Micro-services architectureMicro-services architecture
Micro-services architecture
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
 
Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
 
Saga about distributed business transactions in microservices world
Saga about distributed business transactions in microservices worldSaga about distributed business transactions in microservices world
Saga about distributed business transactions in microservices world
 
Microservices
MicroservicesMicroservices
Microservices
 
Service mesh
Service meshService mesh
Service mesh
 
Cloud Migration: Moving Data and Infrastructure to the Cloud
Cloud Migration: Moving Data and Infrastructure to the CloudCloud Migration: Moving Data and Infrastructure to the Cloud
Cloud Migration: Moving Data and Infrastructure to the Cloud
 
Aws migration case study_blr_meetup
Aws migration case study_blr_meetupAws migration case study_blr_meetup
Aws migration case study_blr_meetup
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Monoliths and Microservices
Monoliths and Microservices Monoliths and Microservices
Monoliths and Microservices
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

Semelhante a The 7 quests of resilient software design

DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
JAXLondon_Conference
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
Amazon Web Services Korea
 

Semelhante a The 7 quests of resilient software design (20)

DevOps and Microservice
DevOps and MicroserviceDevOps and Microservice
DevOps and Microservice
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
Microsoft Microservices
Microsoft MicroservicesMicrosoft Microservices
Microsoft Microservices
 
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWS
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Software Architecture for Agile Development
Software Architecture for Agile DevelopmentSoftware Architecture for Agile Development
Software Architecture for Agile Development
 
The Road to SaaS
The Road to SaaSThe Road to SaaS
The Road to SaaS
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platforms
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platforms
 

Mais de Uwe Friedrichsen

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Uwe Friedrichsen
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
Uwe Friedrichsen
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestors
Uwe Friedrichsen
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
Uwe Friedrichsen
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
Uwe Friedrichsen
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
Uwe Friedrichsen
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
Uwe Friedrichsen
 

Mais de Uwe Friedrichsen (20)

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Life after microservices
Life after microservicesLife after microservices
Life after microservices
 
The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developer
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of software
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestors
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
 
Watch your communication
Watch your communicationWatch your communication
Watch your communication
 
Life, IT and everything
Life, IT and everythingLife, IT and everything
Life, IT and everything
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
 
Production-ready Software
Production-ready SoftwareProduction-ready Software
Production-ready Software
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architectures
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective IT
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) IT
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudes
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Self healing data
Self healing dataSelf healing data
Self healing data
 

Último

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

The 7 quests of resilient software design

  • 1. The 7 quests of resilient software design A guide for the adventurous software engineer Uwe Friedrichsen – codecentric AG – 2012-2017
  • 2. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried
  • 3. You want to do resilient software design ...
  • 4. ... and you expect everything to be like this
  • 5. But somehow it feels more like that ...
  • 6. ... or even that
  • 7. What the **** went wrong?
  • 8. The road to resilience is a twisted one
  • 9. “7 quests you must complete!”
  • 12. “How much money will we earn with it?”
  • 13. “Does it improve our velocity?”
  • 14. Resilience is not about making money Resilience is about not losing money
  • 15. Lack of resilient software design Reduced system availability Users cannot do what they intend to do Less transactions per time period Immediate lost revenue Users get annoyed Churn rate increases Delayed lost revenue Due to non-determinism of distributed systems This is at most your resilience budget
  • 18. Everything fails, all the time. -- Werner Vogels
  • 19. If X then Y What we learned in our IT education If X then maybe Y This changes everything! What we need for distributed systems We are good at this (due to how our brains work) Inside process thinking Reasoning about deterministic behavior Designing a complicated system We are poor at that (due to how our brains work) Reasoning about non-deterministic behavior Across process thinking Designing a complex system
  • 20. Yet, we usually use deterministic thinking to reason about distributed systems
  • 21. Failures in distributed systems ... •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • 22. ... turn seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  • 23. Embrace distributed systems •  Distributed systems introduce non-determinism regarding •  Execution completeness •  Message ordering •  Communication timing •  You will be affected by this at the application level •  Don’t expect your infrastructure to hide all effects from you •  Better have a plan to detect and recover from inconsistencies
  • 24. But do I really need to care? (The system, I am working on, is not a distributed system)
  • 25. (Almost) every system is a distributed system -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  • 26. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web
  • 28. Avoid the “100% available” trap
  • 29. The “100% available” trap, version #1 You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure that this will not happen.”
  • 30. The “100% available” trap, version #2 You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  • 31. The question is not, if a failure will happen The question is, when a failure will happen
  • 32. A short note about availability Assume a service availability of 99,5% (incl. planned downtimes) •  10 services involved in a request à 95,1% probability of success •  50 services involved in a request à 77,8% probability of success
  • 34. Establish the ops-dev feedback loop
  • 35. The big wall between Dev and Ops
  • 36. In a distributed environment, you cannot solve availability issues on an infrastructure level only
  • 37. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect
  • 38. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect All developer activities towards improving robustness are basically “shooting at the dark” which is neither effective not sustainable Having a wall between Dev and Ops breaks the cycle required to implement effective robustness measures
  • 39. For effective resilient software design you need a working ops-dev feedback loop
  • 40. Establishing the feedback loop •  Adopt DevOps •  Adopt Site Reliability Engineering (SRE) •  Or do it your own way if you know a better way ... •  ... but make sure you establish the required feedback loops!
  • 43. Without proper functional design nothing else matters
  • 44. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  • 45. Bulkhead •  Bulkheads implement the “parts” that need to be isolated •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ... •  Shaping good bulkheads is a pure functional design issue (and extremely hard)
  • 46. Hmm, sound easy. Why should that be hard?
  • 47. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  • 48. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  • 50. Let us apply our well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  • 52. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this …
  • 53. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 54. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  • 56. Caches to the rescue!
  • 57. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  • 58. Caches to the rescue?
  • 59. Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
  • 60. We have to re-learn design for distributed system
  • 62. Yet, a few guiding thoughts ...
  • 63. Foundations of design •  “High cohesion, low coupling” & “separation of concerns” •  “Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  • 64. Short activation paths •  Long activation paths affect availability •  Increase likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches can sometimes help, but stale data as trade-off
  • 65. Be (extremely) wary of reusability •  Reusability increases coupling •  Reusability usually leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reusable services •  Strive for replaceable services instead •  Try to tackle reusability issues with libraries
  • 68. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  • 69. Using resilience patterns •  Patterns are options, not obligations •  Don’t pick too many patterns •  Each pattern increases complexity •  Complexity is the enemy of robustness •  Each pattern costs money in dev & ops •  You only have a limited resilience budget •  Look for complementary patterns
  • 71. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Escalation Communication paradigm Isolation System level Monitor Heartbeat Either level Hot deployments Restart Let it crash! Node level Actor Messaging Erlang (Akka) Core patterns
  • 72. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Fallback Share load Bounded queue Communication paradigm Isolation System level Monitor Either level Error injection Retry Limit retries Node level Circuit breaker Timeout Zero downtime deployment Canary releases Redundancy Several variants (Micro)service Request/ response Netflix Core patterns
  • 75. We face a new generation of developers every 5 years
  • 76. We loose our collective memory every 5 years * * Mean time until a topic discussion in the community starts over form scratch
  • 77. Time working in IT Growth of knowledge Depth of insights
  • 78. What do we do to compensate this effect?
  • 79. We look for the new & shiny stuff ...
  • 80. ... as anything not new must be useless crap!
  • 81. We need to rediscover our insights every 5 years
  • 82. In IT, we suffer from continuous collective amnesia and we are even proud of it
  • 83. How can we become better?
  • 85. The 7 quests at a glance
  • 86. Wrap-up •  The road to resilient software design is a twisted one! •  Most challenges are only indirectly related to RSD •  Most challenges are not coding related •  Mastering functional design is extremely hard ... •  ... while learning the patterns is relatively easy •  How do we preserve our collective memory?
  • 87. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried