1. Resisting to the Shocks
Resilience Patterns in an unstable world!
STEFANO FAGO (Extendend Version from Meetup
Crafted Software: 7th Edition 11th
October 2018)
2. Resilience?
The concept of Resilience has multiple
definitions; the definition we will use is:
… The Capacity to Recover Quickly from
Difficulties; Toughness. ...
3. What is a Resilient System?
<< ...it is a system that on the outside seems complex but is characterized
by a simpler modular structure made up of components that, when
necessary, can detach and reconfigure themselves: this prevents the
problems of one part from cascading onto the others... >>
[A. Zolli - http://resiliencethebook.com/]
A Resilient System is featured by:
– dynamicity
– modularity
– diversity
– decoupling
– integrated shock obsorbers
4. Why have a Resilient System?
● ...because have a 24/7 and 99.99999 system... is Cool!?!
● ...because I'm ... an Incredible Software Engineer!?!
● ...because I don't want my Business lose money!
<< ...Many systems are built to pass QA testing rather than to survive
the world after launch... >>
[Michael Nygard - https://pragprog.com/book/mnee2/release-it-second-edition]
5. Fallacies Of Distributed Computing
● The network is reliable
● Latency is zero
● Bandwidth is infinite
● The network is secure
● Topology doesn't change
● There is one administrator
● Transport cost is zero
● The network is homogeneous
[https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing]
[https://www.rgoarchitects.com/Files/fallacies.pdf]
6. The Murphy's Laws for the Resilience
● If there is anything that can break in the system, it will break!
● If there is something that can break the System, there is at least one
Customer who will find it!
● Under Pressure... things get worse!
● The size matters but ... You'll be wrong anyway!
<< ...the three most frequent types of failures we observed were due to: 1)
Inbound request pattern changes, including overload and bad actors 2)
Resource exhaustion such as CPU, memory, io_loop, or networking resources
3) Dependency failures, including infrastructure, data store, and
downstream services … >> [UBER Engineering]
7. Fragility? Some Causes...
● Usage of proprietary protocols and software
● Deployment of proprietary systems to a large number of computers
that cannot be properly assessed in terms of security vulnerabilities
or other potential misuses
● Single points of failure
● Inter-dependece of services
● Systems that can easily be influenced by pressure groups
● Weak architecture
● Missing fallback-scenarios, graceful degradation
https://devopsagenda.techtarget.com/opinion/Why-software-
resilience-should-be-the-real-goal-of-DevOps
8. Resiliency isn't Reliability...
● Reliability: The target at which software designers have always aimed:
perfect operation all the time. Reliability is the planned outcome.
● Resiliency: The ability of an app to recover from certain types of failure
and yet remain functional from the customer perspective. Resilience is
how you achieve the outcome.
https://cabforward.com/the-difference-between-reliable-and-resilient-
software/
9. Resilience in Distributed System :
What does it imply?
● 100% Trap: not IF it will break but ... WHEN it will break!
<< ...the normal state of operation is partial failure... >> [Adrian Hornsby]
● It is not a perfect feature!
<< ...it is impossible for a system to have all three properties of consistency,
resilience and partition-tolerance... >> [Architectural Design for
Resilience - Dong Liu, Ralph Deters, and W.J. Zhang (2010)]
● It implies complexity, it does not reduce it!
● It need to study, measure and understand the business
objectives!
10. Resilience in Distributed System :
Base Elements
● Isolation
● Low Coupling
● Communication Methods
● Mitigate Failures
Break down into parts, autonomy of the parties, avoid the propagation of
failures
Complementary to Isolation, contributes to the non-propagation of the failures,
the Components are ignorant of the others
It conditions how to model the domain and the recovery mechanisms, it can be
heterogeneous (Sync, Async, Location Transparency, Message Passing,
Streaming, ...)
Anticipate unavoidable failures and adopt both system and application recovery
mechanisms
11. Resilience in Distributed System :
Isolation is important
...using an intuitive point of view...
FAILURE & CHANGE [Mark Hibberd - https://www.youtube.com/watch?v=_VftQXWDkfk]
14. Patterns of Resilience: Bulkhead
Isolate! Don't Propagate!
● Redundancy of Systems and Resources: where possible, multiply a critical
resource to be readily replaceable
● Categorized Resource Allocation: Classify Resources and break them down into
measurable and manipulable reference pools
Warning: Redundancy and Pools may vary over time and some of them are
affected by more than one factor
15. Patterns of Resilience: Queueing
Take Your Time!
● Deferrable Work : postpone a non-urgent activity
● Bounded Queue/ Load-Levelling Queue: load-absorbers for request or
traffic spikes
● BackPressure/Throttling: queue overload management policies to avoid
indefinite growth
WARNING: Asynchrony make the coordination complex and it is necessary to
refine the approach on measurements deriving from reality
16. Patterns of Resilience: Timeout
Stop to Wait: Fail Fast & Don't Propagate!
● Make predicatable the duration of an activity
● Set Timing Goals, measure, refine according to reality
WARNING: The goals may be specific to a resource and does not impact the others;
how to handle timeout errors?
17. Patterns of Resilience: Retry
If you fail once, try again!
Some failures are temporary or recoverable...
...Trying again require: the number of attempts, the presence of a
temporal degradation between the retries (backoff)
https://aws.amazon.com/it/blogs/architecture/exponential-backoff-and-jitter/
WARNING: Assumes the Idempotence analysis of the activities involved
18. Patterns of Resilience: Fallback/Fail Silent
Don't Fail... Degrade gracefully!
Do not fail with destructive actions but with approximation or alternative
actions
● Default Value/Derived Value
● Alternative Actions/Invocations
● Caching
WARNING: It is needed to incorporate the relate business conditions!...
19. Patterns of Resilience: Limiter
No Stress, Know Your Limit!
● Rate-Limiter
● Concurrency-Limiter
● Adaptive Resource Sizing
● BackPressure/Throttling
WARNING: These policies should not replace an effort to understand the
Resource-Sizing, use appropriate algorithms and refine the reality of data for the
different use-cases.
20. Patterns of Resilience: Circuit Breaker
Don't do it if it hurts!
Interrupt a pathological situation with controlled and immediate failure. The
state of failure is revoked according to indices or time conditions.
WARNING: The definition of the parameters for the activation of the failure and for
the recovery, can be a difficult task and it is needed to study the consequences on
the critical-path of execution of the services.
21. Patterns of Resilience: Decoupling By Events
Describe in terms of the things that happen (Event), not the things that
do the work (Command)
Isolate/Decouple components, Model with Domains, accept failures with
notifications allowing the recover of the components / sub-systems
● Event-Sourcing / CQRS / Message-Passing
● SAGA ( alternative to 2PC)
WARNING: Asynchronous Activities and Domain Modeling make the system safer
but complex. It could be presents abuse of queues and listener networks. Tradeoff
between Transactionality and Compensative Activities.
22. Patterns of Resilience: Chaos Engineering
<< ...Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to withstand
turbulent conditions in production... >>
https://www.oreilly.com/ideas/chaos-engineering
● Implementing Testing in Production, with realistic data and volumes!
● Having the infrastructure for continuous experiments of ... Chaos!
● Learn from every failure / Always invent new failures!
WARNING: Complex Startup, specific Skills, get products and <<...don't use the
term Chaos Engineering, use Continuous limited scope disaster recovery
instead. You might actually get a budget that way...>> [Russ Miles]
23. From Resilient to (auto)Recoverable
Target for Architectural Maturity [Bilgin Ibryam]
24. From Resilient to (auto)Recoverable
At the first sight yuo'll think to adopt these patterns only as an application
solution but...
… is in this context that DevOps practices and tools become an integral part of
a broader vision
– containers and containers orchestration
– artifacts life cycle
– distribution policies for certificates, configurations and artifacts
– monitoring & metrics
WARNING: adopting DevOps implies complexity, skills, organization and <<
...application safety and correctness, in a distributed system is still the
responsibility, of the application... >> [Christian Posta]
25. From Resilient to (auto)Recoverable
In order to be suitable for automation (in cloud native) environments a service
must be:
– Idempotent for restarts (a service can be killed and started multiple times).
– Idempotent for scaling up/down (a service can be autoscaled to multiple
instances).
– Idempotent service producer (other services may retry calls).
– Idempotent service consumer (the service or the mesh can retry outgoing
calls).
If you service always behaves the same way when the above actions are
performed one or multiples times, then the platform will be able recover your
services from failures without human intervention.
[https://www.infoq.com/articles/microservices-post-kubernetes - Bilgin
Ibryam]
26. Remember that ...
● Distributed systems are different because they fail often / Extract services
● Writing robust distributed systems costs more than writing robust single-
machine systems. / Robust, open source distributed systems are much less
common than robust, single-machine systems
● If you can fit your problem in memory, it’s probably trivial / “It’s slow” is the
hardest problem you’ll ever debug
● Implement backpressure throughout your system /Find ways to be partially
available
● Metrics are the only way to get your job done : Use percentiles, not averages
● Learn to estimate your capacity / Exploit data-locality / Writing cached data
back to persistent storage is bad
● Feature flags are how infrastructure is rolled out / Use the CAP theorem to
critique systems
https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-
for-young-bloods/
27. Resilience & Performance Anti-Patterns
Are you in doubt? Does the system get complicated? Maybe is useful to
compare the design of the system, services or resilience patterns used, with the
following performance anti-patterns!
● N+1 Calls
● N+1 Query
● Payload flood
● Granularity
● Tigh-Coupling
● Inefficient Service Flow
● Dependencies
28. Reality will change again but ...
...do not waste money! Be Resilient and
Recoverable!
Thank You All!!!