SlideShare a Scribd company logo
1 of 63
Download to read offline
Architecting for failures in
micro services:
patterns and lessons learned
Bhakti Mehta
@bhakti_mehta
Introduction
• Platform@Atlassian
• In the past Platform Lead at BlueJeans Network
• Worked at Sun Microsystems/Oracle for 13 years
• Committer to numerous open source projects
including GlassFish Application Server
My recent book
Previous book
What you will learn
• Path to micro services
• Challenges at scale
• Lessons learned, tips and practices to prevent
cascading failures
• Resilience planning at various stages
• Real world examples
Path to micro services
• Advantages
–Simplicity
–Isolation of problems
–Scale up and scale down
–Easy deployment
–Clear separation of concerns
–Heterogeneity and polyglotism
Sounds great!!
In reality……..
Monoliths to Micro services
Path to micro services
• Disadvantages
–Not a free lunch!
–Distributed systems prone to failures
–Eventual consistency
–More effort in terms of deployments, release
managements
– Challenges in testing the various services evolving
independently, regression tests etc
Resilient system
• Processes transactions, even when there are transient
impulses, persistent stresses
• Functions even when there are component failures
disrupting normal processing
• Accepts failures will happen
• Designs for crumple zones
Kinds of failures
• Challenges at scale
• Integration point failures
• Network errors
• Semantic errors.
• Slow responses
• Outright hang
• GC issues
Challenges at scale
Anticipate failures at scale
• Anticipate growth
• Design for next order of magnitude
• Design for 10x plan to rewrite for 100x 

Architecting for failures
The more you sweat on the field
the less you bleed in war!!!
Resiliency planning Stage 1
• When developing code
• Avoiding Cascading failures
• Circuit breaker
• Timeouts
• Retry
• Bulkhead
• Cache optimizations
• Avoid malicious clients
• Rate limiting
Resiliency planning Stage 2
• Planning for dealing with failures before deploy
• load test
• a/b test
• longevity

Resiliency planning Stage 3
• Watching out for failures after deploy
• health check
• metrics
Cascading failures
Cascading failures
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
Cascading failures with
aggregation
Cascading failure with
aggregation
Timeouts pattern
Timeouts
• Clients may prefer a response
• failure
• success
• job queued for later
All aggregation requests to microservices should have
reasonable timeouts set
Types of Timeouts
• Connection timeout
• Max time before connection can be established or
Error
• Socket timeout
• Max time of inactivity between two packets once
connection is established
Timeouts pattern
• Timeouts + Retries go together
• Transient failures can be remedied with fast retries
• However problems in network can last for a while so
probability of retries failing
Retry pattern
• Retry for failures in case of network failures, timeouts
or server errors
• Helps transient network errors such as dropped
connections or server fail over
Retry pattern
• If one of the services is slow or malfunctioning and
other services keep retrying then the problem
becomes worse
• Solution
• Exponential back off
• Circuit breaker pattern
Circuit breaker pattern
Circuit breaker A circuit breaker is an electrical
device used in an electrical panel that monitors
and controls the amount of amperes (amps)
being sent through
Circuit breaker pattern
• Safety device
• If a power surge occurs in the electrical wiring, the
breaker will trip.
• Flips from “On” to “Off” and shuts electrical power
from that breaker
Bulkhead
Bulkhead
• Avoiding chain reactions by isolating failures
• Helps prevent cascading failures
Bulkhead
• An example of bulkhead could be isolating the
database dependencies per service
• Similarly other infrastructure components can be
isolated such as cache infrastructure
Rate limiting
Rate Limiting
• Restricting the number of requests that can be made
by a client
• Client can be identified based on the access token
used
• Additionally clients can be identified based on IP
address
Rate Limiting
• With JAX-RS Rate limiting can be implemented as a
filter
• This filter can check the access count for a client and
if within limit accept the request
• Else throw a 429 Error
• Code at https://github.com/bhakti-mehta/samples/
tree/master/ratelimiting
Cache optimizations
• Stores response information related to requests in a
temporary storage for a specific period of time
• Ensures that server is not burdened processing those
requests in future when responses can be fulfilled from
the cache
Cache optimizations
Getting from first level cache
Getting from
second
level cache
Getting from the DB
Dealing with latencies in
response
• Have a timeout for the aggregation service
• Dispatch requests in parallel and collect responses
• Associate a priority with all the responses collected
Handling partial failures best
practices
• One service calls another which can be slow or
unavailable
• Never block indefinitely waiting for the service
• Try to return partial results
• Provide a caching layer and return cached data
Logging
• Complex distributed systems introduce many points
of failure
• Logging helps link events/transactions between
various components that make an application or a
business service
• ELK stack
• Splunk, syslog
• Loggly
• LogEntries
Logging best practices
• Include detailed, consistent pattern across service
logs
• Obfuscate sensitive data
• Identify caller or initiator as part of logs
• Do not log payloads by default
Best practices when designing
APIs for mobile clients
• Avoid chattiness
• Use aggregator pattern
Thoughts of the on call person paged at 3 am debugging an
issue
Resilience planning Stage 2
• Before deploy
• Load testing
• Longevity testing
• Capacity planning
Load testing
• Ensure that you test for load on APIs
• Plan for longevity testing
Capacity Planning
• Anticipate growth
• Design for handling exponential growth
Resilience planning Stage 3
• After deploy
• Health check
• Metrics and Monitoring
• Phased rollout of features
Health Check
Health Check
• Memory
• CPU
• Threads
• Error rate
• If any of the checks exceed a threshold send alert
Metrics and Monitoring
Metrics
• Response times, throughput
• Identify slow running DB queries
• GC rate and pause duration
• Garbage collection can cause slow responses
• Monitor unusual activity
Metrics
• Load average
• Uptime
• Log sizes
• Response times
Monitoring
Monitoring
server
Production
Environment
CHECKS
ALERTS
Email
Rollout of new features
• Phasing rollout of new features
• Have a way to turn features off if not behaving as
expected
• Alerts and more alerts!
Real time examples
• Netflix's Simian Army induces failures of services and
even datacenters during the working day to test both
the application's resilience and monitoring.
• Latency Monkey to simulate slow running requests
• Wiremock to mock services
• Saboteur to create deliberate network mayhem
Takeaway
• Inevitability of failures
• Expect systems will fail
• Failure prevention
• Automate
References
• https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png
• https://en.wikipedia.org/wiki/Circuit_breaker#/media/
File:Four_1_pole_circuit_breakers_fitted_in_a_meter_box.jpg
• http://weknowyourdreams.com/image.php?pic=/images/happiness/
happiness-04.jpg
• http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg
• http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-
sign-resized_2.jpg
• https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-
A-Hug-Around-The-Neck-With-A-Rope-Image.jpg
• https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative
Commons License
Questions
• Twitter: @bhakti_mehta
• Email: bmehta@atlassian.com

More Related Content

What's hot

Performance testing virtualized systems v5
Performance testing virtualized systems v5Performance testing virtualized systems v5
Performance testing virtualized systems v5
Mentora
 
Exploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator InsecuritiesExploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator Insecurities
Priyanka Aash
 
Perfmon And Profiler 101
Perfmon And Profiler 101Perfmon And Profiler 101
Perfmon And Profiler 101
Quest Software
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated Deployment
Michael Peacock
 

What's hot (20)

Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Resolving problems & high availability
Resolving problems & high availabilityResolving problems & high availability
Resolving problems & high availability
 
Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
 
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
 
Performance management
Performance managementPerformance management
Performance management
 
Performance testing virtualized systems v5
Performance testing virtualized systems v5Performance testing virtualized systems v5
Performance testing virtualized systems v5
 
Deploying PHP apps on the cloud
Deploying PHP apps on the cloudDeploying PHP apps on the cloud
Deploying PHP apps on the cloud
 
Cloud computing Fundamentals - behind the hood of cloud platforms
Cloud computing Fundamentals - behind the hood of cloud platformsCloud computing Fundamentals - behind the hood of cloud platforms
Cloud computing Fundamentals - behind the hood of cloud platforms
 
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATIONEXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
 
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
 
What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?
 
Exploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator InsecuritiesExploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator Insecurities
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Perfmon And Profiler 101
Perfmon And Profiler 101Perfmon And Profiler 101
Perfmon And Profiler 101
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speed
 
Neoload
Neoload Neoload
Neoload
 
... No it's Apache Kafka!
... No it's Apache Kafka!... No it's Apache Kafka!
... No it's Apache Kafka!
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated Deployment
 

Viewers also liked

Bbc jan13 ftth_households
Bbc jan13 ftth_householdsBbc jan13 ftth_households
Bbc jan13 ftth_households
Bailey White
 
Ecce de-gids nl
Ecce de-gids nlEcce de-gids nl
Ecce de-gids nl
swaipnew
 
Splunk Dynamic lookup
Splunk Dynamic lookupSplunk Dynamic lookup
Splunk Dynamic lookup
Splunk
 

Viewers also liked (20)

Bbc jan13 ftth_households
Bbc jan13 ftth_householdsBbc jan13 ftth_households
Bbc jan13 ftth_households
 
Ecce de-gids nl
Ecce de-gids nlEcce de-gids nl
Ecce de-gids nl
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Deploying services: automation with docker and ansible
Deploying services: automation with docker and ansibleDeploying services: automation with docker and ansible
Deploying services: automation with docker and ansible
 
Cloud Foundry Logging and Metrics
Cloud Foundry Logging and MetricsCloud Foundry Logging and Metrics
Cloud Foundry Logging and Metrics
 
Adaptive Content Show & Tell - Austin Content
Adaptive Content Show & Tell - Austin ContentAdaptive Content Show & Tell - Austin Content
Adaptive Content Show & Tell - Austin Content
 
Secure Yourself, Practice what we preach - BSides Austin 2015
Secure Yourself, Practice what we preach - BSides Austin 2015Secure Yourself, Practice what we preach - BSides Austin 2015
Secure Yourself, Practice what we preach - BSides Austin 2015
 
Urban legends - PJ Hagerty - Codemotion Amsterdam 2017
Urban legends - PJ Hagerty - Codemotion Amsterdam 2017Urban legends - PJ Hagerty - Codemotion Amsterdam 2017
Urban legends - PJ Hagerty - Codemotion Amsterdam 2017
 
Splunk Dynamic lookup
Splunk Dynamic lookupSplunk Dynamic lookup
Splunk Dynamic lookup
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Honey Potz - BSides SLC 2015
Honey Potz - BSides SLC 2015Honey Potz - BSides SLC 2015
Honey Potz - BSides SLC 2015
 
Tubular Labs - Using Elastic to Search Over 2.5B Videos
Tubular Labs - Using Elastic to Search Over 2.5B VideosTubular Labs - Using Elastic to Search Over 2.5B Videos
Tubular Labs - Using Elastic to Search Over 2.5B Videos
 
Bsides Delhi Security Automation for Red and Blue Teams
Bsides Delhi Security Automation for Red and Blue TeamsBsides Delhi Security Automation for Red and Blue Teams
Bsides Delhi Security Automation for Red and Blue Teams
 
Micro Services - Small is Beautiful
Micro Services - Small is BeautifulMicro Services - Small is Beautiful
Micro Services - Small is Beautiful
 
Data Visualization on the Tech Side
Data Visualization on the Tech SideData Visualization on the Tech Side
Data Visualization on the Tech Side
 
Resume
ResumeResume
Resume
 
Demystifying Security Analytics: Data, Methods, Use Cases
Demystifying Security Analytics: Data, Methods, Use CasesDemystifying Security Analytics: Data, Methods, Use Cases
Demystifying Security Analytics: Data, Methods, Use Cases
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
 
DevOps Offerings at WhiteHedge
DevOps Offerings at WhiteHedgeDevOps Offerings at WhiteHedge
DevOps Offerings at WhiteHedge
 
Docker experience @inbotapp
Docker experience @inbotappDocker experience @inbotapp
Docker experience @inbotapp
 

Similar to Architecting for Failures in micro services: patterns and lessons learned

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 
Goal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter ZaitsevGoal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter Zaitsev
Fuenteovejuna
 
Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)
Ontico
 
Performance and Scalability Tuning
Performance and Scalability TuningPerformance and Scalability Tuning
Performance and Scalability Tuning
Andres March
 

Similar to Architecting for Failures in micro services: patterns and lessons learned (20)

Resilience planning and how the empire strikes back
Resilience planning and how the empire strikes backResilience planning and how the empire strikes back
Resilience planning and how the empire strikes back
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
 
Kafka PPT.pptx
Kafka PPT.pptxKafka PPT.pptx
Kafka PPT.pptx
 
Goal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter ZaitsevGoal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter Zaitsev
 
Cloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems TriviaCloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems Trivia
 
Roman Rehak: 24/7 Database Administration + Database Mail Unleashed
Roman Rehak: 24/7 Database Administration + Database Mail UnleashedRoman Rehak: 24/7 Database Administration + Database Mail Unleashed
Roman Rehak: 24/7 Database Administration + Database Mail Unleashed
 
Performance Testing Overview
Performance Testing OverviewPerformance Testing Overview
Performance Testing Overview
 
Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
 
Performance and Scalability Tuning
Performance and Scalability TuningPerformance and Scalability Tuning
Performance and Scalability Tuning
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should Know
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
 
JMeter
JMeterJMeter
JMeter
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 

More from Bhakti Mehta

Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhakti
Bhakti Mehta
 

More from Bhakti Mehta (7)

Reliability teamwork
Reliability teamworkReliability teamwork
Reliability teamwork
 
Let if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreLet if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and more
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
Think async
Think asyncThink async
Think async
 
Fight empire-html5
Fight empire-html5Fight empire-html5
Fight empire-html5
 
50 tips50minutes
50 tips50minutes50 tips50minutes
50 tips50minutes
 
Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhakti
 

Recently uploaded

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Architecting for Failures in micro services: patterns and lessons learned

  • 1. Architecting for failures in micro services: patterns and lessons learned Bhakti Mehta @bhakti_mehta
  • 2. Introduction • Platform@Atlassian • In the past Platform Lead at BlueJeans Network • Worked at Sun Microsystems/Oracle for 13 years • Committer to numerous open source projects including GlassFish Application Server
  • 5. What you will learn • Path to micro services • Challenges at scale • Lessons learned, tips and practices to prevent cascading failures • Resilience planning at various stages • Real world examples
  • 6. Path to micro services • Advantages –Simplicity –Isolation of problems –Scale up and scale down –Easy deployment –Clear separation of concerns –Heterogeneity and polyglotism
  • 10. Path to micro services • Disadvantages –Not a free lunch! –Distributed systems prone to failures –Eventual consistency –More effort in terms of deployments, release managements – Challenges in testing the various services evolving independently, regression tests etc
  • 11. Resilient system • Processes transactions, even when there are transient impulses, persistent stresses • Functions even when there are component failures disrupting normal processing • Accepts failures will happen • Designs for crumple zones
  • 12. Kinds of failures • Challenges at scale • Integration point failures • Network errors • Semantic errors. • Slow responses • Outright hang • GC issues
  • 13.
  • 15. Anticipate failures at scale • Anticipate growth • Design for next order of magnitude • Design for 10x plan to rewrite for 100x 

  • 17. The more you sweat on the field the less you bleed in war!!!
  • 18. Resiliency planning Stage 1 • When developing code • Avoiding Cascading failures • Circuit breaker • Timeouts • Retry • Bulkhead • Cache optimizations • Avoid malicious clients • Rate limiting
  • 19. Resiliency planning Stage 2 • Planning for dealing with failures before deploy • load test • a/b test • longevity

  • 20. Resiliency planning Stage 3 • Watching out for failures after deploy • health check • metrics
  • 21.
  • 23. Cascading failures Caused by Chain reactions For example One node in a load balance group fails Others need to pick up work Eventually performance can degenerate
  • 27. Timeouts • Clients may prefer a response • failure • success • job queued for later All aggregation requests to microservices should have reasonable timeouts set
  • 28. Types of Timeouts • Connection timeout • Max time before connection can be established or Error • Socket timeout • Max time of inactivity between two packets once connection is established
  • 29. Timeouts pattern • Timeouts + Retries go together • Transient failures can be remedied with fast retries • However problems in network can last for a while so probability of retries failing
  • 30. Retry pattern • Retry for failures in case of network failures, timeouts or server errors • Helps transient network errors such as dropped connections or server fail over
  • 31. Retry pattern • If one of the services is slow or malfunctioning and other services keep retrying then the problem becomes worse • Solution • Exponential back off • Circuit breaker pattern
  • 32. Circuit breaker pattern Circuit breaker A circuit breaker is an electrical device used in an electrical panel that monitors and controls the amount of amperes (amps) being sent through
  • 33. Circuit breaker pattern • Safety device • If a power surge occurs in the electrical wiring, the breaker will trip. • Flips from “On” to “Off” and shuts electrical power from that breaker
  • 35. Bulkhead • Avoiding chain reactions by isolating failures • Helps prevent cascading failures
  • 36. Bulkhead • An example of bulkhead could be isolating the database dependencies per service • Similarly other infrastructure components can be isolated such as cache infrastructure
  • 38. Rate Limiting • Restricting the number of requests that can be made by a client • Client can be identified based on the access token used • Additionally clients can be identified based on IP address
  • 39. Rate Limiting • With JAX-RS Rate limiting can be implemented as a filter • This filter can check the access count for a client and if within limit accept the request • Else throw a 429 Error • Code at https://github.com/bhakti-mehta/samples/ tree/master/ratelimiting
  • 40. Cache optimizations • Stores response information related to requests in a temporary storage for a specific period of time • Ensures that server is not burdened processing those requests in future when responses can be fulfilled from the cache
  • 41. Cache optimizations Getting from first level cache Getting from second level cache Getting from the DB
  • 42. Dealing with latencies in response • Have a timeout for the aggregation service • Dispatch requests in parallel and collect responses • Associate a priority with all the responses collected
  • 43. Handling partial failures best practices • One service calls another which can be slow or unavailable • Never block indefinitely waiting for the service • Try to return partial results • Provide a caching layer and return cached data
  • 44. Logging • Complex distributed systems introduce many points of failure • Logging helps link events/transactions between various components that make an application or a business service • ELK stack • Splunk, syslog • Loggly • LogEntries
  • 45. Logging best practices • Include detailed, consistent pattern across service logs • Obfuscate sensitive data • Identify caller or initiator as part of logs • Do not log payloads by default
  • 46. Best practices when designing APIs for mobile clients • Avoid chattiness • Use aggregator pattern
  • 47. Thoughts of the on call person paged at 3 am debugging an issue
  • 48. Resilience planning Stage 2 • Before deploy • Load testing • Longevity testing • Capacity planning
  • 49. Load testing • Ensure that you test for load on APIs • Plan for longevity testing
  • 50. Capacity Planning • Anticipate growth • Design for handling exponential growth
  • 51. Resilience planning Stage 3 • After deploy • Health check • Metrics and Monitoring • Phased rollout of features
  • 53. Health Check • Memory • CPU • Threads • Error rate • If any of the checks exceed a threshold send alert
  • 55. Metrics • Response times, throughput • Identify slow running DB queries • GC rate and pause duration • Garbage collection can cause slow responses • Monitor unusual activity
  • 56. Metrics • Load average • Uptime • Log sizes • Response times
  • 58. Rollout of new features • Phasing rollout of new features • Have a way to turn features off if not behaving as expected • Alerts and more alerts!
  • 59. Real time examples • Netflix's Simian Army induces failures of services and even datacenters during the working day to test both the application's resilience and monitoring. • Latency Monkey to simulate slow running requests • Wiremock to mock services • Saboteur to create deliberate network mayhem
  • 60. Takeaway • Inevitability of failures • Expect systems will fail • Failure prevention • Automate
  • 61.
  • 62. References • https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png • https://en.wikipedia.org/wiki/Circuit_breaker#/media/ File:Four_1_pole_circuit_breakers_fitted_in_a_meter_box.jpg • http://weknowyourdreams.com/image.php?pic=/images/happiness/ happiness-04.jpg • http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg • http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change- sign-resized_2.jpg • https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need- A-Hug-Around-The-Neck-With-A-Rope-Image.jpg • https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons License
  • 63. Questions • Twitter: @bhakti_mehta • Email: bmehta@atlassian.com