Architectural Commandments for Building & Running Microservices at Scale

•Transferir como PPTX, PDF•

1 gostou•101 visualizações

Presentation used at the Docker Denver MeetUp, August 2017 Created by Asad Ali, Reinhard Pilz & Andi Grabner. Annotations by Brian Wilson

Tecnologia

Confidential, Dynatrace, LLC
Architectural commandments for building & running
microservices at scale
Brian Wilson, Product Specialist, Dynatrace
@emperorwilson
Join our Podcast Series bit.ly/pureperf

1. Avoid Anti-Patterns
2. Continuous Deployment
3. Infrastructure Utilization
Microservices Techniques

1. N+1 call
2. N+1 query
3. Payload flood
4. Granularity/Tight Coupling
5. Inefficient Service Flow
6. Dependencies
Common Anti-Patterns

Service Instance A
Service Client
Request
Service Instance B
Service Instance N
Router/Service
Registry
Register
Our Setup

Confidential, Dynatrace, LLC
Anti-Patterns

Confidential, Dynatrace, LLC
N + 1 Call Pattern

Monolithic Code
public double getQuote(String type) {
double quote=0;
for (Product product: products) {
quote += product.getValue();
}
return quote;
}
N+1 Call Pattern
Works well within 1
process

N+1 Call Pattern
Product Service
Quote Service
1 call to Quote Service
= 44 calls to product
service

Confidential, Dynatrace, LLC
N + 1 Query Pattern

N+1 Query Pattern
1 call to Quote Service
= 87 calls to DB
Product Service
Quote Service

Confidential, Dynatrace LLC
slide 16
Payload Flood

Payload Flood
Document creation is
split across multiple
services

Payload Flood
Increasing payload between services
call

Payload Flood
Payload could be significantly reduced by calling
individual services from document service.

Confidential, Dynatrace, LLC
Granularity/Tight Coupling

Granularity
Doc Processor Doc Transformer Doc Signer
Doc Encryption
Doc Shipment
Document Encryption is carved out at a separate
service. May not be the best option to run it as a
separate service
Documents

Confidential, Dynatrace, LLC
Inefficient Service Flow
(drawing parallels to Web Performance
Optimization)

WPO (Web Performance Optimization)
taught us optimizing resource dependencies
when loading a web page by analyzing
Resource Waterfalls

Especially useful when page loads get very
complex and overloaded:
3rd party dependencies, non optimize
resources, wrong cache settings, loading too
much data too early, …

SFPO (Service Flow Performance Optimization)
has to teach us how to optimize (micro)service
dependencies through Service Flows

Especially useful to identify: inefficient 3rd party services, recursive
call chains, N+1 Query Patterns, loading too much data, no data
caching, … -> sounds very familiar to WPO

Understanding where time within a
service and between service calls is spent

Classical cascading effect of recursive
service calls!

Confidential, Dynatrace, LLC
Dependencies

Look beyond the “Tip of the Iceberg”:
Understanding Dependencies is critical!

Who is depending on me? What is the risk of change?

Confidential, Dynatrace, LLC
Continuous Deployment

Continuous Deployment
Consumer 2
v1
Consumer 1
v1
Micro
service
v1
Consumer 2
v2
Consumer 1
v1
Micro
service
v2

Consumer 2
Consumer 1 Microservice
v2
Microservice
v1
Continuous Deployment
Service identifier = unique Id + version
Semantic versioning
(major.minor.patch)

Consumer 2
Consumer 1 Microservice
v2
Microservice
v1
Continuous Deployment

Consumer 2
Consumer 1 Microservice
v2
Microservice
v1
Continuous Deployment
Gatekeeper
Gatekeeper != O/R mapping service
Beware of N+1 patterns

Best Practice: Proper Tagging of Services

Confidential, Dynatrace, LLC
Real Use Case

2015201420xx
Response Time
2016+
1) 2-Person Project 2) Limited Success
3) Start Expansion
4) Performance
Slows Growth Users
5) Potential Decline?
Scaling a Search Service for Online Sports Club

Early 2015: Monolith Under Pressure
Can‘t scale vertically endlessly!
May: 2.68s 94.09% CPU
Bound
April: 0.52s

From Monolith to Services in a Hybrid-Cloud
Move Front End
to Cloud
Scale Backend
in Containers!

26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end

The fixed end-to-end use case
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
SQL Count

Confidential, Dynatrace, LLC
Infrastructure Utilization

Infrastructure Utilization
Is the load on microservices equally load
balanced?
When do you scale up/down?
• CPU
• Memory
• Load
Use automation process to scale up/down

Basic Infrastructure Utilization Monitoring

Utilization over time correlated with deployments!

Deployment and Infrastructure Dependencies

Treat containers just as hosts: CPU,Memory, Traffic, …

Confidential, Dynatrace LLC
N+1 Pattern
58
Product Service
Quote Service
Watch out for N+1 query

Confidential, Dynatrace LLC
Payload Flood
59
Payload between services can impact performance

Confidential, Dynatrace LLC
Granularity
60
Doc Processor Doc Transformer Doc Signer
Doc Encryption
Doc Shipment
Not every functionality needs to be its own microservice

Tightly Coupled
Too specialist services? No Caching?

Service Flow
Optimize your End-to-End Service Flows!

Dependencies
Reduce risk when changing highly depending services!

Continuous Deployment
Consumer 2
v1
Consumer 1
v1
Micro
service
V1
Consumer 2
v2
Consumer 1
v1
Micro
service
V2
Ensure multiple versions of services are supported

Infrastructure Utilization
Ensure equal load across all microservice instances

Source Code:
https://github.com/Dynatrace-Reinhard-Pilz/dt-micro
Free Trials:
Dynatrace FullStack: http://bit.ly/dtsaastrial
AppMon: http://bit.ly/dtpersonal
Join our Podcast Series bit.ly/pureperf

Mais conteúdo relacionado

Mais procurados

Istio presentation jhugGeorgios Andrianakis

NATS for Modern Messaging and MicroservicesApcera

[오픈소스컨설팅] 서비스 메쉬(Service mesh)Open Source Consulting

WebSocket MicroService vs. REST MicroserviceRick Hightower

The Java Microservice LibraryRick Hightower

Devoxx fr 2016 - Apache Kafka - Stream Data PlatformPublicis Sapient Engineering

5 lessons learned for Successful Migration to Confluent CloudNatan Silnitsky

MongoDB .local London 2019: Modern Data Backup and Recovery from On-premises ...MongoDB

8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky

Netflix Data Pipeline With KafkaSteven Wu

[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub ServiceOracle Korea

Service Discovery and Registration in a Microservices ArchitecturePLUMgrid

Implementing Microservices with NATSApcera

Microservices AntipatternsC4Media

NATS: Simple, Secure and Scalable Messaging For the Cloud Native Erawallyqs

Distributed Enterprise Monitoring and Management of Apache Kafka (William McL...HostedbyConfluent

NATS: A Central Nervous System for IoT Messaging - Larry McQuearyApcera

Advanced Microservices Caching Patterns - Devoxx UKNatan Silnitsky

Managing traffic routing with istio and envoy workshopOpsta

Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent

Mais procurados (20)

Istio presentation jhug

NATS for Modern Messaging and Microservices

[오픈소스컨설팅] 서비스 메쉬(Service mesh)

WebSocket MicroService vs. REST Microservice

The Java Microservice Library

Devoxx fr 2016 - Apache Kafka - Stream Data Platform

5 lessons learned for Successful Migration to Confluent Cloud

MongoDB .local London 2019: Modern Data Backup and Recovery from On-premises ...

8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...

Netflix Data Pipeline With Kafka

[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub Service

Service Discovery and Registration in a Microservices Architecture

Implementing Microservices with NATS

Microservices Antipatterns

NATS: Simple, Secure and Scalable Messaging For the Cloud Native Era

Distributed Enterprise Monitoring and Management of Apache Kafka (William McL...

NATS: A Central Nervous System for IoT Messaging - Larry McQueary

Advanced Microservices Caching Patterns - Devoxx UK

Managing traffic routing with istio and envoy workshop

Building an Event-oriented Data Platform with Kafka, Eric Sammer

Semelhante a Architectural Commandments for Building & Running Microservices at Scale

Service Mesh CTO Forum (Draft 3)Rick Hightower

Big datadc skyfall_preso_v2abramsm

Move fast and make things with microservicesMithun Arunan

Upgrading_your_microservices_to_next_level_v1.0.pdfVladimirRadzivil

The Network Fabric for Your Digital TransformationAmazon Web Services

Service Virtualization - Next Gen Testing Conference Singapore 2013Min Fang

(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...Amazon Web Services

Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger

Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH

Service Mesh Talk for CTO ForumRick Hightower

Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Tim Bozarth

Evolution of Microservices - Craft ConferenceAdrian Cockcroft

Mastering Chaos - A Netflix Guide to MicroservicesJosh Evans

QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfSimranjyotSuri

Become a Performance Diagnostics HeroTechWell

2016 - 10 questions you should answer before building a new microservicedevopsdaysaustin

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe

Ransomware-Recovery-as-a-ServiceSagi Brody

How Yelp Leapt to Microservices with More than a Message Queueconfluent

Devoxx university - Kafka de haut en basFlorent Ramiere

Semelhante a Architectural Commandments for Building & Running Microservices at Scale (20)

Service Mesh CTO Forum (Draft 3)

Big datadc skyfall_preso_v2

Move fast and make things with microservices

Upgrading_your_microservices_to_next_level_v1.0.pdf

The Network Fabric for Your Digital Transformation

Service Virtualization - Next Gen Testing Conference Singapore 2013

(BDT312) Using the Cloud to Scale from a Database to a Data Platform | AWS re...

Patterns and Pains of Migrating Legacy Applications to Kubernetes

Service Mesh Talk for CTO Forum

Netflix: From Zero to Production-Ready in Minutes (QCon 2017)

Evolution of Microservices - Craft Conference

Mastering Chaos - A Netflix Guide to Microservices

QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf

Become a Performance Diagnostics Hero

2016 - 10 questions you should answer before building a new microservice

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Ransomware-Recovery-as-a-Service

How Yelp Leapt to Microservices with More than a Message Queue

Devoxx university - Kafka de haut en bas

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

MINDCTI Revenue Release Quarter One 2024MIND CTI

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Architecting Cloud Native ApplicationsWSO2

GenAI Risks & Security Meetup 01052024.pdflior mazor

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Architectural Commandments for Building & Running Microservices at Scale

1. Confidential, Dynatrace, LLC Architectural commandments for building & running microservices at scale Brian Wilson, Product Specialist, Dynatrace @emperorwilson Join our Podcast Series bit.ly/pureperf

2. 1. Avoid Anti-Patterns 2. Continuous Deployment 3. Infrastructure Utilization Microservices Techniques

3. 1. N+1 call 2. N+1 query 3. Payload flood 4. Granularity/Tight Coupling 5. Inefficient Service Flow 6. Dependencies Common Anti-Patterns

4. Confidential, Dynatrace, LLC Our Setup

5. Service Instance A Service Client Request Service Instance B Service Instance N Router/Service Registry Register Our Setup

6. Confidential, Dynatrace, LLC Anti-Patterns

7. Confidential, Dynatrace, LLC N + 1 Call Pattern

8. Monolithic Code public double getQuote(String type) { double quote=0; for (Product product: products) { quote += product.getValue(); } return quote; } N+1 Call Pattern Works well within 1 process

9. N+1 Call Pattern Product Service Quote Service 1 call to Quote Service = 44 calls to product service

10. Confidential, Dynatrace, LLC N + 1 Query Pattern

11. N+1 Query Pattern

12. N+1 Query Pattern 1 call to Quote Service = 87 calls to DB Product Service Quote Service

13. Confidential, Dynatrace LLC slide 16 Payload Flood

14. Payload Flood Document creation is split across multiple services

15. Payload Flood Increasing payload between services call

16. Payload Flood Payload could be significantly reduced by calling individual services from document service.

17. Confidential, Dynatrace, LLC Granularity/Tight Coupling

18. Granularity Doc Processor Doc Transformer Doc Signer Doc Encryption Doc Shipment Document Encryption is carved out at a separate service. May not be the best option to run it as a separate service Documents

19. Tightly coupled. Really Distributed?

20. Confidential, Dynatrace, LLC Inefficient Service Flow (drawing parallels to Web Performance Optimization)

21. WPO (Web Performance Optimization) taught us optimizing resource dependencies when loading a web page by analyzing Resource Waterfalls

22. Especially useful when page loads get very complex and overloaded: 3rd party dependencies, non optimize resources, wrong cache settings, loading too much data too early, …

23. SFPO (Service Flow Performance Optimization) has to teach us how to optimize (micro)service dependencies through Service Flows

24. Especially useful to identify: inefficient 3rd party services, recursive call chains, N+1 Query Patterns, loading too much data, no data caching, … -> sounds very familiar to WPO

25. Understanding where time within a service and between service calls is spent

26. Classical cascading effect of recursive service calls!

27. Confidential, Dynatrace, LLC Dependencies

28. Look beyond the “Tip of the Iceberg”: Understanding Dependencies is critical!

29. Who is depending on me? What is the risk of change?

30. Confidential, Dynatrace, LLC Continuous Deployment

31. Continuous Deployment Consumer 2 v1 Consumer 1 v1 Micro service v1 Consumer 2 v2 Consumer 1 v1 Micro service v2

32. Consumer 2 Consumer 1 Microservice v2 Microservice v1 Continuous Deployment Service identifier = unique Id + version Semantic versioning (major.minor.patch)

33. Consumer 2 Consumer 1 Microservice v2 Microservice v1 Continuous Deployment

34. Consumer 2 Consumer 1 Microservice v2 Microservice v1 Continuous Deployment Gatekeeper Gatekeeper != O/R mapping service Beware of N+1 patterns

35. Best Practice: Proper Tagging of Services

36. Confidential, Dynatrace, LLC Real Use Case

37. 2015201420xx Response Time 2016+ 1) 2-Person Project 2) Limited Success 3) Start Expansion 4) Performance Slows Growth Users 5) Potential Decline? Scaling a Search Service for Online Sports Club

38. Early 2015: Monolith Under Pressure Can‘t scale vertically endlessly! May: 2.68s 94.09% CPU Bound April: 0.52s

39. From Monolith to Services in a Hybrid-Cloud Move Front End to Cloud Scale Backend in Containers!

40. Go live – 7:00 a.m.

41. Go live – 12:00 p.m.

42.

43. 26.7s Load Time 5kB Payload 33! Service Calls 99kB - 3kB for each call! 171!Total SQL Count Architecture Violation Direct access to DB from frontend service Single search query end-to-end

44. The fixed end-to-end use case 2.5s (vs 26.7) 5kB Payload 1! (vs 33!) Service Call 5kB (vs 99) Payload! 3!(vs 177) Total SQL Count

45. Confidential, Dynatrace, LLC Infrastructure Utilization

46. Infrastructure Utilization Is the load on microservices equally load balanced? When do you scale up/down? • CPU • Memory • Load Use automation process to scale up/down

47. Basic Infrastructure Utilization Monitoring

48. Utilization over time correlated with deployments!

49. Deployment and Infrastructure Dependencies

50. Keep an eye on active Docker containers

51. Treat containers just as hosts: CPU,Memory, Traffic, …

52.

53. Confidential, Dynatrace LLC N+1 Pattern 58 Product Service Quote Service Watch out for N+1 query

54. Confidential, Dynatrace LLC Payload Flood 59 Payload between services can impact performance

55. Confidential, Dynatrace LLC Granularity 60 Doc Processor Doc Transformer Doc Signer Doc Encryption Doc Shipment Not every functionality needs to be its own microservice

56. Tightly Coupled Too specialist services? No Caching?

57. Service Flow Optimize your End-to-End Service Flows!

58. Dependencies Reduce risk when changing highly depending services!

59. Continuous Deployment Consumer 2 v1 Consumer 1 v1 Micro service V1 Consumer 2 v2 Consumer 1 v1 Micro service V2 Ensure multiple versions of services are supported

60. Infrastructure Utilization Ensure equal load across all microservice instances

61. Source Code: https://github.com/Dynatrace-Reinhard-Pilz/dt-micro Free Trials: Dynatrace FullStack: http://bit.ly/dtsaastrial AppMon: http://bit.ly/dtpersonal Join our Podcast Series bit.ly/pureperf

Notas do Editor

Original recording can be found here - https://info.dynatrace.com/apm_dtm_all_17q2_wc_microservices_en_registration.html
PurePerformance Podcast: http://bit.ly/pureperf Or http://www.spreaker.com/user/pureperformance
Today, we are going to look at three important areas to focus on when moving from monolith to to microservices. Most of the data we’re going to look at today comes from experiences shared with us by our customers. Not just stories our customers related to us, but, especially in anti-patterns area – events we see occur over and over and over again based on data they share with us in in our free trial. First, we’ll look at some common anti-patterns and how to avoid them. Some of these anti-patterns are a bit newer, but many of them are the same old common problems that everybody insists on migrating to their microservices environments Next we’ll look at some important considerations for Continuous Deployment and take a look at a real use case An the last area we’ll look at is Infrastructure Utilization of your microservices environment.
Let’s start with the Anti-Patterns. We’ll cover 6 of them today.
Download the github repo with this microservice app: https://github.com/Dynatrace-Reinhard-Pilz/dt-micro Dynatrace Free Trial: http://bit.ly/dtsaastrial AppMon Free Trial: http://bit.ly/dtpersonal In order to get screenshots of some of the problem patterns presented here, we setup our own simple microservices environment to re-create them. You can recreate the environment yourself and try these out yourself – we have it in a github repository and I’ll share the link at the end. This environment runs on a single host, and what you’ll see is that with the right tools and the right frame of mind, you can very easily detect these problems very early on. Application is a controller that spins up multiple processes To make this actually microservices, we have the registry/router service Each service registers itself with the registry on startup The Service client, whether web request or request from another service, hits the router, the router sends the request to the proper service Created this with spring boot to make this easy Spring boot offers a rich set of technologies with which we can easily integrate We can deploy the same binaries to each instance and control what they’re doing through configuration. Download from github and try these out yourself
Example of the code if you want to display - probably hard to read in most situations.
While we’ve set up a lot of this up with Spring Boot, the anti-patterns I’m about to discuss have nothing specifically to do with Spring Boot. These anti-patterns are true regardless of what technology you are using. It could be jave, .net, node, or anything else. These are not technology specific, but rather architecturally based. In fact, microservices quite often span multiple languages and technologies. That’s part of what makes them great. Anti-patterns, however, are not great.
Let’s start with the N+1 Call pattern. For both this and the next, I wish they had been called the 1+N pattern as it more accurately describes what’s going on, however, N+1 is already engrained.
Let’s start with this getQuote function, leftover from the monolithic code. IN this monolithic example, you’re making a call to an API called getQuote. The main function of getQuote is to go through the list of products and sum up the value of their prices. This works kind of fine when running in a single, monolithic-type process because you can iterate through the products and prices because all the info is in cache and you’re just accessing local memory and it’s all fast. Overhead is very minimal when something like this is setup properly in a single process. So, what typically happens when everybody gets all excited and moves this to microservices.
You end up with something like this. This is a screenshot of a transaction flow. We see the a web request on the left side making a call to the quote service. The quote service makes a call to the product service. The product service retrieves the product price from the database which it then passes back to the quote service. The quote service sums up all the prices and sends the response back to the client. Everybody is happy because we have microservices and we can scale. We can even automate. There a few problems with this, though. One call to the quote service results in 44 calls to the product service. Product service has very minimal business logic built in it, so it’s only handling the price for one product at a time. This quote has 44 items in it, so 44 calls to the product service Adds a lot of overhead To Network To Product service, because there’s no telling what kind of request the quote service, or any other new service, will introduce. Also, separate queries are being made for each product – more on that next Quote service is waiting for all the data and then has to process it, which could tie it up from servicing more calls from clients. Better way: Take some of the business logic from the quote service and move it to the product service. Product service should be more intelligent to take the full list of products, sum the prices and return the total to the quote service. This also frees up the load on the Quote service, making it more responsive to the end clients. Reason this might have been designed like this: Quote service and product service are typically different teams, maybe different management. Product service not talking to its customers, so they’re just writing the most basic of functions. Quote service not talking to Product service team to let them how they’re going to use them – this leads to unintended abuse of the product service because it’s too simple Communicate to build better services and avoid the N+1 call problem. Even with these improvements, we still see a lot going on with the database…
The N+1 query problem is very similar to the N+1 Call pattern, however this one involves the database. A very simple example is if your application has to get the the employment start data for all of your employees, the application first makes a call for all employ IDs, then for each employee id, makes a query to get the start date.
N+1 query is very similar to N+1 call, but they are separate problems. Fixing one doesn’t fix the other. Being aware of the pattern, though can help you to avoid introducing it anywhere. In this call trace screenshot – 1 transaction instance - Recursive calls of quote service to product service which makes a DB acquisition call and single product query for each individual item in the quote. It’s easy to spot the N+1 Problem visually. Though the query itself is fast, you’re adding network load, connection constrains and loading the DB. Even if there’s no problem right now, chances are conditions will arise where this will blow up in your face.
Going back to our transaction flow, we see one single call into the quote service results in 87 calls to the database. So, in this one spectacularly horrible example, we have the quote service making 44 calls to the Product Service, and the product Service making 87 calls to the database. There are a few thing to consider: If you eliminate the N+1 call pattern on the product service, there’s a good chance you’ll eliminate the N+1 query pattern to the DB, but not necessarily. Though the product service may handle the quote service intelligently, you can still end up executing a single query for price for each individual item. Write a better query. Or Think about leveraging in-memory cache like memcache. Multiple product services instance are using cache to get the data: Much faster Eliminates network calls to the db Consider that when you move from a monolithic to microservices architecture, where you have all of this scaling capability, one of the immediate impacts is that you, at least initially, blow away all of your caching strategies. Microservices doesn’t mean that you should stop using a caching strategy.
Payload Flood Architecture follow a hierarchy model where top level service does it’s part of the work, fire and forget the entire payload to the next service that does it’s tiny bit of work and so on down the line. In the end, you have a big data stream and unlike the waterfall in this picture, you don’t have gravity to get the payload from microservice to microservice. Also, you have the client back at the top who you have to get the data back to.
Our example app: Created a small set of services to create a big report Document service gets the initial dataset from the DB, fire and forget to doc-processor Doc processor run’s it’s tiny bit of code, sends the entire payload do the transformer, and so on down the line. On the surface, doesn’t look bad. You can scale any tier Since document service, and each service below, fires and forgets, it’s free to work on the next request. But, if we look into the detail…
We see the trade off. Huge amount of data being transferred among the services Some might push back and say they have a very robust network and network payloads are not a problem This defines an environmental condition which must be true in order for the services to work. What happens if: Due to unforeseen circumstances, there’s a temporary restriction on network? Your microservice gets deployed to a different environment. Your company expands to mulitple public, private or hybrid data centers which are not all created equal. You can’t control the network. It’s like the movie speed: the bus had to drive around the streets of LA at a speed of over 50mph – other wise if blows up. It’s a condition for their survival. Same thing goes with your services – if the network slows down, your process blows up. Another problem is when dealing with a lot of this kind of data is the data is result of serialized objects. And serialized objects eventually need to get deserialized. This consumes CPU, and if you’re really unlucky, you block resources and create synchronization issues. And I want to stress, this is all one request. You’re switching from a monolithic application to a microserves to scale well. You have to think about the maximum amount of transactions you want to support per second and estimate, based on these numbers, what you can actually support.
To fix this, in our example app, we got rid of the hierarchical model and replaced it with a parent/child model where more intelligence was built into the documet service, which in turn orchestrates the data between the different tiers. A single large payload is no longer moved between tiers. Instead, document service sends only the data each service needs in order to do it’s bit of work. Also, we can run the job in parallel because the doc processor, transformer and signer don’t require the work of each other in order to do their job. This may look like a disadvantage because now the Document Processor has to be aware the entire time, orchestrating, and it will not be free to handle the next request. However, this can be overcome by still leveraging a fire & forget type call, where it monitors a queue that the other services send a message to when they’re done. This then allows not only parallel processing, but allows it to be asynchronous. With this change, we gain the following performance improvements: No longer tied to an environmental network condition Reduces Network payload Run parts of the job in parallel Run parts of the job asynchronously.
The next anti-pattern we constantly come across is the concept of granularity and too tight coupling
A great candidate for a microservice is one that both has a very well defined API and, as a component itself, doesn’t require calls to any other services. Looking at a transaction flow of our document service, in order to illustrate the concept, we created a step called encryption. In this example, every step of the document creation workflow makes at least 1 call to Doc Encryption. Since it’s a well defined API and doesn’t require calls to other services, and you can scale it, it looks like a good idea. Also, you don’t have to maintain encryption code and keys on each service. Looks like a good plan Let’s look at breaking an API call like this into a microservice is not a good idea. In this setup, there are a lot of rest calls to the encryption services. The service does not consumer a lot of CPU, but there are a lot of calls over the network to it. A colleague saw something like this in an engagement. There were a lot of calls to a fast service, but there was a lot of network overhead. When he asked why they needed to separate out the service, they said “so we can scale and spin up multiple instances when we need to.” They ended their discussion when my colleague asked how many instances they’re running, to which they answered, ‘we’ve only ever needed one”. So, that begs the question, if you have a service that is only running 1 instance and doesn’t have to scale, why are you breaking it out and adding the cost of running a service external to the other services that use it? Additionally, architects should look at the ideas and try to figure out if there’s a better way. In the specific example of encryption, we can just make the inter-tier calls with SSL instead of having to make a call to encryption, simplifying everything. Keep a look out on for ways to simplify.
Too Tight Coupling: 99% of the calls to Journey service make a call to Check Destination. This means, basically, for every call to Journey service, Journey Service has to make a network call to the check destination service and the check destination services has to be running and responsive. However if 90, 99, 100% of calls are going to another service, you are making things too complicated. You are creating a nano-service. Can anybody make a good case for nano-services? You are introducing complexity and adding two more points of failure – network problems and CheckDestination availability. If there’s this tight of a coupling between services, you’ve split too much. Either join them back together, or, see if you can leverage caching.
+- 10 years of WPO to learn from Steve Souders wrote the book High Performance Web Sites in 2007. Thanks to people like Steve, Pat Meenan, Paul Irish, Nicole Sullivan, Tammy Everts and many more, and all the people in the trenches toiling away at WPO, we have a wealth of knowledge that we can apply to service flows. This knowledge includes both problem patterns, some of which translate from web/browser patterns to services, as well as the concept of visually analyzing the performance in flows and waterfalls to identify problem patterns.
Browser waterfalls help us highlight the problems we have – makes the very easy to spot, especially in very complex web pages – visualization is key. WPO helps with minifying and combining JS and CSS files to reduce round trips, optimizing images, ensure proper use of browser caching, loading critical elements first instead of large bulk request, etc.
We like to call this Service Flow Performance Optimization, or SFPO We can apply a lot of these learnings to optimize our service flow. Caching, bulking, Teach us how to optimize microservices dependencies - visualize it.
Like WPO, when we get into especially complex service flows, visualizing them is key. We can use these flows to identify all the things in the box – like WPO waterfall. Similar patterns and parallels to WPO
Another way to view flows is in a waterfall/PurePath type view, just like browser waterfalls. This allows to visually see what services are called by an initiating call. We can easily see how much time, what kind of time and how much network time was spend where. This makes it very easy to spot patterns…
Without even looking at the details, this pictures should raise a concern in anybody. Recursive call chain – easy to detect when you can see it, just like WPO
Don’t just focus on your own service and it’s immediate neighbors, somebody has to look at the whole thing it can get huge and out of control If you just look at your part of the front, it looks great If you look at the big picture, you’ll find that there is a lot more complexity involved. Do you know who is dependent on your service. Do you know what the services you are dependent on are dependent on? Is there a service 5 layers down that is critical to your existence?
Understanding who your customers are, who is dependent on you. In Monolithic code, this is easy. Microservices complicate the picture exponentially. You have no good way to know unless you are monitoring who is making calls into you. Service back traces clearly display all of service who depend on your service. Armed with this info: You know who’s at risk if you make changes You know who to collaborate with when coming up with changes (think back to the reason N+1 Call pattern happens) By collaborating with your dependents, you’ll write a much better, more performant, and more useful service
So, now that we covered a bunch of anit-patterns, let’s continue onto the topic of Continuous deployment.
So, not that you’ve taken all the necessary steps to ensure your services are as performant and well tuned as possible, what you need to always consider is that you are not deploying a big “GA” version. The next version might already be in the pipeline, a few days away from production, and another version is being conceptualized right behind that. The version you are pushing out right now is impermanent. So, when you first deploy, everything is great. You have multiple consumers using your microservice, and since this is the first time it’s being used, they’re all using the same version. Everybody is happy. However, your service is so popular and everybody wants to use it that you’re forced to update it and add new functionality. When you do that, you run the risk of this. And since you don’t want to have any downtime when you deploy, you do this all live. Consumer 2 has already changed in tandem to take advantage of the new functionality, but consumer 1, not needing the new functionality, didn’t change and now they’re broken. 50% of your users are now failing.
So, to avoid this issue, you have to be prepared to run multiple versions of your service. This could be for a short time or for a long a time. It depends on how long you want to support the old version and how long other consumers need to make their changes. If consumer one is a mobile app, you’ll, for example, you’ll need longer support than if these consumers were other internal services. There are a few ways you can do this. If you’re clever enough, you can introduce capability layers where the either of the services can talk to the new service and that new service has a backward compatible protocol layer. However, you don’t have to worry about this if you deploy multiple version of the service. TO do this, you need more than just a unique identifier for the service like “document service”. You need to add a version to this. So, the consumer should always be looking for the, let’s say Document Service, with a specific version. We’d ideally suggest each service has a minimum and maximum version supported definition, and when talking versions, semantic versioning should be used. – major, minor, patch at minimum. They should be meaningful. Most people would expect a change in the major version to indicate a change in the API, and therefore a likely incompatibility with consumers running an older version, where as the minor/patch should still be compatible with the older versions.
This concept can extend when we move to the database. Service 2 may introduce some changes that require changes in the database schema, and this in turn may break calls made from microservice v1. So, again, 50% of your users are getting an error page. We can’t maintain 2 databases, so, what do we do about this?
You basically have to inject some type of mediator, or gatekeeper, between the services and the database. The gatekeeper is the only one who can talk to the database. Whoever wants to talk to the DB has to talk to the gatekeeper. The gatekeeper runs on a specific version and has the compatibility layer built into it. Each version of the service can talk to the gatekeeper, and the gatekeeper, in turn, will create the queries compatible with the database. So, this works nice, however, let’s keep in mind the N+1 problems. You have to make your gatekeeper clever. Don’t just create O/R mapping services. A gatekeeper that is only offering an object oriented API to create statements to execute on the DB is a bad idea. Sooner or later you’ll run into N+1 again. Instead, options would include singling out functionality from the services and moving them to the gatekeeper to make the gatekeeper more intelligent, or perhaps your services don’t even interact with the database. Instead, you leverage a third party caching mechanism that takes care of interacting with the database and knows how to distribute the cache among the multiple instances.
Most platforms support tags. Very important to use them and monitor the entities as well as being aware of the tags. Monitoring has to be able to see these tags. By seeing tags in monitoring, you’ll know which version of your service has an issue.
Let’s pivot to a real use case from one of our customers.
This was a search service for an online sports club in Europe. Users could go on and search for local soccer sports clubs and go there. It started as a 2 person project, it was used a little bit and had a little bit of success. IN 2014, they decided to expand the service to different cities in Europe. As they did this, they saw an increase in users to the site. And, as you can imagine, as the users increased, they sad a significant increase in response time. They, predictably also started seeing a drop off in users and response time got worse.
They had a monolithic .net app that connected to a SQL server in the back end. IN april 2015, the response time was decent. Next month, when they expanded, the response time increased. Not terrible, but not great. But what they saw is that the application was CPU bound and they could not scale it vertically.
So, they thought “hey - microservices and cloud will save the day“ Don‘t we all? So, they moved the frontend logic into the public cloud and the backend search service into Containers. The idea was to be able to host these containers in the public cloud, deploying the front end where they need it globally, with the ability to scale the back end as needed. So, they quickly modified the app to break it out into microservices. They could now scale and their problems were solved.
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online! Response time was acceptable, users were mostly satisfied and bounce rate was ok.
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items.N+1 Lots of wasted traffic and resources as these Key Architectural Metrics show us 33 service calls N+1 call problem 99KB – payload call 171 queries N+1 Query problem.
So, they went back to the drawing board. They made the front end more intelligent. They re-architected the backend and got rid of the n+1 problems. Payload went down as a result, eliminating payload flood. They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
So, now that you’ve taken care of your performance issues and have made sure that you can deploy safely and compatibly, you are faced with another very important concept - how do you utilize your infrastructure optimally.
Balancing the load of your microservices is one of the keys to properly utilizing your infrastructure. With so many instances running, keeping an eye on balancing is much more difficult. And if your services aren’t balanced, you: Run the risk of the overloaded instances encountering performance problems Waste money by running too many instances. If you can spread your load evenly, you can likely reduce the number of instances running. Somebody has to pay for all of the compute you are using. Balancing is more than just balancing throughput. Make sure CPU and Memory consumption is balanced as well. If you can get a handle on your balancing, you can more easily set parameters for scaling. Take time to identify and establish criteria along the dimensions of CPU, Memory and Load for when you are to scale up. More importantly, identify the same thresholds for when you can comfortable scale down. Not scaling down defeats the purpose of microservices and containers and costs a lot of money. Once you scale up or down, make sure the load balances across the new configuration. Automate all this. Why do this manually.
One thing a lot of people don’t realize is that monitoring your infrastructure in a monolithic app is very different than monitoring in a microservices architecture. You might be monitoring all of the hosts in your environment, but even if all of the hosts are green, it doesn’t mean your system is running well. It just means that none of your hosts or processes have a problem, and that’s at least good. Your red instances might indicate a non-critical full disk partition, in which case, who really cares except the infrastructure team. With the green hosts, your host and process may be healthy, but your code might be terrible. We need to see all the way through from infrastructure to service/code performance. Also, Most of this type do monitoring only shows how you are doing right now. You have to look at over time know what happens to impact
Historical and trend data is very important. You need to know when deployments happen in your monitoring, so that if you see strange behavior, you can tie it to the change. If you can’t see that, you’re stuck. Also need to be able to compare performance same hour last week to know if something is our of the ordinary. Time based monitoring is a huge improvement, but it still leaves a lt out. We don’t know what impacts what, and what dependencies there are.
Understand your dependencies – know which services are running on which processes on which hosts in which data centers. Know how services talk to each other, processes interact with each other. In Monolithic, If a single server is having a problem, you’ll know which services are impacted. You know which dependents might be impacted. In a microservices environment, your hosts can go well over 100, even into the thousands, processes and service instances into the 10s of thousands. There’s no way you’re going to know what all is dependent on each other. And if you figure it out today, tomorrow it will change. Dependency monitoring/mapping is very important. This means you have to be able to map this in your monitoring. If a single server goes down in a microservices environment, who are the immediate neighbors that are impacted. Who are the services further down the line that might be impacted. Think of that tip of the iceberg slide we looked at earlier. If one of those nodes out to the far left had a problem, are we really going to know that it was the one to impact the node on the far right? If you have 5 machines with high CPU, you’re likely going to look for root cause on those machines. If you know the dependencies, you’ll be able to see that the problem is a network issue on a downstream component. Green & Red lights are good. Timeline metrics, especially with deployment markers, are good, but in the microservices world, you have to monitor the dependencies as well.
[For Docker crowds] It’s important to monitor your entire Docker stack. Monitor the containers, the hosts they run on, the instance counts, throughput, etc. Utilize dependency mapping to be able to look at your Docker components as both an ecosystem as well as individual components.
Important to monitor each container as if it was a host – traffic, CPU, memory Also important to monitor the hosts the containers are running on. Here is also where you can see the load distribution. If I’m running 10 containers, but my load is not balanced like the picture here, there’s a good chance I may be able to spin down to 6 or 7 containers if I balance the load. That’ll save us money.
And last but not least, in order to know what to do here, you have to monitor your instance. Chart the load distribution, chart the number of instances, chart the resource utilization across your services. If you don’t take the time to monitor your system, you wont’ know what’s going on and you will end up paying much more for the compute resources you are using. The point of moving to, say, docker and the cloud, is to improve performance and save money. You can only do this if you are monitoring performance and resource utilization. Also, it’s important not to set-and-forget this. Review your monitoring strategies and collected data often to make see if you can optimize better. A final thought here – when you move to a microservices architecture, the components that keep your business logic up and running are as important as the business logic itself.
Lessons Learned!
N+1 patterns are like cockroaches – they’ll outlive us all Approach N+1 patterns like crack – don’t do it.

Architectural Commandments for Building & Running Microservices at Scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Architectural Commandments for Building & Running Microservices at Scale

Semelhante a Architectural Commandments for Building & Running Microservices at Scale (20)

Último

Último (20)

Architectural Commandments for Building & Running Microservices at Scale

Notas do Editor