Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Architectural Commandments for Building & Running Microservices at Scale
1. Confidential, Dynatrace, LLC
Architectural commandments for building & running
microservices at scale
Brian Wilson, Product Specialist, Dynatrace
@emperorwilson
Join our Podcast Series bit.ly/pureperf
8. Monolithic Code
public double getQuote(String type) {
double quote=0;
for (Product product: products) {
quote += product.getValue();
}
return quote;
}
N+1 Call Pattern
Works well within 1
process
9. N+1 Call Pattern
Product Service
Quote Service
1 call to Quote Service
= 44 calls to product
service
18. Granularity
Doc Processor Doc Transformer Doc Signer
Doc Encryption
Doc Shipment
Document Encryption is carved out at a separate
service. May not be the best option to run it as a
separate service
Documents
21. WPO (Web Performance Optimization)
taught us optimizing resource dependencies
when loading a web page by analyzing
Resource Waterfalls
22. Especially useful when page loads get very
complex and overloaded:
3rd party dependencies, non optimize
resources, wrong cache settings, loading too
much data too early, …
23. SFPO (Service Flow Performance Optimization)
has to teach us how to optimize (micro)service
dependencies through Service Flows
24. Especially useful to identify: inefficient 3rd party services, recursive
call chains, N+1 Query Patterns, loading too much data, no data
caching, … -> sounds very familiar to WPO
43. 26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end
44. The fixed end-to-end use case
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
SQL Count
46. Infrastructure Utilization
Is the load on microservices equally load
balanced?
When do you scale up/down?
• CPU
• Memory
• Load
Use automation process to scale up/down
Original recording can be found here - https://info.dynatrace.com/apm_dtm_all_17q2_wc_microservices_en_registration.html
PurePerformance Podcast:
http://bit.ly/pureperf
Or
http://www.spreaker.com/user/pureperformance
Today, we are going to look at three important areas to focus on when moving from monolith to to microservices. Most of the data we’re going to look at today comes from experiences shared with us by our customers. Not just stories our customers related to us, but, especially in anti-patterns area – events we see occur over and over and over again based on data they share with us in in our free trial.
First, we’ll look at some common anti-patterns and how to avoid them. Some of these anti-patterns are a bit newer, but many of them are the same old common problems that everybody insists on migrating to their microservices environments
Next we’ll look at some important considerations for Continuous Deployment and take a look at a real use case
An the last area we’ll look at is Infrastructure Utilization of your microservices environment.
Let’s start with the Anti-Patterns. We’ll cover 6 of them today.
Download the github repo with this microservice app:
https://github.com/Dynatrace-Reinhard-Pilz/dt-micro
Dynatrace Free Trial: http://bit.ly/dtsaastrial
AppMon Free Trial: http://bit.ly/dtpersonal
In order to get screenshots of some of the problem patterns presented here, we setup our own simple microservices environment to re-create them. You can recreate the environment yourself and try these out yourself – we have it in a github repository and I’ll share the link at the end.
This environment runs on a single host, and what you’ll see is that with the right tools and the right frame of mind, you can very easily detect these problems very early on.
Application is a controller that spins up multiple processes
To make this actually microservices, we have the registry/router service
Each service registers itself with the registry on startup
The Service client, whether web request or request from another service, hits the router, the router sends the request to the proper service
Created this with spring boot to make this easy
Spring boot offers a rich set of technologies with which we can easily integrate
We can deploy the same binaries to each instance and control what they’re doing through configuration.
Download from github and try these out yourself
Example of the code if you want to display - probably hard to read in most situations.
While we’ve set up a lot of this up with Spring Boot, the anti-patterns I’m about to discuss have nothing specifically to do with Spring Boot. These anti-patterns are true regardless of what technology you are using. It could be jave, .net, node, or anything else. These are not technology specific, but rather architecturally based. In fact, microservices quite often span multiple languages and technologies. That’s part of what makes them great. Anti-patterns, however, are not great.
Let’s start with the N+1 Call pattern. For both this and the next, I wish they had been called the 1+N pattern as it more accurately describes what’s going on, however, N+1 is already engrained.
Let’s start with this getQuote function, leftover from the monolithic code. IN this monolithic example, you’re making a call to an API called getQuote. The main function of getQuote is to go through the list of products and sum up the value of their prices.
This works kind of fine when running in a single, monolithic-type process because you can iterate through the products and prices because all the info is in cache and you’re just accessing local memory and it’s all fast. Overhead is very minimal when something like this is setup properly in a single process.
So, what typically happens when everybody gets all excited and moves this to microservices.
You end up with something like this. This is a screenshot of a transaction flow. We see the a web request on the left side making a call to the quote service. The quote service makes a call to the product service. The product service retrieves the product price from the database which it then passes back to the quote service. The quote service sums up all the prices and sends the response back to the client. Everybody is happy because we have microservices and we can scale. We can even automate.
There a few problems with this, though.
One call to the quote service results in 44 calls to the product service.
Product service has very minimal business logic built in it, so it’s only handling the price for one product at a time. This quote has 44 items in it, so 44 calls to the product service
Adds a lot of overhead
To Network
To Product service, because there’s no telling what kind of request the quote service, or any other new service, will introduce.
Also, separate queries are being made for each product – more on that next
Quote service is waiting for all the data and then has to process it, which could tie it up from servicing more calls from clients.
Better way:
Take some of the business logic from the quote service and move it to the product service.
Product service should be more intelligent to take the full list of products, sum the prices and return the total to the quote service.
This also frees up the load on the Quote service, making it more responsive to the end clients.
Reason this might have been designed like this:
Quote service and product service are typically different teams, maybe different management.
Product service not talking to its customers, so they’re just writing the most basic of functions.
Quote service not talking to Product service team to let them how they’re going to use them – this leads to unintended abuse of the product service because it’s too simple
Communicate to build better services and avoid the N+1 call problem.
Even with these improvements, we still see a lot going on with the database…
The N+1 query problem is very similar to the N+1 Call pattern, however this one involves the database. A very simple example is if your application has to get the the employment start data for all of your employees, the application first makes a call for all employ IDs, then for each employee id, makes a query to get the start date.
N+1 query is very similar to N+1 call, but they are separate problems. Fixing one doesn’t fix the other. Being aware of the pattern, though can help you to avoid introducing it anywhere.
In this call trace screenshot – 1 transaction instance - Recursive calls of quote service to product service which makes a DB acquisition call and single product query for each individual item in the quote. It’s easy to spot the N+1 Problem visually.
Though the query itself is fast, you’re adding network load, connection constrains and loading the DB. Even if there’s no problem right now, chances are conditions will arise where this will blow up in your face.
Going back to our transaction flow, we see one single call into the quote service results in 87 calls to the database. So, in this one spectacularly horrible example, we have the quote service making 44 calls to the Product Service, and the product Service making 87 calls to the database.
There are a few thing to consider:
If you eliminate the N+1 call pattern on the product service, there’s a good chance you’ll eliminate the N+1 query pattern to the DB, but not necessarily.
Though the product service may handle the quote service intelligently, you can still end up executing a single query for price for each individual item.
Write a better query.
Or
Think about leveraging in-memory cache like memcache. Multiple product services instance are using cache to get the data:
Much faster
Eliminates network calls to the db
Consider that when you move from a monolithic to microservices architecture, where you have all of this scaling capability, one of the immediate impacts is that you, at least initially, blow away all of your caching strategies. Microservices doesn’t mean that you should stop using a caching strategy.
Payload Flood
Architecture follow a hierarchy model where top level service does it’s part of the work, fire and forget the entire payload to the next service that does it’s tiny bit of work and so on down the line.
In the end, you have a big data stream and unlike the waterfall in this picture, you don’t have gravity to get the payload from microservice to microservice.
Also, you have the client back at the top who you have to get the data back to.
Our example app:
Created a small set of services to create a big report
Document service gets the initial dataset from the DB, fire and forget to doc-processor
Doc processor run’s it’s tiny bit of code, sends the entire payload do the transformer, and so on down the line.
On the surface, doesn’t look bad.
You can scale any tier
Since document service, and each service below, fires and forgets, it’s free to work on the next request.
But, if we look into the detail…
We see the trade off.
Huge amount of data being transferred among the services
Some might push back and say they have a very robust network and network payloads are not a problem
This defines an environmental condition which must be true in order for the services to work.
What happens if:
Due to unforeseen circumstances, there’s a temporary restriction on network?
Your microservice gets deployed to a different environment.
Your company expands to mulitple public, private or hybrid data centers which are not all created equal.
You can’t control the network.
It’s like the movie speed: the bus had to drive around the streets of LA at a speed of over 50mph – other wise if blows up. It’s a condition for their survival. Same thing goes with your services – if the network slows down, your process blows up.
Another problem is when dealing with a lot of this kind of data is the data is result of serialized objects. And serialized objects eventually need to get deserialized. This consumes CPU, and if you’re really unlucky, you block resources and create synchronization issues.
And I want to stress, this is all one request. You’re switching from a monolithic application to a microserves to scale well. You have to think about the maximum amount of transactions you want to support per second and estimate, based on these numbers, what you can actually support.
To fix this, in our example app, we got rid of the hierarchical model and replaced it with a parent/child model where more intelligence was built into the documet service, which in turn orchestrates the data between the different tiers.
A single large payload is no longer moved between tiers. Instead, document service sends only the data each service needs in order to do it’s bit of work.
Also, we can run the job in parallel because the doc processor, transformer and signer don’t require the work of each other in order to do their job.
This may look like a disadvantage because now the Document Processor has to be aware the entire time, orchestrating, and it will not be free to handle the next request.
However, this can be overcome by still leveraging a fire & forget type call, where it monitors a queue that the other services send a message to when they’re done. This then allows not only parallel processing, but allows it to be asynchronous.
With this change, we gain the following performance improvements:
No longer tied to an environmental network condition
Reduces Network payload
Run parts of the job in parallel
Run parts of the job asynchronously.
The next anti-pattern we constantly come across is the concept of granularity and too tight coupling
A great candidate for a microservice is one that both has a very well defined API and, as a component itself, doesn’t require calls to any other services.
Looking at a transaction flow of our document service, in order to illustrate the concept, we created a step called encryption. In this example, every step of the document creation workflow makes at least 1 call to Doc Encryption. Since it’s a well defined API and doesn’t require calls to other services, and you can scale it, it looks like a good idea. Also, you don’t have to maintain encryption code and keys on each service. Looks like a good plan
Let’s look at breaking an API call like this into a microservice is not a good idea.
In this setup, there are a lot of rest calls to the encryption services. The service does not consumer a lot of CPU, but there are a lot of calls over the network to it.
A colleague saw something like this in an engagement. There were a lot of calls to a fast service, but there was a lot of network overhead. When he asked why they needed to separate out the service, they said “so we can scale and spin up multiple instances when we need to.” They ended their discussion when my colleague asked how many instances they’re running, to which they answered, ‘we’ve only ever needed one”. So, that begs the question, if you have a service that is only running 1 instance and doesn’t have to scale, why are you breaking it out and adding the cost of running a service external to the other services that use it?
Additionally, architects should look at the ideas and try to figure out if there’s a better way. In the specific example of encryption, we can just make the inter-tier calls with SSL instead of having to make a call to encryption, simplifying everything. Keep a look out on for ways to simplify.
Too Tight Coupling:
99% of the calls to Journey service make a call to Check Destination.
This means, basically, for every call to Journey service, Journey Service has to make a network call to the check destination service and the check destination services has to be running and responsive. However if 90, 99, 100% of calls are going to another service, you are making things too complicated. You are creating a nano-service. Can anybody make a good case for nano-services?
You are introducing complexity and adding two more points of failure – network problems and CheckDestination availability.
If there’s this tight of a coupling between services, you’ve split too much. Either join them back together, or, see if you can leverage caching.
+- 10 years of WPO to learn from
Steve Souders wrote the book High Performance Web Sites in 2007.
Thanks to people like Steve, Pat Meenan, Paul Irish, Nicole Sullivan, Tammy Everts and many more, and all the people in the trenches toiling away at WPO, we have a wealth of knowledge that we can apply to service flows.
This knowledge includes both problem patterns, some of which translate from web/browser patterns to services, as well as the concept of visually analyzing the performance in flows and waterfalls to identify problem patterns.
Browser waterfalls help us highlight the problems we have – makes the very easy to spot, especially in very complex web pages – visualization is key.
WPO helps with minifying and combining JS and CSS files to reduce round trips, optimizing images, ensure proper use of browser caching, loading critical elements first instead of large bulk request, etc.
We like to call this Service Flow Performance Optimization, or SFPO
We can apply a lot of these learnings to optimize our service flow. Caching, bulking,
Teach us how to optimize microservices dependencies - visualize it.
Like WPO, when we get into especially complex service flows, visualizing them is key.
We can use these flows to identify all the things in the box – like WPO waterfall. Similar patterns and parallels to WPO
Another way to view flows is in a waterfall/PurePath type view, just like browser waterfalls.
This allows to visually see what services are called by an initiating call. We can easily see how much time, what kind of time and how much network time was spend where.
This makes it very easy to spot patterns…
Without even looking at the details, this pictures should raise a concern in anybody.
Recursive call chain – easy to detect when you can see it, just like WPO
Don’t just focus on your own service and it’s immediate neighbors, somebody has to look at the whole thing it can get huge and out of control
If you just look at your part of the front, it looks great
If you look at the big picture, you’ll find that there is a lot more complexity involved. Do you know who is dependent on your service. Do you know what the services you are dependent on are dependent on? Is there a service 5 layers down that is critical to your existence?
Understanding who your customers are, who is dependent on you.
In Monolithic code, this is easy. Microservices complicate the picture exponentially. You have no good way to know unless you are monitoring who is making calls into you.
Service back traces clearly display all of service who depend on your service.
Armed with this info:
You know who’s at risk if you make changes
You know who to collaborate with when coming up with changes (think back to the reason N+1 Call pattern happens)
By collaborating with your dependents, you’ll write a much better, more performant, and more useful service
So, now that we covered a bunch of anit-patterns, let’s continue onto the topic of Continuous deployment.
So, not that you’ve taken all the necessary steps to ensure your services are as performant and well tuned as possible, what you need to always consider is that you are not deploying a big “GA” version. The next version might already be in the pipeline, a few days away from production, and another version is being conceptualized right behind that. The version you are pushing out right now is impermanent.
So, when you first deploy, everything is great. You have multiple consumers using your microservice, and since this is the first time it’s being used, they’re all using the same version. Everybody is happy. However, your service is so popular and everybody wants to use it that you’re forced to update it and add new functionality. When you do that, you run the risk of this.
And since you don’t want to have any downtime when you deploy, you do this all live. Consumer 2 has already changed in tandem to take advantage of the new functionality, but consumer 1, not needing the new functionality, didn’t change and now they’re broken. 50% of your users are now failing.
So, to avoid this issue, you have to be prepared to run multiple versions of your service. This could be for a short time or for a long a time. It depends on how long you want to support the old version and how long other consumers need to make their changes. If consumer one is a mobile app, you’ll, for example, you’ll need longer support than if these consumers were other internal services.
There are a few ways you can do this. If you’re clever enough, you can introduce capability layers where the either of the services can talk to the new service and that new service has a backward compatible protocol layer.
However, you don’t have to worry about this if you deploy multiple version of the service. TO do this, you need more than just a unique identifier for the service like “document service”. You need to add a version to this. So, the consumer should always be looking for the, let’s say Document Service, with a specific version. We’d ideally suggest each service has a minimum and maximum version supported definition, and when talking versions, semantic versioning should be used. – major, minor, patch at minimum. They should be meaningful. Most people would expect a change in the major version to indicate a change in the API, and therefore a likely incompatibility with consumers running an older version, where as the minor/patch should still be compatible with the older versions.
This concept can extend when we move to the database. Service 2 may introduce some changes that require changes in the database schema, and this in turn may break calls made from microservice v1. So, again, 50% of your users are getting an error page. We can’t maintain 2 databases, so, what do we do about this?
You basically have to inject some type of mediator, or gatekeeper, between the services and the database. The gatekeeper is the only one who can talk to the database. Whoever wants to talk to the DB has to talk to the gatekeeper. The gatekeeper runs on a specific version and has the compatibility layer built into it. Each version of the service can talk to the gatekeeper, and the gatekeeper, in turn, will create the queries compatible with the database.
So, this works nice, however, let’s keep in mind the N+1 problems. You have to make your gatekeeper clever. Don’t just create O/R mapping services. A gatekeeper that is only offering an object oriented API to create statements to execute on the DB is a bad idea. Sooner or later you’ll run into N+1 again. Instead, options would include singling out functionality from the services and moving them to the gatekeeper to make the gatekeeper more intelligent, or perhaps your services don’t even interact with the database. Instead, you leverage a third party caching mechanism that takes care of interacting with the database and knows how to distribute the cache among the multiple instances.
Most platforms support tags. Very important to use them and monitor the entities as well as being aware of the tags. Monitoring has to be able to see these tags. By seeing tags in monitoring, you’ll know which version of your service has an issue.
Let’s pivot to a real use case from one of our customers.
This was a search service for an online sports club in Europe. Users could go on and search for local soccer sports clubs and go there.
It started as a 2 person project, it was used a little bit and had a little bit of success.
IN 2014, they decided to expand the service to different cities in Europe.
As they did this, they saw an increase in users to the site. And, as you can imagine, as the users increased, they sad a significant increase in response time.
They, predictably also started seeing a drop off in users and response time got worse.
They had a monolithic .net app that connected to a SQL server in the back end. IN april 2015, the response time was decent. Next month, when they expanded, the response time increased. Not terrible, but not great. But what they saw is that the application was CPU bound and they could not scale it vertically.
So, they thought “hey - microservices and cloud will save the day“ Don‘t we all? So, they moved the frontend logic into the public cloud and the backend search service into Containers. The idea was to be able to host these containers in the public cloud, deploying the front end where they need it globally, with the ability to scale the back end as needed.
So, they quickly modified the app to break it out into microservices. They could now scale and their problems were solved.
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online! Response time was acceptable, users were mostly satisfied and bounce rate was ok.
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items.N+1 Lots of wasted traffic and resources as these Key Architectural Metrics show us
33 service calls N+1 call problem
99KB – payload call
171 queries N+1 Query problem.
So, they went back to the drawing board. They made the front end more intelligent. They re-architected the backend and got rid of the n+1 problems. Payload went down as a result, eliminating payload flood.
They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
So, now that you’ve taken care of your performance issues and have made sure that you can deploy safely and compatibly, you are faced with another very important concept - how do you utilize your infrastructure optimally.
Balancing the load of your microservices is one of the keys to properly utilizing your infrastructure. With so many instances running, keeping an eye on balancing is much more difficult. And if your services aren’t balanced, you:
Run the risk of the overloaded instances encountering performance problems
Waste money by running too many instances. If you can spread your load evenly, you can likely reduce the number of instances running. Somebody has to pay for all of the compute you are using.
Balancing is more than just balancing throughput. Make sure CPU and Memory consumption is balanced as well.
If you can get a handle on your balancing, you can more easily set parameters for scaling.
Take time to identify and establish criteria along the dimensions of CPU, Memory and Load for when you are to scale up.
More importantly, identify the same thresholds for when you can comfortable scale down. Not scaling down defeats the purpose of microservices and containers and costs a lot of money.
Once you scale up or down, make sure the load balances across the new configuration.
Automate all this. Why do this manually.
One thing a lot of people don’t realize is that monitoring your infrastructure in a monolithic app is very different than monitoring in a microservices architecture.
You might be monitoring all of the hosts in your environment, but even if all of the hosts are green, it doesn’t mean your system is running well. It just means that none of your hosts or processes have a problem, and that’s at least good.
Your red instances might indicate a non-critical full disk partition, in which case, who really cares except the infrastructure team. With the green hosts, your host and process may be healthy, but your code might be terrible. We need to see all the way through from infrastructure to service/code performance.
Also, Most of this type do monitoring only shows how you are doing right now. You have to look at over time know what happens to impact
Historical and trend data is very important.
You need to know when deployments happen in your monitoring, so that if you see strange behavior, you can tie it to the change. If you can’t see that, you’re stuck. Also need to be able to compare performance same hour last week to know if something is our of the ordinary.
Time based monitoring is a huge improvement, but it still leaves a lt out. We don’t know what impacts what, and what dependencies there are.
Understand your dependencies – know which services are running on which processes on which hosts in which data centers. Know how services talk to each other, processes interact with each other.
In Monolithic, If a single server is having a problem, you’ll know which services are impacted. You know which dependents might be impacted.
In a microservices environment, your hosts can go well over 100, even into the thousands, processes and service instances into the 10s of thousands. There’s no way you’re going to know what all is dependent on each other. And if you figure it out today, tomorrow it will change. Dependency monitoring/mapping is very important.
This means you have to be able to map this in your monitoring. If a single server goes down in a microservices environment, who are the immediate neighbors that are impacted. Who are the services further down the line that might be impacted. Think of that tip of the iceberg slide we looked at earlier. If one of those nodes out to the far left had a problem, are we really going to know that it was the one to impact the node on the far right?
If you have 5 machines with high CPU, you’re likely going to look for root cause on those machines. If you know the dependencies, you’ll be able to see that the problem is a network issue on a downstream component.
Green & Red lights are good. Timeline metrics, especially with deployment markers, are good, but in the microservices world, you have to monitor the dependencies as well.
[For Docker crowds]
It’s important to monitor your entire Docker stack. Monitor the containers, the hosts they run on, the instance counts, throughput, etc. Utilize dependency mapping to be able to look at your Docker components as both an ecosystem as well as individual components.
Important to monitor each container as if it was a host – traffic, CPU, memory
Also important to monitor the hosts the containers are running on.
Here is also where you can see the load distribution. If I’m running 10 containers, but my load is not balanced like the picture here, there’s a good chance I may be able to spin down to 6 or 7 containers if I balance the load. That’ll save us money.
And last but not least, in order to know what to do here, you have to monitor your instance. Chart the load distribution, chart the number of instances, chart the resource utilization across your services. If you don’t take the time to monitor your system, you wont’ know what’s going on and you will end up paying much more for the compute resources you are using. The point of moving to, say, docker and the cloud, is to improve performance and save money. You can only do this if you are monitoring performance and resource utilization. Also, it’s important not to set-and-forget this. Review your monitoring strategies and collected data often to make see if you can optimize better.
A final thought here – when you move to a microservices architecture, the components that keep your business logic up and running are as important as the business logic itself.
Lessons Learned!
N+1 patterns are like cockroaches – they’ll outlive us all
Approach N+1 patterns like crack – don’t do it.