SlideShare uma empresa Scribd logo
1 de 93
Running our software on a 2000-core cluster Lessons learnt
Structure For each problem Symptoms Method of investigation Cause Action taken Morale
Background Pretty simple: Distributing embarassingly parallel computations on a cluster Distribution fabric is RabbitMQ Publish tasks to queue Pull results from queue Computational listeners on cluster nodes Tasks are “fast” (~1s cpu time) or “slow” (~15min cpu time) Tasks are split into parts (usually 160) Also parts share the same data chunk – it’s stored in memcached and task input contains the “shared data id” Requirements: 95% utilization for slow tasks, “as much as we can” for fast ones.
RabbitMQ starts refusing connections to some clients when there are  too many of them.
Investigation Eventually turned out RabbitMQ supports  max ~400 connections per process on Windows.
Solution In RabbitMQ: Establish a cluster of RabbitMQs 2 “eternal” connections per client, 512 connections per instance, 1600 clients  ~16 instances suffice. Instances start on same IP, on subsequent ports (5672,5673..) In code: Make both submitter and consumer scan ports until success
Morale Capacity planning! If there’s a resource, plan how much of it you’ll need and with what pattern of usage. Otherwise you’ll exhaust it sooner or later. Network bandwidth Network latency Connections Threads Memory Whatever
RabbitMQConsumer uses a legacy component which can’t run concurrent instances in the same directory
Solution Create temporary directory. Directory.SetCurrentDirectory() at startup. The temp directories pile up.
Solution At startup, clean up unused temp directories. How to know if it is unused? Create a lock file in the directory At startup, try removing lock files and dirs Problem Races: several instances want to delete the same file All but one crash! Several solutions with various kinds of races, “fixed” by try/ignore band-aid… Just wrap the whole “clean-up” block in a try/ignore! That’s it.
Morale If it’s non-critical, wrap the whole thing with try/ignore Even if you think it will never fail It will (maybe in the future, after someone changes the code…) Thinking “it won’t” is unneeded complexity Low-probable errors will happen The chance is small but frequent 0.001 probability of error, 2000 occasions = 87% that at least 1 failure occurs
Then the thing started working. Kind of. We asked for 1000 tasks “in flight”, and got only about 125.
Gateway is highly CPU loaded (perhaps that’s the bottleneck?)
Solution Eliminate data compression It was unneeded – 160 compressions of <1kb-sized data per task (1 per subtask)! Eliminate unneeded deserialization Eliminate Guid.NewGuid() per subtask It’s not nearly as cheap as one might think Especially if there’s 160 of them per task Turn on server GC
Solution (ctd.) There was support for our own throttling and round-robining in code We didn’t actually need it! (needed before, but not now) Eliminated both Result Oops, RabbitMQ crashed!
Cause 3 queues per client Remember “Capacity planning”? A RabbitMQ queue is an exhaustable resource Didn’t even remove unneeded queues Long to explain, but Didn’t actually need them in this scenario RabbitMQ is not ok with several thousand queues rabbimqctl list_queues took an eternity
Solution Have 2 queues per JOB and no cancellation queues Just purge request queue OK unless several jobs share their request queue We don’t use this option.
And then it worked Compute nodes at 100% cpu Cluster quickly and sustainedly saturated Cluster fully loaded
Morale Eliminate bloat – Complexity kills Even if “We’ve got feature X” sounds cool Round-robining and throttling Cancellation queues Compression
Morale Rethink what is CPU-cheap O(1) is not enough You’re going to compete with 2000 cores You’re going to do this “cheap” stuff a zillion times
Morale Rethink what is CPU-cheap 1 task = avg. 600ms of computation for 2000 cores Split into 160 parts 160 Guid.NewGuid() 160 gzip compressions of 1kb data 160 publishes to RabbitMQ 160*N serializations/deserializations It’s not cheap at all, compared to 600ms Esp. compared to 30ms, if you’re aiming at 95% scalability
And then we tried short tasks ~1000x shorter
Oh well. The tasks are really short, after all…
And we started getting really a lot of memcached misses.
Investigation Have we put so much into memcached that it evicted the tasks? Log: Key XXX not found > echo “GET XXX” | telnet 123.45.76.89 11211 YYYYYYYY Nope, it’s still there.
Solution Retry until ok (with exponential back-off)
Desperately retrying Blue: Fetching from memcached Orange: Computing Oh.
Investigation Memcached can’t be down for that long, right? Right. Look into code… We cached the MemcachedClient objects to avoid creating them per each request because this is oh so slow
Investigation There was a bug in the memcached client library (Enyim) It took too long to discover that a server is back online Our “retries” were not actually retrying They were stumbling on Enyim’s cached “server is down”.
Solution Do not cache the MemcachedClient objects Result: That helped. No more misses.
Morale Eliminate bloat – Complexity kills I think we’ve already talked of this one. Smart code is bad because you don’t know what it’s actually doing
Then we saw that memcached gets take 200ms each
Investigation Memcached can’t be that slow, right? Right. Then who is slow? Who is between us and memcached? Right, Enyim. Creating those non-cached Client objects
Solution Write own fat-free “memcached client” Just a dozen lines of code The protocol is very simple. Nothing stands between us and memcached(well, except for the OS TCP stack) Result: That helped. Now gets took ~2ms.
Morale Eliminate bloat – Complexity kills Should I say more?
And this is how well we scaled these short tasks. About 5 1-second tasks/s.  Terrific for a 2000-core cluster.
Investigation These stripes are almost parallel! Because tasks are round-robined to nodes in the same order. And this round-robiner’s not keeping up. Who’s that? RabbitMQ. We must have hit RabbitMQ limits ORLY? We push 160 messages per 1 task that takes 0.25ms on 2000 cores. Capacity planning?
Investigation And we also have 16 RabbitMQs. And there’s just 1 queue. Every queue lives on 1 node. 15/16 = 93.75% of pushes and pulls are indirect.
Solution Don’t split these short tasks into parts. Result: That helped. ~76 tasks/s submitted to RabbitMQ.
And then this An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.(during connection to the Gateway) Spurious program crashes in Enyim code under load
Solution Update Enyim to latest version. Result: Didn’t help.
Solution Get rid of Enyim completely. (also implement put() – another 10 LOC) Result: That helped No more crashes Postfactum: Actually I forgot to destroy the Enyim client objects 
Morale Third-party libraries can fail They’re written by humans Maybe by humans who didn’t test them under these conditions (i.e. a large number of connections occupied by the rest of the program) YOU can fail (for example, misuse a library) You’re a human Do not fear replacing a library with an easy piece of code Of course if it is easy (for memcached it, luckily, was) “Why did they write a complex library?” Because it does more, but maybe not what you need.
But we’re still there at 76 tasks/s.
Solution A thourough and blind CPU hunt in Client and Gateway. Didn’t want to launch a profiler on the cluster nodesbecause RDP was laggy and I was lazy (Most probably this was a mistake)
Solution Fix #1 Special-case optimization for TO-0 tasks: Unneeded deserialization and splitting in Gateway (don’t split them at all) Result Gateway CPU load drops 2x Scalability doesn’t improve
Solution Fix #2 Eliminate task GUID generation in Client Parallelize submission of requests To spread WCF serialization CPU overhead over cores Turn on Server GC Result Now it takes 14 instead of 20s to push 1900 tasks to Gateway (130/s). Still not quite there.
Look at the cluster load again Where do these pauses come from? They appear consistently on every run.
Where do these pauses come from? What can pause a .NET application? The Garbage Collector The OS (swap in/out) What’s common between these runs? ~Number of tasks in memory at pauses
Where did the memory go? Node with Client had 98-99% physical memory occupied. By whom? SQL Server: >4Gb MS HPC Server: Another few Gb No wonder.
Solution Turn off HPC Server on this node. Result: These pauses got much milder
Still don’t know what’s this. About 170 tasks/s. Only using 1248 cores. Why? We don’t know yet.
Morale Measure your application.Eliminate interference from others. The interference can be drastic. Do not place a latency-sensitive component together with anything heavy (throughput-sensitive) like SQL Server.
But scalability didn’t improve much.
How do we understand why it’s so bad? Eliminate interference.
What interference is there? “Normalizing” tasks Deserialize Extract data to memcached Serialize Let us remove it (prepare tasks, then shoot like a machinegun). Result: almost same – 172 tasks/s (Unrealistic but easier for further investigation)
So how long does it take to submit a task? (now that it’s the only thing we’re doing) Client: “Oh, quite a lot!” Gateway: “Not much.” 1 track = 1 thread. Before BeginExecute: start orange bar, after BeginExecute: end bar.
Duration of these bars Client: “Usually and consistently about 50ms.” Gateway:  “Usually a couple ms.”
Very suspicious What are those 50ms? Too round of a number. Perhaps some protocol is enforcing it? What’s our protocol?
What’s our protocol? tcp, right? var client = new CloudGatewayClient(	"BasicHttpBinding_ICloudGateway"); Oops.
Solution Change to NetTcpBinding Don’t remember which is which :(Still looks strange, but much better.
About 340 tasks/s.  Only using 1083 of >1800 cores!  Why? We don’t know yet.
Morale Double-check your configuration. Measure the “same” thing in several ways. Time to submit a task, from POV of client and gateway
Here comes the dessert. “Tools matter” Already shown how pictures (and drawing tools) matter. We have a logger. “Greg” = “Global Registrator”. Most of the pictures wouldn’t be possible without it. Distributed (client/server) Accounts for machine clock offset Output is sorted on “global time axis” Lots of smart “scalability” tricks inside
Tools matter And it didn’t work quite well, for quite a long time.  Here’s how it failed: Ate 1-2Gb RAM Output was not sorted Logged events with a 4-5min lag
Tools matter Here’s how its failures mattered: Had to wait several minutes to gather all the events from a run. Sometimes not all of them were even gathered After the problems were fixed, “experiment roundtrip” (change, run, collect data, analyze) skyrocketed at least 2x-3x.
Tools matter Too bad it was on the last day of cluster availability.
Why was it so buggy? The problem ain’t that easy (as it seemed). Lots of clients (~2000) Lots of messages 1 RPC request per message = unacceptable Don’t log a message until clock synced with the client machine Resync clock periodically Log messages in order of global time, not order of arrival Anyone might (and does) fail or come back online at any moment Must not crash Must not overflow RAM Must be fast
How does it work? Client buffers messages and sends them to server in batches (client initiates).  Messages marked with client’s local timestamp. Server buffers messages from each client. Periodically client and server calibrate clocks (server initiates).Once a client machine is calibrated  its messages go to global buffer with transformed timestamp. Messages stay in global buffer for 10s (“if a message is earliest for 10s, it will remain earliest”) Global buffer(windowSize): 	Add(time, event) PopEarliest() : (time,event)
So, the tricks were: Limit the global buffer (drop messages if it’s full) “Dropping message”…“Dropped 10000,20000… messages”…”Accepting again after dropping N” Limit the send buffer on client Same Use compression for batches (unused actually) Ignore (but log) errors like failed calibration, failed send, failed receive, failed connect etc Retry after a while Send records to server in bounded batches If I’ve got 1mln records to say, I shouldn’t keep the connection busy for a long time (num.concurrent connections is a resource!). Cut into batches of 10000. Prefer polling to blocking because it’s simpler
So, the tricks were: Prefer “negative feedback” style Wake up, see what’s wrong, fix Not: “react to every event with preserving invariants”Much harder, sometimes impossible. Network performance tricks: TCP NO_DELAY whenever possible Warm up the connection before calibrating Calibrate N times, average until confidence interval reached (actually precise calibration is theoretically impossible, only if network latencies are symmetric, which they aren’t…)
And the bugs were: Client called server even if it had nothing to say. Impact: *lots* of unneeded connections. Fix: Check, poll.
And the bugs were: “Pending records” per-client buffer was unbounded. Impact: Server ate memory if it couldn’t sync clock Reason: Code duplication. Should have abstracted away “Bounded buffer”. Fix: Bound.
And the bugs were: If couldn’t calibrate with client at 1st attempt, never calibrated. Impact: Well… Esp. given the previous bug. Reason: try{loop}/ignore instead of loop{try/ignore} Meta reason: too complex code, mixed levels of abstraction Mixed what’s being “tried” with how it’s being managed (failures handled) Fix: change to loop{try/ignore}. Meta fix: Go through all code, classify methods into “spaghetti” and “flat logic”. Extract logic from spaghetti.
And the bugs were: No calibration with a machine in scenario “Start client A, start client B, kill client A” Impact: Very bad. Reason: If client couldn’t establish a calibration TCP listener, it wouldn’t try again (“someone else’s listening, not my job”).Then that guy dies and whose job is it now? Meta reason: One-time global initialization for a globally periodic process (init; loop{action}).Global conditions change and initialization is needed again. Fix: Transform to loop{init; action} – periodically establish listener (ignore failure).
And the bugs were: Events were not coming out in order. Impact: Not critical by itself, but casts doubt on the correctness of everything.If this doesn’t work, how can we be sure that we even get all messages?All in all, very bad. Reason: ??? And they were also coming out with a huge lag. Impact: Dramatic (as already said).
The case of the lagging events  There were many places where they could lag. That’s already very bad by itself… On client? (repeatedly failing to connect to server) On server? (repeatedly failing to read from client) In per-client buffer? (failing to calibrate / to notice that calibration is done) In global buffer?(failing to notice that this event has “expired” its 10s)
The case of the lagging events Meta fix: More internal logging Didn’t help. This logging was invisible because done with Trace.WriteLine and viewed with DbgView, which doesn’t work between sessions My fault – didn’t cope with this. Only failed under large load from many machines (the worst kind of error…) But could have helped. Log/assert everything If things were fine where you expect them to be, there’d be no bugs.But there are.
The case of the lagging events Investigation by sequential elimination of reasons. The most suspicious thing was “time-buffered queue”. A complex piece of mud. “Kind of” a priority queue with tracking times and sleeping/blocking on “pop” Looked right and passed tests, but felt uncomfortable Rewritten it.
The case of the lagging events Rewritten it. Polling instead of blocking: “What’s the earliest event? Has it been here for 10s yet?” A classic priority queue “from the book” Peek minimum, check expiry  pop or not. That’s it. Now the queue definitely worked correctly. But events still lagged.
The case of the lagging events What remained? Only a walk through the code.
The case of the lagging events A while later…
The case of the lagging events A client has 3 associated threads. (1 per batch of records) Thread that reads them to per-client buffer. (1 per client) Thread that pulls from per-client bufferand writes calibrated events to global buffer(after calibration is done) (1 per machine) Calibration thread
The case of the lagging events A client has 3 associated threads. And they were created in ThreadPool. And ThreadPool creates no more than 2 new threads/s.
The case of the lagging events So we have 2000 clients on 250 machines. A couple thousand threads. Not a big deal, OS can handle more. And they’re all doing IO. That’s what an OS is for. Created at a rate of 2 per second. 4-5 minutes pass before the calibration thread is created in pool for the last machine!
The case of the lagging events Fix: Start a new thread without ThreadPool. And suddenly everything worked.
The case of the lagging events Why did it take so long to find? Unreproducible on less than a dozen machines Bad internal debugging tools (Trace.WriteLine) And lack of understanding of their importance Too complex architecture Too many places can fail, need to debug all at once
The case of the lagging events Morale: Functional abstractions leak in non-functional ways. Thread pool functional abstraction = “Do something soon” Know how exactly they leak, or don’t use them. “Soon, but no sooner than 2/s”
Greg again Rewritten it nearly from scratch Calibration now also initiated by client Server only accepts client connections and moves messages around the queues Pattern “Move responsibility to client” – server now does a lot less calibration-related bookkeeping Pattern “Eliminate dependency cycles / feedback loops” Now server doesn’t care at all about failure of client Pattern “Do one thing and do it well” Just serve requests. Don’t manage workflow. It’s now easier for server to throttle the number of concurrent requests of any kind
The good partsOK, lots of things were broken. Which weren’t? Asynchronous processing We’d be screwed if not for the recent “fully asynchronous” rewrite “Concurrent synchronous calls” are a very scarce resource Reliance on a fault-tolerant abstraction: Messaging We’d be screwed if RabbitMQ didn’t handle the failures for us Good measurement tools We’d be blindfolded without the global clock-synced logging and drawing tools Good deployment scripts We’d be in a configuration hell if we did that manually Reasonably low coupling We’d have much longer experiment roundtrips if we ran tests on “the real thing” (Huge Legacy Program + HPC Server + everything) It was not hard to do independent performance optimizations of all the component layers involved (and there were not too many layers)
Morales
Morales Tools matter Would be helpless without the graphs  Would have done much more if the logger was fixed earlier… Capacity planning How much of X will you need for 2000 cores? Complexity kills Problems are everywhere, and if they’re also complex, then you can’t fix them Rethink “CPU cheap” Is it cheap compared to what 2000 cores can do? Abstractions leak Do not rely on a functional abstraction when you have non-functional requirements Everything fails Especially you Planning to have failures is more robust than planning how exactly to fight them There are no “almost improbable errors”: probabilities accumulate Explicitly ignore failures in non-critical code Code that does this is larger but simpler to understand than code that doesn’t Think where to put responsibility for what Difference in ease of implementation may be dramatic
That’s all.

Mais conteúdo relacionado

Mais procurados

Lab3 advanced port scanning 30 oct 21
Lab3 advanced port scanning 30 oct 21Lab3 advanced port scanning 30 oct 21
Lab3 advanced port scanning 30 oct 21
Hussain111321
 
Xen_and_Rails_deployment
Xen_and_Rails_deploymentXen_and_Rails_deployment
Xen_and_Rails_deployment
Abhishek Singh
 

Mais procurados (20)

Lab3 advanced port scanning 30 oct 21
Lab3 advanced port scanning 30 oct 21Lab3 advanced port scanning 30 oct 21
Lab3 advanced port scanning 30 oct 21
 
NYC Cassandra Day - Java Intro
NYC Cassandra Day - Java IntroNYC Cassandra Day - Java Intro
NYC Cassandra Day - Java Intro
 
Don't dump thread dumps
Don't dump thread dumpsDon't dump thread dumps
Don't dump thread dumps
 
Writing Serverless Application in Java with comparison of 3 approaches: AWS S...
Writing Serverless Application in Java with comparison of 3 approaches: AWS S...Writing Serverless Application in Java with comparison of 3 approaches: AWS S...
Writing Serverless Application in Java with comparison of 3 approaches: AWS S...
 
LJC: Microservices in the real world
LJC: Microservices in the real worldLJC: Microservices in the real world
LJC: Microservices in the real world
 
Embracing Events
Embracing EventsEmbracing Events
Embracing Events
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With Terracotta
 
Presentation
PresentationPresentation
Presentation
 
Writing Java Serverless Application Using Micronaut
Writing Java Serverless Application Using MicronautWriting Java Serverless Application Using Micronaut
Writing Java Serverless Application Using Micronaut
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
 
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
 
DevOps throughout time
DevOps throughout timeDevOps throughout time
DevOps throughout time
 
Xen_and_Rails_deployment
Xen_and_Rails_deploymentXen_and_Rails_deployment
Xen_and_Rails_deployment
 
Backy - VM backup beyond bacula
Backy - VM backup beyond baculaBacky - VM backup beyond bacula
Backy - VM backup beyond bacula
 
Introduce about Nodejs - duyetdev.com
Introduce about Nodejs - duyetdev.comIntroduce about Nodejs - duyetdev.com
Introduce about Nodejs - duyetdev.com
 
Cfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreCfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymore
 
Zookeeper Architecture
Zookeeper ArchitectureZookeeper Architecture
Zookeeper Architecture
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Multi core programming 2
Multi core programming 2Multi core programming 2
Multi core programming 2
 

Semelhante a Lessons learnt on a 2000-core cluster

Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
Peter Lawrey
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
Jeff Miccolis
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
parallellabs
 

Semelhante a Lessons learnt on a 2000-core cluster (20)

Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
 
Bugs from Outer Space | while42 SF #6
Bugs from Outer Space | while42 SF #6Bugs from Outer Space | while42 SF #6
Bugs from Outer Space | while42 SF #6
 
Bandwidth, Throughput, Iops, And Flops
Bandwidth, Throughput, Iops, And FlopsBandwidth, Throughput, Iops, And Flops
Bandwidth, Throughput, Iops, And Flops
 
Austin Cassandra Meetup re: Atomic Counters
Austin Cassandra Meetup re: Atomic CountersAustin Cassandra Meetup re: Atomic Counters
Austin Cassandra Meetup re: Atomic Counters
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
Easier, Better, Faster, Safer Deployment with Docker and Immutable Containers
Easier, Better, Faster, Safer Deployment with Docker and Immutable ContainersEasier, Better, Faster, Safer Deployment with Docker and Immutable Containers
Easier, Better, Faster, Safer Deployment with Docker and Immutable Containers
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
 
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attacDefcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
Latency vs everything
Latency vs everythingLatency vs everything
Latency vs everything
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil CholewińskiPilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Performance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksPerformance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware Bottlenecks
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Lessons learnt on a 2000-core cluster

  • 1. Running our software on a 2000-core cluster Lessons learnt
  • 2. Structure For each problem Symptoms Method of investigation Cause Action taken Morale
  • 3. Background Pretty simple: Distributing embarassingly parallel computations on a cluster Distribution fabric is RabbitMQ Publish tasks to queue Pull results from queue Computational listeners on cluster nodes Tasks are “fast” (~1s cpu time) or “slow” (~15min cpu time) Tasks are split into parts (usually 160) Also parts share the same data chunk – it’s stored in memcached and task input contains the “shared data id” Requirements: 95% utilization for slow tasks, “as much as we can” for fast ones.
  • 4. RabbitMQ starts refusing connections to some clients when there are too many of them.
  • 5. Investigation Eventually turned out RabbitMQ supports max ~400 connections per process on Windows.
  • 6. Solution In RabbitMQ: Establish a cluster of RabbitMQs 2 “eternal” connections per client, 512 connections per instance, 1600 clients  ~16 instances suffice. Instances start on same IP, on subsequent ports (5672,5673..) In code: Make both submitter and consumer scan ports until success
  • 7. Morale Capacity planning! If there’s a resource, plan how much of it you’ll need and with what pattern of usage. Otherwise you’ll exhaust it sooner or later. Network bandwidth Network latency Connections Threads Memory Whatever
  • 8. RabbitMQConsumer uses a legacy component which can’t run concurrent instances in the same directory
  • 9. Solution Create temporary directory. Directory.SetCurrentDirectory() at startup. The temp directories pile up.
  • 10. Solution At startup, clean up unused temp directories. How to know if it is unused? Create a lock file in the directory At startup, try removing lock files and dirs Problem Races: several instances want to delete the same file All but one crash! Several solutions with various kinds of races, “fixed” by try/ignore band-aid… Just wrap the whole “clean-up” block in a try/ignore! That’s it.
  • 11. Morale If it’s non-critical, wrap the whole thing with try/ignore Even if you think it will never fail It will (maybe in the future, after someone changes the code…) Thinking “it won’t” is unneeded complexity Low-probable errors will happen The chance is small but frequent 0.001 probability of error, 2000 occasions = 87% that at least 1 failure occurs
  • 12. Then the thing started working. Kind of. We asked for 1000 tasks “in flight”, and got only about 125.
  • 13. Gateway is highly CPU loaded (perhaps that’s the bottleneck?)
  • 14. Solution Eliminate data compression It was unneeded – 160 compressions of <1kb-sized data per task (1 per subtask)! Eliminate unneeded deserialization Eliminate Guid.NewGuid() per subtask It’s not nearly as cheap as one might think Especially if there’s 160 of them per task Turn on server GC
  • 15. Solution (ctd.) There was support for our own throttling and round-robining in code We didn’t actually need it! (needed before, but not now) Eliminated both Result Oops, RabbitMQ crashed!
  • 16. Cause 3 queues per client Remember “Capacity planning”? A RabbitMQ queue is an exhaustable resource Didn’t even remove unneeded queues Long to explain, but Didn’t actually need them in this scenario RabbitMQ is not ok with several thousand queues rabbimqctl list_queues took an eternity
  • 17. Solution Have 2 queues per JOB and no cancellation queues Just purge request queue OK unless several jobs share their request queue We don’t use this option.
  • 18. And then it worked Compute nodes at 100% cpu Cluster quickly and sustainedly saturated Cluster fully loaded
  • 19. Morale Eliminate bloat – Complexity kills Even if “We’ve got feature X” sounds cool Round-robining and throttling Cancellation queues Compression
  • 20. Morale Rethink what is CPU-cheap O(1) is not enough You’re going to compete with 2000 cores You’re going to do this “cheap” stuff a zillion times
  • 21. Morale Rethink what is CPU-cheap 1 task = avg. 600ms of computation for 2000 cores Split into 160 parts 160 Guid.NewGuid() 160 gzip compressions of 1kb data 160 publishes to RabbitMQ 160*N serializations/deserializations It’s not cheap at all, compared to 600ms Esp. compared to 30ms, if you’re aiming at 95% scalability
  • 22. And then we tried short tasks ~1000x shorter
  • 23. Oh well. The tasks are really short, after all…
  • 24. And we started getting really a lot of memcached misses.
  • 25. Investigation Have we put so much into memcached that it evicted the tasks? Log: Key XXX not found > echo “GET XXX” | telnet 123.45.76.89 11211 YYYYYYYY Nope, it’s still there.
  • 26. Solution Retry until ok (with exponential back-off)
  • 27. Desperately retrying Blue: Fetching from memcached Orange: Computing Oh.
  • 28. Investigation Memcached can’t be down for that long, right? Right. Look into code… We cached the MemcachedClient objects to avoid creating them per each request because this is oh so slow
  • 29. Investigation There was a bug in the memcached client library (Enyim) It took too long to discover that a server is back online Our “retries” were not actually retrying They were stumbling on Enyim’s cached “server is down”.
  • 30. Solution Do not cache the MemcachedClient objects Result: That helped. No more misses.
  • 31. Morale Eliminate bloat – Complexity kills I think we’ve already talked of this one. Smart code is bad because you don’t know what it’s actually doing
  • 32. Then we saw that memcached gets take 200ms each
  • 33. Investigation Memcached can’t be that slow, right? Right. Then who is slow? Who is between us and memcached? Right, Enyim. Creating those non-cached Client objects
  • 34. Solution Write own fat-free “memcached client” Just a dozen lines of code The protocol is very simple. Nothing stands between us and memcached(well, except for the OS TCP stack) Result: That helped. Now gets took ~2ms.
  • 35. Morale Eliminate bloat – Complexity kills Should I say more?
  • 36. And this is how well we scaled these short tasks. About 5 1-second tasks/s. Terrific for a 2000-core cluster.
  • 37. Investigation These stripes are almost parallel! Because tasks are round-robined to nodes in the same order. And this round-robiner’s not keeping up. Who’s that? RabbitMQ. We must have hit RabbitMQ limits ORLY? We push 160 messages per 1 task that takes 0.25ms on 2000 cores. Capacity planning?
  • 38. Investigation And we also have 16 RabbitMQs. And there’s just 1 queue. Every queue lives on 1 node. 15/16 = 93.75% of pushes and pulls are indirect.
  • 39. Solution Don’t split these short tasks into parts. Result: That helped. ~76 tasks/s submitted to RabbitMQ.
  • 40. And then this An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.(during connection to the Gateway) Spurious program crashes in Enyim code under load
  • 41. Solution Update Enyim to latest version. Result: Didn’t help.
  • 42. Solution Get rid of Enyim completely. (also implement put() – another 10 LOC) Result: That helped No more crashes Postfactum: Actually I forgot to destroy the Enyim client objects 
  • 43. Morale Third-party libraries can fail They’re written by humans Maybe by humans who didn’t test them under these conditions (i.e. a large number of connections occupied by the rest of the program) YOU can fail (for example, misuse a library) You’re a human Do not fear replacing a library with an easy piece of code Of course if it is easy (for memcached it, luckily, was) “Why did they write a complex library?” Because it does more, but maybe not what you need.
  • 44. But we’re still there at 76 tasks/s.
  • 45. Solution A thourough and blind CPU hunt in Client and Gateway. Didn’t want to launch a profiler on the cluster nodesbecause RDP was laggy and I was lazy (Most probably this was a mistake)
  • 46. Solution Fix #1 Special-case optimization for TO-0 tasks: Unneeded deserialization and splitting in Gateway (don’t split them at all) Result Gateway CPU load drops 2x Scalability doesn’t improve
  • 47. Solution Fix #2 Eliminate task GUID generation in Client Parallelize submission of requests To spread WCF serialization CPU overhead over cores Turn on Server GC Result Now it takes 14 instead of 20s to push 1900 tasks to Gateway (130/s). Still not quite there.
  • 48. Look at the cluster load again Where do these pauses come from? They appear consistently on every run.
  • 49. Where do these pauses come from? What can pause a .NET application? The Garbage Collector The OS (swap in/out) What’s common between these runs? ~Number of tasks in memory at pauses
  • 50. Where did the memory go? Node with Client had 98-99% physical memory occupied. By whom? SQL Server: >4Gb MS HPC Server: Another few Gb No wonder.
  • 51. Solution Turn off HPC Server on this node. Result: These pauses got much milder
  • 52. Still don’t know what’s this. About 170 tasks/s. Only using 1248 cores. Why? We don’t know yet.
  • 53. Morale Measure your application.Eliminate interference from others. The interference can be drastic. Do not place a latency-sensitive component together with anything heavy (throughput-sensitive) like SQL Server.
  • 54. But scalability didn’t improve much.
  • 55. How do we understand why it’s so bad? Eliminate interference.
  • 56. What interference is there? “Normalizing” tasks Deserialize Extract data to memcached Serialize Let us remove it (prepare tasks, then shoot like a machinegun). Result: almost same – 172 tasks/s (Unrealistic but easier for further investigation)
  • 57. So how long does it take to submit a task? (now that it’s the only thing we’re doing) Client: “Oh, quite a lot!” Gateway: “Not much.” 1 track = 1 thread. Before BeginExecute: start orange bar, after BeginExecute: end bar.
  • 58. Duration of these bars Client: “Usually and consistently about 50ms.” Gateway: “Usually a couple ms.”
  • 59. Very suspicious What are those 50ms? Too round of a number. Perhaps some protocol is enforcing it? What’s our protocol?
  • 60. What’s our protocol? tcp, right? var client = new CloudGatewayClient( "BasicHttpBinding_ICloudGateway"); Oops.
  • 61. Solution Change to NetTcpBinding Don’t remember which is which :(Still looks strange, but much better.
  • 62. About 340 tasks/s. Only using 1083 of >1800 cores! Why? We don’t know yet.
  • 63. Morale Double-check your configuration. Measure the “same” thing in several ways. Time to submit a task, from POV of client and gateway
  • 64. Here comes the dessert. “Tools matter” Already shown how pictures (and drawing tools) matter. We have a logger. “Greg” = “Global Registrator”. Most of the pictures wouldn’t be possible without it. Distributed (client/server) Accounts for machine clock offset Output is sorted on “global time axis” Lots of smart “scalability” tricks inside
  • 65. Tools matter And it didn’t work quite well, for quite a long time. Here’s how it failed: Ate 1-2Gb RAM Output was not sorted Logged events with a 4-5min lag
  • 66. Tools matter Here’s how its failures mattered: Had to wait several minutes to gather all the events from a run. Sometimes not all of them were even gathered After the problems were fixed, “experiment roundtrip” (change, run, collect data, analyze) skyrocketed at least 2x-3x.
  • 67. Tools matter Too bad it was on the last day of cluster availability.
  • 68. Why was it so buggy? The problem ain’t that easy (as it seemed). Lots of clients (~2000) Lots of messages 1 RPC request per message = unacceptable Don’t log a message until clock synced with the client machine Resync clock periodically Log messages in order of global time, not order of arrival Anyone might (and does) fail or come back online at any moment Must not crash Must not overflow RAM Must be fast
  • 69. How does it work? Client buffers messages and sends them to server in batches (client initiates). Messages marked with client’s local timestamp. Server buffers messages from each client. Periodically client and server calibrate clocks (server initiates).Once a client machine is calibrated  its messages go to global buffer with transformed timestamp. Messages stay in global buffer for 10s (“if a message is earliest for 10s, it will remain earliest”) Global buffer(windowSize): Add(time, event) PopEarliest() : (time,event)
  • 70. So, the tricks were: Limit the global buffer (drop messages if it’s full) “Dropping message”…“Dropped 10000,20000… messages”…”Accepting again after dropping N” Limit the send buffer on client Same Use compression for batches (unused actually) Ignore (but log) errors like failed calibration, failed send, failed receive, failed connect etc Retry after a while Send records to server in bounded batches If I’ve got 1mln records to say, I shouldn’t keep the connection busy for a long time (num.concurrent connections is a resource!). Cut into batches of 10000. Prefer polling to blocking because it’s simpler
  • 71. So, the tricks were: Prefer “negative feedback” style Wake up, see what’s wrong, fix Not: “react to every event with preserving invariants”Much harder, sometimes impossible. Network performance tricks: TCP NO_DELAY whenever possible Warm up the connection before calibrating Calibrate N times, average until confidence interval reached (actually precise calibration is theoretically impossible, only if network latencies are symmetric, which they aren’t…)
  • 72. And the bugs were: Client called server even if it had nothing to say. Impact: *lots* of unneeded connections. Fix: Check, poll.
  • 73. And the bugs were: “Pending records” per-client buffer was unbounded. Impact: Server ate memory if it couldn’t sync clock Reason: Code duplication. Should have abstracted away “Bounded buffer”. Fix: Bound.
  • 74. And the bugs were: If couldn’t calibrate with client at 1st attempt, never calibrated. Impact: Well… Esp. given the previous bug. Reason: try{loop}/ignore instead of loop{try/ignore} Meta reason: too complex code, mixed levels of abstraction Mixed what’s being “tried” with how it’s being managed (failures handled) Fix: change to loop{try/ignore}. Meta fix: Go through all code, classify methods into “spaghetti” and “flat logic”. Extract logic from spaghetti.
  • 75. And the bugs were: No calibration with a machine in scenario “Start client A, start client B, kill client A” Impact: Very bad. Reason: If client couldn’t establish a calibration TCP listener, it wouldn’t try again (“someone else’s listening, not my job”).Then that guy dies and whose job is it now? Meta reason: One-time global initialization for a globally periodic process (init; loop{action}).Global conditions change and initialization is needed again. Fix: Transform to loop{init; action} – periodically establish listener (ignore failure).
  • 76. And the bugs were: Events were not coming out in order. Impact: Not critical by itself, but casts doubt on the correctness of everything.If this doesn’t work, how can we be sure that we even get all messages?All in all, very bad. Reason: ??? And they were also coming out with a huge lag. Impact: Dramatic (as already said).
  • 77. The case of the lagging events There were many places where they could lag. That’s already very bad by itself… On client? (repeatedly failing to connect to server) On server? (repeatedly failing to read from client) In per-client buffer? (failing to calibrate / to notice that calibration is done) In global buffer?(failing to notice that this event has “expired” its 10s)
  • 78. The case of the lagging events Meta fix: More internal logging Didn’t help. This logging was invisible because done with Trace.WriteLine and viewed with DbgView, which doesn’t work between sessions My fault – didn’t cope with this. Only failed under large load from many machines (the worst kind of error…) But could have helped. Log/assert everything If things were fine where you expect them to be, there’d be no bugs.But there are.
  • 79. The case of the lagging events Investigation by sequential elimination of reasons. The most suspicious thing was “time-buffered queue”. A complex piece of mud. “Kind of” a priority queue with tracking times and sleeping/blocking on “pop” Looked right and passed tests, but felt uncomfortable Rewritten it.
  • 80. The case of the lagging events Rewritten it. Polling instead of blocking: “What’s the earliest event? Has it been here for 10s yet?” A classic priority queue “from the book” Peek minimum, check expiry  pop or not. That’s it. Now the queue definitely worked correctly. But events still lagged.
  • 81. The case of the lagging events What remained? Only a walk through the code.
  • 82. The case of the lagging events A while later…
  • 83. The case of the lagging events A client has 3 associated threads. (1 per batch of records) Thread that reads them to per-client buffer. (1 per client) Thread that pulls from per-client bufferand writes calibrated events to global buffer(after calibration is done) (1 per machine) Calibration thread
  • 84. The case of the lagging events A client has 3 associated threads. And they were created in ThreadPool. And ThreadPool creates no more than 2 new threads/s.
  • 85. The case of the lagging events So we have 2000 clients on 250 machines. A couple thousand threads. Not a big deal, OS can handle more. And they’re all doing IO. That’s what an OS is for. Created at a rate of 2 per second. 4-5 minutes pass before the calibration thread is created in pool for the last machine!
  • 86. The case of the lagging events Fix: Start a new thread without ThreadPool. And suddenly everything worked.
  • 87. The case of the lagging events Why did it take so long to find? Unreproducible on less than a dozen machines Bad internal debugging tools (Trace.WriteLine) And lack of understanding of their importance Too complex architecture Too many places can fail, need to debug all at once
  • 88. The case of the lagging events Morale: Functional abstractions leak in non-functional ways. Thread pool functional abstraction = “Do something soon” Know how exactly they leak, or don’t use them. “Soon, but no sooner than 2/s”
  • 89. Greg again Rewritten it nearly from scratch Calibration now also initiated by client Server only accepts client connections and moves messages around the queues Pattern “Move responsibility to client” – server now does a lot less calibration-related bookkeeping Pattern “Eliminate dependency cycles / feedback loops” Now server doesn’t care at all about failure of client Pattern “Do one thing and do it well” Just serve requests. Don’t manage workflow. It’s now easier for server to throttle the number of concurrent requests of any kind
  • 90. The good partsOK, lots of things were broken. Which weren’t? Asynchronous processing We’d be screwed if not for the recent “fully asynchronous” rewrite “Concurrent synchronous calls” are a very scarce resource Reliance on a fault-tolerant abstraction: Messaging We’d be screwed if RabbitMQ didn’t handle the failures for us Good measurement tools We’d be blindfolded without the global clock-synced logging and drawing tools Good deployment scripts We’d be in a configuration hell if we did that manually Reasonably low coupling We’d have much longer experiment roundtrips if we ran tests on “the real thing” (Huge Legacy Program + HPC Server + everything) It was not hard to do independent performance optimizations of all the component layers involved (and there were not too many layers)
  • 92. Morales Tools matter Would be helpless without the graphs Would have done much more if the logger was fixed earlier… Capacity planning How much of X will you need for 2000 cores? Complexity kills Problems are everywhere, and if they’re also complex, then you can’t fix them Rethink “CPU cheap” Is it cheap compared to what 2000 cores can do? Abstractions leak Do not rely on a functional abstraction when you have non-functional requirements Everything fails Especially you Planning to have failures is more robust than planning how exactly to fight them There are no “almost improbable errors”: probabilities accumulate Explicitly ignore failures in non-critical code Code that does this is larger but simpler to understand than code that doesn’t Think where to put responsibility for what Difference in ease of implementation may be dramatic