In this talk I will walk you through the performance tuning steps that I took to serve 1.2M JSON requests per second from a 4 vCPU c5 instance, using a simple API server written in C.
At the start of the journey the server is capable of a very respectable 224k req/s with the default configuration. Along the way I made extensive use of tools like FlameGraph and bpftrace to measure, analyze, and optimize the entire stack, from the application framework, to the network driver, all the way down to the kernel.
I began this wild adventure without any prior low-level performance optimization experience; but once I started going down the performance tuning rabbit-hole, there was no turning back. Fueled by my curiosity, willingness to learn, and relentless persistence, I was able to boost performance by over 400% and reduce p99 latency by almost 80%.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Â
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
1. Brought to you by
Extreme HTTP Performance Tuning:
1.2M API req/s on a 4 vCPU EC2 Instance
Marc Richards
Chief Problem Solver at
2. Marc Richards
Chief Problem Solver at Talawah Solutions
Talawah Solutions
â Based in Kingston Jamaica
â Cloud Computing Consultant for almost a decade
â Solutions Architect / DevOps Engineer / Performance Engineer
â No low-level systems performance tuning experience before
this project!
3. Demystifying Systems Performance Tuning
â You don't need to be a kernel developer or a wizard sysadmin.
â FlameGraph and bpftrace have changed the game.
â New ebpf based tools coming out will only make things easier!
4. Overview
â I accidentally fell down this optimization rabbit hole.
â Started with a simple, high-performance API server written in C.
â Used FlameGraph and bpftrace to analyze and optimize the entire stack.
5. Overview
â Cloud: AWS
â Hardware: 4 vCPU c5n.xlarge** (server) / 16 vCPU c5n.4xlarge (client)
â Benchmark: Techempower JSON Serialization test
â Server: Techempower libreactor implementation
** In order to minimize inconsistencies at the platform level I did the ïŹnal benchmark run on a c5n.9xlarge that was
restricted to 4 vCPUS using the EC2 CPU Options feature.
6. Blog post with even more details
https://talawah.io/blog/extreme-http-performance-tuning-one
-point-two-million/
7. Optimizations
Optimization Gain Req/s
Ground Zero - 224k
Application Optimizations 55% 347k
Disabling Speculative Execution Mitigations 28% 446k
Disabling Syscall Auditing / Blocking 11% 495k
Disabling iptables / netïŹlter 22% 603k
Perfect Locality 38% 834k
Interrupt Optimizations 28% 1.06M
The Case of the Nosy Neighbor 6% 1.12M
The Battle Against the Spin Lock 2% 1.15M
This Goes to Twelve 4% 1.20M
8. Optimizations
Optimization Gain Req/s
Ground Zero - 224k
Application Optimizations 55% 347k
Disabling Speculative Execution Mitigations 28% 446k
Disabling Syscall Auditing / Blocking 11% 495k
Disabling iptables / netïŹlter 22% 603k
Perfect Locality 38% 834k
Interrupt Optimizations 28% 1.06M
The Case of the Nosy Neighbor 6% 1.12M
The Battle Against the Spin Lock 2% 1.15M
This Goes to Twelve 4% 1.20M
9. Ground Zero
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 1.14ms
90.00% 1.21ms
99.00% 1.26ms
99.99% 1.32ms
2243551 requests in 10.00s, 331.64MB read
Requests/sec: 224,353.73
10. * I modiïŹed nginx.conf to send back a hardcoded JSON response. This is not a part of the Techempower implementation.
25. The Case of the Nosy Neighbor
+
The Battle Against the Spin Lock
26. The Case of the Nosy Neighbor
Someone, somewhere was spying on all my packets (kinda)
â dev_queue_xmit_nit() -> packet_rcv()
â packet_rcv() implicates AF_PACKET
â sudo ss --packet --processes -> (("dhclient",pid=3191,fd=5))
â My (extreme) solution was to disable dhclient after boot
27. The Case of the Nosy Neighbor
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 218.00us
90.00% 254.00us
99.00% 285.00us
99.99% 341.00us
11279049 requests in 10.00s, 1.53GB read
Requests/sec: 1,127,894.86
29. The Battle Against the Spin Lock
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 212.00us
90.00% 246.00us
99.00% 276.00us
99.99% 338.00us
11551707 requests in 10.00s, 1.57GB read
Requests/sec: 1,155,162.15
37. Next Steps
â Next gen kernel: 5.10 LTS
â Next gen technologies: io_uring
â Next gen instances: ARM vs Intel vs AMD
â Driving performance from the bottom-up using Rust, Java, etc
38. Brought to you by
Marc Richards
https://talawah.io/contact
@talawahtech