Slides of our CoNEXT'19 presentation of "RSS++: load and state-aware receive side scaling", a technique to insure a good load-balancing among multiple cores of a server for networking application.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Â
RSS++
1. RSS++: load and state-aware receive side scaling
Tom Barbette, Georgios P. Katsikas, Gerald Q. Maguire Jr., and Dejan KostiÄ
CORE
1
CORE
1
CORE
1
CORE
1
CORE
2
CORE
3
CORE
2
CORE
3
CORE
2
100G
2. Networking today
1/2/2020 2
0
100
200
300
400
1980 1990 2000 2010 2020
EthernetStandard
Speed(Gbps)
Year
1980 1990 2000 2010 2020
0
100
200
300
Years
Cores
100GHow to dispatch dozens of
millions of packets per seconds to
many cores?
Data from Karl Rupp / Creative Commons Attribution 4.0 International Public License
3. 2020-01-02 3
Sharding
Key-Value Stores
Minos [NSDIâ19]
Herd [SIGCOMMâ14]
MICA [NSDIâ14]
Chronos [SoCCâ12]
CPHASH [PPoPPâ12]
Packet Processing / NFV
Metron [NSDIâ18]
NetBricks [OSDIâ16]
SNF [PeerJâ16]
FastClick [ANCSâ15]
Megapipe [OSDIâ12]
ShardingNetwork Stacks
ClickNF [ATCâ18]
StackMap [ATCâ16]
mTCP [NSDIâ14]
F-Stack [Tencent Cloud 13]
Affinity-Accept [EuroSysâ12]
Sharding
Hello SoTA !
How to dispatch dozens
of millions of packets
per seconds to many
cores?
5. Shardingâs problem : high imbalance
2020-01-02 5
ï Underutilization and high tail latency
6. RSS++ : Rebalance groups of flow from time
to time
2020-01-02 6
âą Much better load
spreading
âą Much lower latency
ï Latency reduced
by 30%
ï Tail latency
reduced by 5X
7. RSS++ : Rebalance groups of flow from time
to time
2020-01-02 7
âą Much better load
spreading
âą Much lower latency
âą Opportunity to
release 6 cores for
other applications
ï 1/3 resources
freed
15. RSS++
2020-01-02 15
Rebalance some RSS buckets from time to time
RSS++ strikes for the right balance between perfect load spreading
and sharding
By migrating the RSS indirection buckets based upon the output of
an optimization algorithm to even the load
16. RSS++
2020-01-02 16
Handle stateful use-cases with a new per-bucket flow table
algorithm that migrates the state with the buckets
RSS++ strikes for the right balance between perfect load spreading
and sharding
By migrating the RSS indirection buckets based upon the output of
an optimization algorithm to even the load
19. RSS++ overview
2020-01-02 19
Hash
2
2
1
2
1
âŠ
Indirection table
Core 2
Core 1
Counter
Tables
RSS++
Balancing Timer
10Hz ~ 1Hz
Ethtool API
DPDK APIs
2421
2622
1231
âŠ
502
âŠ
3112 90%
40%
CPU Load
LINUX
XDP [CoNEXTâ18]
BPF program
In-app
function call
DPDK
Kernel CPU load
Useful cycles /
Application cycles
12%
27%
8%
46%
36%
Greedy iterative approach
ïIn 85% of the cases, a single run is enough to be in a
0,5% squared imbalance margin, in 25usec
20. Stateful use-cases: state migration
2020-01-02 20
âą RSS++ migrates some RSS buckets
ï Packets from migrated flows need to find their state
20
Core
1
Core
2
Flow table #1
Flow table #2
???
21. Stateful use-cases: state migration
2020-01-02 21
âą RSS++ migrates some RSS buckets
ï Packets from migrated flows need to find their state
âą Possible approach: a shared flow table
21
22. âą RSS++ migrates some RSS buckets
ï Packets from migrated flows need to find their state
âą Possible approach: a shared flow table
âą RSS++ (DPDK implementation only):
Stateful use-cases: state migration
2020-01-02 22
1
2
1
2
1
âŠ
Indirection table
Hash-table #1
Hash-table #2
Hash-table #3
âŠ
âŠ
Flow Ptr table
Hash
3
(nearly never) QUEUE
Until previous core finished
handling all packets of bucket #2
Core
2
Core
3
RSS++
25. Evaluation
Load imbalance of packet-based methods
2020-01-02 25
Packet-based
method
have a very good
balance !
15Gbps trace (~80K active flows/s) replayed towards the DUT
Flow-
awareness
Fine-grained
load
balancing
26. Evaluation
Load imbalance of RSS
2020-01-02 2615Gbps trace (~80K active flows/s) replayed towards the DUT
Flow-
awareness
Fine-grained
load
balancing
27. Evaluation
Load imbalance of stateful methods
2020-01-02 27
Without migration,
other approaches
cannot really do
anything good!
15Gbps trace (~80K active flows/s) replayed towards the DUT
Flow-
awareness
Fine-grained
load
balancing
28. X12
(Avg. ~X5)
Evaluation: Load imbalance of RSS++
2020-01-02 2815Gbps trace (~80K active flows/s) replayed towards the DUT
Flow-
awareness
Fine-grained
load
balancing
29. Service chain at 100G: FW+NAT
2020-01-02 2915Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW
RSS is not able to fully
utilize new cores
30. Service chain at 100G: FW+NAT
2020-01-02 3015Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW
RSS++ shows linear
improvement with the
number of cores
RSS is not able to fully
utilize new cores
31. Service chain at 100G: FW+NAT
2020-01-02 3115Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW
Sharing state between
cores leads to poor
performance
RSS++ shows linear
improvement with the
number of cores
RSS is not able to fully
utilize new cores
32. Conclusion
2020-01-02 33
State-aware NIC-assisted scheduling to solve a problem that will only get
worst
â No dispatching cores
â Sharded approach (no OS scheduling)
A new state migration technique
â Minimal state âtransferâ
â No lock in the datapath
up to 14x lower 95th lower
latency, no drops and with 25%-
37% less cores
Linux (via Kernel API + small patch) and DPDK implementation, fully available,
with all experiment scripts
33. Thanks !
2020-01-02 34
github.com/rsspp/
In the paper:
â How the solver works
â More evaluations
> Particularly tail latency studies
> Comparison with Metronâs traffic-class dispatching
â More state of the art
â Future work
â Discussions about use in other contexts:
> KVS load-balancing
> Dispatching using multiple cores in a pipeline
> NUMA
â Trace analysis
This work is supported by SSF and ERC
36. Solutions for RSSâs imbalance
2020-01-02 37
âą Sprayer [Hotnetsâ18] / RPCValet [SOSPâ19]
â Forget about flows, do per-packet dispatching
ï Stateful use case dead
ï Even stateless sometimes inefficients
âą Metron [NSDIâ18]
â Compute traffic classes, and split/merge classes among cores
ï Miss load-awareness. Traffic classes may not be uniform hashing.
âą Affinity-Accept [EuroSysâ12]
â Redirect connections in software to other cores, and re-program some RSS entries when they contain
mostly redirected connections
ï Load imbalance as best as good as « SW Stateful Load » ï We need migration.
ï Software dispatching to some extent
37. SOTA : Intra-server LB
2020-01-02 38
Dispatchers cores
Shinjuku*, Shenango
ï Still need RSS++ to dispatch to the many dispatching cores needed for 100G
ï Inefficient
Shuffling layer
ZygOS*, Affinity-Accept, Linux
ï Why pay for cache misses when the NIC can do it?
ï Do not support migration ï high imbalance
*BUT we miss the mixing of multiple applications on a single core
38. Our contributions
2020-01-02 39
âą We solve the packet dispatching problem by migrating the RSS indirection
buckets between shards based upon the output of an optimization algorithm
â Without the need for dispatching cores
âą Dynamically scale the number of cores
ï Avoids the typical 25% over-provisioning
ï Order of magnitude lower tail latency
âą Compensate for occasional state migration with a new stateful per-bucket flow
table algorithm:
â Prevents packet reordering during migration
â 20% more efficient than a shared flow table
ï Stateful near-perfect intra-server load-balancing, even at the speed of 100
Gbps links
42. RSS++ algorithm
2020-01-02 43
CPU 2
CPU 1
90%
40%
3112
2421
2622
1231
Counting Table
502
Counting Table
Buckets
fractional load
31% * 40% = 12%
12%
27%
8%
46%
36%
65%
Average CPU load
90%
40%
+ 25%
- 25%
2
2
1
2
1
Indirection table
RSS++
Problem Solver
54%
76%
- 11%
+11%
1
ïIn 85% of the cases, a single run is enough to be in a
0,5% squared imbalance margin, in 25usec
43. Solver
1/2/2020 44
If you like math, go to the paper.
ïš We use a greedy, non-optimal, approach because:
ï We donât care about the optimal solution
ï State of the art showed too slow resolution time of for multi-way number partitioning
44. Greedy approach
1/2/2020 45
1. Sort buckets by descending fractional load
2. Sort underloaded cores by ascending load
3. Dispatch most loaded buckets to underloaded cores, allowing over-moves by
a threshold
4. Restart up to 10 times using different threshold to find an inflection point
ïIn 85% of the cases, a single run is enough to be in a 0,5% squared imbalance
margin, in 25usec
45. Stateful use-cases: state migration
2020-01-02 46
âą RSS++ migrates some RSS buckets
ï Packets from migrated flows need to find their state
âą Possible approach: a shared, as efficient-as-possible hash-table
2
2
1
2
1
âŠ
Indirection table
CPU
1
CPU
2
Hash-table
BANG
46. RSS++ : Rebalance some RSS buckets from
time to time
2020-01-02 47
30% lower average latency
4~5x lower standard deviation and tail latency
56. Evaluation: State migration
2020-01-02 57
Forwarding 1500 bytes UDP packets from 1024 concurrent flows of 1000 packets, classified in either a
unique thread-safe Cuckoo hash-table or in a per-bucket hash-table
57. Evaluation: Firewall only
2020-01-02 58Trace accelerated up to 100 Gbps
âą RSS cannot always
fully utilize more
cores due to load
imbalance
âą Even for a stateless
case, packet-based
approach is harmful
to cache
58. Evaluation: 39K rules firewall at 100G
2020-01-02 59Trace accelerated up to 100 Gbps
âą Even for a stateless
case, packet-based
approach is harmful
to cache
âą We need hardware
dispatching
67. 2020-01-02 68
Watch RSS live !
The internet is not random
ï
âą Buckets have up to
x1000 imbalance
âą There is a stickiness
over time
ï Solution at t0 is mostly
valid for t1
69. Why Sharding in Linux ?
2020-01-02 70
âą Unpublished result of a 3seconds sampling
âą Still much to do to take real advantage of sharding
70. Multiple applications
2020-01-02 71
âą To keep all advantage of sharding, one should slightly modify our
implementation to use a set of RSS queues per application, and exchange
cores through a common pool of available cores
âą Another idea would be to combine slow applications on one core, and reduce
the problem of polling
72. Background noise
2020-01-02 73
âą A small background noise will make the load go higher and therefore buckets
will get evicted
âą A high background noise would need a modification to the algorithm to take it
out from the capacity of a CPU : note that some CPU is at 60% out of 70% of
load, or else the « bucket fractional load » will be disproportional from the load
of other cores.
(Do not say the names if chair introduce me)
(Else after joint work do not say myself )
Hundred gigabit NICs are becoming a commodity in datacenters.
Those NICs have to dispatch dozen of million of packets to many-cores CPUs.
CLICK
And both those numbers, the ethernet speeds, and the number of cores, are increasing dramatically.
So the question that Iâll address in this talk, [how to âŠ], which is already a problem today, will even be more of a problem tomorrow.
If we look at the recent SOTA in high speed software networking, a lot of recent works in Key value-stores CLICK and packet processing and network function virtualization advocates for the use of sharding, as well as all recent networks stacks, which are sharded. CLICK
So what is this sharding about? To answer that, Iâll show you our sharded testbed. We have a computer with 18 cores, and a hundred gigabit NIC. We configure the NIC so it dispatches packets to 18 queues, one per core.
On each core, we run an instance of the application, in our case, iperf 2. The application is pinned to the core, and thatâs the idea of sharding. The computer is divided into independent shardsone can almost consider each core as a different server. The advantage of this is that we avoid any shared data structure, any contention between CPU cores.
If there was no problem with sharding ,we would not have a paper today. CLICK
So to showcase the problem, we run an iperf client that will request 100 tcp flows. CLICK
One important point, the NIC dispatches packets to the cores using RSS, basically hashing packets so packets of the same flow go to the same core.
Sprayer[Hugo Sadok 2018] HotNets.
Sprayer[Hugo Sadok 2018] HotNets.
We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
20% more efficient, an order of magnitude lower latency with a high number of core
With this I will thank you for listening and be happy to take any question you may have
25: no joke
Do this in an animation
One library for NIC-driven scheduling:
With multiple scheduling strategies, one of them being RSS++
Two « integrations » :
Linux, reading packets using a XDP BPF program, and writing the indirection table using the ethtool API
DPDK, counting packets through function calls and programing the NIC with DPDKâs API
20% more efficient
Order of magnitude better latency
« Controlling Parallelism in a Multicore Software Router â
TODO : limit at 100G
« Controlling Parallelism in a Multicore Software Router â
TODO : limit at 100G
TODO : make in multiple graphs
TODO : numbers
If we look at the number of packets received by each buckets, and map it as per a default indirection table, we can see the number of packets received by each core is very disproportional.
Moreover, we see the load of each buckets is not completely random, some buckets tend to be highly loaded, or stay loaded for some time.
ï So what we propose in RSS++ is to migrate, a few of those overloaded buckets from time to time, to even the load between all CPUs