The talk I gave at Papers We Love #20 (Singapore) about this academic paper "The Linux Scheduler: a Decade of Wasted Cores" by a few researchers.
The video of this talk can be found here: https://engineers.sg/v/758
Here are some relevant links:
Paper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
Reference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf
Reference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
The Linux Scheduler: a Decade of Wasted Cores
1. The Linux Scheduler:
a Decade of Wasted CoresAuthored by:
1. Jean-Pierre Lozi (Université Nice Sophia Antipolis)
2. Baptiste Lepers (École Polytechnique Fédérale de Lausanne)
3. Justin Funston (University of British Columbia)
4. Fabien Gaud (Coho Data)
5. Vivien Quéma (Grenoble Institute of Technology)
6. Alexandra Fedorova (University of British Columbia)
Eurosys Conference (18 – 21 April 2016) Paper
Paper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
Reference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf
Reference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
1
Papers We Love #20 (30 May 2016) By: Yeo Kheng Meng (yeokm1@gmail.com)
3. Some history
3
• Everybody wants ↑ CPU performance
• Before 2004:
• ↓ transistor size
• ↓ power of each transistor (Dennard Scaling)
• ↑ CPU frequency -> ↑ CPU performance
• ~2005-2007 to present:
• End of Dennard Scaling
• Increased use of multicores to ↑ CPU performance
• But did Linux properly take advantage of these cores?
4. Objective/Invariant of a Linux scheduler
• Load balance evenly on CPU cores to maximise resources
• No idle CPU cores if some cores have waiting threads
4
5. Test setup
• AMD Bulldozer Opteron 6272 (Socket G34) + 512GB RAM
• 8 NUMA nodes x 8 core (64 threads)
• NUMA: Non Uniform Memory Access
• Cores within nodes have faster access to local memory closer to them compared to foreign memory
• Each Opteron NUMA node has faster access to its last-level (L3) cache
• Total RAM spit into 64GB RAM chunks among nodes
• Linux Kernel up to 4.3
• TPC-H benchmark
• Transaction Processing Performance Council
• TPC-H: Complex database queries and data modification
5
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/
6. What is the problem?
• Idle cores exist despite other cores being overloaded
• Performance bugs in the Linux Kernel
6
7. What are the bugs?
1. Group imbalance (Mar 2011)
2. Scheduling Group Construction (Nov 2013)
3. Overload-on-Wakeup (Dec 2009)
4. Missing Scheduling Domains (Feb 2015)
7
8. First some concepts
8
• Thread weight
• Higher priority -> Higher weight
• Decided by Linux
• Timeslice:
• Time allocated for each thread to run on CPU in a certain time interval/timeout
• CPU cycles divided in proportion to thread’s weight
• Runtime:
• Accumulative thread time on CPU.
• Once runtime > timeslice, thread is preempted.
• Runqueue
• Queue of threads waiting to be executed by CPU
• Queue sorted by runtime
• Implemented as red-black tree
• Completely Fair Scheduler (CFS)
• Linux’s scheduler based on Weighted Fair Queuing to schedule threads
9. Completely Fair Scheduler (Single-Core)
9
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
CPU Core
Time elapsed (s)
Sorted
threads
Runtime
A 0
B 0
C 0
D 0
E 0
Runqueue sorted by Runtime
Thread A
0
10. Completely Fair Scheduler (Single-Core)
10
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
CPU Core
Time elapsed (s)
Sorted
threads
Runtime
B 0
C 0
D 0
E 0
Runqueue sorted by Runtime
Thread A
Thread B
0.050
11. Completely Fair Scheduler (Single-Core)
11
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
C 0
D 0
E 0
A 0.05
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread B
Thread C
0.150.05
12. Completely Fair Scheduler (Single-Core)
12
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
D 0
E 0
A 0.05
B 0.10
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread C
Thread D
0.350.15
13. Completely Fair Scheduler (Single-Core)
13
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
E 0
A 0.05
B 0.10
C 0.20
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread D
Thread E
0.600.35
14. Completely Fair Scheduler (Single-Core)
14
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread E 1.000.60
15. Completely Fair Scheduler (Single-Core)
15
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
1.00
16. Completely Fair Scheduler (Single-Core)
16
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
F 0
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
0
Thread F
17. Completely Fair Scheduler (Single-Core)
17
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread F 0.200
Thread A
18. Completely Fair Scheduler (Single-Core)
18
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
B 0.10
C 0.20
F 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread A 0.240.20
Thread B
19. What about multi-cores? (Global runqueue)
19
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Global Runqueue
…
…
…
…
…
Problems
• Context Switching requires access to runqueue
• Only one core can access/manipulate runqueue at any one time
• Other cores must wait to get new threads
20. What about multi-cores? (Per-core runqueue)
20
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
…
…
…
…
…
Core 1 Runqueue
…
…
…
…
…
Core 2 Runqueue
…
…
…
…
…
Core 3 Runqueue
…
…
…
…
…
Scheduler Objectives in multi-runqueue system
1. Runqueues has to be periodically load-balanced (4ms) or when threads are added/awoken
2. No runqueue should have high-proportion of high-priority threads
3. Should not balance every time there is a change in queue
• Load-balancing is computationally-heavy
• Moving threads across different runqueues will cause cache misses
• DO IT FEWER, DO IT BETTER
4. No idle cores should be allowed -> Emergency load-balancing when one core goes idle
21. Naïve runqueue load-balancing algorithms
• Balance runqueues by same number of threads?
• Ignores thread-priority, some threads more important than others
• Balance runqueues by thread weights?
• Some high priority threads can sleep a lot
• Scenario: One sleepy high priority thread in a queue
• -> Waste of CPU resources
21
Core 0 Runqueue
Thread A (W= 80, 25%)
-
-
-
Total Weight = 80
Core 1 Runqueue
Thread B (W=25, 60%)
Thread C (W=25, 40%)
Thread D (W=10, 50%)
Thread E (W=20, 50%)
Total Weight = 80
22. Slightly improved load-balancing algorithm
• Concept of “load”
• 𝑙𝑜𝑎𝑑 𝑡ℎ𝑟𝑒𝑎𝑑 = 𝑇ℎ𝑟𝑒𝑎𝑑 𝑊𝑒𝑖𝑔ℎ𝑡 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 % 𝐶𝑃𝑈 𝑢𝑡𝑖𝑙𝑖𝑠𝑎𝑡𝑖𝑜𝑛
• Balance runqueues by total load
22
Core 0
Runqueue
Thread load
Thread A
(W=80, 25%)
20
- -
- -
- -
Core 1
Runqueue
Thread load
Thread B
(W=25, 60%)
15
Thread C
(W=25, 40%)
10
Thread D
(W=10, 50%)
5
Thread E
(W=20, 50%)
10
Core 0
Runqueue
Thread load
Thread A
(W=80, 25%)
20
Thread E
(W=20, 50%)
10
- -
- -
Core 1
Runqueue
Thread load
Thread B
(W=25, 60%)
15
Thread C
(W=25, 40%)
10
Thread D
(W=10, 50%)
5
- -
23. Pseudocode of load-balancing algorithm
23
• Scheduling group (SG) is a subset of Scheduling domain (SD)
• SG comprises of CPU core(s)
1. For each SD, from lowest hierarchy to highest
2. Select designated core in SD to run algorithm (first idle core or core 0)
11. Compute average loads of all SG in SD
13. Select SG with highest average load
15. If load of other SG > current SG, balance load by stealing work from the
other SG
25. 25
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Between pairs of cores
Eg. Core 0 balances with Core 1, Core 2 with Core 3,…, Core 62 with Core 63
Load balancing hierarchal order (Level 1) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: CPU pairs
Number of SDs: 32
Scheduling Groups: CPU cores
26. 26
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First pair of every node balances with others cores in same node
Load balancing hierarchal order (Level 2)
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: NUMA nodes
Number of SDs: 8
Scheduling Groups: CPU pairs
27. 27
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 0 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {0, 1, 2, 4, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
28. 28
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 1 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {1, 0, 3, 4, 5, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
29. 29
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 2 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {2, 0, 3, 4, 5, 6, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
30. 30
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 3 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {3, 1, 2, 3, 4, 5}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
31. 31
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 4 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {4, 0, 1, 2, 3, 5, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
32. 32
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 5 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {5, 1, 2, 3, 5, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
33. 33
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 6 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {6, 0, 2, 4, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
34. 34
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 7 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {7, 1, 2, 3, 5, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
35. 35
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking first core/node 0.
Second scheduling group is constructed by picking first node (Node 3) not covered in first group
Load balancing hierarchal order (Level 4) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 1
Scheduling Groups: Directly-connected nodes
36. Bug 1: Group Imbalance
• When a core tries to steal work from
another SG, it compares the average load
of the SG instead of looking at each core.
• Load is only transferred if average load of
target SG > current SG
• Averages don’t account for spread
36
37. NUMA node 1NUMA node 0
Bug 1 Example Scenario
37
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
Total = 0
Core 1 Runqueue
A: Load = 1000
Total = 1000
Core 2 Runqueue
B: Load = 125
C: Load = 125
D: Load = 125
E: Load = 125
Total = 500
Core 3 Runqueue
F: Load = 125
G: Load = 125
H: Load = 125
I : Load = 125
Total = 500
Thread running… Thread running… Thread running… Thread running…
Average Load = 500 Average Load = 500
BalancedBalanced
Load of individual runqueues are unbalanced -> Averages do not tell the true story
41. Bug 1: Actual Scenario
• 1 lighter load “make” process with 64 threads
• 2 heavier load “R” processes of 1 thread each
• Heavier R Threads run on cores in node 0 and 4 skewing up their average load
• Other cores in node 0 and 4 are thus underloaded
• Other nodes are overloaded
41
42. Bug 1 Solution Results
• Speed of “make” process increased by 13%
• No impact to R threads
42
vs
Before After
43. Bug 2: Scheduling Group Construction
• Occurs when core pinning: Run programs on certain subset of cores
• No load balancing when threads on pinned on nodes 2 hops apart
43
44. 44
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
An application is pinned on nodes 1 and 2.
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
45. 45
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. 3 nodes
4. All nodes
Bug 2: Actual scenario
1. App is started and spawns multiple threads on first core (Core 16) on Node 2
48. 48
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
3. Load is balanced across nodes one-hop away but cannot be transferred due to core-pinning.
Load is not transferred to Node 1 yet.
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Nodes directly connected to Node 2
Scheduling Groups: Nodes {2}, {0}, {3}, {4}, {5}, {6}
49. 49
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
4. Node 2’s threads cannot be stolen by Node 1 as they are in the same scheduling group with same average loads.
Cause: Scheduling Groups at this level constructed in perspective of Core/Node 0
Scheduling Domains: All nodes in machine
Scheduling Groups: {0, 1, 2, 4, 6}, {1, 2, 3, 4, 5, 7}
51. 51
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Solution
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Construct the other SG based on perspective of other “leader” Node 1 not in previous SG
(Scheduling Domain: {1, 0, 3, 4, 5, 7})
52. 52
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Solution
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
(Scheduling Domain: {1, 0, 3, 4, 5, 7}, {2, 0, 3, 4, 5, 6, 7})
Nodes 1 and 2 are now in different scheduling groups, so Node 1 can now steal load from Node 2
53. 53
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 0 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 3) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 0 vs Node leader 3
54. 54
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 1 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 2) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 1 vs Node leader 2
55. 55
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 2 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 1) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 2 vs Node leader 1
56. 56
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 3 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 3 vs Node leader 0
57. 57
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 4 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 7) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 4 vs Node leader 7
58. 58
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 5 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 5 vs Node leader 0
59. 59
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 6 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 1) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 6 vs Node leader 1
60. 60
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 7 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 7 vs Node leader 0
61. Bug 2: Solution and Results
• Construct Scheduling Groups based on perspective of core
61
62. Bug 3: Overload-on-Wakeup
• Scenario
1. Thread A is running on Node X
2. A thread sleeps in a core in Node X
3. Node X gets busy
4. The sleeping thread A wakes up
5. Scheduler only wakes it up on a core in Node X even if
other nodes are idle
• Rationale: Maximise cache reuse
62
Other cores
Some thread…
…
Core in Node X
Thread ASome heavy thread…
Some heavy thread…
Some heavy thread…
Some heavy thread…
Thread A: Zzz…
Thread A
63. Bug 3: Actual scenario
• 64 worker threads of TPC-H + threads from other processes
• Thread stays on overloaded core despite existence of idle cores
63
64. Bug 3: Solution and results
• Wake thread up on core idle for longest time
64
65. Bug 4: Missing scheduling domains
65
• Regression from a refactoring process
Issue:
When a core is disabled and then re-enabled using the /proc interface, load balancing
between any NUMA nodes is no longer performed.
Bug:
The bug is due to an incorrect update of a global variable representing the number of
scheduling domains (sched_domains) in the machine.
Cause:
When a core is disabled, this variable is set to the number of domains inside a NUMA node.
As a consequence, the main scheduling loop (line 1 of Algorithm 1) exits earlier than
expected.
66. Bug 4: Actual scenario
• The vertical blue lines represent the cores considered by Core 0 for each (failed) load balancing call.
• There is one load balancing call every 4ms.
• We can see that Core 0 only considers its sibling core and cores on the same node for load
balancing, even though cores of Node 1 are overloaded.
66
67. Bug 4: Solution and Results
• Fix the regression: Regenerate Scheduling domains when a core is re-enabled
67
68. Lessons learned and possible solutions
• Issues:
• Performance bugs hard to detect. Bugs lasted for years!
• Importance of visualisation tools to help identify issues
• Scheduling designs/assumptions must adapt to hardware changes
• Newer scheduling algorithms/optimisations come out from research
• Possible long-term solution:
• -> Increased modularity of scheduler instead of monolithic
• Bugs over the years led to the decade of wasted cores!
68