SlideShare uma empresa Scribd logo
1 de 68
The Linux Scheduler:
a Decade of Wasted CoresAuthored by:
1. Jean-Pierre Lozi (Université Nice Sophia Antipolis)
2. Baptiste Lepers (École Polytechnique Fédérale de Lausanne)
3. Justin Funston (University of British Columbia)
4. Fabien Gaud (Coho Data)
5. Vivien Quéma (Grenoble Institute of Technology)
6. Alexandra Fedorova (University of British Columbia)
Eurosys Conference (18 – 21 April 2016) Paper
Paper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
Reference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf
Reference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
1
Papers We Love #20 (30 May 2016) By: Yeo Kheng Meng (yeokm1@gmail.com)
This presentation is best viewed with the
animations enabled
2
Some history
3
• Everybody wants ↑ CPU performance
• Before 2004:
• ↓ transistor size
• ↓ power of each transistor (Dennard Scaling)
• ↑ CPU frequency -> ↑ CPU performance
• ~2005-2007 to present:
• End of Dennard Scaling
• Increased use of multicores to ↑ CPU performance
• But did Linux properly take advantage of these cores?
Objective/Invariant of a Linux scheduler
• Load balance evenly on CPU cores to maximise resources
• No idle CPU cores if some cores have waiting threads
4
Test setup
• AMD Bulldozer Opteron 6272 (Socket G34) + 512GB RAM
• 8 NUMA nodes x 8 core (64 threads)
• NUMA: Non Uniform Memory Access
• Cores within nodes have faster access to local memory closer to them compared to foreign memory
• Each Opteron NUMA node has faster access to its last-level (L3) cache
• Total RAM spit into 64GB RAM chunks among nodes
• Linux Kernel up to 4.3
• TPC-H benchmark
• Transaction Processing Performance Council
• TPC-H: Complex database queries and data modification
5
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/
What is the problem?
• Idle cores exist despite other cores being overloaded
• Performance bugs in the Linux Kernel
6
What are the bugs?
1. Group imbalance (Mar 2011)
2. Scheduling Group Construction (Nov 2013)
3. Overload-on-Wakeup (Dec 2009)
4. Missing Scheduling Domains (Feb 2015)
7
First some concepts
8
• Thread weight
• Higher priority -> Higher weight
• Decided by Linux
• Timeslice:
• Time allocated for each thread to run on CPU in a certain time interval/timeout
• CPU cycles divided in proportion to thread’s weight
• Runtime:
• Accumulative thread time on CPU.
• Once runtime > timeslice, thread is preempted.
• Runqueue
• Queue of threads waiting to be executed by CPU
• Queue sorted by runtime
• Implemented as red-black tree
• Completely Fair Scheduler (CFS)
• Linux’s scheduler based on Weighted Fair Queuing to schedule threads
Completely Fair Scheduler (Single-Core)
9
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
CPU Core
Time elapsed (s)
Sorted
threads
Runtime
A 0
B 0
C 0
D 0
E 0
Runqueue sorted by Runtime
Thread A
0
Completely Fair Scheduler (Single-Core)
10
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
CPU Core
Time elapsed (s)
Sorted
threads
Runtime
B 0
C 0
D 0
E 0
Runqueue sorted by Runtime
Thread A
Thread B
0.050
Completely Fair Scheduler (Single-Core)
11
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
C 0
D 0
E 0
A 0.05
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread B
Thread C
0.150.05
Completely Fair Scheduler (Single-Core)
12
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
D 0
E 0
A 0.05
B 0.10
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread C
Thread D
0.350.15
Completely Fair Scheduler (Single-Core)
13
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
E 0
A 0.05
B 0.10
C 0.20
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread D
Thread E
0.600.35
Completely Fair Scheduler (Single-Core)
14
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread E 1.000.60
Completely Fair Scheduler (Single-Core)
15
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 200 * 1 0.05
B 20 20 / 200 * 1 0.10
C 40 40 / 200 * 1 0.20
D 50 50 / 200 * 1 0.25
E 80 80 / 200 * 1 0.40
Total 200
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
1.00
Completely Fair Scheduler (Single-Core)
16
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
F 0
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
0
Thread F
Completely Fair Scheduler (Single-Core)
17
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
A 0.05
B 0.10
C 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread F 0.200
Thread A
Completely Fair Scheduler (Single-Core)
18
Thread
name
Weight Timeslice calculation
(Weight / Total) * Interval
Assigned
Timeslice
A 10 10 / 250 * 1 0.04
B 20 20 / 250 * 1 0.08
C 40 40 / 250 * 1 0.16
D 50 50 / 250 * 1 0.20
E 80 80 / 250 * 1 0.32
F 50 50 / 250 * 1 0.20
Total 250
Time interval: 1 second
Sorted
threads
Runtime
B 0.10
C 0.20
F 0.20
D 0.25
E 0.40
Runqueue sorted by Runtime
CPU Core
Time elapsed (s)
Thread A 0.240.20
Thread B
What about multi-cores? (Global runqueue)
19
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Global Runqueue
…
…
…
…
…
Problems
• Context Switching requires access to runqueue
• Only one core can access/manipulate runqueue at any one time
• Other cores must wait to get new threads
What about multi-cores? (Per-core runqueue)
20
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
…
…
…
…
…
Core 1 Runqueue
…
…
…
…
…
Core 2 Runqueue
…
…
…
…
…
Core 3 Runqueue
…
…
…
…
…
Scheduler Objectives in multi-runqueue system
1. Runqueues has to be periodically load-balanced (4ms) or when threads are added/awoken
2. No runqueue should have high-proportion of high-priority threads
3. Should not balance every time there is a change in queue
• Load-balancing is computationally-heavy
• Moving threads across different runqueues will cause cache misses
• DO IT FEWER, DO IT BETTER
4. No idle cores should be allowed -> Emergency load-balancing when one core goes idle
Naïve runqueue load-balancing algorithms
• Balance runqueues by same number of threads?
• Ignores thread-priority, some threads more important than others
• Balance runqueues by thread weights?
• Some high priority threads can sleep a lot
• Scenario: One sleepy high priority thread in a queue
• -> Waste of CPU resources
21
Core 0 Runqueue
Thread A (W= 80, 25%)
-
-
-
Total Weight = 80
Core 1 Runqueue
Thread B (W=25, 60%)
Thread C (W=25, 40%)
Thread D (W=10, 50%)
Thread E (W=20, 50%)
Total Weight = 80
Slightly improved load-balancing algorithm
• Concept of “load”
• 𝑙𝑜𝑎𝑑 𝑡ℎ𝑟𝑒𝑎𝑑 = 𝑇ℎ𝑟𝑒𝑎𝑑 𝑊𝑒𝑖𝑔ℎ𝑡 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 % 𝐶𝑃𝑈 𝑢𝑡𝑖𝑙𝑖𝑠𝑎𝑡𝑖𝑜𝑛
• Balance runqueues by total load
22
Core 0
Runqueue
Thread load
Thread A
(W=80, 25%)
20
- -
- -
- -
Core 1
Runqueue
Thread load
Thread B
(W=25, 60%)
15
Thread C
(W=25, 40%)
10
Thread D
(W=10, 50%)
5
Thread E
(W=20, 50%)
10
Core 0
Runqueue
Thread load
Thread A
(W=80, 25%)
20
Thread E
(W=20, 50%)
10
- -
- -
Core 1
Runqueue
Thread load
Thread B
(W=25, 60%)
15
Thread C
(W=25, 40%)
10
Thread D
(W=10, 50%)
5
- -
Pseudocode of load-balancing algorithm
23
• Scheduling group (SG) is a subset of Scheduling domain (SD)
• SG comprises of CPU core(s)
1. For each SD, from lowest hierarchy to highest
2. Select designated core in SD to run algorithm (first idle core or core 0)
11. Compute average loads of all SG in SD
13. Select SG with highest average load
15. If load of other SG > current SG, balance load by stealing work from the
other SG
24
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
balances with
Load balancing hierarchal order
25
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Between pairs of cores
Eg. Core 0 balances with Core 1, Core 2 with Core 3,…, Core 62 with Core 63
Load balancing hierarchal order (Level 1) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: CPU pairs
Number of SDs: 32
Scheduling Groups: CPU cores
26
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First pair of every node balances with others cores in same node
Load balancing hierarchal order (Level 2)
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: NUMA nodes
Number of SDs: 8
Scheduling Groups: CPU pairs
27
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 0 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {0, 1, 2, 4, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
28
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 1 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {1, 0, 3, 4, 5, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
29
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 2 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {2, 0, 3, 4, 5, 6, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
30
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 3 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {3, 1, 2, 3, 4, 5}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
31
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 4 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {4, 0, 1, 2, 3, 5, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
32
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 5 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {5, 1, 2, 3, 5, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
33
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 6 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {6, 0, 2, 4, 7}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
34
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
Node 7 balances with nodes one hop away, steals threads from heaviest node
Nodes in current domain: {7, 1, 2, 3, 5, 6}
Load balancing hierarchal order (Level 3) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Directly-connected nodes
Number of SDs: 8
Scheduling Groups: NUMA nodes
35
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking first core/node 0.
Second scheduling group is constructed by picking first node (Node 3) not covered in first group
Load balancing hierarchal order (Level 4) Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 1
Scheduling Groups: Directly-connected nodes
Bug 1: Group Imbalance
• When a core tries to steal work from
another SG, it compares the average load
of the SG instead of looking at each core.
• Load is only transferred if average load of
target SG > current SG
• Averages don’t account for spread
36
NUMA node 1NUMA node 0
Bug 1 Example Scenario
37
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
Total = 0
Core 1 Runqueue
A: Load = 1000
Total = 1000
Core 2 Runqueue
B: Load = 125
C: Load = 125
D: Load = 125
E: Load = 125
Total = 500
Core 3 Runqueue
F: Load = 125
G: Load = 125
H: Load = 125
I : Load = 125
Total = 500
Thread running… Thread running… Thread running… Thread running…
Average Load = 500 Average Load = 500
BalancedBalanced
Load of individual runqueues are unbalanced -> Averages do not tell the true story
NUMA node 0
Bug 1 Solution: Compare minimum loads
38
NUMA node 1
38
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue Core 1 Runqueue
A: Load = 1000
Core 2 Runqueue
B: Load = 125
C: Load = 125
D: Load = 125
E: Load = 125
Core 3 Runqueue
F: Load = 125
G: Load = 125
H: Load = 125
I : Load = 125
Thread running… Thread running… Thread running… Thread running…
Minimum load = 0 Minimum load = 500
NUMA node 0
Bug 1 Solution: Compare minimum loads
39
NUMA node 1
39
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
D: Load = 125
E: Load = 125
Core 1 Runqueue
A: Load = 1000
Core 2 Runqueue
B: Load = 125
C: Load = 125
Core 3 Runqueue
F: Load = 125
G: Load = 125
H: Load = 125
I : Load = 125
Thread running… Thread running… Thread running… Thread running…
Minimum load = 250 Minimum load = 250
NUMA node 0
Bug 1 Solution: Compare minimum loads
40
NUMA node 1
40
CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3
Core 0 Runqueue
D: Load = 125
E: Load = 125
Core 1 Runqueue
A: Load = 1000
Core 2 Runqueue
B: Load = 125
C: Load = 125
I : Load = 125
Core 3 Runqueue
F: Load = 125
G: Load = 125
H: Load = 125
Thread running… Thread running… Thread running… Thread running…
Bug 1: Actual Scenario
• 1 lighter load “make” process with 64 threads
• 2 heavier load “R” processes of 1 thread each
• Heavier R Threads run on cores in node 0 and 4 skewing up their average load
• Other cores in node 0 and 4 are thus underloaded
• Other nodes are overloaded
41
Bug 1 Solution Results
• Speed of “make” process increased by 13%
• No impact to R threads
42
vs
Before After
Bug 2: Scheduling Group Construction
• Occurs when core pinning: Run programs on certain subset of cores
• No load balancing when threads on pinned on nodes 2 hops apart
43
44
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
An application is pinned on nodes 1 and 2.
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
45
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. 3 nodes
4. All nodes
Bug 2: Actual scenario
1. App is started and spawns multiple threads on first core (Core 16) on Node 2
46
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
2. Load is balanced across first pair
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodesScheduling Domains: Core 16-17 pair
Scheduling Groups: Cores {16}, {17}
47
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
2. Load is balanced across entire node
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodesScheduling Domains: Node 2
Scheduling Groups: Cores {16, 17}, {18, 19}, {20, 21}, {22, 23}
48
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario
3. Load is balanced across nodes one-hop away but cannot be transferred due to core-pinning.
Load is not transferred to Node 1 yet.
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: Nodes directly connected to Node 2
Scheduling Groups: Nodes {2}, {0}, {3}, {4}, {5}, {6}
49
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Actual scenario Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
4. Node 2’s threads cannot be stolen by Node 1 as they are in the same scheduling group with same average loads.
Cause: Scheduling Groups at this level constructed in perspective of Core/Node 0
Scheduling Domains: All nodes in machine
Scheduling Groups: {0, 1, 2, 4, 6}, {1, 2, 3, 4, 5, 7}
50
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Solution
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Construct SG based on perspective of “leader” node 2 for Level 4
(Scheduling Domain: {2, 0, 3, 4, 5, 6, 7})
51
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Solution
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Construct the other SG based on perspective of other “leader” Node 1 not in previous SG
(Scheduling Domain: {1, 0, 3, 4, 5, 7})
52
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
Bug 2: Solution
Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
(Scheduling Domain: {1, 0, 3, 4, 5, 7}, {2, 0, 3, 4, 5, 6, 7})
Nodes 1 and 2 are now in different scheduling groups, so Node 1 can now steal load from Node 2
53
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 0 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 3) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 0 vs Node leader 3
54
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 1 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 2) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 1 vs Node leader 2
55
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 2 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 1) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 2 vs Node leader 1
56
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 3 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 3 vs Node leader 0
57
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 4 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 7) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 4 vs Node leader 7
58
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 5 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 5 vs Node leader 0
59
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 6 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 1) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 6 vs Node leader 1
60
Node 0 Node 4 Node 5 Node 1
Node 6 Node 2 Node 3 Node 7
balances with
First scheduling group constructed by picking Node 7 and directly connected nodes.
Second scheduling group is constructed by picking first node (Node 0) not covered in first group
New Level 4 Balancing situation Scheduling domain hierarchy
1. 2 cores
2. 1 node
3. Directly-connected nodes
4. All nodes
Scheduling Domains: All nodes
Number of SDs: 8
Scheduling Groups: Directly-connected nodes
Node leader 7 vs Node leader 0
Bug 2: Solution and Results
• Construct Scheduling Groups based on perspective of core
61
Bug 3: Overload-on-Wakeup
• Scenario
1. Thread A is running on Node X
2. A thread sleeps in a core in Node X
3. Node X gets busy
4. The sleeping thread A wakes up
5. Scheduler only wakes it up on a core in Node X even if
other nodes are idle
• Rationale: Maximise cache reuse
62
Other cores
Some thread…
…
Core in Node X
Thread ASome heavy thread…
Some heavy thread…
Some heavy thread…
Some heavy thread…
Thread A: Zzz…
Thread A
Bug 3: Actual scenario
• 64 worker threads of TPC-H + threads from other processes
• Thread stays on overloaded core despite existence of idle cores
63
Bug 3: Solution and results
• Wake thread up on core idle for longest time
64
Bug 4: Missing scheduling domains
65
• Regression from a refactoring process
Issue:
When a core is disabled and then re-enabled using the /proc interface, load balancing
between any NUMA nodes is no longer performed.
Bug:
The bug is due to an incorrect update of a global variable representing the number of
scheduling domains (sched_domains) in the machine.
Cause:
When a core is disabled, this variable is set to the number of domains inside a NUMA node.
As a consequence, the main scheduling loop (line 1 of Algorithm 1) exits earlier than
expected.
Bug 4: Actual scenario
• The vertical blue lines represent the cores considered by Core 0 for each (failed) load balancing call.
• There is one load balancing call every 4ms.
• We can see that Core 0 only considers its sibling core and cores on the same node for load
balancing, even though cores of Node 1 are overloaded.
66
Bug 4: Solution and Results
• Fix the regression: Regenerate Scheduling domains when a core is re-enabled
67
Lessons learned and possible solutions
• Issues:
• Performance bugs hard to detect. Bugs lasted for years!
• Importance of visualisation tools to help identify issues
• Scheduling designs/assumptions must adapt to hardware changes
• Newer scheduling algorithms/optimisations come out from research
• Possible long-term solution:
• -> Increased modularity of scheduler instead of monolithic
• Bugs over the years led to the decade of wasted cores!
68

Mais conteúdo relacionado

Mais procurados

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 

Mais procurados (20)

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
U boot-boot-flow
U boot-boot-flowU boot-boot-flow
U boot-boot-flow
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emption
 
Linux scheduler
Linux schedulerLinux scheduler
Linux scheduler
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
Linux Kernel I/O Schedulers
Linux Kernel I/O SchedulersLinux Kernel I/O Schedulers
Linux Kernel I/O Schedulers
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux Kernel
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 
Embedded linux network device driver development
Embedded linux network device driver developmentEmbedded linux network device driver development
Embedded linux network device driver development
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...
“Vision and AI DSPs for Ultra-High-End and Always-On Applications,” a Present...
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Linux PCI device driver
Linux PCI device driverLinux PCI device driver
Linux PCI device driver
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 

Destaque

Docker & Badoo: 
никогда не останавливайся на достигнутом
Docker & Badoo: 
никогда не останавливайся на достигнутомDocker & Badoo: 
никогда не останавливайся на достигнутом
Docker & Badoo: 
никогда не останавливайся на достигнутом
Anton Turetsky
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1
sprdd
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
Open-NFP
 

Destaque (20)

Linux O(1) Scheduling
Linux O(1) SchedulingLinux O(1) Scheduling
Linux O(1) Scheduling
 
LCU13: Power-efficient scheduling, and the latest news from the kernel summit
LCU13: Power-efficient scheduling, and the latest news from the kernel summitLCU13: Power-efficient scheduling, and the latest news from the kernel summit
LCU13: Power-efficient scheduling, and the latest news from the kernel summit
 
Intro to cluster scheduler for Linux containers
Intro to cluster scheduler for Linux containersIntro to cluster scheduler for Linux containers
Intro to cluster scheduler for Linux containers
 
Scheduling In Linux
Scheduling In LinuxScheduling In Linux
Scheduling In Linux
 
Process scheduling linux
Process scheduling linuxProcess scheduling linux
Process scheduling linux
 
3. CPU virtualization and scheduling
3. CPU virtualization and scheduling3. CPU virtualization and scheduling
3. CPU virtualization and scheduling
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
Live model transformations driven by incremental pattern matching
Live model transformations driven by incremental pattern matchingLive model transformations driven by incremental pattern matching
Live model transformations driven by incremental pattern matching
 
Generic and Meta-Transformations for Model Transformation Engineering
Generic and Meta-Transformations for Model Transformation EngineeringGeneric and Meta-Transformations for Model Transformation Engineering
Generic and Meta-Transformations for Model Transformation Engineering
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Operating system 11.10.2016 adarsh bang
Operating system 11.10.2016 adarsh bangOperating system 11.10.2016 adarsh bang
Operating system 11.10.2016 adarsh bang
 
Jireh ict
Jireh ictJireh ict
Jireh ict
 
Docker & Badoo: 
никогда не останавливайся на достигнутом
Docker & Badoo: 
никогда не останавливайся на достигнутомDocker & Badoo: 
никогда не останавливайся на достигнутом
Docker & Badoo: 
никогда не останавливайся на достигнутом
 
React native
React nativeReact native
React native
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup.
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Executive Information Security Training
Executive Information Security TrainingExecutive Information Security Training
Executive Information Security Training
 

Semelhante a The Linux Scheduler: a Decade of Wasted Cores

101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
Acácio Oliveira
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
Kristofferson A
 
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docxCIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
clarebernice
 

Semelhante a The Linux Scheduler: a Decade of Wasted Cores (20)

Run Run Trema Test
Run Run Trema TestRun Run Trema Test
Run Run Trema Test
 
Java concurrency - Thread pools
Java concurrency - Thread poolsJava concurrency - Thread pools
Java concurrency - Thread pools
 
CPU Performance Enhancements
CPU Performance EnhancementsCPU Performance Enhancements
CPU Performance Enhancements
 
OpenStack Tempest and REST API testing
OpenStack Tempest and REST API testingOpenStack Tempest and REST API testing
OpenStack Tempest and REST API testing
 
Accurate Synchronization of EtherCAT Systems Using Distributed Clocks
Accurate Synchronization of EtherCAT Systems Using Distributed ClocksAccurate Synchronization of EtherCAT Systems Using Distributed Clocks
Accurate Synchronization of EtherCAT Systems Using Distributed Clocks
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-complete
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Multi Threading
Multi ThreadingMulti Threading
Multi Threading
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Intro To .Net Threads
Intro To .Net ThreadsIntro To .Net Threads
Intro To .Net Threads
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docxCIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
CIS3110 Winter 2016CIS3110 (Operating Systems) Assig.docx
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisFundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump Analysis
 

Mais de yeokm1

Mais de yeokm1 (20)

I became a Private Pilot and this is my story
I became a Private Pilot and this is my storyI became a Private Pilot and this is my story
I became a Private Pilot and this is my story
 
What's inside a Cessna 172 and flying a light plane
What's inside a Cessna 172 and flying a light planeWhat's inside a Cessna 172 and flying a light plane
What's inside a Cessna 172 and flying a light plane
 
Speaking at Tech meetups/conferences for Junior Devs
Speaking at Tech meetups/conferences for Junior DevsSpeaking at Tech meetups/conferences for Junior Devs
Speaking at Tech meetups/conferences for Junior Devs
 
Reflections on Trusting Trust for Go
Reflections on Trusting Trust for GoReflections on Trusting Trust for Go
Reflections on Trusting Trust for Go
 
Meltdown and Spectre
Meltdown and SpectreMeltdown and Spectre
Meltdown and Spectre
 
Gentoo on a 486
Gentoo on a 486Gentoo on a 486
Gentoo on a 486
 
BLE Localiser (Full) for iOS Dev Scout
BLE Localiser (Full) for iOS Dev ScoutBLE Localiser (Full) for iOS Dev Scout
BLE Localiser (Full) for iOS Dev Scout
 
BLE Localiser for iOS Conf SG 2017
BLE Localiser for iOS Conf SG 2017BLE Localiser for iOS Conf SG 2017
BLE Localiser for iOS Conf SG 2017
 
Repair Kopitiam Specialty Tools (Part 2): Short Circuit Limiter
 Repair Kopitiam Specialty Tools (Part 2): Short Circuit Limiter Repair Kopitiam Specialty Tools (Part 2): Short Circuit Limiter
Repair Kopitiam Specialty Tools (Part 2): Short Circuit Limiter
 
PCB Business Card (Singapore Power)
PCB Business Card (Singapore Power)PCB Business Card (Singapore Power)
PCB Business Card (Singapore Power)
 
SP Auto Door Unlocker
SP Auto Door UnlockerSP Auto Door Unlocker
SP Auto Door Unlocker
 
SP IoT Doorbell
SP IoT DoorbellSP IoT Doorbell
SP IoT Doorbell
 
Distance Machine Locker
Distance Machine LockerDistance Machine Locker
Distance Machine Locker
 
A Science Project: Building a sound card based on the Covox Speech Thing
A Science Project: Building a sound card based on the Covox Speech ThingA Science Project: Building a sound card based on the Covox Speech Thing
A Science Project: Building a sound card based on the Covox Speech Thing
 
A Science Project: Swift Serial Chat
A Science Project: Swift Serial ChatA Science Project: Swift Serial Chat
A Science Project: Swift Serial Chat
 
The slide rule
The slide ruleThe slide rule
The slide rule
 
Windows 3.1 (WFW) on vintage and modern hardware
Windows 3.1 (WFW) on vintage and modern hardwareWindows 3.1 (WFW) on vintage and modern hardware
Windows 3.1 (WFW) on vintage and modern hardware
 
Repair Kopitiam Circuit Breaker Training
Repair Kopitiam Circuit Breaker TrainingRepair Kopitiam Circuit Breaker Training
Repair Kopitiam Circuit Breaker Training
 
A2: Analog Malicious Hardware
A2: Analog Malicious HardwareA2: Analog Malicious Hardware
A2: Analog Malicious Hardware
 
Getting Started with Raspberry Pi
Getting Started with Raspberry PiGetting Started with Raspberry Pi
Getting Started with Raspberry Pi
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

The Linux Scheduler: a Decade of Wasted Cores

  • 1. The Linux Scheduler: a Decade of Wasted CoresAuthored by: 1. Jean-Pierre Lozi (Université Nice Sophia Antipolis) 2. Baptiste Lepers (École Polytechnique Fédérale de Lausanne) 3. Justin Funston (University of British Columbia) 4. Fabien Gaud (Coho Data) 5. Vivien Quéma (Grenoble Institute of Technology) 6. Alexandra Fedorova (University of British Columbia) Eurosys Conference (18 – 21 April 2016) Paper Paper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf Reference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf Reference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/ 1 Papers We Love #20 (30 May 2016) By: Yeo Kheng Meng (yeokm1@gmail.com)
  • 2. This presentation is best viewed with the animations enabled 2
  • 3. Some history 3 • Everybody wants ↑ CPU performance • Before 2004: • ↓ transistor size • ↓ power of each transistor (Dennard Scaling) • ↑ CPU frequency -> ↑ CPU performance • ~2005-2007 to present: • End of Dennard Scaling • Increased use of multicores to ↑ CPU performance • But did Linux properly take advantage of these cores?
  • 4. Objective/Invariant of a Linux scheduler • Load balance evenly on CPU cores to maximise resources • No idle CPU cores if some cores have waiting threads 4
  • 5. Test setup • AMD Bulldozer Opteron 6272 (Socket G34) + 512GB RAM • 8 NUMA nodes x 8 core (64 threads) • NUMA: Non Uniform Memory Access • Cores within nodes have faster access to local memory closer to them compared to foreign memory • Each Opteron NUMA node has faster access to its last-level (L3) cache • Total RAM spit into 64GB RAM chunks among nodes • Linux Kernel up to 4.3 • TPC-H benchmark • Transaction Processing Performance Council • TPC-H: Complex database queries and data modification 5 http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/
  • 6. What is the problem? • Idle cores exist despite other cores being overloaded • Performance bugs in the Linux Kernel 6
  • 7. What are the bugs? 1. Group imbalance (Mar 2011) 2. Scheduling Group Construction (Nov 2013) 3. Overload-on-Wakeup (Dec 2009) 4. Missing Scheduling Domains (Feb 2015) 7
  • 8. First some concepts 8 • Thread weight • Higher priority -> Higher weight • Decided by Linux • Timeslice: • Time allocated for each thread to run on CPU in a certain time interval/timeout • CPU cycles divided in proportion to thread’s weight • Runtime: • Accumulative thread time on CPU. • Once runtime > timeslice, thread is preempted. • Runqueue • Queue of threads waiting to be executed by CPU • Queue sorted by runtime • Implemented as red-black tree • Completely Fair Scheduler (CFS) • Linux’s scheduler based on Weighted Fair Queuing to schedule threads
  • 9. Completely Fair Scheduler (Single-Core) 9 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second CPU Core Time elapsed (s) Sorted threads Runtime A 0 B 0 C 0 D 0 E 0 Runqueue sorted by Runtime Thread A 0
  • 10. Completely Fair Scheduler (Single-Core) 10 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second CPU Core Time elapsed (s) Sorted threads Runtime B 0 C 0 D 0 E 0 Runqueue sorted by Runtime Thread A Thread B 0.050
  • 11. Completely Fair Scheduler (Single-Core) 11 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second Sorted threads Runtime C 0 D 0 E 0 A 0.05 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread B Thread C 0.150.05
  • 12. Completely Fair Scheduler (Single-Core) 12 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second Sorted threads Runtime D 0 E 0 A 0.05 B 0.10 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread C Thread D 0.350.15
  • 13. Completely Fair Scheduler (Single-Core) 13 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second Sorted threads Runtime E 0 A 0.05 B 0.10 C 0.20 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread D Thread E 0.600.35
  • 14. Completely Fair Scheduler (Single-Core) 14 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second Sorted threads Runtime A 0.05 B 0.10 C 0.20 D 0.25 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread E 1.000.60
  • 15. Completely Fair Scheduler (Single-Core) 15 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 200 * 1 0.05 B 20 20 / 200 * 1 0.10 C 40 40 / 200 * 1 0.20 D 50 50 / 200 * 1 0.25 E 80 80 / 200 * 1 0.40 Total 200 Time interval: 1 second Sorted threads Runtime A 0.05 B 0.10 C 0.20 D 0.25 E 0.40 Runqueue sorted by Runtime CPU Core Time elapsed (s) 1.00
  • 16. Completely Fair Scheduler (Single-Core) 16 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 250 * 1 0.04 B 20 20 / 250 * 1 0.08 C 40 40 / 250 * 1 0.16 D 50 50 / 250 * 1 0.20 E 80 80 / 250 * 1 0.32 F 50 50 / 250 * 1 0.20 Total 250 Time interval: 1 second Sorted threads Runtime F 0 A 0.05 B 0.10 C 0.20 D 0.25 E 0.40 Runqueue sorted by Runtime CPU Core Time elapsed (s) 0 Thread F
  • 17. Completely Fair Scheduler (Single-Core) 17 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 250 * 1 0.04 B 20 20 / 250 * 1 0.08 C 40 40 / 250 * 1 0.16 D 50 50 / 250 * 1 0.20 E 80 80 / 250 * 1 0.32 F 50 50 / 250 * 1 0.20 Total 250 Time interval: 1 second Sorted threads Runtime A 0.05 B 0.10 C 0.20 D 0.25 E 0.40 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread F 0.200 Thread A
  • 18. Completely Fair Scheduler (Single-Core) 18 Thread name Weight Timeslice calculation (Weight / Total) * Interval Assigned Timeslice A 10 10 / 250 * 1 0.04 B 20 20 / 250 * 1 0.08 C 40 40 / 250 * 1 0.16 D 50 50 / 250 * 1 0.20 E 80 80 / 250 * 1 0.32 F 50 50 / 250 * 1 0.20 Total 250 Time interval: 1 second Sorted threads Runtime B 0.10 C 0.20 F 0.20 D 0.25 E 0.40 Runqueue sorted by Runtime CPU Core Time elapsed (s) Thread A 0.240.20 Thread B
  • 19. What about multi-cores? (Global runqueue) 19 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Global Runqueue … … … … … Problems • Context Switching requires access to runqueue • Only one core can access/manipulate runqueue at any one time • Other cores must wait to get new threads
  • 20. What about multi-cores? (Per-core runqueue) 20 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Core 0 Runqueue … … … … … Core 1 Runqueue … … … … … Core 2 Runqueue … … … … … Core 3 Runqueue … … … … … Scheduler Objectives in multi-runqueue system 1. Runqueues has to be periodically load-balanced (4ms) or when threads are added/awoken 2. No runqueue should have high-proportion of high-priority threads 3. Should not balance every time there is a change in queue • Load-balancing is computationally-heavy • Moving threads across different runqueues will cause cache misses • DO IT FEWER, DO IT BETTER 4. No idle cores should be allowed -> Emergency load-balancing when one core goes idle
  • 21. Naïve runqueue load-balancing algorithms • Balance runqueues by same number of threads? • Ignores thread-priority, some threads more important than others • Balance runqueues by thread weights? • Some high priority threads can sleep a lot • Scenario: One sleepy high priority thread in a queue • -> Waste of CPU resources 21 Core 0 Runqueue Thread A (W= 80, 25%) - - - Total Weight = 80 Core 1 Runqueue Thread B (W=25, 60%) Thread C (W=25, 40%) Thread D (W=10, 50%) Thread E (W=20, 50%) Total Weight = 80
  • 22. Slightly improved load-balancing algorithm • Concept of “load” • 𝑙𝑜𝑎𝑑 𝑡ℎ𝑟𝑒𝑎𝑑 = 𝑇ℎ𝑟𝑒𝑎𝑑 𝑊𝑒𝑖𝑔ℎ𝑡 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 % 𝐶𝑃𝑈 𝑢𝑡𝑖𝑙𝑖𝑠𝑎𝑡𝑖𝑜𝑛 • Balance runqueues by total load 22 Core 0 Runqueue Thread load Thread A (W=80, 25%) 20 - - - - - - Core 1 Runqueue Thread load Thread B (W=25, 60%) 15 Thread C (W=25, 40%) 10 Thread D (W=10, 50%) 5 Thread E (W=20, 50%) 10 Core 0 Runqueue Thread load Thread A (W=80, 25%) 20 Thread E (W=20, 50%) 10 - - - - Core 1 Runqueue Thread load Thread B (W=25, 60%) 15 Thread C (W=25, 40%) 10 Thread D (W=10, 50%) 5 - -
  • 23. Pseudocode of load-balancing algorithm 23 • Scheduling group (SG) is a subset of Scheduling domain (SD) • SG comprises of CPU core(s) 1. For each SD, from lowest hierarchy to highest 2. Select designated core in SD to run algorithm (first idle core or core 0) 11. Compute average loads of all SG in SD 13. Select SG with highest average load 15. If load of other SG > current SG, balance load by stealing work from the other SG
  • 24. 24 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes balances with Load balancing hierarchal order
  • 25. 25 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Between pairs of cores Eg. Core 0 balances with Core 1, Core 2 with Core 3,…, Core 62 with Core 63 Load balancing hierarchal order (Level 1) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: CPU pairs Number of SDs: 32 Scheduling Groups: CPU cores
  • 26. 26 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First pair of every node balances with others cores in same node Load balancing hierarchal order (Level 2) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: NUMA nodes Number of SDs: 8 Scheduling Groups: CPU pairs
  • 27. 27 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 0 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {0, 1, 2, 4, 6} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 28. 28 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 1 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {1, 0, 3, 4, 5, 7} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 29. 29 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 2 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {2, 0, 3, 4, 5, 6, 7} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 30. 30 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 3 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {3, 1, 2, 3, 4, 5} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 31. 31 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 4 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {4, 0, 1, 2, 3, 5, 6} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 32. 32 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 5 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {5, 1, 2, 3, 5, 7} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 33. 33 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 6 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {6, 0, 2, 4, 7} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 34. 34 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with Node 7 balances with nodes one hop away, steals threads from heaviest node Nodes in current domain: {7, 1, 2, 3, 5, 6} Load balancing hierarchal order (Level 3) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Directly-connected nodes Number of SDs: 8 Scheduling Groups: NUMA nodes
  • 35. 35 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking first core/node 0. Second scheduling group is constructed by picking first node (Node 3) not covered in first group Load balancing hierarchal order (Level 4) Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 1 Scheduling Groups: Directly-connected nodes
  • 36. Bug 1: Group Imbalance • When a core tries to steal work from another SG, it compares the average load of the SG instead of looking at each core. • Load is only transferred if average load of target SG > current SG • Averages don’t account for spread 36
  • 37. NUMA node 1NUMA node 0 Bug 1 Example Scenario 37 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Core 0 Runqueue Total = 0 Core 1 Runqueue A: Load = 1000 Total = 1000 Core 2 Runqueue B: Load = 125 C: Load = 125 D: Load = 125 E: Load = 125 Total = 500 Core 3 Runqueue F: Load = 125 G: Load = 125 H: Load = 125 I : Load = 125 Total = 500 Thread running… Thread running… Thread running… Thread running… Average Load = 500 Average Load = 500 BalancedBalanced Load of individual runqueues are unbalanced -> Averages do not tell the true story
  • 38. NUMA node 0 Bug 1 Solution: Compare minimum loads 38 NUMA node 1 38 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Core 0 Runqueue Core 1 Runqueue A: Load = 1000 Core 2 Runqueue B: Load = 125 C: Load = 125 D: Load = 125 E: Load = 125 Core 3 Runqueue F: Load = 125 G: Load = 125 H: Load = 125 I : Load = 125 Thread running… Thread running… Thread running… Thread running… Minimum load = 0 Minimum load = 500
  • 39. NUMA node 0 Bug 1 Solution: Compare minimum loads 39 NUMA node 1 39 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Core 0 Runqueue D: Load = 125 E: Load = 125 Core 1 Runqueue A: Load = 1000 Core 2 Runqueue B: Load = 125 C: Load = 125 Core 3 Runqueue F: Load = 125 G: Load = 125 H: Load = 125 I : Load = 125 Thread running… Thread running… Thread running… Thread running… Minimum load = 250 Minimum load = 250
  • 40. NUMA node 0 Bug 1 Solution: Compare minimum loads 40 NUMA node 1 40 CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3 Core 0 Runqueue D: Load = 125 E: Load = 125 Core 1 Runqueue A: Load = 1000 Core 2 Runqueue B: Load = 125 C: Load = 125 I : Load = 125 Core 3 Runqueue F: Load = 125 G: Load = 125 H: Load = 125 Thread running… Thread running… Thread running… Thread running…
  • 41. Bug 1: Actual Scenario • 1 lighter load “make” process with 64 threads • 2 heavier load “R” processes of 1 thread each • Heavier R Threads run on cores in node 0 and 4 skewing up their average load • Other cores in node 0 and 4 are thus underloaded • Other nodes are overloaded 41
  • 42. Bug 1 Solution Results • Speed of “make” process increased by 13% • No impact to R threads 42 vs Before After
  • 43. Bug 2: Scheduling Group Construction • Occurs when core pinning: Run programs on certain subset of cores • No load balancing when threads on pinned on nodes 2 hops apart 43
  • 44. 44 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Actual scenario An application is pinned on nodes 1 and 2. Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes
  • 45. 45 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. 3 nodes 4. All nodes Bug 2: Actual scenario 1. App is started and spawns multiple threads on first core (Core 16) on Node 2
  • 46. 46 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Actual scenario 2. Load is balanced across first pair Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodesScheduling Domains: Core 16-17 pair Scheduling Groups: Cores {16}, {17}
  • 47. 47 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Actual scenario 2. Load is balanced across entire node Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodesScheduling Domains: Node 2 Scheduling Groups: Cores {16, 17}, {18, 19}, {20, 21}, {22, 23}
  • 48. 48 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Actual scenario 3. Load is balanced across nodes one-hop away but cannot be transferred due to core-pinning. Load is not transferred to Node 1 yet. Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: Nodes directly connected to Node 2 Scheduling Groups: Nodes {2}, {0}, {3}, {4}, {5}, {6}
  • 49. 49 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Actual scenario Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes 4. Node 2’s threads cannot be stolen by Node 1 as they are in the same scheduling group with same average loads. Cause: Scheduling Groups at this level constructed in perspective of Core/Node 0 Scheduling Domains: All nodes in machine Scheduling Groups: {0, 1, 2, 4, 6}, {1, 2, 3, 4, 5, 7}
  • 50. 50 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Solution Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Construct SG based on perspective of “leader” node 2 for Level 4 (Scheduling Domain: {2, 0, 3, 4, 5, 6, 7})
  • 51. 51 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Solution Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Construct the other SG based on perspective of other “leader” Node 1 not in previous SG (Scheduling Domain: {1, 0, 3, 4, 5, 7})
  • 52. 52 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Bug 2: Solution Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes (Scheduling Domain: {1, 0, 3, 4, 5, 7}, {2, 0, 3, 4, 5, 6, 7}) Nodes 1 and 2 are now in different scheduling groups, so Node 1 can now steal load from Node 2
  • 53. 53 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 0 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 3) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 0 vs Node leader 3
  • 54. 54 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 1 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 2) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 1 vs Node leader 2
  • 55. 55 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 2 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 1) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 2 vs Node leader 1
  • 56. 56 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 3 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 3 vs Node leader 0
  • 57. 57 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 4 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 7) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 4 vs Node leader 7
  • 58. 58 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 5 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 5 vs Node leader 0
  • 59. 59 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 6 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 1) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 6 vs Node leader 1
  • 60. 60 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 balances with First scheduling group constructed by picking Node 7 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group New Level 4 Balancing situation Scheduling domain hierarchy 1. 2 cores 2. 1 node 3. Directly-connected nodes 4. All nodes Scheduling Domains: All nodes Number of SDs: 8 Scheduling Groups: Directly-connected nodes Node leader 7 vs Node leader 0
  • 61. Bug 2: Solution and Results • Construct Scheduling Groups based on perspective of core 61
  • 62. Bug 3: Overload-on-Wakeup • Scenario 1. Thread A is running on Node X 2. A thread sleeps in a core in Node X 3. Node X gets busy 4. The sleeping thread A wakes up 5. Scheduler only wakes it up on a core in Node X even if other nodes are idle • Rationale: Maximise cache reuse 62 Other cores Some thread… … Core in Node X Thread ASome heavy thread… Some heavy thread… Some heavy thread… Some heavy thread… Thread A: Zzz… Thread A
  • 63. Bug 3: Actual scenario • 64 worker threads of TPC-H + threads from other processes • Thread stays on overloaded core despite existence of idle cores 63
  • 64. Bug 3: Solution and results • Wake thread up on core idle for longest time 64
  • 65. Bug 4: Missing scheduling domains 65 • Regression from a refactoring process Issue: When a core is disabled and then re-enabled using the /proc interface, load balancing between any NUMA nodes is no longer performed. Bug: The bug is due to an incorrect update of a global variable representing the number of scheduling domains (sched_domains) in the machine. Cause: When a core is disabled, this variable is set to the number of domains inside a NUMA node. As a consequence, the main scheduling loop (line 1 of Algorithm 1) exits earlier than expected.
  • 66. Bug 4: Actual scenario • The vertical blue lines represent the cores considered by Core 0 for each (failed) load balancing call. • There is one load balancing call every 4ms. • We can see that Core 0 only considers its sibling core and cores on the same node for load balancing, even though cores of Node 1 are overloaded. 66
  • 67. Bug 4: Solution and Results • Fix the regression: Regenerate Scheduling domains when a core is re-enabled 67
  • 68. Lessons learned and possible solutions • Issues: • Performance bugs hard to detect. Bugs lasted for years! • Importance of visualisation tools to help identify issues • Scheduling designs/assumptions must adapt to hardware changes • Newer scheduling algorithms/optimisations come out from research • Possible long-term solution: • -> Increased modularity of scheduler instead of monolithic • Bugs over the years led to the decade of wasted cores! 68