Why User-Mode Threads Are Good for Performance

Brought to you by
Why User-Mode Threads
Are Good for Performance
Ron Pressler
Architect, Java Platform Group, Oracle

2
Why?
• Why do anything?
• Why do this rather than that?
2 Copyright © 2022, Oracle and/or its affiliates

Concurrency

4
Concurrency and Parallelism
• Parallelism:
• Speed up a task by splitting it into cooperating subtasks
scheduled onto multiple available computing resources.
• Performance measure: latency (time duration)
• Concurrency:
• Schedule available computing resources to multiple largely
independent tasks that compete over them.
• Performance measure: throughput (task/time unit)

5
Little’s Law

6
Little’s Law
In any stable system

7
Little’s Law
with long term averages:
W
λ
λ
}L
λ — arrival rate = exit rate
= throughput
W — duration inside
L — no. items inside

8
Little’s Law
with long term averages:
L = λW
W
λ
λ
}L
λ — arrival rate = exit rate
= throughput
W — duration inside
L — no. items inside

9
Little’s Law
W
L̄
¯
λ
}
L̄ = ¯
λW
L̄
— Capacity
¯
λ
(long-term average)
(long-term average)

10
Little’s Law
W
L̄
}
L̄ = ¯
λW
L̄
— Capacity
¯
λ
> ¯
λ (long-term average)
(momentary)

11
Little’s Law
W
L̄
}
L̄ = ¯
λW
L̄
— Capacity
¯
λ
< ¯
λ (long-term average)
(momentary)

12
Little’s Law — subsystems
L = λW
S
S′

L′

= λ′

W′

13
Little’s Law — subsystems and queues
— Average queue length
}
n
L = λW
n
S
S′

L′

= λ′

W′

14
}
n
L = λW
n
S
S′

L′

= λ′

W′

λ′

= λ
L′

= L + n
W′

= W + n
W
L

15
}
n
L = λW
n
S
S′

L′

= λ′

W′

λ′

= λ
L′

= L + n
W′

= W + n
W
L
L′

= λ′

W′

L + n = λ(W + n
W
L
)
L(1 +
n
L
) = λW(1 +
n
L
)
L = λW

16
Little’s Law — other applications
— Interference coefficient
L1 = pλW1
x
L = L1 + L2
L = λ(1 + xL)W
L =
λW
1 − xλW
L2 = (1 − p)λW2
= λ(pW1 + (1 − p)W2)
P(cache hit) = p
P(cache miss) = 1 − p
Caching Interference
= λW + (xλW)2
+ (xλW)3
+ . . .
W

17
Capacity and throughput
¯
λ =
L̄
W
L̄
¯
λ
— Capacity
— Maximum throughput
W — Average latency

18
CPU capacity
¯
λ =
L̄CPU
WCPU
L̄CPU
¯
λ
— #cores
WCPU
— Average CPU consumption
WCPU = 100μs = 1 × 10−4
s (> 1500 cache misses)
L̄ = 30 cores
¯
λ = 30,000 req/s

19
Capacity and throughput
¯
λ =
L̄
W
L̄
¯
λ
— Capacity
W — Average latency
λ/3
λ/3
λ/3
Load
balancer

20
Thread-per-Request — Thread Capacity
Thread-per-Request: A request consumes a thread
(either new or borrowed from a pool) for its duration, W
Request: The domain’s unit of concurrency
Thread: The software unit of concurrency

21
#threads = L = λW

22
with parallel fanout
⇒ For the purpose of estimating #threads, we can consider W to be the sum of all fanout latencies,
even if they’re done in parallel.
#threads = cL = λ(
W
c
) = λW
c — average fanout, i.e. average number of threads consumed by a request
#threads = L = λW

23
CPU vs. Thread Capacity with thread/req
WCPU = 100μs
(> 1500 cache misses)
⇒
L̄CPU = #cores
L̄ = ¯
λW
W = 10ms
= (
L̄CPU
WCPU
)W
= L̄CPU(
W
WCPU
)
100 threads per core
WCPU = 50μs
⇒
W = 50ms
1000 threads per core
WCPU = 10μs
⇒
W = 100ms
10,000 threads per core

25
simple
less scalable
scalable,
complex,
non-interoperable,
hard to debug/profile
OR
SYNC
Java Blue
ASYNC
Programmer 😀
Hardware ☹
Programmer ☹
Hardware 😀

26
App
Connections
26
Virtual Threads
Programmer 😀
Hardware 😀

27
¯
λ =
L̄
W

28
The Impact of Context Switching
(
λ1
λ2
=
W2
W1
)
W = n(μ + t)
n(μ + t)
nμ
= 1 +
t
μ
• Context-switching affects throughput by means of duration, not capacity
• Virtual threads have a faster context-switch than OS threads
• Structured concurrency allows waiting for a set of operations with one context-switch
where
n
t
μ
avg. #operations
avg. wait duration
avg. context-switch duration
μ = 20μs
t = 1μs
⇒ 5% impact
(quite fast)
(quite slow)
⇒

29
Not cooperative, but no time-sharing yet
• Non-cooperative scheduling is more composable
• … but people overestimate the importance of time-sharing in servers
• Effective time-sharing effective when #threads is #cores×105
requires study

30
Summary
• Virtual threads allow higher throughput for the
thread-per-request style — the style harmonious
with the platform — by drastically increasing the
request capacity of the server.
• We can juggle more balls not by adding hands
but by enlarging the arch.
• Context-switching cost could be important, but
aren’t the main reason for the throughput
increase.

Brought to you by
Ron Pressler
ron.pressler@oracle.com
@pressron
• JEP 425: Virtual Threads (Preview)
• The Design of User Mode Threads in Java [video] (why not async/await)

Why User-Mode Threads Are Good for Performance

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Why User-Mode Threads Are Good for Performance

Semelhante a Why User-Mode Threads Are Good for Performance (20)

Mais de ScyllaDB

Mais de ScyllaDB (20)

Último

Último (20)

Why User-Mode Threads Are Good for Performance