2. Thank You
✦ Stefan Marr, Mattias De Wael
✦ Presenters
✦ Authors
✦ Program Committee
✦ Co-chair & Organizer: Theo D’Hondt
✦ Organizers: Andrew Black, Doug Kimelman, Martin
Rinard
✦ Voters
Saturday 4 May 13
3. Announcements
✦ Program at:
✦ http://soft.vub.ac.be/races/program/
✦ Strict timekeepers
✦ Dinner?
✦ Recording
Saturday 4 May 13
4. 9:00 Lightning and Welcome
9:10 Unsynchronized Techniques for Approximate Parallel Computing
9:35 Programming with Relaxed Synchronization
9:50 (Relative) Safety Properties for Relaxed Approximate Programs
10:05 Break
10:35 Nondeterminism is unavoidable, but data races are pure evil
11:00 Discussion
11:45 Lunch
1:15 How FIFO is Your Concurrent FIFO Queue?
1:35 The case for relativistic programming
1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
2:15 Does Better Throughput Require Worse Latency?
2:30 Parallel Sorting on a Spatial Computer
2:50 Break
3:25 Dancing with Uncertainty
3:45 Beyond Expert-Only Parallel Programming
4:00 Discussion
4:30 Wrap up
Saturday 4 May 13
6. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
7. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
8. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
9. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
10. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
11. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
12. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
13. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
14. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
15. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
16. Hardware
Towards Approximate Computing:
Programming with Relaxed Synchronization
Precise Less Precise
Accurate
Less Accurate, less up-
to-date, possibly
corrupted
Reliable
Variable
Computation
Data
Computing model
today
Human Brain
Relaxed
Synchronization
Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012
Saturday 4 May 13
18. Nondeterminism
is
Unavoidable,
but
Data
Races
are
Pure
Evil
Hans-‐J.
Boehm,
HP
Labs
• Much
low-‐level
code
is
inherently
nondeterminisBc,
but
• Data
races
–Are
forbidden
by
C/C++/OpenMP/Posix
language
standards.
–May
break
code
now
or
when
you
recompile.
Data
Races
–Don’t
improve
scalability
significantly,
even
if
the
code
sBll
works.
–Are
easily
avoidable
in
C11
&
C++11.
Saturday 4 May 13
19. How FIFO isYour Concurrent FIFO Queue?
Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer
University of Salzburg
semantically correct
and therefore “slow”
FIFO queues
semantically relaxed
and thereby “fast”
FIFO queues
Semantically relaxed FIFO queues can appear more
FIFO than semantically correct FIFO queues.
vs.
Saturday 4 May 13
20. A Case for Relativistic Programming
• Alter ordering requirements
(Causal, not Total)
• Don’t Alter correctness requirements
• High performance, Highly scalable
• Easy to program
Philip W. Howard and Jonathan Walpole
Saturday 4 May 13
22. Does Better Throughput Require
Worse Latency?
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word
Taking turns, broadcasting changes: Low latency
Dividing into sections, round-robin: High throughput
throughput -> parallel -> distributed/replicated -> latency
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM
Saturday 4 May 13
23. spatial computing
offers insights into:
• the costs and constraints
of communication in large
parallel computer arrays
• how to design algorithms
that respect these costs
and constraints
parallel sorting on a spatial computer
Max Orhai, Andrew P. Black
Saturday 4 May 13