SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Welcome to
RACES’12
Saturday 4 May 13
Thank You
✦ Stefan Marr, Mattias De Wael
✦ Presenters
✦ Authors
✦ Program Committee
✦ Co-chair & Organizer: Theo D’Hondt
✦ Organizers: Andrew Black, Doug Kimelman, Martin
Rinard
✦ Voters
Saturday 4 May 13
Announcements
✦ Program at:
✦ http://soft.vub.ac.be/races/program/
✦ Strict timekeepers
✦ Dinner?
✦ Recording
Saturday 4 May 13
9:00 Lightning and Welcome
9:10 Unsynchronized Techniques for Approximate Parallel Computing
9:35 Programming with Relaxed Synchronization
9:50 (Relative) Safety Properties for Relaxed Approximate Programs
10:05 Break
10:35 Nondeterminism is unavoidable, but data races are pure evil
11:00 Discussion
11:45 Lunch
1:15 How FIFO is Your Concurrent FIFO Queue?
1:35 The case for relativistic programming
1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
2:15 Does Better Throughput Require Worse Latency?
2:30 Parallel Sorting on a Spatial Computer
2:50 Break
3:25 Dancing with Uncertainty
3:45 Beyond Expert-Only Parallel Programming
4:00 Discussion
4:30 Wrap up
Saturday 4 May 13
Lightning
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
2
4length
next
values
a
Expandable	
  Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
Hardware
Towards Approximate Computing:
Programming with Relaxed Synchronization
Precise Less Precise
Accurate
Less Accurate, less up-
to-date, possibly
corrupted
Reliable
Variable
Computation
Data
Computing model
today
Human Brain
Relaxed
Synchronization
Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012
Saturday 4 May 13
(Relative) Safety Properties
for Relaxed Approximate
Programs
Michael Carbin and Martin Rinard
Saturday 4 May 13
Nondeterminism	
  is	
  Unavoidable,
but	
  Data	
  Races	
  are	
  Pure	
  Evil
Hans-­‐J.	
  Boehm,	
  HP	
  Labs	
  
• Much	
  low-­‐level	
  code	
  is	
  inherently
nondeterminisBc,	
  but
• Data	
  races
–Are	
  forbidden	
  by	
  C/C++/OpenMP/Posix	
  language	
  
standards.
–May	
  break	
  code	
  now	
  or	
  when	
  you	
  recompile.
Data
Races
–Don’t	
  improve	
  scalability	
  significantly,	
  even	
  
if	
  the	
  code	
  sBll	
  works.
–Are	
  easily	
  avoidable	
  in	
  C11	
  &	
  C++11.
Saturday 4 May 13
How FIFO isYour Concurrent FIFO Queue?
Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer
University of Salzburg
semantically correct
and therefore “slow”
FIFO queues
semantically relaxed
and thereby “fast”
FIFO queues
Semantically relaxed FIFO queues can appear more
FIFO than semantically correct FIFO queues.
vs.
Saturday 4 May 13
A Case for Relativistic Programming
• Alter ordering requirements
(Causal, not Total)
• Don’t Alter correctness requirements
• High performance, Highly scalable
• Easy to program
Philip W. Howard and Jonathan Walpole
Saturday 4 May 13
IBM Research
© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012
Edge Chasing Delayed Consistency: Pushing the Limits of
Weak Ordering
§ From the RACES website:
– “an approach towards scalability that reduces synchronization requirements
drastically, possibly to the point of discarding them altogether.”
§ A hardware developer’s perspective:
– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the applications
that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as
much as possible?
• For example by allowing each core to use stale versions of cache lines as long as
possible
• While maintaining architectural correctness; i.e. we will not break existing code
• If we do that, what will happen?
Trey Cain and Mikko Lipasti
Saturday 4 May 13
Does Better Throughput Require
Worse Latency?
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word
Taking turns, broadcasting changes: Low latency
Dividing into sections, round-robin: High throughput
throughput -> parallel -> distributed/replicated -> latency
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM
Saturday 4 May 13
spatial computing
offers insights into:
• the costs and constraints
of communication in large
parallel computer arrays
• how to design algorithms
that respect these costs
and constraints
parallel sorting on a spatial computer
Max Orhai, Andrew P. Black
Saturday 4 May 13
Dancing with
Uncertainty
Sasa Misailovic, Stelios Sidiroglou and Martin
Rinard
Saturday 4 May 13
© 2009 IBM Corporation
1
Sea Change In Linux-Kernel Parallel Programming
In 2006, Linus Torvalds noted that since 2003, the Linux
kernel community's grasp of concurrency had improved to the
point that patches were often correct at first submission
Why the improvement?
–Not programming language: C before, during, and after
–Not synchronization primitives: Locking before, during, and after
–Not a change in personnel: Relatively low turnover
–Not born parallel programmers: Remember Big Kernel Lock!
So what was it?
–Stick around for the discussion this afternoon and find out!!!
Paul E. McKenney: Beyond Expert-Only Parallel Programming?
Saturday 4 May 13

Mais conteúdo relacionado

Semelhante a Welcome and Lightning Intros

The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryVinícius Uchôa
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstracttsysglobalsolutions
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoRobson Araujo
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Collaborative archietyped for ipv4
Collaborative archietyped for ipv4Collaborative archietyped for ipv4
Collaborative archietyped for ipv4Fredrick Ishengoma
 
Brian Klumpe Unification of Producer Consumer Key Pairs
Brian Klumpe Unification of Producer Consumer Key PairsBrian Klumpe Unification of Producer Consumer Key Pairs
Brian Klumpe Unification of Producer Consumer Key PairsBrian_Klumpe
 
Fake gyarmathyvarasdi
Fake gyarmathyvarasdiFake gyarmathyvarasdi
Fake gyarmathyvarasdiEva Gyarmathy
 
Constructing Operating Systems and E-Commerce
Constructing Operating Systems and E-CommerceConstructing Operating Systems and E-Commerce
Constructing Operating Systems and E-CommerceIJARIIT
 
Scimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbinScimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbinAgostino_Marchetti
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSGyan Prakash
 

Semelhante a Welcome and Lightning Intros (20)

Model checking
Model checkingModel checking
Model checking
 
Crypto
CryptoCrypto
Crypto
 
Scaling your team
Scaling your teamScaling your team
Scaling your team
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstract
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Collaborative archietyped for ipv4
Collaborative archietyped for ipv4Collaborative archietyped for ipv4
Collaborative archietyped for ipv4
 
Brian Klumpe Unification of Producer Consumer Key Pairs
Brian Klumpe Unification of Producer Consumer Key PairsBrian Klumpe Unification of Producer Consumer Key Pairs
Brian Klumpe Unification of Producer Consumer Key Pairs
 
Fake gyarmathyvarasdi
Fake gyarmathyvarasdiFake gyarmathyvarasdi
Fake gyarmathyvarasdi
 
Constructing Operating Systems and E-Commerce
Constructing Operating Systems and E-CommerceConstructing Operating Systems and E-Commerce
Constructing Operating Systems and E-Commerce
 
Harmful interupts
Harmful interuptsHarmful interupts
Harmful interupts
 
Scimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbinScimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbin
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
 

Último

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Welcome and Lightning Intros

  • 2. Thank You ✦ Stefan Marr, Mattias De Wael ✦ Presenters ✦ Authors ✦ Program Committee ✦ Co-chair & Organizer: Theo D’Hondt ✦ Organizers: Andrew Black, Doug Kimelman, Martin Rinard ✦ Voters Saturday 4 May 13
  • 3. Announcements ✦ Program at: ✦ http://soft.vub.ac.be/races/program/ ✦ Strict timekeepers ✦ Dinner? ✦ Recording Saturday 4 May 13
  • 4. 9:00 Lightning and Welcome 9:10 Unsynchronized Techniques for Approximate Parallel Computing 9:35 Programming with Relaxed Synchronization 9:50 (Relative) Safety Properties for Relaxed Approximate Programs 10:05 Break 10:35 Nondeterminism is unavoidable, but data races are pure evil 11:00 Discussion 11:45 Lunch 1:15 How FIFO is Your Concurrent FIFO Queue? 1:35 The case for relativistic programming 1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models 2:15 Does Better Throughput Require Worse Latency? 2:30 Parallel Sorting on a Spatial Computer 2:50 Break 3:25 Dancing with Uncertainty 3:45 Beyond Expert-Only Parallel Programming 4:00 Discussion 4:30 Wrap up Saturday 4 May 13
  • 6. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 7. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 8. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 9. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 10. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 11. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Saturday 4 May 13
  • 12. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Data Race! Saturday 4 May 13
  • 13. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Data Race! Saturday 4 May 13
  • 14. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Data Race! Saturday 4 May 13
  • 15. 2 4length next values a Expandable  Array append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1; Data Race! Saturday 4 May 13
  • 16. Hardware Towards Approximate Computing: Programming with Relaxed Synchronization Precise Less Precise Accurate Less Accurate, less up- to-date, possibly corrupted Reliable Variable Computation Data Computing model today Human Brain Relaxed Synchronization Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012 Saturday 4 May 13
  • 17. (Relative) Safety Properties for Relaxed Approximate Programs Michael Carbin and Martin Rinard Saturday 4 May 13
  • 18. Nondeterminism  is  Unavoidable, but  Data  Races  are  Pure  Evil Hans-­‐J.  Boehm,  HP  Labs   • Much  low-­‐level  code  is  inherently nondeterminisBc,  but • Data  races –Are  forbidden  by  C/C++/OpenMP/Posix  language   standards. –May  break  code  now  or  when  you  recompile. Data Races –Don’t  improve  scalability  significantly,  even   if  the  code  sBll  works. –Are  easily  avoidable  in  C11  &  C++11. Saturday 4 May 13
  • 19. How FIFO isYour Concurrent FIFO Queue? Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer University of Salzburg semantically correct and therefore “slow” FIFO queues semantically relaxed and thereby “fast” FIFO queues Semantically relaxed FIFO queues can appear more FIFO than semantically correct FIFO queues. vs. Saturday 4 May 13
  • 20. A Case for Relativistic Programming • Alter ordering requirements (Causal, not Total) • Don’t Alter correctness requirements • High performance, Highly scalable • Easy to program Philip W. Howard and Jonathan Walpole Saturday 4 May 13
  • 21. IBM Research © 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering § From the RACES website: – “an approach towards scalability that reduces synchronization requirements drastically, possibly to the point of discarding them altogether.” § A hardware developer’s perspective: – Constraints of Legacy Code • What if we want to apply this principle, but have no control over the applications that are running on a system? – Can one build a coherence protocol that avoids synchronizing cores as much as possible? • For example by allowing each core to use stale versions of cache lines as long as possible • While maintaining architectural correctness; i.e. we will not break existing code • If we do that, what will happen? Trey Cain and Mikko Lipasti Saturday 4 May 13
  • 22. Does Better Throughput Require Worse Latency? Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Introduction As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock- free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput: Algorithms that improve application-level throughput worsen inter-core application-level latency. We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches. Throughput and Latency For this proposal, we define throughput and latency as follows: • Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores. • Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure- imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it constitutes a real lower bound for the overall latency that is apparent to an application. Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput. Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Table 1: Hypothetical figures if tradeoff were linear Version Core count Best-possible inter-core latency Mean observed latency in application Normalized latency (observed / best possible) App-operations/sec. ( 1 core) App.-operations/sec. ( 10 cores) Normalized throughput (normalized to perfect scaling) Latency / Throughput A B 10 10 200 µs 200 µs 1,000 µs 3,000 µs 5 15 1,000 1,000 2,500 7,500 0.25 0.75 20 20 A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency: • Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput. • Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word Taking turns, broadcasting changes: Low latency Dividing into sections, round-robin: High throughput throughput -> parallel -> distributed/replicated -> latency David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM Saturday 4 May 13
  • 23. spatial computing offers insights into: • the costs and constraints of communication in large parallel computer arrays • how to design algorithms that respect these costs and constraints parallel sorting on a spatial computer Max Orhai, Andrew P. Black Saturday 4 May 13
  • 24. Dancing with Uncertainty Sasa Misailovic, Stelios Sidiroglou and Martin Rinard Saturday 4 May 13
  • 25. © 2009 IBM Corporation 1 Sea Change In Linux-Kernel Parallel Programming In 2006, Linus Torvalds noted that since 2003, the Linux kernel community's grasp of concurrency had improved to the point that patches were often correct at first submission Why the improvement? –Not programming language: C before, during, and after –Not synchronization primitives: Locking before, during, and after –Not a change in personnel: Relatively low turnover –Not born parallel programmers: Remember Big Kernel Lock! So what was it? –Stick around for the discussion this afternoon and find out!!! Paul E. McKenney: Beyond Expert-Only Parallel Programming? Saturday 4 May 13