2. Overview
Introduction
Synchronization
Non-blocking
Synchronization
Is Non-blocking Synchronization performancebeneficial for Parallel Applications?
NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?
Lock-free Skip lists
Conclusions, Future Work
6. Non-blocking Synchronization
Lock-Free Synchronization
Optimistic
approach
• Assumes it’s alone and prepares
operation which later takes place (unless
interfered) in one atomic step, using
hardware atomic primitives
• Interference is detected via shared
memory
• Retries until not interfered by other
operations
• Can cause starvation
7. Example: Shared Queue
The usual approach is to implement operations using retry loops.
Here’s an example:
type Qtype = record v: valtype; next: pointer to Qtype end
type Qtype = record v: valtype; next: pointer to Qtype end
shared var Tail: pointer to Qtype;
shared var Tail: pointer to Qtype;
local var old, new: pointer to Qtype
local var old, new: pointer to Qtype
procedure Enqueue (input: valtype)
procedure Enqueue (input: valtype)
new := (input, NIL);
new := (input, NIL);
repeat old := Tail
repeat old := Tail
until CAS2(&Tail, &(old->next), old, NIL, new, new)
until CAS2(&Tail, &(old->next), old, NIL, new, new)
old
Tail
new
old
Tail
new
8. Non-blocking Synchronization
Lock-Free Synchronization
Avoids
problems that locks have
Fast
Starvation?
(not in the Context of HPC)
Wait-Free Synchronization
Always
finishes in a finite number of its own
steps.
• Complex algorithms
• Memory consuming
• Less efficient on average than lock-free
9. Overview
Introduction
Synchronization
Non-blocking
Synchronization
Is Non-blocking Synchronization performancebeneficial for Parallel Scientific Applications?
NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?
Conclusions, Future Work
10. Non-blocking
Synchronisation
Synchronisation:
An alternative approach for synchronisation
introduced 25 years ago
Many theoretical results
Evaluation:
Micro-benchmarks shows better
performance than mutual exclusion in real
or simulated multiprocessor systems.
11. Practice
Non-blocking synchronization is still not
used in practical applications
Non-blocking solutions are often
complex
having
non-standard or un-clear
interfaces
non-practical
?
?
12. Practice
Question?
”How the performance of
parallel scientific
applications is affected by
the use of non-blocking
synchronisation rather than
lock-based one?”
?
?
?
13. Answers
How the performance of parallel scientific
applications is affected by the use of nonblocking synchronisation rather than lockbased one?
The identification of the basic locking
operations that parallel programmers use in
their applications.
The efficient non-blocking implementation of
these synchronisation operations.
The architectural implications on the design
of non-blocking synchronisation.
Comparison of the lock-based and lock-free
versions of the respective applications
14. Applications
Ocean
simulates eddy currents in an ocean basin.
Radiosity
computes the equilibrium distribution of light in a scene
using the radiosity method.
Volrend
renders 3D volume data into an image using a raycasting method.
Water
Evaluates forces and potentials that occur over time
between water molecules.
Spark98
a collection of sparse matrix kernels.
Each kernel performs a sequence of sparse matrix
vector product operations using matrices that are
derived from a family of three-dimensional finite
element earthquake applications.
15. Removing Locks in
Applications
Many locks are
“Simple Locks”.
Many critical
sections contain
shared floatingpoint variables.
Large critical
sections.
CAS, FAA and LL/SC can
be used to implement
non-blocking version.
Floating-point
synchronization primitives
are needed. A DoubleFetch-and-Add primitive
was designed.
Efficient Non-blocking
implementations of big
ADT are used.
18. Overview
Introduction
Synchronization
Non-blocking
Synchronization
Is Non-blocking Synchronization beneficial for
Parallel Scientific Applications?
NOBLE: A Non-blocking Synchronization Interface.
How can we make non-blocking synchronization
accessible to the parallel programmer?
Conclusions, Future Work
19. Practice
Non-blocking synchronization is still not
used in practical applications
Non-blocking solutions are often
complex
having
non-standard or un-clear
interfaces
non-practical
?
?
20. NOBLE: Brings Non-blocking closer to Practice
Create a non-blocking inter-process
communication interface with the properties:
Attractive
functionality
Programmer friendly
Easy to adapt existing solutions
Efficient
Portable
Adaptable for different programming languages
22. Using NOBLE
• First create a global variable
handling the shared data
object, for example a stack:
• Create the stack with the
appropriate implementation:
Globals
#include <noble.h>
...
NBLStack* stack;
Main
stack=NBLStackCreateLF(10000);
...
Threads
• When some thread wants to
do some operation:
NBLStackPush(stack, item);
or
item=NBLStackPop(stack);
24. Using NOBLE
Globals
#include <noble.h>
...
NBLStack* stack;
• To change the
synchronization mechanism,
only one line of code has to
be changed!
Main
stack=NBLStackCreateLB();
...
NBLStackFree(stack);
Threads
NBLStackPush(stack, item);
or
item=NBLStackPop(stack);
27. Did our Work have any
Impact?
1)
2)
3)
Industry has initialized contacts and
uses a test version of NOBLE.
Free-ware developers has showed
interest.
Interest from research organisations.
NOBLE is freely availiable for
research and educational purposes.
28. A Lock-Free Skip list
Presented as part of the: H. Sundell, Ph. Tsigas
Fast and Lock-Free Concurrent Priority Queues
for Multi-Thread Systems. 17th IEEE/ACM
International Parallel and Distributed
Processing Symposium (IPDPS ´03), May 2003
(TR 2002). Best Paper Award
A very similar lock-free skip list algorithm will be
presented this August at the ACM Symposium
on Principles of Distributed Computing (PODC
2004):
”Lock-Free Linked Lists and Skip Lists”
Mikhail Fomitchev, Eric Ruppert
29. Randomized Algorithm: Skip Lists
William Pugh: ”Skip Lists: A Probabilistic
Alternative to Balanced Trees”, 1990
Layers
of ordered lists with different
densities, achieves a tree-like behavior
Head
Tail
1
2
Time
3
4
5
6
7
complexity: O(log2N) – probabilistic!
…
25%
50%
30. Our Lock-Free Concurrent
Skip List
Define
node state to depend on the
insertion status at lowest level as well
as a deletion flag
1
3
2
1
p
D
2
D
Insert
Set
3
D
4
D
5
D
6
D
7
D
from lowest level going upwards
deletion flag. Delete from
highest level going downwards
3
2
1
p
D
31. Concurrent Insert vs. Delete
operations
b)
1
Problem:
2
Delete
3
Insert
- both nodes are deleted!
4
a)
Solution (Harris et al): Use bit 0 of
pointer to mark deletion status
1
b)
2 *
c)
a)
3
4
32. Dynamic Memory Management
Problem: System memory allocation
functionality is blocking!
Solution (lock-free), IBM freelists:
Pre-allocate
a number of nodes, link
them into a dynamic stack structure,
and allocate/reclaim using CAS
Allocate
Head
Mem 1
Reclaim
Used 1
Mem 2
…
Mem n
33. The ABA problem
Problem: Because of concurrency
(pre-emption in particular), same
pointer value does not always mean
same node (i.e. CAS succeeds)!!!
Step 1:
1
6
7
3
7
4
Step 2:
2
4
34. The ABA problem
Solution: (Valois et al) Add reference
counting to each node, in order to prevent
nodes that are of interest to some thread to
be reclaimed until all threads have left the
node
New Step 2:
1 *
6 *
1
1
CAS Failes!
2
3
?
7
?
4
1
?
35. Helping Scheme
Threads need to traverse safely
2 *
1
4
or
2 *
4
?
?
1
Need to remove marked-to-be-deleted
nodes while traversing – Help!
Finds previous node, finish deletion and
continues traversing from previous node
1
2 *
4
36. Overlapping operations on
Insert 2
shared data
2
Example: Insert operation 1
4
- which of 2 or 3 gets inserted?
Solution: Compare-And-Swap
atomic primitive:
CAS(p:pointer to word, old:word,
new:word):boolean
atomic do
if *p = old then
*p := new;
return true;
else return false;
3
Insert 3
37. Experiments
1-30 threads on platforms with
different levels of real concurrency
10000 Insert vs. DeleteMin operations
by each thread. 100 vs. 1000 initial
inserts
Compare with other implementations:
Lotan
and Shavit, 2000
Hunt et al “An Efficient Algorithm for
Concurrent Priority Queue Heaps”,
1996
41. Lessons Learned
The Non-Blocking Synchronization
Paradigm can be suitable and beneficial to
large scale parallel applications.
Experimental Reproducable Work. Many
results claimed by simulation are not
consistent with what we observed.
Applications gave us nice problems to look
at and do theoretical work on. (IPDPS 2003
Algorithmic Best Paper Award)
NOBLE helped programmers to trust our
implementations.
42. Future Work
Extend NOBLE for loosely coupled
systems.
Extend the set of data structures
supported by NOBLE based on the
needs of the applications.
Reactive-Synchronisation