This paper presents an algorithm and a data structure for scalable dynamic synchronization in fine-grained parallelism. The algorithm supports the full generality of phasers with dynamic, two-phase, and point-to-point synchronization. It retains the scalability of classical tree barriers, but provides unbounded dynamicity by employing a tailor-made insertion tree data structure.
It is the first completely documented implementation strategy for a scalable phaser synchronization construct. Our evaluation shows that it can be used as a drop-in replacement for classic barriers without harming performance, despite its additional complexity and potential for performance optimizations. Furthermore, our approach overcomes performance and scalability limitations which have been present in other phaser proposals.
VMADL: An Architecture Definition Language for Variability and Composition ...
Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism
1. Insertion Tree Phasers Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism Stefan Marr S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter Software Languages Lab VrijeUniversiteitBrussel
2. Agenda Introduction Barriers, Phasers Insertion Tree Phasers Insertion Tree Phaser Algorithm Evaluation Summary 9/26/10 2
3. Barriers Synchronizing parallel activities High productivity: easy to get right Mostly for scientific computing Many-core evolution Synchronizing dynamic and irregular problems Requires low-overhead dynamic hierarchical barriers 9/26/10 3 Introduction t1p1 t2p1 t3p1 t1p2 t2p2 t3p2 t1p3 t2p3 t3p3
4. t1p1 Phasers 9/26/10 4 Introduction Extension of X10 clocks Clocks: dynamic two-phase barrier for fork/join parallelism Registration modes for barrier Enables expression of producer/consumer relation Single statements Executed only by single thread, avoids duplicated barrier operations t1p2 t2p2 t3p2 t2p2 t3p2 t2p3 t3p3
5. Hierarchical Phasers 9/26/10 5 Introduction Shirako & Sarkar in Proc. of IEEE IPDPS 2010 [1] Array access List access First scalable implementation strategy Predefined tree structure Degree, i.e., tree arity Max. number of tiers, i.e., height Composed from phasers Problematic None dynamic structure Two-phase support incomplete Leaves design decisions open Phaser Tier 0 sub sub Tier 1 sub sub sub sub Tier 2 (leafs) sig sig sig sig sig sig sig sig A1 A2 A3 A4 A5 A6 A7 A8
6. Open Questions withHierarchical Phasers Dynamic tree construction, or on initialization? Tradeoffs for atomic operations, overhead of joining/leaving phaser How are operations synchronized? Tradeoffs for overheads and restrictions on parallelism Garbage collection problem for dropped participants Keeps list of synchronization objects incl. dropped participants After reaching max. #participants Is the tree rebalanced? (Hint at it for dropped nodes) Two-phase barrier support does not hide latency for original phasers 9/26/10 6 Introduction
8. Design Goal Support for full generality of Phaser properties Two-phase support Signal-only/wait-only for producers/consumers Single statement Full dynamicity: fine-grained hierarchical fork/join Adaptation of existing, scalable approaches Dissemination barrier not adaptable Remaining are tree-based approaches 9/26/10 8 Insertion TreePhaserAlgorithm
9. Insertion Tree Goals Stable, i.e., minimized tree modifications Avoid inconsistent synchronization information Maximizing parallel operations Solution: Insertion Tree Inverted tree No removal Complete smallest subtree first 9/26/10 9 Insertion TreePhaserAlgorithm 1/2
17. Determining the Insertion Point defgetNextInsertNode(tree): result = tree.lastNode i = tree.numLeaves whileimod 2 == 0: result = result.parent i = i/2 return result # this is for 2-ary trees # is adaptable for n-ary trees, too 9/26/10 17 Insertion TreePhaserAlgorithm
18. Synchronization Tree* 9/26/10 18 Insertion TreePhaserAlgorithm Phaser phase: 0 0 0 Phase counter 0 0 0 0 wo Helper nodes Wait-only flag Phase counter 0 0 0 0 rsmd Participant nodes Resume flag *) is simplified, leaves out registration modes A1 A2 A3 A4
38. Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier 9/26/10 38 Evaluation
39. Summary Scalable and efficient approach to Phasers Documents implementation Based on fully dynamic insertion tree Overcomes limitations of existing approaches Usable as drop-in replacement Future work Scalability beyond 59 cores Optimization for other memory architectures 9/26/10 39 Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
40. 9/26/10 40 Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers Questions? Phaser phase: 1 h1:R wo 1 h1:L wo wo 1 1 Implementation http://barriers.googlecode.com/ MIT license 1 rsmd 1 rsmd A1 A2 A3 A4
41. References [1] Shirako, Jun & Sarkar, Vivek: Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism In: Proc. of IEEE IPDPS (2010). 9/26/10 41
Notas do Editor
Shirako et al.X10 Vijay Saraswat
Shirako + Sarkar
So I went to the whiteboard drew a tree and figured out how to do it slightly different
How to build a tree to synchronize dynamic parallelism?
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed