SlideShare uma empresa Scribd logo
1 de 39
Insertion Tree Phasers Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism Stefan Marr S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter Software Languages Lab VrijeUniversiteitBrussel
Agenda Introduction Barriers, Phasers Insertion Tree Phasers Insertion Tree Phaser Algorithm Evaluation Summary 9/26/10 2
Barriers Synchronizing parallel activities High productivity: easy to get right  Mostly for scientific computing Many-core evolution Synchronizing dynamic and irregular problems Requires low-overhead dynamic hierarchical barriers 9/26/10 3 Introduction t1p1 t2p1 t3p1 t1p2 t2p2 t3p2 t1p3 t2p3 t3p3
t1p1 Phasers 9/26/10 4 Introduction Extension of X10 clocks Clocks: dynamic two-phase barrier for fork/join parallelism Registration modes for barrier Enables expression of producer/consumer relation Single statements Executed only by single thread, avoids duplicated barrier operations t1p2 t2p2 t3p2 t2p2 t3p2 t2p3 t3p3
Hierarchical Phasers 9/26/10 5 Introduction Shirako & Sarkar in Proc. of IEEE IPDPS 2010 [1] Array access List access First scalable implementation strategy Predefined tree structure Degree, i.e., tree arity Max. number of tiers, i.e., height Composed from phasers Problematic None dynamic structure Two-phase support incomplete Leaves design decisions open  Phaser Tier 0 sub sub Tier 1 sub sub sub sub Tier 2 (leafs) sig sig sig sig sig sig sig sig A1 A2 A3 A4 A5 A6 A7 A8
Open Questions withHierarchical Phasers Dynamic tree construction, or on initialization? Tradeoffs for atomic operations, overhead of joining/leaving phaser How are operations synchronized? Tradeoffs for overheads and restrictions on parallelism Garbage collection problem for dropped participants Keeps list of synchronization objects incl. dropped participants After reaching max. #participants Is the tree rebalanced? (Hint at it for dropped nodes) Two-phase barrier support does not hide latency for original phasers 9/26/10 6 Introduction
Insertion Tree Phasers 9/26/10 7
Design Goal Support for full generality of Phaser properties Two-phase support Signal-only/wait-only for producers/consumers Single statement Full dynamicity: fine-grained hierarchical fork/join Adaptation of existing, scalable approaches Dissemination barrier not adaptable Remaining are tree-based approaches 9/26/10 8 Insertion TreePhaserAlgorithm
Insertion Tree Goals Stable, i.e., minimized tree modifications Avoid inconsistent synchronization information Maximizing parallel operations Solution: Insertion Tree Inverted tree No removal Complete smallest subtree first 9/26/10 9 Insertion TreePhaserAlgorithm 1/2
Insertion Tree 9/26/10 10 Insertion TreePhaserAlgorithm 2/2
Insertion Tree 9/26/10 11 Insertion TreePhaserAlgorithm 2/2 1
Insertion Tree 9/26/10 12 Insertion TreePhaserAlgorithm 2/2 h1 1 2
Insertion Tree 9/26/10 13 Insertion TreePhaserAlgorithm 2/2 h2 h1 1 2 3
Insertion Tree 9/26/10 14 Insertion TreePhaserAlgorithm 2/2 h2 h1 h3 1 2 3 4
Insertion Tree 9/26/10 15 Insertion TreePhaserAlgorithm 2/2 h4 h2 h1 h3 1 2 3 4 5
Insertion Tree 9/26/10 16 Insertion TreePhaserAlgorithm 2/2 h4 h2 h6 h1 h3 h5 h7 1 2 3 4 5 6 7 8
Determining the Insertion Point defgetNextInsertNode(tree):   result = tree.lastNode i = tree.numLeaves whileimod 2 == 0:     result = result.parent i = i/2 return result   # this is for 2-ary trees   # is adaptable for n-ary trees, too 9/26/10 17 Insertion TreePhaserAlgorithm
Synchronization Tree* 9/26/10 18 Insertion TreePhaserAlgorithm Phaser phase:   0 0 0 Phase counter 0 0 0 0 wo Helper nodes Wait-only flag Phase counter 0 0 0 0 rsmd Participant nodes Resume flag *)	is simplified, leaves out registration modes A1 A2 A3 A4
Announcing Synchronization 9/26/10 19 Insertion Tree Phaser Algorithm Phaser phase:   0 0 0 0 0 0 0 0 0 0 0 A1 A2 A3 A4
Announcing Synchronization 9/26/10 20 Insertion Tree Phaser Algorithm Phaser phase:   0 0 0 0 1 1 0 0 0 1 rsmd 1 rsmd A1 A2 A3 A4
Announcing Synchronization 9/26/10 21 Insertion Tree Phaser Algorithm Phaser phase:   0 0 0 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
Announcing Synchronization 9/26/10 22 Insertion Tree Phaser Algorithm Phaser phase:   0 0 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
Announcing Synchronization 9/26/10 23 Insertion Tree Phaser Algorithm Phaser phase:   0 1 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
Announcing Synchronization 9/26/10 24 Insertion Tree Phaser Algorithm Synchronization reached. Continue to next phase. Phaser phase:   1 1 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
Dropping Participants 9/26/10 25 Insertion TreePhaserAlgorithm Phaser phase:   0 0 1 0 0 1 1 0 0 1 rsmd 1 rsmd A1 A2 A3 A4
Dropping Participants 9/26/10 26 Insertion TreePhaserAlgorithm Phaser phase:   0 0 1 0 wo 1 1 0 1 rsmd 1 rsmd A1 A2 A3 A4
h1:R Dropping Participants 9/26/10 27 Insertion TreePhaserAlgorithm Phaser phase:   0 0 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
h1:R Dropping Participants 9/26/10 28 Insertion TreePhaserAlgorithm Phaser phase:   0 wo 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
Dropping Participants 9/26/10 29 Insertion TreePhaserAlgorithm Synchronization reached. Continue to next phase. Phaser phase:   1 h1:R wo 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
h1:R Dropping Participants 9/26/10 30 Insertion TreePhaserAlgorithm Phaser phase:   1 wo 1 h1:L wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
Adding New Participants 9/26/10 31 Insertion TreePhaserAlgorithm Phaser phase:   8 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
Adding New Participants 9/26/10 32 Insertion TreePhaserAlgorithm Phaser phase:   8 9 8 8 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
Adding New Participants 9/26/10 33 Insertion TreePhaserAlgorithm Phaser phase:   8 -1 8 8 +1 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
Adding New Participants 9/26/10 34 Insertion TreePhaserAlgorithm Phaser phase:   8 8 8 propagate phase count 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
Evaluation 9/26/10 35
Two-Phaser Barrier Operation 9/26/10 36 Evaluation
Overhead: Two-Phase vs. Classic 9/26/10 37 Evaluation
Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier 9/26/10 38 Evaluation
Summary Scalable and efficient approach to Phasers Documents implementation Based on fully dynamic insertion tree Overcomes limitations of existing approaches Usable as drop-in replacement Future work Scalability beyond 59 cores Optimization for other memory architectures 9/26/10 39 Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers

Mais conteúdo relacionado

Semelhante a Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

High Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
High Accuracy Distance Measurement for Bluetooth Based on Phase RangingHigh Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
High Accuracy Distance Measurement for Bluetooth Based on Phase RangingEalwan Lee
 
MPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorMPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorPremier Farnell
 
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...IRJET Journal
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET Journal
 
ADS1256 library documentation
ADS1256 library documentationADS1256 library documentation
ADS1256 library documentationCuriousScientist
 
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET Journal
 
Efficient Design of Reversible Multiplexers with Low Quantum Cost
Efficient Design of Reversible Multiplexers with Low Quantum CostEfficient Design of Reversible Multiplexers with Low Quantum Cost
Efficient Design of Reversible Multiplexers with Low Quantum CostIJERA Editor
 
Signal descriptors of 8086
Signal descriptors of 8086Signal descriptors of 8086
Signal descriptors of 8086aviban
 
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adderFpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adderMalik Tauqir Hasan
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET Journal
 
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...Masashi Imano
 
Site Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring SiteSite Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring SiteTAMUK
 
Building communication platforms for the IoT
Building communication platforms for the IoTBuilding communication platforms for the IoT
Building communication platforms for the IoTTroels Brødsgaard
 

Semelhante a Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism (20)

High Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
High Accuracy Distance Measurement for Bluetooth Based on Phase RangingHigh Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
High Accuracy Distance Measurement for Bluetooth Based on Phase Ranging
 
8085 interrupts
8085 interrupts8085 interrupts
8085 interrupts
 
MPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorMPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro Processor
 
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
IRJET- MASH 1-2 Delta Sigma Modulator with Quantizer for Fractional-N Frequen...
 
Aw25293296
Aw25293296Aw25293296
Aw25293296
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
 
ADS1256 library documentation
ADS1256 library documentationADS1256 library documentation
ADS1256 library documentation
 
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
 
Crash course in verilog
Crash course in verilogCrash course in verilog
Crash course in verilog
 
361
361361
361
 
Efficient Design of Reversible Multiplexers with Low Quantum Cost
Efficient Design of Reversible Multiplexers with Low Quantum CostEfficient Design of Reversible Multiplexers with Low Quantum Cost
Efficient Design of Reversible Multiplexers with Low Quantum Cost
 
Signal descriptors of 8086
Signal descriptors of 8086Signal descriptors of 8086
Signal descriptors of 8086
 
ICIECA 2014 Paper 10
ICIECA 2014 Paper 10ICIECA 2014 Paper 10
ICIECA 2014 Paper 10
 
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adderFpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
Fpga 07-port-rules-gate-delay-data-flow-carry-look-ahead-adder
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
 
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
 
Site Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring SiteSite Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring Site
 
Building communication platforms for the IoT
Building communication platforms for the IoTBuilding communication platforms for the IoT
Building communication platforms for the IoT
 

Mais de Stefan Marr

Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...Stefan Marr
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingStefan Marr
 
Optimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with TruffleOptimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with TruffleStefan Marr
 
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...Stefan Marr
 
Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?Stefan Marr
 
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Stefan Marr
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortStefan Marr
 
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsCloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsStefan Marr
 
Supporting Concurrency Abstractions in High-level Language Virtual Machines
Supporting Concurrency Abstractions in High-level Language Virtual MachinesSupporting Concurrency Abstractions in High-level Language Virtual Machines
Supporting Concurrency Abstractions in High-level Language Virtual MachinesStefan Marr
 
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...Stefan Marr
 
Sly and the RoarVM: Parallel Programming with Smalltalk
Sly and the RoarVM: Parallel Programming with SmalltalkSly and the RoarVM: Parallel Programming with Smalltalk
Sly and the RoarVM: Parallel Programming with SmalltalkStefan Marr
 
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Stefan Marr
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingStefan Marr
 
PHP.next: Traits
PHP.next: TraitsPHP.next: Traits
PHP.next: TraitsStefan Marr
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraStefan Marr
 
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...Stefan Marr
 
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...Stefan Marr
 
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...Stefan Marr
 
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...Stefan Marr
 
VMADL: An Architecture Definition Language for Variability and Composition ...
VMADL: An Architecture Definition Language  for Variability and Composition  ...VMADL: An Architecture Definition Language  for Variability and Composition  ...
VMADL: An Architecture Definition Language for Variability and Composition ...Stefan Marr
 

Mais de Stefan Marr (20)

Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "U...
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent Programming
 
Optimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with TruffleOptimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with Truffle
 
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
 
Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?Why Is Concurrent Programming Hard? And What Can We Do about It?
Why Is Concurrent Programming Hard? And What Can We Do about It?
 
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsCloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
 
Supporting Concurrency Abstractions in High-level Language Virtual Machines
Supporting Concurrency Abstractions in High-level Language Virtual MachinesSupporting Concurrency Abstractions in High-level Language Virtual Machines
Supporting Concurrency Abstractions in High-level Language Virtual Machines
 
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
Identifying A Unifying Mechanism for the Implementation of Concurrency Abstra...
 
Sly and the RoarVM: Parallel Programming with Smalltalk
Sly and the RoarVM: Parallel Programming with SmalltalkSly and the RoarVM: Parallel Programming with Smalltalk
Sly and the RoarVM: Parallel Programming with Smalltalk
 
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
 
Sly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of ProgrammingSly and the RoarVM: Exploring the Manycore Future of Programming
Sly and the RoarVM: Exploring the Manycore Future of Programming
 
PHP.next: Traits
PHP.next: TraitsPHP.next: Traits
PHP.next: Traits
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore Era
 
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
Locality and Encapsulation: A Foundation for Concurrency Support in Multi-Lan...
 
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
Encapsulation and Locality: A Foundation for Concurrency Support in Multi-Lan...
 
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
Intermediate Language Design of High-level Language VMs: Towards Comprehensiv...
 
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from...
 
VMADL: An Architecture Definition Language for Variability and Composition ...
VMADL: An Architecture Definition Language  for Variability and Composition  ...VMADL: An Architecture Definition Language  for Variability and Composition  ...
VMADL: An Architecture Definition Language for Variability and Composition ...
 

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

  • 1. Insertion Tree Phasers Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism Stefan Marr S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter Software Languages Lab VrijeUniversiteitBrussel
  • 2. Agenda Introduction Barriers, Phasers Insertion Tree Phasers Insertion Tree Phaser Algorithm Evaluation Summary 9/26/10 2
  • 3. Barriers Synchronizing parallel activities High productivity: easy to get right Mostly for scientific computing Many-core evolution Synchronizing dynamic and irregular problems Requires low-overhead dynamic hierarchical barriers 9/26/10 3 Introduction t1p1 t2p1 t3p1 t1p2 t2p2 t3p2 t1p3 t2p3 t3p3
  • 4. t1p1 Phasers 9/26/10 4 Introduction Extension of X10 clocks Clocks: dynamic two-phase barrier for fork/join parallelism Registration modes for barrier Enables expression of producer/consumer relation Single statements Executed only by single thread, avoids duplicated barrier operations t1p2 t2p2 t3p2 t2p2 t3p2 t2p3 t3p3
  • 5. Hierarchical Phasers 9/26/10 5 Introduction Shirako & Sarkar in Proc. of IEEE IPDPS 2010 [1] Array access List access First scalable implementation strategy Predefined tree structure Degree, i.e., tree arity Max. number of tiers, i.e., height Composed from phasers Problematic None dynamic structure Two-phase support incomplete Leaves design decisions open Phaser Tier 0 sub sub Tier 1 sub sub sub sub Tier 2 (leafs) sig sig sig sig sig sig sig sig A1 A2 A3 A4 A5 A6 A7 A8
  • 6. Open Questions withHierarchical Phasers Dynamic tree construction, or on initialization? Tradeoffs for atomic operations, overhead of joining/leaving phaser How are operations synchronized? Tradeoffs for overheads and restrictions on parallelism Garbage collection problem for dropped participants Keeps list of synchronization objects incl. dropped participants After reaching max. #participants Is the tree rebalanced? (Hint at it for dropped nodes) Two-phase barrier support does not hide latency for original phasers 9/26/10 6 Introduction
  • 8. Design Goal Support for full generality of Phaser properties Two-phase support Signal-only/wait-only for producers/consumers Single statement Full dynamicity: fine-grained hierarchical fork/join Adaptation of existing, scalable approaches Dissemination barrier not adaptable Remaining are tree-based approaches 9/26/10 8 Insertion TreePhaserAlgorithm
  • 9. Insertion Tree Goals Stable, i.e., minimized tree modifications Avoid inconsistent synchronization information Maximizing parallel operations Solution: Insertion Tree Inverted tree No removal Complete smallest subtree first 9/26/10 9 Insertion TreePhaserAlgorithm 1/2
  • 10. Insertion Tree 9/26/10 10 Insertion TreePhaserAlgorithm 2/2
  • 11. Insertion Tree 9/26/10 11 Insertion TreePhaserAlgorithm 2/2 1
  • 12. Insertion Tree 9/26/10 12 Insertion TreePhaserAlgorithm 2/2 h1 1 2
  • 13. Insertion Tree 9/26/10 13 Insertion TreePhaserAlgorithm 2/2 h2 h1 1 2 3
  • 14. Insertion Tree 9/26/10 14 Insertion TreePhaserAlgorithm 2/2 h2 h1 h3 1 2 3 4
  • 15. Insertion Tree 9/26/10 15 Insertion TreePhaserAlgorithm 2/2 h4 h2 h1 h3 1 2 3 4 5
  • 16. Insertion Tree 9/26/10 16 Insertion TreePhaserAlgorithm 2/2 h4 h2 h6 h1 h3 h5 h7 1 2 3 4 5 6 7 8
  • 17. Determining the Insertion Point defgetNextInsertNode(tree): result = tree.lastNode i = tree.numLeaves whileimod 2 == 0: result = result.parent i = i/2 return result # this is for 2-ary trees # is adaptable for n-ary trees, too 9/26/10 17 Insertion TreePhaserAlgorithm
  • 18. Synchronization Tree* 9/26/10 18 Insertion TreePhaserAlgorithm Phaser phase: 0 0 0 Phase counter 0 0 0 0 wo Helper nodes Wait-only flag Phase counter 0 0 0 0 rsmd Participant nodes Resume flag *) is simplified, leaves out registration modes A1 A2 A3 A4
  • 19. Announcing Synchronization 9/26/10 19 Insertion Tree Phaser Algorithm Phaser phase: 0 0 0 0 0 0 0 0 0 0 0 A1 A2 A3 A4
  • 20. Announcing Synchronization 9/26/10 20 Insertion Tree Phaser Algorithm Phaser phase: 0 0 0 0 1 1 0 0 0 1 rsmd 1 rsmd A1 A2 A3 A4
  • 21. Announcing Synchronization 9/26/10 21 Insertion Tree Phaser Algorithm Phaser phase: 0 0 0 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
  • 22. Announcing Synchronization 9/26/10 22 Insertion Tree Phaser Algorithm Phaser phase: 0 0 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
  • 23. Announcing Synchronization 9/26/10 23 Insertion Tree Phaser Algorithm Phaser phase: 0 1 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
  • 24. Announcing Synchronization 9/26/10 24 Insertion Tree Phaser Algorithm Synchronization reached. Continue to next phase. Phaser phase: 1 1 1 1 1 1 1 1 rsmd 1 rsmd 1 rsmd 1 rsmd A1 A2 A3 A4
  • 25. Dropping Participants 9/26/10 25 Insertion TreePhaserAlgorithm Phaser phase: 0 0 1 0 0 1 1 0 0 1 rsmd 1 rsmd A1 A2 A3 A4
  • 26. Dropping Participants 9/26/10 26 Insertion TreePhaserAlgorithm Phaser phase: 0 0 1 0 wo 1 1 0 1 rsmd 1 rsmd A1 A2 A3 A4
  • 27. h1:R Dropping Participants 9/26/10 27 Insertion TreePhaserAlgorithm Phaser phase: 0 0 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
  • 28. h1:R Dropping Participants 9/26/10 28 Insertion TreePhaserAlgorithm Phaser phase: 0 wo 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
  • 29. Dropping Participants 9/26/10 29 Insertion TreePhaserAlgorithm Synchronization reached. Continue to next phase. Phaser phase: 1 h1:R wo 1 wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
  • 30. h1:R Dropping Participants 9/26/10 30 Insertion TreePhaserAlgorithm Phaser phase: 1 wo 1 h1:L wo wo 1 1 1 rsmd 1 rsmd A1 A2 A3 A4
  • 31. Adding New Participants 9/26/10 31 Insertion TreePhaserAlgorithm Phaser phase: 8 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
  • 32. Adding New Participants 9/26/10 32 Insertion TreePhaserAlgorithm Phaser phase: 8 9 8 8 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
  • 33. Adding New Participants 9/26/10 33 Insertion TreePhaserAlgorithm Phaser phase: 8 -1 8 8 +1 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
  • 34. Adding New Participants 9/26/10 34 Insertion TreePhaserAlgorithm Phaser phase: 8 8 8 propagate phase count 9 8 8 9 9 rsmd 8 9 rsmd 8 A1 A2 A3 A4
  • 36. Two-Phaser Barrier Operation 9/26/10 36 Evaluation
  • 37. Overhead: Two-Phase vs. Classic 9/26/10 37 Evaluation
  • 38. Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier 9/26/10 38 Evaluation
  • 39. Summary Scalable and efficient approach to Phasers Documents implementation Based on fully dynamic insertion tree Overcomes limitations of existing approaches Usable as drop-in replacement Future work Scalability beyond 59 cores Optimization for other memory architectures 9/26/10 39 Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
  • 40. 9/26/10 40 Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers Questions? Phaser phase: 1 h1:R wo 1 h1:L wo wo 1 1 Implementation http://barriers.googlecode.com/ MIT license 1 rsmd 1 rsmd A1 A2 A3 A4
  • 41. References [1] Shirako, Jun & Sarkar, Vivek: Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism In: Proc. of IEEE IPDPS (2010). 9/26/10 41

Notas do Editor

  1. Shirako et al.X10 Vijay Saraswat
  2. Shirako + Sarkar
  3. So I went to the whiteboard drew a tree and figured out how to do it slightly different
  4. How to build a tree to synchronize dynamic parallelism?
  5. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  6. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  7. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  8. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  9. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  10. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  11. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  12. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  13. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  14. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  15. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  16. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  17. Example tree like in paper, briefly the different properties, and that they are aggregations of the subtree
  18. In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  19. In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  20. In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed
  21. In the general case: - propagate the phase count minimum up the tree - while doing this, wait for racing values, by checking that the found value is the expected from the last visited node, if it is not, wait until it is, thus the racing activity passed