SlideShare a Scribd company logo
1 of 60
Master Thesis Defense
November 21, 2011
Advisor: Dr. Barbara Chapman
Debjyoti Majumder
Department of Computer Science
University of Houston
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
2
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
3
 Part of the Fortran 2008 standard
 Enables parallel programming in Fortran
 Suitable for both shared and distributed
memory systems
4
PARTITIONED GLOBAL ADDRESS
SPACE
Processor
A
Processor
B
Processor
C
Memory
 Single Program Multiple Data (SPMD), each
executing process is called ‘IMAGE’
 Declaring a variable with square brackets[*]
makes it a coarray; e.g. real :: var_a[*]
 Square brackets with ‘image_id’ is used to
access data in remote image; e.g.
var_a[2]=10 will write 10 to var_a in image 2
5
PROGRAM HELLOWORLD
integer :: A_coarray(10)[*]
A_coarray(:) = 10*this_image()
sync all
if (this_image() == 1) then
A_coarray(:) = A_coarray(:)[2]
end if
if (this_image() == 1) then
print *, “Array (on image 1):" , A_coarray(:)
end if
END PROGRAM
Output: 20 20 20 20 ..
Intrinsic
Barrier
Remote Memory Read
cosubscripts
6
7
 Common memory model for SMP & cluster
 Clear distinction between local memory and
remote memory
8
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
9
 Implement the runtime library for CAF, which
will enable remote direct memory access.
 Explore techniques for performance
optimization.
 Evaluate correctness and performance.
10
 My mentor, Deepak Eachempati designed the
runtime API and implemented part of the
runtime library.
11
12
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
13
 Challenge: Memory addresses get swapped in
& out by OS virtual memory management.
 Need network hardware to ensure memory
availability
14
1. Automatic hardware-assisted virtual
memory management
NIC track changes in the page table by
accessing virtual memory subsystem of the
OS
E.g. Quadrics
15
2. Passive pinning based
NIC prevents OS from swapping out portion
of the memory (pinned memory)
Requires rendezvous or bounce buffers
E.g. Infiniband
16
3. Network with no hardware support for
remote memory access.
Must use message passing under the hood
E.g. Ethernet
17
18
 Active Messages execute handler function
◦ Helper thread
◦ Hardware interrupt
◦ Dedicated NIC processor run handler code
19
Provides:
 Data transfer functions
 Synchronizations
 Thin layer on top of sophisticated network
hardware
 Firehose algorithm on pinning based network
 Active message on ethernet
20
 Helper threads instead of active message
 Rendezvous or Bounce buffer – No firehose
 Zero-copy host-assisted protocol on certain
networks
◦ Convert read to write
21
 Similar performance in most cases.
 GASNet has several optimization parameters.
 In GASNet the entire shared memory segment
must be created during program
initialization.
22
 ARMCI has hard limitation on amount of
shared memory on pinning based networks.
 GASNet has finer control over non-blocking
communication.
 ARMCI has better documentation.
23
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
24
25
 Big chunk of shared memory created during
program initiation
 Starting address is broadcast to all images
 Default size 20MB per image
 Can be changed using cafrun option or
environment variable
26
Same on
all images
27
Static/Global Coarrays
Common Slot
Size: 20MB
FEB: Empty
Coarray 2MB
FEB: Full
Common Slot
Size: 18MB
FEB: Empty
Pointer 3MB
FEB: Full
Common Slot
Size: 15MB
FEB: Empty
Coarray 10MB
FEB: Full
Common Slot
Size 5MB
Common Slot
Size 8MB
Size: 2MB
FEB: Empty
28
Operation GASNet ARMCI
Read gasnet_get ARMCI_Get
Write gasnet_put ARMCI_Put
Strided Read gasnet_gets_bulk ARMCI_Gets
Strided Write gasnet_puts_bulk ARMCI_Puts
29
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Image 1
sync images(/2,3/)
Image 2
sync images(1)
Image 3
sync images(1)
Image 4
1 1
? ?wait
1
?
1
30
 Same algorithm, Remote Atomic Read Modify
Write function instead of Active Message.
31
 Coarray A[2:2,3:4,0:*] on 6 images
32
2
3 4
0
1
2
1 2
43
5 6
this_image()
this_image(A) on Image 6:
2 4 2
this_image(A,2) on Image 6:
4
image_index(A,(/2,4,0/)) on any img:
2
lcobound(A) : 2 3 0
lcobound(A,2): 3
ucobound(A) : 2 4 2
ucobound(A,2) : 4
 uhcaf for compilation
◦ uhcaf helloworld.f90 –layer=armci
 cafrun for execution
◦ cafrun –np 4 ./a.out –log-levels=DEBUG
33
 Tracing options provide timing, debugging
and memory information
34
 Contains small CAF programs to verify
correctness.
 Automatic shell script for compilation,
execution and output verification.
35
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
36
37
integer :: A_coarray(100,100)[*], B_coarray(100,100)[*]
X(:,:) = A_coarray(:,:)[2]
Cache Line
A_coarray B_coarray
y(:,:) = B_coarray(2:4, 10:20)[2]
Local copy
A_coarray(:,:)[2] = 5
Remote Write + Local write
sync all
Nonblocking Refetch
 Exploit spatial locality of coarrays
 Use nonblocking communication to prefetch
 Write through
 Converts strided remote access into
contiguous
38
Without NB Put
Remote Write
Computation
Synchronization
With NB Put
Remote NB Write
Computation
Wait on NB Handle
Synchronization
40
“Automatic Nonblocking Communication for Partitioned Global Address Space Programs” by
Wei-Yu Chen, Dan Bonachea, Constin Iancu, Katherine Yelick from Berkeley
IMAGE 1
Integer :: A[*]
A[2] = 10
sync all
IMAGE 2
Integer :: A[*]
sync all
X = A
?
41
 Need to check :
◦ Source overwrite – Always ensure local completion
◦ Read Conflict:
A[2]=10
x=A[2]
◦ Write Conflict
A[2]=10
A[2]=5
42
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
43
 Measures:
◦ Put & Get Latency
◦ Put & Get Bandwidth
◦ Bidirectional Bandwidth
◦ Noncontiguous Bandwidth
◦ Broadcast Bandwidth
CAF, ARMCI, GASNet, UPC, Global Arrays, MPI(1sided)
Based on Asma Farjallah’s (intern at Total)
microbenchmark suite.
44
 Hardware:
◦ Dual Core AMD Opteron
◦ 2.4GHz
◦ 2011 MB
◦ openSUSE 11.3
45
46
47
48
Loop 200 times
26 Gets (from different coarrays)
Barrier
Computation
End Loop.
49
Loop 200 times
26 Puts
Computation
Barrier
End Loop.
Number of Nodes 330
Peak Performance 29.5 TFLOPS
Cores per Node 8 (2 Intel Nehalem quad-
core)
Operating Frequency 2.8 GHz
Memory per Node 24 GB
Interconnect QDR Infiniband
Interconnect Bandwidth 40 Gbps ( both up and
down)
50
 Intel MPI ( version 12 ) Flags
◦ O3
◦ fp-model precise (for IEEE fp)
 UHCAF Flags
◦ O3
◦ Uses GASNet layer
◦ GASNET_VIS_AMPIPE=1 (for non-contiguous)
 1 process per node (no SMP)
51
52
 Input: Two 3D matrices with timings of
reflected waves
 Computes 2-way wave equation
 The input matrices are divided among
images, and the halo cells are exchanged
after each iteration
53
2.04
4.44
8.35
16.57
33.78
2.49
4.94
9.18
19.16
39.06
0
5
10
15
20
25
30
35
40
45
8 16 32 64 128
SpeedUpwrt4procs
Number of Processes
IntelMPI UHCAF
54
55
 Tilted Transverse Isotropic Wave Equation
 Models anisotropic media
 Input: Six 3D matrices with timing data
56
 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
57
 Using CAF is much easier than using external
libraries
 Apart from increasing productivity, CAF
provides better performance.
 UHCAF is the first efficient open-source CAF
compiler.
58
 Extensions:
◦ Broadcast
◦ Reduction
◦ Teams
◦ Parallel IO
◦ Fault Tolerance
59
 Use pthreads on SMP instead of
ARMCI/GASNet
 Make the memory management and other
data structures more scalable
 Build another API over MPI 1-sided operations
(when MPI 3.0 is available)
60
61

More Related Content

What's hot

Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Pankaj Debbarma
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPUJino Antony
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Hsien-Hsin Sean Lee, Ph.D.
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPALinaro
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architecturesGrigore Lupescu
 
Bharat gargi final project report
Bharat gargi final project reportBharat gargi final project report
Bharat gargi final project reportBharat Biyani
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Implementation of Low Power and High Speed encryption Using Crypto-Hardware
Implementation of Low Power and High Speed encryption Using Crypto-HardwareImplementation of Low Power and High Speed encryption Using Crypto-Hardware
Implementation of Low Power and High Speed encryption Using Crypto-HardwareIJMER
 

What's hot (20)

Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Session 1
Session 1Session 1
Session 1
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPA
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Session 3
Session 3Session 3
Session 3
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architectures
 
Bharat gargi final project report
Bharat gargi final project reportBharat gargi final project report
Bharat gargi final project report
 
Session 2
Session 2Session 2
Session 2
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Implementation of Low Power and High Speed encryption Using Crypto-Hardware
Implementation of Low Power and High Speed encryption Using Crypto-HardwareImplementation of Low Power and High Speed encryption Using Crypto-Hardware
Implementation of Low Power and High Speed encryption Using Crypto-Hardware
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 

Viewers also liked

Woods Bagot_Lifestyle_Hotel and Resort_1
Woods Bagot_Lifestyle_Hotel and Resort_1Woods Bagot_Lifestyle_Hotel and Resort_1
Woods Bagot_Lifestyle_Hotel and Resort_1Davide Bertacca
 
Boyang BV 2014-12-26
Boyang BV 2014-12-26Boyang BV 2014-12-26
Boyang BV 2014-12-26shu bojian
 
Redesigned SAT: What You Need To Know
Redesigned SAT: What You Need To KnowRedesigned SAT: What You Need To Know
Redesigned SAT: What You Need To KnowDavid Seff
 

Viewers also liked (6)

Top 5 seo memes
Top 5 seo memesTop 5 seo memes
Top 5 seo memes
 
Woods Bagot_Lifestyle_Hotel and Resort_1
Woods Bagot_Lifestyle_Hotel and Resort_1Woods Bagot_Lifestyle_Hotel and Resort_1
Woods Bagot_Lifestyle_Hotel and Resort_1
 
Boyang BV 2014-12-26
Boyang BV 2014-12-26Boyang BV 2014-12-26
Boyang BV 2014-12-26
 
CV
CVCV
CV
 
Redesigned SAT: What You Need To Know
Redesigned SAT: What You Need To KnowRedesigned SAT: What You Need To Know
Redesigned SAT: What You Need To Know
 
Adpoint info detik.com
Adpoint info detik.comAdpoint info detik.com
Adpoint info detik.com
 

Similar to Defense_Presentation

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Lcu14 101- coresight overview
Lcu14 101- coresight overviewLcu14 101- coresight overview
Lcu14 101- coresight overviewLinaro
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Understanding the virtual memory - Ixia Connect #2
Understanding the virtual memory - Ixia Connect #2Understanding the virtual memory - Ixia Connect #2
Understanding the virtual memory - Ixia Connect #2IxiaRomania
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMjournalBEEI
 

Similar to Defense_Presentation (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Lcu14 101- coresight overview
Lcu14 101- coresight overviewLcu14 101- coresight overview
Lcu14 101- coresight overview
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
 
DSP Processor.pptx
DSP Processor.pptxDSP Processor.pptx
DSP Processor.pptx
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
ADCSS 2022
ADCSS 2022ADCSS 2022
ADCSS 2022
 
Understanding the virtual memory - Ixia Connect #2
Understanding the virtual memory - Ixia Connect #2Understanding the virtual memory - Ixia Connect #2
Understanding the virtual memory - Ixia Connect #2
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
 

Defense_Presentation

  • 1. Master Thesis Defense November 21, 2011 Advisor: Dr. Barbara Chapman Debjyoti Majumder Department of Computer Science University of Houston
  • 2.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 2
  • 3.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 3
  • 4.  Part of the Fortran 2008 standard  Enables parallel programming in Fortran  Suitable for both shared and distributed memory systems 4 PARTITIONED GLOBAL ADDRESS SPACE Processor A Processor B Processor C Memory
  • 5.  Single Program Multiple Data (SPMD), each executing process is called ‘IMAGE’  Declaring a variable with square brackets[*] makes it a coarray; e.g. real :: var_a[*]  Square brackets with ‘image_id’ is used to access data in remote image; e.g. var_a[2]=10 will write 10 to var_a in image 2 5
  • 6. PROGRAM HELLOWORLD integer :: A_coarray(10)[*] A_coarray(:) = 10*this_image() sync all if (this_image() == 1) then A_coarray(:) = A_coarray(:)[2] end if if (this_image() == 1) then print *, “Array (on image 1):" , A_coarray(:) end if END PROGRAM Output: 20 20 20 20 .. Intrinsic Barrier Remote Memory Read cosubscripts 6
  • 7. 7
  • 8.  Common memory model for SMP & cluster  Clear distinction between local memory and remote memory 8
  • 9.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 9
  • 10.  Implement the runtime library for CAF, which will enable remote direct memory access.  Explore techniques for performance optimization.  Evaluate correctness and performance. 10
  • 11.  My mentor, Deepak Eachempati designed the runtime API and implemented part of the runtime library. 11
  • 12. 12
  • 13.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 13
  • 14.  Challenge: Memory addresses get swapped in & out by OS virtual memory management.  Need network hardware to ensure memory availability 14
  • 15. 1. Automatic hardware-assisted virtual memory management NIC track changes in the page table by accessing virtual memory subsystem of the OS E.g. Quadrics 15
  • 16. 2. Passive pinning based NIC prevents OS from swapping out portion of the memory (pinned memory) Requires rendezvous or bounce buffers E.g. Infiniband 16
  • 17. 3. Network with no hardware support for remote memory access. Must use message passing under the hood E.g. Ethernet 17
  • 18. 18
  • 19.  Active Messages execute handler function ◦ Helper thread ◦ Hardware interrupt ◦ Dedicated NIC processor run handler code 19
  • 20. Provides:  Data transfer functions  Synchronizations  Thin layer on top of sophisticated network hardware  Firehose algorithm on pinning based network  Active message on ethernet 20
  • 21.  Helper threads instead of active message  Rendezvous or Bounce buffer – No firehose  Zero-copy host-assisted protocol on certain networks ◦ Convert read to write 21
  • 22.  Similar performance in most cases.  GASNet has several optimization parameters.  In GASNet the entire shared memory segment must be created during program initialization. 22
  • 23.  ARMCI has hard limitation on amount of shared memory on pinning based networks.  GASNet has finer control over non-blocking communication.  ARMCI has better documentation. 23
  • 24.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 24
  • 25. 25
  • 26.  Big chunk of shared memory created during program initiation  Starting address is broadcast to all images  Default size 20MB per image  Can be changed using cafrun option or environment variable 26
  • 28. Static/Global Coarrays Common Slot Size: 20MB FEB: Empty Coarray 2MB FEB: Full Common Slot Size: 18MB FEB: Empty Pointer 3MB FEB: Full Common Slot Size: 15MB FEB: Empty Coarray 10MB FEB: Full Common Slot Size 5MB Common Slot Size 8MB Size: 2MB FEB: Empty 28
  • 29. Operation GASNet ARMCI Read gasnet_get ARMCI_Get Write gasnet_put ARMCI_Put Strided Read gasnet_gets_bulk ARMCI_Gets Strided Write gasnet_puts_bulk ARMCI_Puts 29
  • 30. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Image 1 sync images(/2,3/) Image 2 sync images(1) Image 3 sync images(1) Image 4 1 1 ? ?wait 1 ? 1 30
  • 31.  Same algorithm, Remote Atomic Read Modify Write function instead of Active Message. 31
  • 32.  Coarray A[2:2,3:4,0:*] on 6 images 32 2 3 4 0 1 2 1 2 43 5 6 this_image() this_image(A) on Image 6: 2 4 2 this_image(A,2) on Image 6: 4 image_index(A,(/2,4,0/)) on any img: 2 lcobound(A) : 2 3 0 lcobound(A,2): 3 ucobound(A) : 2 4 2 ucobound(A,2) : 4
  • 33.  uhcaf for compilation ◦ uhcaf helloworld.f90 –layer=armci  cafrun for execution ◦ cafrun –np 4 ./a.out –log-levels=DEBUG 33
  • 34.  Tracing options provide timing, debugging and memory information 34
  • 35.  Contains small CAF programs to verify correctness.  Automatic shell script for compilation, execution and output verification. 35
  • 36.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 36
  • 37. 37 integer :: A_coarray(100,100)[*], B_coarray(100,100)[*] X(:,:) = A_coarray(:,:)[2] Cache Line A_coarray B_coarray y(:,:) = B_coarray(2:4, 10:20)[2] Local copy A_coarray(:,:)[2] = 5 Remote Write + Local write sync all Nonblocking Refetch
  • 38.  Exploit spatial locality of coarrays  Use nonblocking communication to prefetch  Write through  Converts strided remote access into contiguous 38
  • 39. Without NB Put Remote Write Computation Synchronization With NB Put Remote NB Write Computation Wait on NB Handle Synchronization 40 “Automatic Nonblocking Communication for Partitioned Global Address Space Programs” by Wei-Yu Chen, Dan Bonachea, Constin Iancu, Katherine Yelick from Berkeley
  • 40. IMAGE 1 Integer :: A[*] A[2] = 10 sync all IMAGE 2 Integer :: A[*] sync all X = A ? 41
  • 41.  Need to check : ◦ Source overwrite – Always ensure local completion ◦ Read Conflict: A[2]=10 x=A[2] ◦ Write Conflict A[2]=10 A[2]=5 42
  • 42.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 43
  • 43.  Measures: ◦ Put & Get Latency ◦ Put & Get Bandwidth ◦ Bidirectional Bandwidth ◦ Noncontiguous Bandwidth ◦ Broadcast Bandwidth CAF, ARMCI, GASNet, UPC, Global Arrays, MPI(1sided) Based on Asma Farjallah’s (intern at Total) microbenchmark suite. 44
  • 44.  Hardware: ◦ Dual Core AMD Opteron ◦ 2.4GHz ◦ 2011 MB ◦ openSUSE 11.3 45
  • 45. 46
  • 46. 47
  • 47. 48 Loop 200 times 26 Gets (from different coarrays) Barrier Computation End Loop.
  • 48. 49 Loop 200 times 26 Puts Computation Barrier End Loop.
  • 49. Number of Nodes 330 Peak Performance 29.5 TFLOPS Cores per Node 8 (2 Intel Nehalem quad- core) Operating Frequency 2.8 GHz Memory per Node 24 GB Interconnect QDR Infiniband Interconnect Bandwidth 40 Gbps ( both up and down) 50
  • 50.  Intel MPI ( version 12 ) Flags ◦ O3 ◦ fp-model precise (for IEEE fp)  UHCAF Flags ◦ O3 ◦ Uses GASNet layer ◦ GASNET_VIS_AMPIPE=1 (for non-contiguous)  1 process per node (no SMP) 51
  • 51. 52  Input: Two 3D matrices with timings of reflected waves  Computes 2-way wave equation  The input matrices are divided among images, and the halo cells are exchanged after each iteration
  • 52. 53
  • 53. 2.04 4.44 8.35 16.57 33.78 2.49 4.94 9.18 19.16 39.06 0 5 10 15 20 25 30 35 40 45 8 16 32 64 128 SpeedUpwrt4procs Number of Processes IntelMPI UHCAF 54
  • 54. 55  Tilted Transverse Isotropic Wave Equation  Models anisotropic media  Input: Six 3D matrices with timing data
  • 55. 56
  • 56.  Coarray Fortran Overview  Research Objective  Enabling Remote Direct Memory Access  Runtime Implementation  Runtime Optimizations  Performance Evaluation  Conclusion and Future Work 57
  • 57.  Using CAF is much easier than using external libraries  Apart from increasing productivity, CAF provides better performance.  UHCAF is the first efficient open-source CAF compiler. 58
  • 58.  Extensions: ◦ Broadcast ◦ Reduction ◦ Teams ◦ Parallel IO ◦ Fault Tolerance 59
  • 59.  Use pthreads on SMP instead of ARMCI/GASNet  Make the memory management and other data structures more scalable  Build another API over MPI 1-sided operations (when MPI 3.0 is available) 60
  • 60. 61

Editor's Notes

  1. 2)Goal is to make minimum change in the language to introduce parallelism. 3)Multiple copies of the same program execute independently, like MPI 4)Because it is a partitioned global address space language
  2. Image 1, 2 & 3 may be on shared memory or on remote memory. CAF runtime must provide the abstraction of partitioned global address space.
  3. Based on Open64. Compiles C, C++, Fortran, OpenMP & CAF. The Cray Fortran 95 frontend was modified to support CAF syntax. The coarrays are lowered in the backend to utilize compiler optimizations. Corray operations are converted into runtime calls.
  4. Advantage: better utilization of memory
  5. Total performs seismic explorations to find oil both on land and sea. Dynamites are exploded on the surface. The reflected sound waves are captured using geophones and hydrophones.