Defense_Presentation

Master Thesis Defense
November 21, 2011
Advisor: Dr. Barbara Chapman
Debjyoti Majumder
Department of Computer Science
University of Houston

 Coarray Fortran Overview
 Research Objective
 Enabling Remote Direct Memory Access
 Runtime Implementation
 Runtime Optimizations
 Performance Evaluation
 Conclusion and Future Work
2

3

 Part of the Fortran 2008 standard
 Enables parallel programming in Fortran
 Suitable for both shared and distributed
memory systems
4
PARTITIONED GLOBAL ADDRESS
SPACE
Processor
A
Processor
B
Processor
C
Memory

 Single Program Multiple Data (SPMD), each
executing process is called ‘IMAGE’
 Declaring a variable with square brackets[*]
makes it a coarray; e.g. real :: var_a[*]
 Square brackets with ‘image_id’ is used to
access data in remote image; e.g.
var_a[2]=10 will write 10 to var_a in image 2
5

PROGRAM HELLOWORLD
integer :: A_coarray(10)[*]
A_coarray(:) = 10*this_image()
sync all
if (this_image() == 1) then
A_coarray(:) = A_coarray(:)[2]
end if
if (this_image() == 1) then
print *, “Array (on image 1):" , A_coarray(:)
end if
END PROGRAM
Output: 20 20 20 20 ..
Intrinsic
Barrier
Remote Memory Read
cosubscripts
6

 Common memory model for SMP & cluster
 Clear distinction between local memory and
remote memory
8

9

 Implement the runtime library for CAF, which
will enable remote direct memory access.
 Explore techniques for performance
optimization.
 Evaluate correctness and performance.
10

 My mentor, Deepak Eachempati designed the
runtime API and implemented part of the
runtime library.
11

13

 Challenge: Memory addresses get swapped in
& out by OS virtual memory management.
 Need network hardware to ensure memory
availability
14

1. Automatic hardware-assisted virtual
memory management
NIC track changes in the page table by
accessing virtual memory subsystem of the
OS
E.g. Quadrics
15

2. Passive pinning based
NIC prevents OS from swapping out portion
of the memory (pinned memory)
Requires rendezvous or bounce buffers
E.g. Infiniband
16

3. Network with no hardware support for
remote memory access.
Must use message passing under the hood
E.g. Ethernet
17

 Active Messages execute handler function
◦ Helper thread
◦ Hardware interrupt
◦ Dedicated NIC processor run handler code
19

Provides:
 Data transfer functions
 Synchronizations
 Thin layer on top of sophisticated network
hardware
 Firehose algorithm on pinning based network
 Active message on ethernet
20

 Helper threads instead of active message
 Rendezvous or Bounce buffer – No firehose
 Zero-copy host-assisted protocol on certain
networks
◦ Convert read to write
21

 Similar performance in most cases.
 GASNet has several optimization parameters.
 In GASNet the entire shared memory segment
must be created during program
initialization.
22

 ARMCI has hard limitation on amount of
shared memory on pinning based networks.
 GASNet has finer control over non-blocking
communication.
 ARMCI has better documentation.
23

24

 Big chunk of shared memory created during
program initiation
 Starting address is broadcast to all images
 Default size 20MB per image
 Can be changed using cafrun option or
environment variable
26

Static/Global Coarrays
Common Slot
Size: 20MB
FEB: Empty
Coarray 2MB
FEB: Full
Common Slot
Size: 18MB
FEB: Empty
Pointer 3MB
FEB: Full
Common Slot
Size: 15MB
FEB: Empty
Coarray 10MB
FEB: Full
Common Slot
Size 5MB
Common Slot
Size 8MB
Size: 2MB
FEB: Empty
28

Operation GASNet ARMCI
Read gasnet_get ARMCI_Get
Write gasnet_put ARMCI_Put
Strided Read gasnet_gets_bulk ARMCI_Gets
Strided Write gasnet_puts_bulk ARMCI_Puts
29

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Image 1
sync images(/2,3/)
Image 2
sync images(1)
Image 3
sync images(1)
Image 4
1 1
? ?wait
1
?
1
30

 Same algorithm, Remote Atomic Read Modify
Write function instead of Active Message.
31

 Coarray A[2:2,3:4,0:*] on 6 images
32
2
3 4
0
1
2
1 2
43
5 6
this_image()
this_image(A) on Image 6:
2 4 2
this_image(A,2) on Image 6:
4
image_index(A,(/2,4,0/)) on any img:
2
lcobound(A) : 2 3 0
lcobound(A,2): 3
ucobound(A) : 2 4 2
ucobound(A,2) : 4

 uhcaf for compilation
◦ uhcaf helloworld.f90 –layer=armci
 cafrun for execution
◦ cafrun –np 4 ./a.out –log-levels=DEBUG
33

 Tracing options provide timing, debugging
and memory information
34

 Contains small CAF programs to verify
correctness.
 Automatic shell script for compilation,
execution and output verification.
35

36

37
integer :: A_coarray(100,100)[*], B_coarray(100,100)[*]
X(:,:) = A_coarray(:,:)[2]
Cache Line
A_coarray B_coarray
y(:,:) = B_coarray(2:4, 10:20)[2]
Local copy
A_coarray(:,:)[2] = 5
Remote Write + Local write
sync all
Nonblocking Refetch

 Exploit spatial locality of coarrays
 Use nonblocking communication to prefetch
 Write through
 Converts strided remote access into
contiguous
38

Without NB Put
Remote Write
Computation
Synchronization
With NB Put
Remote NB Write
Computation
Wait on NB Handle
Synchronization
40
“Automatic Nonblocking Communication for Partitioned Global Address Space Programs” by
Wei-Yu Chen, Dan Bonachea, Constin Iancu, Katherine Yelick from Berkeley

IMAGE 1
Integer :: A[*]
A[2] = 10
sync all
IMAGE 2
Integer :: A[*]
sync all
X = A
?
41

 Need to check :
◦ Source overwrite – Always ensure local completion
◦ Read Conflict:
A[2]=10
x=A[2]
◦ Write Conflict
A[2]=10
A[2]=5
42

43

 Measures:
◦ Put & Get Latency
◦ Put & Get Bandwidth
◦ Bidirectional Bandwidth
◦ Noncontiguous Bandwidth
◦ Broadcast Bandwidth
CAF, ARMCI, GASNet, UPC, Global Arrays, MPI(1sided)
Based on Asma Farjallah’s (intern at Total)
microbenchmark suite.
44

 Hardware:
◦ Dual Core AMD Opteron
◦ 2.4GHz
◦ 2011 MB
◦ openSUSE 11.3
45

48
Loop 200 times
26 Gets (from different coarrays)
Barrier
Computation
End Loop.

49
Loop 200 times
26 Puts
Computation
Barrier
End Loop.

Number of Nodes 330
Peak Performance 29.5 TFLOPS
Cores per Node 8 (2 Intel Nehalem quad-
core)
Operating Frequency 2.8 GHz
Memory per Node 24 GB
Interconnect QDR Infiniband
Interconnect Bandwidth 40 Gbps ( both up and
down)
50

 Intel MPI ( version 12 ) Flags
◦ O3
◦ fp-model precise (for IEEE fp)
 UHCAF Flags
◦ O3
◦ Uses GASNet layer
◦ GASNET_VIS_AMPIPE=1 (for non-contiguous)
 1 process per node (no SMP)
51

52
 Input: Two 3D matrices with timings of
reflected waves
 Computes 2-way wave equation
 The input matrices are divided among
images, and the halo cells are exchanged
after each iteration

2.04
4.44
8.35
16.57
33.78
2.49
4.94
9.18
19.16
39.06
0
5
10
15
20
25
30
35
40
45
8 16 32 64 128
SpeedUpwrt4procs
Number of Processes
IntelMPI UHCAF
54

55
 Tilted Transverse Isotropic Wave Equation
 Models anisotropic media
 Input: Six 3D matrices with timing data

57

 Using CAF is much easier than using external
libraries
 Apart from increasing productivity, CAF
provides better performance.
 UHCAF is the first efficient open-source CAF
compiler.
58

 Extensions:
◦ Broadcast
◦ Reduction
◦ Teams
◦ Parallel IO
◦ Fault Tolerance
59

 Use pthreads on SMP instead of
ARMCI/GASNet
 Make the memory management and other
data structures more scalable
 Build another API over MPI 1-sided operations
(when MPI 3.0 is available)
60

Defense_Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Defense_Presentation

Similar to Defense_Presentation (20)

Defense_Presentation

Editor's Notes