An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
Defense_Presentation
1. Master Thesis Defense
November 21, 2011
Advisor: Dr. Barbara Chapman
Debjyoti Majumder
Department of Computer Science
University of Houston
2. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
2
3. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
3
4. Part of the Fortran 2008 standard
Enables parallel programming in Fortran
Suitable for both shared and distributed
memory systems
4
PARTITIONED GLOBAL ADDRESS
SPACE
Processor
A
Processor
B
Processor
C
Memory
5. Single Program Multiple Data (SPMD), each
executing process is called ‘IMAGE’
Declaring a variable with square brackets[*]
makes it a coarray; e.g. real :: var_a[*]
Square brackets with ‘image_id’ is used to
access data in remote image; e.g.
var_a[2]=10 will write 10 to var_a in image 2
5
6. PROGRAM HELLOWORLD
integer :: A_coarray(10)[*]
A_coarray(:) = 10*this_image()
sync all
if (this_image() == 1) then
A_coarray(:) = A_coarray(:)[2]
end if
if (this_image() == 1) then
print *, “Array (on image 1):" , A_coarray(:)
end if
END PROGRAM
Output: 20 20 20 20 ..
Intrinsic
Barrier
Remote Memory Read
cosubscripts
6
8. Common memory model for SMP & cluster
Clear distinction between local memory and
remote memory
8
9. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
9
10. Implement the runtime library for CAF, which
will enable remote direct memory access.
Explore techniques for performance
optimization.
Evaluate correctness and performance.
10
11. My mentor, Deepak Eachempati designed the
runtime API and implemented part of the
runtime library.
11
13. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
13
14. Challenge: Memory addresses get swapped in
& out by OS virtual memory management.
Need network hardware to ensure memory
availability
14
15. 1. Automatic hardware-assisted virtual
memory management
NIC track changes in the page table by
accessing virtual memory subsystem of the
OS
E.g. Quadrics
15
16. 2. Passive pinning based
NIC prevents OS from swapping out portion
of the memory (pinned memory)
Requires rendezvous or bounce buffers
E.g. Infiniband
16
17. 3. Network with no hardware support for
remote memory access.
Must use message passing under the hood
E.g. Ethernet
17
19. Active Messages execute handler function
◦ Helper thread
◦ Hardware interrupt
◦ Dedicated NIC processor run handler code
19
20. Provides:
Data transfer functions
Synchronizations
Thin layer on top of sophisticated network
hardware
Firehose algorithm on pinning based network
Active message on ethernet
20
21. Helper threads instead of active message
Rendezvous or Bounce buffer – No firehose
Zero-copy host-assisted protocol on certain
networks
◦ Convert read to write
21
22. Similar performance in most cases.
GASNet has several optimization parameters.
In GASNet the entire shared memory segment
must be created during program
initialization.
22
23. ARMCI has hard limitation on amount of
shared memory on pinning based networks.
GASNet has finer control over non-blocking
communication.
ARMCI has better documentation.
23
24. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
24
26. Big chunk of shared memory created during
program initiation
Starting address is broadcast to all images
Default size 20MB per image
Can be changed using cafrun option or
environment variable
26
33. uhcaf for compilation
◦ uhcaf helloworld.f90 –layer=armci
cafrun for execution
◦ cafrun –np 4 ./a.out –log-levels=DEBUG
33
34. Tracing options provide timing, debugging
and memory information
34
35. Contains small CAF programs to verify
correctness.
Automatic shell script for compilation,
execution and output verification.
35
36. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
36
37. 37
integer :: A_coarray(100,100)[*], B_coarray(100,100)[*]
X(:,:) = A_coarray(:,:)[2]
Cache Line
A_coarray B_coarray
y(:,:) = B_coarray(2:4, 10:20)[2]
Local copy
A_coarray(:,:)[2] = 5
Remote Write + Local write
sync all
Nonblocking Refetch
38. Exploit spatial locality of coarrays
Use nonblocking communication to prefetch
Write through
Converts strided remote access into
contiguous
38
39. Without NB Put
Remote Write
Computation
Synchronization
With NB Put
Remote NB Write
Computation
Wait on NB Handle
Synchronization
40
“Automatic Nonblocking Communication for Partitioned Global Address Space Programs” by
Wei-Yu Chen, Dan Bonachea, Constin Iancu, Katherine Yelick from Berkeley
40. IMAGE 1
Integer :: A[*]
A[2] = 10
sync all
IMAGE 2
Integer :: A[*]
sync all
X = A
?
41
41. Need to check :
◦ Source overwrite – Always ensure local completion
◦ Read Conflict:
A[2]=10
x=A[2]
◦ Write Conflict
A[2]=10
A[2]=5
42
42. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
43
43. Measures:
◦ Put & Get Latency
◦ Put & Get Bandwidth
◦ Bidirectional Bandwidth
◦ Noncontiguous Bandwidth
◦ Broadcast Bandwidth
CAF, ARMCI, GASNet, UPC, Global Arrays, MPI(1sided)
Based on Asma Farjallah’s (intern at Total)
microbenchmark suite.
44
49. Number of Nodes 330
Peak Performance 29.5 TFLOPS
Cores per Node 8 (2 Intel Nehalem quad-
core)
Operating Frequency 2.8 GHz
Memory per Node 24 GB
Interconnect QDR Infiniband
Interconnect Bandwidth 40 Gbps ( both up and
down)
50
50. Intel MPI ( version 12 ) Flags
◦ O3
◦ fp-model precise (for IEEE fp)
UHCAF Flags
◦ O3
◦ Uses GASNet layer
◦ GASNET_VIS_AMPIPE=1 (for non-contiguous)
1 process per node (no SMP)
51
51. 52
Input: Two 3D matrices with timings of
reflected waves
Computes 2-way wave equation
The input matrices are divided among
images, and the halo cells are exchanged
after each iteration
56. Coarray Fortran Overview
Research Objective
Enabling Remote Direct Memory Access
Runtime Implementation
Runtime Optimizations
Performance Evaluation
Conclusion and Future Work
57
57. Using CAF is much easier than using external
libraries
Apart from increasing productivity, CAF
provides better performance.
UHCAF is the first efficient open-source CAF
compiler.
58
59. Use pthreads on SMP instead of
ARMCI/GASNet
Make the memory management and other
data structures more scalable
Build another API over MPI 1-sided operations
(when MPI 3.0 is available)
60
2)Goal is to make minimum change in the language to introduce parallelism.
3)Multiple copies of the same program execute independently, like MPI
4)Because it is a partitioned global address space language
Image 1, 2 & 3 may be on shared memory or on remote memory. CAF runtime must provide the abstraction of partitioned global address space.
Based on Open64. Compiles C, C++, Fortran, OpenMP & CAF.
The Cray Fortran 95 frontend was modified to support CAF syntax.
The coarrays are lowered in the backend to utilize compiler optimizations.
Corray operations are converted into runtime calls.
Advantage: better utilization of memory
Total performs seismic explorations to find oil both on land and sea. Dynamites are exploded on the surface. The reflected sound waves are captured using geophones and hydrophones.