1. n
NeighborList(t);
for vn 2 n do
nn
NeighborList(vn );
CalculateIfTriangleExists(t, vn , nn);
end
subsystems.
A possible design of ComEx using MPI send/receive semantics can
be done by carefully optimizing RMA operations using MPI twosided semantics. In this design, every process must service requests
end
for data while at the same time performing computation and initiating communication requests on behalf of the calling process.
Procedure NeighborList(v)
Input: vertex v
As a result, this design is never allowed to make synchronous requests; all operations must be non-blocking. Otherwise, deadlock
f
Get(v);
is for example, for load synchronization barriers
The AMOs are used,inevitable. Furthermore, balancing in these ap- and collective
the ComEx library having made a request of its own. The perreturn f
plications. AMOs operations must also beby a simple extension of progress while of a compute-intensive large-message application would
can be implemented non-blocking to facilitate
formance
servicing requests. ComEx does not provide an explicit progress degrade using this design, unless asynchronous progress
the Get operation: In addition to servicing a get request, the remote
certainly
function, so progress can on behalf of the initia- other could be made. send/receive primitives typically use eager and rendezvous
only be made when any
ComEx
The
process also performs 1.1atomic operation
an Contributions
3.2 First Design: MPI-RMA
function is called. paperconsider design issues such as the above
protocols for transferring small and large messages, respectively.
Specifically, this We
contributions:
MPI_Send needs to bemakes the followinghost of
MPI-RMA 2.0 supports the one-sided operations tor. An additional while mapping one-sided initiated by the two-sided semantics in
get, put and
semantics onto
For high choices to facilitate communication progress:
theactive and provide the original value before the increment. The
target to
There are two main performance interconnects such as InfiniBand [27], Cray
atomic memory operations in addition to supporting
Gemini [28] and Blue Gene/Q [29], the eager protocol involves a
the following sections.
accumulate operations do• not need to analysis the original value to a PGAS commulti-threading andboth the sender and the receiver, while large next sec-use
An in-depth return of design alternatives for
passive synchronization modes. MPI-RMA 3.0 has several features
copy by dynamic process management. In the messages
munication subsystem using MPI. We present a total tion, we discuss each of these alternativesRemote Direct Memory Access
of four
the initiator.
which facilitate the design and implementation of scalable commua zero-copy mechanism such as in detail.
3.3.1 design alternatives: MPI-RMA (RMA), MPI Two-Sided
Put/Accumulate Operations
nication runtime systems for PGAS models. It allows non-blocking
(RDMA). Figure 1 shows these communication protocols.
In PGAS (TS), MPI Two-Sided with Multi-threading (MT), and MPI
models like Global Arrays, blocking and non-blocking
RMA requests, request-based transfers, window-based memory
3.3.4 Synchronization can be designed using MPI_Send and(DPM).
3.4
Put operations
Two-Sided with Dynamic Process Management MPI_IsendThird Design:
allocation, data type communication, and multiple synchronization
P
Pj$
Pi$
Pj$
ComEx supports location consistency with an active the sourcesynprimitives, respectively, issued from modeProtocols in theMPIi$ Send/Receive with Multi-threading
Typical Communication of process. Due tosemodes. MPI-RMA 3.0 is semantically complete and suitable for
• A novel approach which uses a combination of two-sided
implicitly barrier is both a communication bar- the
semantics of send/receive,
chronization [11]. A ComExsynchronous automatic, user-transparent split of destination
designing scalable PGAS communication subsystems.
Multi-threading support is a feature which allows multiple threads
an side shows
RTS$
MPI.manticssomee.g. asinitiate a receiveeagertoMPI comTheandto point asynchronousThis can ranks complete
leftact MPI_Barrier. progress
in order (PR) for
rier (fence) as wellprocess must at
as a control barrier
municators
Data$ different threading modes. It is an importo make MPI calls with
the pair-wise send/receiveand fast communication protocols.process
operation. In the case of
be achieved by usingprotocol, and theaccumulate, the proEach shows the
designing scalable semantics. sidedestination tant feature in the multi-core era to facilitate hierarchical decomporight
MPI-RMA has completely different semantics than the popularly
must also perform a as soon as it has after receiving the data. Figlocal accumulate received the
cess can An
used send/receive and collective communication interface. exit a synchronization phase
sition
• based
of TS,
ure fromImplementationprocess.MT, with RTS (requestdata and computation on deep memory hierarchies. Shared
read every otherprotocol and PR approaches and theirof
Data$
termination messages 3 illustrates this design.
While synchro- runtime for space programming models such as OpenMP provide effiimportant implication is that an optimal design of MPI-RMA
address
integration with ComEx - the communication
needs a completely different approach than two-sided semantics.
nizing, all other externalsend) Arrays. We serviced. in-depth evaluation oncient support for multi-core/many-core architectures. MPI thread
to requests and FIN (finish) imporGlobal are also perform an It is messages.
a spec3.3.2 trum of software including communication benchmarks, apGet Operations
Since MPI-RMA has achieved low acceptance in comparison to
tant to note that each process needs to receive a termination mesFIN$
multiple
Variants such be full application, barAn process. A collectiveasa Receiver NWChem[18]. [2] mode allows invocation of MPI routines from any thread.
as a as initiated
two-sided semantics [10], most vendors choose to only provideevery other MPI get operation canand designedsuch request to get + receive
a
plication kernels, operation
sage from
operation atfor memory synchronization, get (the remote process
Get*
Put*
Acc*
compatibility port due to resource limitations. At the same time, cannot and Write basedsource protocols exist in
Algorithm 3: ComEx
rier/allreduce
be used the initiator. The [3] of the since it
Pi* Pj* Pk*
P Pj* Pk*
P * * Pk*
The computation model can bei*broadly classified iinPjterms of symwhich owns the memory from where the data is to be read) particPGAS communication runtimes such as ComEx and does not [17]
GASNet provide pair-wise causality.
Input: source address s, target address d, message size m,
Our performance evaluation reveals
PR approach
ipates in the get operation implicitly that servicing the get metric and asymmetric work being performed by the threads. The
request.
literature andthe other MPIby the proposedWe achieve a
practice.
are tailored to serve the needs of their respective PGAS models.
target process r
outperforms each of
symmetric model may require different thread support levels, deA possible implementation would use aapproaches. of MPI probe,
combination
As an example, UPC [17] and Co-Array Fortan [22] need active
Procedure PUT(s, d, m, r)
Typical Communication
MPI. The left
detecreceive and send in thatNWChem, 1.31x illustrates
3.3.5 Collective speedup of 2.17xon sparse matrix-vector multiplythis design. 2K Send* Figure 1:IProbe* Send* As anand theProtocols inshows the read
Operations on order. Figure 3 on graph communitypending upon algorithm design.
IProbe*
example, a stencil computamessages for linked data structures which are not well supported
Send* side
r1
TranslateRankstoPR(r);
side shows eager protocol, IProbe* right
tion, and 1.14x
using up to
Recv*
Recv*
Recv*
by MPI-RMA. Similarly, Global Futures [8] - an ComEx does not attempt to on two high-end systems.
extension of
tion canRecv* performed using a thread(request to model and Local*Acc*
be based protocol with RTS multiple send) (each thread
if m < then
FIN (finish)
processes reimplement the already highlySend*
Put*
Acc*
Global Arrays to perform locality driven execution - needs active collective Get*
buf
InlineDataWithHeader(m);
optimized MPI
operations such as all reduce. However,
reads/updates its individual edge) or thread serialized model (one
messages. Variants such as Receiver initiated [23] and Write
P * Pj*
Pi
Pj*
Pi*
messages for scaling and minimizing data movement.
Send(buf . . . r1);
based [25] protocols and in literature out as a sparse
This work has demonstrated that highly-tuned two-sided thread
since this design requires iall operations to be* non-blocking, enter- Pj* semantics coalesces reads/updates exist sends them and practice. col-
has been designed to use Openfabrics Verbs (OFA) for InfiniBand [26] and RoCE Interconnects, Distributed Memory Applications (DMAPP) for Cray Gemini Interconnect [30, 28], and PAMI
for x86, PERCS, and Blue Gene/Q Interconnects [29]. The specification is being extended to support multi-threading, group aware
communication, non-cache-coherent architectures and generic
active messages. ComEx provides abstractions for RMA operations such as get, put and atomic memory operations and provides
location consistency [11]. Figure 2 shows the software ecosystem
of Global Arrays and ComEx.
aspects of NWChem. It uses several PGAS abstractions: GET,
ACCUMULATE and NEXTTASK which is a load-balancing counter
based on fetch-and-add. The above algorithm is entirely dependent on the set of tasks(t) which a process gets during execution.
As a result, there is little temporal locality in NWChem. A process only uses indices(i) corresponding to the task id(t) to index
into distributed data structures and does not address the processes
directly. The scalability of NWChem is dependent upon the acceleration of these routines. NWChem uses Global Arrays [20] which
itself uses the native ports of ComEx, when available. As an example, the GET operation and NEXTTASK are accelerated on native
ports using RDMA and network-based atomics, respectively. The
non-native ports based on MPI two-sided must provide progress for
these operations, as they cannot be accelerated using RDMA (without invoking a protocol) or network atomics. The MPI-RMA based
port [10] can provide the acceleration. However, they typically provide sub-optimal implementation and performance as reported by
Dinan et al. [10].
PGAS Models using an MPI Runtime: Design Alternatives and Performance Evaluation
Jeff Daily, Abhinav Vishnu, Bruce Palmer, Hubertus van Dam
ComEx#
User-transparent split of the world communicator
(Communica<on#Run<me#For#Exascale)#
One master rank per shared memory domain
For performance reasons, uses two-sided semantics
Scalable#Protocols#Layer#
Asynchronous progress by master rank
source code change in the application. The upcoming section proGemini our proposed approach, which is subsequently on
vides details of Interconnect. We used the default Cray MPI library reWe abstract the primary computation and communication structure
ferredNWChem Progress system (PR) based approach for to this paas system. focus on the referred to as Hopper for rest of the
in tothis the and This Rank is main elements relevant rest of the
paper. performance1 abstracts the communication and computation
per. Algorithm evaluation.
Bandwidth (MB/s)
Bandwidth (MB/s)
Shared*Memory*
Shared*Memory*
Shared*Memory*
Software Ecosystem of
Global Arrays and ComEx.
Native implementations are
available for Cray, IBM, and
IB systems, but not for
Kcomputer and Tianhe-1A.
• PR approach consistently
outperforms other approaches
• 2.17x speed-up on 1008
processors over other MPI
designs
RMA
PR
MT
RMA
PR
MT
RMA
PR
MT
RMA
PR
MT
RMA
PR
Relative Speedup to RMA
Communications Runtime for Exascale (ComEx):
MT
RMA
PR
MT
Relative Speedup
Time (ms)
PR
RMA
Coupled Cluster, Singles and Double (CCSD) and Triples (T) rea user-defined number of progress ranks on a node - without any
quire N 6 and N 7 computation, respectively.
MT
j,0 j,1
PR
i
RMA
j,0 j,1
MT
i
Triangle Calculation
Get
Protocols using Twoused primitives in designing parallel applications and
In NWChem data structures and does not address the processes Figureceiver.One-Sided Communication zero copy ical sections, a possibility is to use dynamic process management
into distributed and the TC algorithm (2) presented higher level 4:
However, for large messages, a
2.2 CommunicationMPI with Multi-threading inTranslation of communication
Runtime whichapproach
libraries. The two-sided semantics require of the accel(Not scalability the computation Process Sided is used in MPI with a rendezvous protocol. Hence,based ⇡we ·explore 5.the PERFORMANCE EVALUATION
directly. The Shown: Dynamictime an each
next section.
prohibitive to predictof NWChem is dependent uponimplicit synchroniza- Communication Protocols in
Tput m G,
We present a performance evaluation of the approaches discussed
for Exascale (ComEx)
eration tion between sender and receiver where the [21] which
operations in the PR approach.
and
are on
code, itof these routines. NWChemhow Global Arraysmessages are matched which is equivalent to the performance of the native ports. the left,
is difficult to predict
edges each
Management.) tag uses manyidentifier)an exam- (MT). Protocols for Get, PutruntimeAccumulate(ComEx) isUsing a
PNNL Institutional Computing Cluster (PIC) is a 604 node clusitself uses the a combinationComEx, when available. As and communicator
using native ports of of
(message
The Communication Communication a successor in the previous section using a set of communication benchmarks, a
for Exascale
One-Sided
middle, and analysis, Tget for large messages is initiates a similar to
vertex has in its adjacency list, especially for natural aaccelerated on native a wildcard similar right, respectively. Process Piexpected3.5 requestto[20]. Design: typical nodefull application with NWChem.
graphs, which
to be(ARMCI)
ter with each node consisting of two sixteen-core AMD Interlagos
N2#
Fourth graph kernel, a SpMV kernel, and with with two
ple, the(group of processes). MPI allows receiver to specify
GET operation and NEXTTASK are
Left: A
of the Aggregate Remote Memory Copy Interface
nativeComEx is handled interfaces for facilitating one-sided .
ports. uses message get transfer is impacted by
process Pj which Small native using Two-sided thePj,1commu- Table 2 shows the various design alternatives considered in this paasynchronously by thread copies
Protocols
follow a power-law distribution. and to receive aimportant respectively. and a wildcard
ports using(allowingHence, it is message with provide The
processors where the compute nodes are connected using a QLogic
tag RDMA it network-based atomics, to any tag)
DynamicPR rankswhether they were considered for evaluation.
Process Management yellow
on both sides. primitives in Global Arrays.combination of a send
InfiniBand QDR network. We have used MVAPICH2 for the MPI
a mechanism fornon-native ports based on it to receive a message from any source).
asynchronous progress two-sided mustto using MPI forAcc*
source (allowing MPI in addition provide progress
nication The protocol for get uses a As an example, ComEx per and indicates (blue and
Get*
Put*
Communication using MPI 3DPM4.is an MPI
Global#Arrays#
Algorithm 3: ComEx
and receive on each side as shown inProtocols in Hence: feature which allows an MPI process to spawn new
Algorithms and
these operations,Pas they cannot be accelerated*using RDMA (with- * P *
on this system. This N
system
N3
Pi* P
two-sided semantics. Pi* Pa* protocol) or networkPi* Pj* Pk MPI-RMA based j Designing a communication runtime
multi-threading
k*
k
circles). Blue is not considered for evaluation because libraryperformance evaluation.1# is referred to as IB for the# rest
PR is responsible
N1#
out invoking j
atomics. The
Tget = TRDM AGet +4· source address s, target address d, message size m, DPM, a new inter-communicator can
where is the memory copy cost on each
of the
MPI Input: Progress is that the lock(s) used The Using
is a non-trivial task. withprimary such as InfiniBandprocesses dynamically. TS implementation
The networks reason Ranks. and Gemport [10] can provide the acceleration. However, they typically proside. RDMA-enabled target process
[26]
be created which itfor used for communication. An advantage of
be performing RMA operations
For large put andvide sub-optimal implementation and performance as reported by by the ini [30] Protocolsabstracted of ⇡rperformanceimpact of
accumulate messages requiring a rendezvous proprogress engine are for Get, Put, and portabil- can is many orders of magnitude slower than rest of the implemen(for 1µs, hence the
provide a RDMAGet latency
that alleviates a need
Procedure PUT(s, d, m, r) such an approach tationsitconsidered in this paper. As an example, even on a modDinan et al. will
tocol, the sending process [10]. not complete the transfer until the
ity), which results in non-deterministic performance observed with iseratelythe memory to use multi-threading,
on sized system with NWChem, we by green perfor- 5.2 Simple Communication Benchmarks
hosted new procan be non-trivial on the latency for small messages.
ComEx#
IProbe* Send*
IProbe*
IProbe* Send*
r1 asynchronous thread isand yet it provides asynchronous progress by spawning noticed 10-15x
TranslateRankstoPR(r);inAccumulate are on the left,
Send*
The purpose of the simple communication benchmarks is to unthe Recv* design. Since the
MT
frequently
target process has initiated a receive. Unfortunately, the target promance degradation processes. native approach. The TS
in comparison
Recv*
(Communica<on#Run<me#For#Exascale)#
Recv*
user-level explicit process to the
cesses.
Algorithm on requests unless
derstand the raw performance of N #
communication primitives when
Recv*
cess cannot make progress1: Self-Consistent Field it also has called into
vokingThe FetchAndAdd operation is translatedcommunicator send on
MPI_Iprobe (even on a separate as an irecv and than
N4#
Local*Acc* middle,if m < right, respectively. the
approach requires
intervention for RMA progress
Send*
p
andrankthen to perform a recv of the
the processes are well synchronized. Figures 7 and 8 show the
the initiator side with a PR
buf needing
InlineDataWithHeader(m); makes it user-level processother approaches.
Procedure SCF(m, n)
which
Right: A very slow in comparison to the
Input: my rank m, total tasks n
Get communication bandwidth performance. As expected, NAT
request,Process P initiates . a request to the
a local compute, Send(buf . . initial value back toapproach isDPM approach is not(x) number of processes
and a send of the r1); A possible
The to spawn a few evaluated because dynamic process mani
…"
Data: current task ID t, data indices i, data d
initiator. By using two-sided semantics, our approach cannotand to useagement is not supported onprogress. The original on
communicates with PR systems considered for
per node take
them for asynchronous the high-end ranks
Scalable#Protocols#Layer#
else
the progress on the NIC such as the availadvantage of hardware atomics rank Pk for and spawned processes would then attach to the sameof the implementations, we
the ones
t
m;
evaluation in this paper. For the rest shared membuf InfiniBand [27]. The accumulate
PrepareHeader(m);
ComEx MPI
other process/thread pinning
nodes for RMA requests.
while t < n do
able for Cray Gemini [30] and
Key:*
Communica)on*
Computa)on*
Message*
used
over-subscription.
RMA targeting Pj . .Pj of theory region in order havethe spawned processes to with noprogress on
andput koperation
P
for
make
Send(buf . . r1);
T
Tsend
i
CalculateIndices(t);
operations are bounded by the performance
On-node requests are performed Figure 5: Translationput communication operations in the PR
behalf of
Tacc
d
Get(i);
Send(d . . s);
and the reside within the .same shared the processes within its shared memory domain. This
localacc function. For large accumulates, we expect the
of Tsend + Tlocalacc
Blue#
Tamo
Tsend + Tlocalacc + T PR
5.1 Experimental Testbed
Figure FockMatrix(t, i, d);
6: One-Sided Communication Protocols using Twoapproach is very similar to the approach memory.Krishnan et
using Global Arrays. The data distribution is owner-computes, reperformance to be similar to a native port implementation since
using shared proposed by
IB#
Cray#
MPI#
end
approach. Left: A typical node with with tworecv ranks (blue
Accumulate(t, i, d);
Gene#
Tget
Tsend + Trecv
We have used two production systems for performance evaluation: yellow circles). Blue PR is responsible for performing
sulting in local accumulates to b. The y is distributed evenly among
there memory domain. d, m, r) al.[19]
sided Communication Protocols in MPI with Progress Ranks.are no known network hardware implementations of arbitrary
and
Procedure GET(s,
t
NextTask();
processes. A request for non-local patches of y uses the get comsize accumulates.
Protocols for Get, Put, and Accumulate are on the left, midRMA operations on the memory hosted by green user-level
end
r1
TranslateRankstoPR(r);
Table 1: Translation of ComEx communication primitives to
NERSC Hopper is a Cray XE6 system with 6,384
munication primitive. The sparse matrix used in this calculation
dle, and right, respectively. Process Pi initiates a request to the
Unfortunately, dynamic process management is not available compute nodes
on
Figure 2: Software Ecosystem of Global Arrays and ComEx.
processes, Right: A user-level primitivescommunicatestime comhandle
Irecv(d . . . r1);
their corresponding MPI process with respect to with PR
made up of two twelve-core AMD MagnyCours processor. The
corresponds to a Laplacian operator using a 7-point stencil and 3D
References implementations are available for Cray, IBM, and IB
Acknowledgments
Native
progress rank Pk for the RMA targeting Pj . Pj and Pk reside
most high-end systems. As an example, the in a 3DGemini system the Cray on other nodes for RMA requests. On-node requests are
Cray torus topology with
ranks plexity.
buf
PrepareHeader(m);
compute nodes are connected
orthogonal grid.
Triangle Counting (TC), among other graph algorithms, exhibits
systems, but not for Kcomputer and Tianhe-1A.
within .
the same E.
patterns.L.
can be designed using PGAS
sharedTmemory domain. Pakin. Receiver-initiated message passing over rdma.networks. In IPDPS, pagesevaluation does not support dynamic process manageused in our 1–12, 2008.
[1]
M.
Valiev,
E.J.
Bylaska,
N.
Govind,
K.
Kowalski,
T.P.
Straatsma,
H.J.J.
Van
Dam,
D.
Wang,
communication Apra,
and
This research used resources of the National Energy Research Scientific Computing Center,
[2] Scott
irregular J Nieplocha,
performed using shared memory.
Send(buf . . r1);
Windus,
and
W.A.
de
Jong.
Nwchem:
A
comprehensive
and
scalable
open-‐source
smodels. fIn graphsswith molecular
oluNon
or
large
cale
R-MAT structure such as twitter and facewhich is supported by the Office of Science of the U.S. Department of Energy under Contract No.
ment even though the system has been in production use for two
3. EXPLORATION SPACE
[3] Sayantan Sur, Hyun-Wook Jin, Lei Chai, andWait(handle);
Dhabaleswar K. Panda. Rdma read based rendezvous protocol
simulaNons.
Computer
Physics
Communica2ons,
181(9):1477
–
1489,
2010.
DE- AC02-05CH11231.
book, it is frequently important to detect communities. An imIn this section, we begin with a description of two example PGAS
for mpi over infiniband: design alternatives and benefits. ACC(s, d, pages 32–39, 2006. requires support from the process manager, however,
Procedure In PPOPP, m, r) years. DPM
portant method3 shows communities is by finding cliques in the
to detect the pseudocode executed by compute processes.
Algorithm
4.2 Primary Details
algorithms which motivate our design choices. We then present a
many do not
r1
TranslateRankstoPR(r); support dynamic process management since it is not
graphs. Since CLIQUE is an NP-complete problem, a popular
thorough exploration of design alternatives for using MPI as a comIt shows the protocols for each of the Get, Put, Accumulate and
Figure 5 shows the separation of data-serving processes in PR and
commonly used in MPI applications. Due to a lack of available imheuristic is to calculate cliques with a size of three which is equivaactive messages. ComEx provides abstractions forwork stealing.
RMA operations such as get, put and atomic memory operations and provides too
earlier, it is
location consistency [11]. Figure 2 shows the software ecosystem
task. In the TC
of Global Arrays and ComEx.
File Name // File Date // PNNL-SA-#####
j,0 j,1
k
RMA
i
j
PR
j
k
MT
i
t
m ⇤ v/p;
while t < (m + 1) ⇤ v/p do
n
NeighborList(t);
for vn 2 n do
nn
NeighborList(vn );
1
CalculateIfTriangleExists(t, vn , nn);
2
end
3
end
4
5 MPI Two-Sided + DPM DPM
Yes [19]
No Procedure NeighborList(v)
Input: vertex v
6
MPI Progress Rank
PR
Yes
Yes
Triangle Counting (TC), among other graph algorithms, exhibits Get(v);
f
irregular communication patterns and can in this paper and return f
Table 2: Different Approaches Consideredbe designed using PGAS
Performance Resultsstructure such as twitter and facetheir implementations. R-MAT
models. In graphs with
6000
3000
3500
book, it is frequently important to detect communities. An imMT
MT
MT
RMA
3000 RMA
portantPR
method to detect communities is by2500 RMAcliques in the
finding
5000
PR
PR
provides the best performance for Gemini and IB. The PR imple- First Design: MPI-RMA
NAT
NAT
graphs. Since CLIQUE is an NP-complete problem, 3.2
a popular
2500 NAT
4000
2000
mentation on Hopper is based on MPI/uGNI, while the nativeMPI-RMA 2.0 supports the one-sided operations get, put and
imheuristic is to calculate DMAPPwith aso the difference in peak
cliques [28], size of three which is equiva2000
plementation is based on
atomic memory operations in addition to supporting active and
3000 to finding triangles in a graph. We show an example of commu1500
lent
bandwidth for PR and NAT can be attributed to the use of differ1500
nity detection in natural (power-law) implementation provides
graphs, where the passive synchronization modes. MPI-RMA 3.0 has several features
algorithm
2000
1000
ent communication libraries. The RMA
1000
which facilitate the design and implementation of scalable communeeds to calculate the number of triangles comparison to the
sub-optimal performance on all message sizes in in a given graph. The
1000
500
nication runtime systems for PGAS models. It allows non-blocking
500
edges are easily distributed The MT implementation performs
NAT and PR implementations. using a compressed sparse row (CSR)
RMA requests, request-based transfers, window-based memory
0
0
0
poorly in The numberto the PR implementation primarily due to proformat. comparison of vertices are divided equally among the
16 64 256 1K 4K 16K 64K 256K
16 64 256 1K 4K data 64K 256K
16 64 256 1K 4K 16K 64K 256K
allocation, 16K type communication, and multiple synchronization
lock contention. On IB, the PR and RMA implementations percesses.
Message Size(Bytes)
Message MPI-RMA
Message Size(Bytes)
modes. Size(Bytes) 3.0 is semantically complete and suitable for
form similarly. The MT implementation uses the default threading
designing
Gemini, Get
Infiniband, Get
Gemini,
mode, and it consistently performs 2, the NEIGHBORLIST functionscalable PGAS communication subsystems. Put
in
As shown in the Algorithm sub-optimally45 comparison to
35
other approaches. Similar trends are observed for Put communiMultiply
translates to GET. Since the computation is negligible, the runtime has completely different semantics than the popularly
Multiply
Get
2
40
MPI-RMA
30
Get
cation primitives in Figure 9 with a drop at 8Kbytes for RMA, PR
is bound by the time for the GET function. Other kernels such
35
used
and MT implementations due to the change of eager to rendezvous send/receive and collective2 communication interface. An
25
as Sparse that message size.
Matrix-Vector Multiply (SpMV) 30
kernel, which are fre- implication is that an optimal design of MPI-RMA
important
protocol at
20
25
2
quently used in scientific applications and graph algorithms exhibit
needs a completely different approach than two-sided semantics.
20
15
similar structure. A PGAS incarnation of SpMV kernelSince MPI-RMA has achieved low acceptance in comparison to
is largely
1
15
10
6000on the performance of GET routine.
dependent
10
two-sided semantics [10], most vendors choose to only provide a
MT
0
5
5
RMA
compatibility port due to resource limitations. At the same time,
5000
0
0An understanding from the motivating examples can help us in de0
PR
PGAS communication runtimes such as ComEx and GASNet [17]
NAT
512
1,024
2,048
signing scalable communication 4,096
runtime systems for512 are tailored to serve 2,048 needs of their respective PGAS models.
PGAS modthe
512
1,024
2,048
1,024
Triangle Counting
4000
els using MPI. We begin with a possible design choice using MPIAs an example, UPC [17] and Co-Array Fortan [22] need active
RMA, followed by other choices.
Infiniband, SpMV
messages for linked data structures which Infiniband, supported
are not well TC
3000 Gemini, SpMV
by MPI-RMA. Similarly, Global Futures [8] - an extension of
3.5
Global Arrays to perform locality driven execution - needs active
2000 Total
3
Acc
messages for scaling and minimizing data movement.
Get
AddPatch
1000 SCF
2.5
CCSD(T)
MPI-RMA can be implemented either using native for
NWChem CCSD(T) results communica2 0
tion interfaces which leverage RDMA offloading, or by utilizing
PR relative to RMA thread in conjunction
16 64 256 1K 4K 16K 64K 256K an accelerating asynchronous RMA. MT did not with
1.5
Message Size(Bytes)
send/receive semantics. Either offor these runs.
finish execution these cases require significant
1
effort for scalable design and implementation. Dinan et al. have
presented an in-depth performance evaluation of Global Arrays
Figure 7: Gemini, Get Performance
0.5
using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA
0
For more information on the science
implementations perform 40-50% worse than comparable native
PIC−2,040
Hopper−1,008
PIC−1,020
It is expected that the NAT implementation provides the best possiyou see here, please with NWChem.
ble performance. For the rest of the sections, we only compare the on Blue Gene/P, Cray XT5 and InfiniBandcontact:
ports
performance of the MT, PR and RMA implementations since This observation implies that vendors are not providing optimal
they
provide fair comparison against each other.
implementations on high-end Daily
Jeff systems. Unfortunately, although
MPI-RMA is semantically complete as a backend for PGAS modPacific Northwest National Laboratory
els, sub-optimal implementations require MS-IN: K7-90alternative
5.3 Sparse Matrix Vector Multiply
P.O. Box 999, us to consider
MPI
SpMV (A · y = b) is an important kernel which is used in scien- features for designing PGAS communication subsystems.
Richland, WA 99352
tific applications and graph algorithms such as PageRank. Here,
(509) 372-6548
we have considered a block CSR format of A matrix and a onedimensional RHS vector (y), which are allocated and distributed
jeff.daily@pnnl.gov
PR
Message*
Procedure TC(p, m, v, e)
Input: job size p, my rank m, graph G=(v,e)
Data: vertex ID t
t
m;
while t < n do
i
CalculateIndices(t);
d
Get(i);
Approach
Symbol Implemented Evaluated
FockMatrix(t, i, d);
Native
NAT
Yes
Yes
Accumulate(t,RMA
i, d); Yes [10]
MPI-RMA
Yes
t Two-sided
NextTask();
MPI
TS
Yes
No
end
MPI Two-sided + MT
MT
Yes
Yes
Algorithm 4: ComEx PR Progress Blue#
IB#Input: sourceCray# s, target address d, message sizeMPI#
address
m,
Gene#
target process r
Procedure PROGRESS()
while running do
Figure 2: Software Ecosystem of Global Arrays and ComEx.
f lag
Iprobe();
Native implementations are available for Cray, IBM, and IB
if flag then
systems, but not for Kcomputer and Tianhe-1A.
header Recv();
switch header.messageT ype do
3. EXPLORATION SPACE
case PUT
if IsDataInline(header) then
In this section, we begin with a description of two example PGAS
CopyInlineData(header);
algorithms which motivate our design choices. We then present a
else
thorough exploration of design alternatives for using MPI as a comRecv(header.d . . header.r1);
munication runtime for PGAS models. We first. suggest the semanend
tically matching choice of using MPI-RMA before considering the
break;
use of two-sided protocols. While considering two-sided protocols,
end
the limitations of each case GET are discussed which motivate more
approach
complex approaches. For Send(header.spaper, the MPI two-sided
the rest of the . . . header.r1);
and send/receive semantics are used interchangeably.
break;
end
case ACC
3.1 Motivating Examples
LocalAcc(header.d);
Self-consistent field (SCF) is a module from NWChem [25] which
break;
we use as a motivating example to understand the asynchronous
end
nature of communication and computation within PGAS models.
case FADD
NWChem is a computational quantum chemistry package which
counter
LocalFAdd(header.d);
provides multiple modules for energy calculation varying in space
Send(counter . . . header.r1);
and time complexities. The self-consistency field (SCF) module
end
is less computationally expensive relative to other NWChem modendsw
ules, requiring ⇥(N 4 ) computation and N 2 data movement, where
end
split of the number of basis sets. Higher accuracy methods such As
N is the processes in compute ranks and progress ranks. as
shown in the end
figure, a simple configuration change would allow
3.3
Algorithm 2: Triangle Counting
Procedure SCF(m, n)
Input: my rank m, total tasks n
Data: current task ID t, data indices i, data d
RMA
Computa)on*
Algorithm 1: Self-Consistent Field
MT
Communica)on*
Shared*Memory*
Key:*
Shared*Memory*
Shared*Memory*
• An in-depth analysis of design alternatives for a
PGAS communication subsystem using MPI. We
present a total of four design alternatives: MPI-RMA
(RMA), MPI Two-Sided (TS), MPI Two-Sided with
else
semantics
collective operation would certainly cause
Multi-threading (MT), and MPI be implemented either using nativeing into a synchronousareasufficient for implementing one-sidedcollective in thetoabsence or individual point-to-point communication).
Two-Sided with deadlock. The two-sided native implementation.perform a should continue lective One-Sided Communication
buf
PrepareHeader(m);
MPI-RMA can
communicaof design must then This result
affirm Protocols using Two- Sided
2.1.2 MPI-RMA
Send(buf . . . r1);
tion interfaces which leverage RDMA offloading, or by utilizing
MPI-RMA provides interfaces for get, put, and atomic memory
Send* addition to Send*
Send*
communication fence system procurementarequirements of optimized two-sided commu-improvement over the previous send/receive design, progress
in
control barrier prior to
As an
Dynamic Process Management (DPM). thread in conjunction with collective. while suggesting that one-sided communication is madeCommunication3.0 allows for explicit request handles, for
Send(d . . . s);
nication
can be
operations. MPI-RMA Protocols in MPI.
an accelerating asynchronous RMA
entering an MPI
using6: One-Sided Communicationshown inunderlying runtime,
an asynchronous thread as Protocols Figure Two4. In
readily improved in the future using the existing MPI interface Figure
request window memory to be allocated by the using
end
send/receive semantics. Either of these cases require significant
our proposed design of for ProtocolsPut and Progress Ranks.
Protocols ComEx on MPI multi-threading (MT), the
sided and for windows that allocate in MPI with MPI-RMA provides
Communication Get, shared memory.
based on ourIProbe*
proposed approach.
• A novel approach which uses a design and implementation. of twoProcedure GET(s, d, m, r)
effort for scalable combination Dinan et al. have
IProbe*
IProbe* Protocols for Get, Put, and Accumulate arehas the left, midRecv*
asynchronous thread calls MPI_Iprobe after it target, where the RMA
multiple synchronization modes: active on finished servr1
TranslateRankstoPR(r);
Recv*
presented an in-depth performance evaluation of Global Arrays
3.3.6 Location Consistency
Recv*
Accumulate windowseparate initiates amiddle, for
areProcess Pparticipate the synchronization
The rest of Send*
the paper is organizedRecv*
as follows: In section 2, we dle, and right, respectively. a on the left, in request to the
owners
ing the sendsource and target
requests. We
handle
Irecv(d . . . r1);
sided semantics and an automatic, user-transparent consistency semantics required inourComExIncan be 3, weLocal*Acc*progressright, target,useRMAonly the communicatorP each
using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA
The location
rank P respectively.initiator of andRMA operation
for the
targeting thread-process. This
P . P an
reside
present some background for
work.
section
present and between
and
where
buf
PrepareHeader(m);
communication passive process-thread and
implementations perform 40-50% worse than comparable native using the buffer reuse semantics of MPI - invoking
achieved by
various alternatives when using MPI for designing ComEx. In sec- within the same shared memory domain.availability of generic RMA
is involved in synchronization. The
Send(buf . . . r1);
reduces
locking overhead in the MPI runtime.
ports on to act as asynchronous
split of MPI communicators Blue Gene/P, Cray XT5 and InfiniBand withwait on a requestKey:* 4, weCommunica)on* re-use and present aMessage* (Not Shown:synchronization mechanismsHowever, even
a NWChem.
handle canpresent our similar Computa)on*
provide proposed design semantics performance the operations and MPI-RMA.)
tion
makes MPI-RMA
Wait(handle);
This observation implies that vendors are not providing optimal MPI. In addition, section 5. We present related work in section 6 and optimization, it is notPGAS communication subsystems. However,
with this useful for designing possible to completely remove lockevaluation in messages are ordered between
ComEx
ACC(s, d, m, r)
Unfortunately, although
thethere
critical no known
progress ranks (PR) forimplementations on high-end systems.a backend fortoPGAS mod-as pairings by using the same MPI tag for all com- using Two-Algorithmtheare path. pseudocode executed by compute processes. high ProcedureTranslateRankstoPR(r);
designing scalable and fast rank Figure 3: One-Sided Communication Protocols ing from shows 3 shows the for implementations of MPI-RMA 3.0 and
conclude in section 7.
on
all process
r1
MPI-RMA is semantically complete as
protocols
each of the Get, Put, Accumulate
One-Sided that a seriesinof operations for Get, PutIt end systems (a reference implementation of MPI-RMA 3.0 within
Sided guaranteeingCommunication
Communication Protocols MPI. Protocols
munications,
if m < then
FetchAndAdd [1] is available). The implementations of previous
communication
The progress function
els, sub-optimal implementations require us to consider alternative implicitly
communication protocols.features for designing PGAS communication subsystems. area and2. BACKGROUND middle,the same respectively.Get% MPICHnecessary to makeprimitives.on outstandingHowever, MPI
Accumulate using executed in and right,
are on the left,
Put%
Acc%
buf InlineDataWithHeader(m);
Protocols
on the same
of remote memory are Two-Sided
MPI
specifications (such P %P progressare available. %P send and they
P % P %P %
%
P % as MPI 2.0)
%
P% P
In this section, we introduce the various features of MPI and is invoked as
Send(buf . . . r1);
order as initiated by a given process. Location consistency can
recv requests. poorly in comparison thenative implementations as shown
perform Algorithm 4 shows to progress function executed
Communication Protocols
ComEx
• Implementation of TS, MT, and PR approachesbeand in conjunctionwhich influence our design decisions.in
else
3.3.3 Other Atomic Memorynon-blocking
guaranteed
with the exclusively Operations
by the progressetranks. The protocol processing is abstracted to
by Dinan al. [10].
buf PrepareHeader(m);
hide
MPI with Multi-threading (MT).
requirement of thisAtomic byMessage Passing Interface as the
design Memory Operations (AMOs) such
queuing requests and only testing fetch-and-add are the details such as a LocalAccumulate function, which can
2.1
Send(buf . . . r1);
their integration with ComEx - the communicationof the queue for completion before servicingmodel which provides a CSP exe- use high performance libraries/intrinsics directly.
critical operations in scaling applications such as NWChem [18].
head
the next item in
2.1.3 Multi-Threading
IProbe%
IProbe% Send%
MPI [13, 12] a programming
Send(d . . . s);
IProbe% Send%
Protocolsisfor Get, Put and
Send%
Recv%
OneRecv% most important Recv%
of the
features of MPI is supporting multithe queue.
cution model with send/receive semantics and a Bulk Synchronous
end
Recv%
runtime for Global Arrays. We perform an in-depth
Local%Acc%
Communication MPI supports multiple thread levels (sinthreaded
Accumulate are on the In addition, MPI provides a 4.3.1 Send%communication.Model
left,
Parallel (BSP) model with MPI-RMA.
Procedure FADD(s, d, m, r)
1 shows the translation of ComEx communication primitives
rich set of collective communication primitives, derived data types Table gle, funneled, serialized, and multiple). The multiple mode is least has to frequently relinquish the lock by using
r1
Primarymiddle, and right, section,abstractions: present
Issue: CommunicationPGAS we briefly GET,
evaluation on a spectrum of been designed to use Openfabrics Verbs (OFA) for Infini- aspectsand NWChem. It uses In this respectively. relevant to their correspondingallowstwo-sided primitives.ofprocesstothread), itAt the same TranslateRankstoPR(r);not used,
restrictive and it MPI an arbitrary number The purpose of MPI
threads make
has software including 3.3.7
of multi-threading. several Progress
buf time, if sched_yield is
InlineDataWithHeader(m);
calls simultaneously. In general, multiple is sched_yield.
the most commonly
The primary
with MPI two-sided is the a request to
this equivalence is to qualitatively analyze the performance degraBand [26] and RoCE Interconnects, Distributed Memory Applica- problem parts of and MPI specification.is general need for
ACCUMULATE the NEXTTASK which a load-balancing counter
Process P initiates
Send(buf . . . r1);
the resulting performance is non-deterministic.
tions application kernels,
28], and
The above algorithm is entirely
dationused threaded model in MPI.ports. design section, we explore the
in comparison to the native In the
communication based on for all operations, but especially for depencommunication benchmarks,(DMAPP) for Cray Gemini Interconnect [30,[29]. The PAMI progressfetch-and-add.i whichprocesshandledGet
Recv(d . . . r1);
Key:%
Communica.on%
Computa.on%
Message%
for x86, PERCS, and Blue Gene/Q Interconnects
specidentprocess P which a
on the set of tasks(t)
possibility of using thread multiple mode as an option for PGAS
and FetchAndAdd primitives. MPIj Two-Sidedis frequently comPGAS models are gets during execution.
2.1.1
Semantics
fication is being extended to support multi-threading, group aware
As a result, there is little temporal locality in NWChem. A procommunication.
To eliminate
binedgeneric non-SPMD executionand collectiveby to the task id(t) the index commonly For small messages, the Put primitive is expected to use the eagerthe non-determinism as a result of locking in the critwith
load balancing to most
and a full application, NWChem[1].
Send/receive models forcommunication are and.
communication, non-cache-coherent architectures and
cessasynchronously
only uses indices(i) corresponding thread P
MPI protocol which involves a copy by each of the sender and rej,1
Motivating Applications
Bandwidth (MB/s)
•
•
•
•
Bandwidth (MB/s)
Contributions:
Solution: Progress Ranks for Shared Memory Domains Evaluation of Approach
Time (ms)
What is the best way to design a partitioned
global address space (PGAS) communication
subsystem given that MPI is our only choice?
Design: Exploration Space
Global#Arrays#
Se
MPI two-s
tions. The
Their near
ily optimiz
hardware
two-sided
subsystem
A possible
be done b
sided sem
for data w
tiating com
As a resul
quests; all
is inevitab
operations
servicing
function,
function i
while map
the follow
3.3.1
In PGAS
Put operat
primitives
implicitly
process m
the operat
must also
ure 3 illus
3.3.2
An MPI g
operation
which own
ipates in t
A possible
receive an
Get
Pi*
Send*
Recv*
Key:*
Figure 3:
Sided Com
and Accu
3.3.3
Atomic M
critical op