SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
n
NeighborList(t);
for vn 2 n do
nn
NeighborList(vn );
CalculateIfTriangleExists(t, vn , nn);
end

subsystems.

A possible design of ComEx using MPI send/receive semantics can
be done by carefully optimizing RMA operations using MPI twosided semantics. In this design, every process must service requests
end
for data while at the same time performing computation and initiating communication requests on behalf of the calling process.
Procedure NeighborList(v)
Input: vertex v
As a result, this design is never allowed to make synchronous requests; all operations must be non-blocking. Otherwise, deadlock
f
Get(v);
is for example, for load synchronization barriers
The AMOs are used,inevitable. Furthermore, balancing in these ap- and collective
the ComEx library having made a request of its own. The perreturn f
plications. AMOs operations must also beby a simple extension of progress while of a compute-intensive large-message application would
can be implemented non-blocking to facilitate
formance
servicing requests. ComEx does not provide an explicit progress degrade using this design, unless asynchronous progress
the Get operation: In addition to servicing a get request, the remote
certainly
function, so progress can on behalf of the initia- other could be made. send/receive primitives typically use eager and rendezvous
only be made when any
ComEx
The
process also performs 1.1atomic operation
an Contributions
3.2 First Design: MPI-RMA
function is called. paperconsider design issues such as the above
protocols for transferring small and large messages, respectively.
Specifically, this We
contributions:
MPI_Send needs to bemakes the followinghost of
MPI-RMA 2.0 supports the one-sided operations tor. An additional while mapping one-sided initiated by the two-sided semantics in
get, put and
semantics onto
For high choices to facilitate communication progress:
theactive and provide the original value before the increment. The
target to
There are two main performance interconnects such as InfiniBand [27], Cray
atomic memory operations in addition to supporting
Gemini [28] and Blue Gene/Q [29], the eager protocol involves a
the following sections.
accumulate operations do• not need to analysis the original value to a PGAS commulti-threading andboth the sender and the receiver, while large next sec-use
An in-depth return of design alternatives for
passive synchronization modes. MPI-RMA 3.0 has several features
copy by dynamic process management. In the messages
munication subsystem using MPI. We present a total tion, we discuss each of these alternativesRemote Direct Memory Access
of four
the initiator.
which facilitate the design and implementation of scalable commua zero-copy mechanism such as in detail.
3.3.1 design alternatives: MPI-RMA (RMA), MPI Two-Sided
Put/Accumulate Operations
nication runtime systems for PGAS models. It allows non-blocking
(RDMA). Figure 1 shows these communication protocols.
In PGAS (TS), MPI Two-Sided with Multi-threading (MT), and MPI
models like Global Arrays, blocking and non-blocking
RMA requests, request-based transfers, window-based memory
3.3.4 Synchronization can be designed using MPI_Send and(DPM).
3.4
Put operations
Two-Sided with Dynamic Process Management MPI_IsendThird Design:
allocation, data type communication, and multiple synchronization
P
Pj$
Pi$
Pj$
ComEx supports location consistency with an active the sourcesynprimitives, respectively, issued from modeProtocols in theMPIi$ Send/Receive with Multi-threading
Typical Communication of process. Due tosemodes. MPI-RMA 3.0 is semantically complete and suitable for
• A novel approach which uses a combination of two-sided
implicitly barrier is both a communication bar- the
semantics of send/receive,
chronization [11]. A ComExsynchronous automatic, user-transparent split of destination
designing scalable PGAS communication subsystems.
Multi-threading support is a feature which allows multiple threads
an side shows
RTS$
MPI.manticssomee.g. asinitiate a receiveeagertoMPI comTheandto point asynchronousThis can ranks complete
leftact MPI_Barrier. progress
in order (PR) for
rier (fence) as wellprocess must at
as a control barrier
municators
Data$ different threading modes. It is an importo make MPI calls with
the pair-wise send/receiveand fast communication protocols.process
operation. In the case of
be achieved by usingprotocol, and theaccumulate, the proEach shows the
designing scalable semantics. sidedestination tant feature in the multi-core era to facilitate hierarchical decomporight
MPI-RMA has completely different semantics than the popularly
must also perform a as soon as it has after receiving the data. Figlocal accumulate received the
cess can An
used send/receive and collective communication interface. exit a synchronization phase
sition
• based
of TS,
ure fromImplementationprocess.MT, with RTS (requestdata and computation on deep memory hierarchies. Shared
read every otherprotocol and PR approaches and theirof
Data$
termination messages 3 illustrates this design.
While synchro- runtime for space programming models such as OpenMP provide effiimportant implication is that an optimal design of MPI-RMA
address
integration with ComEx - the communication
needs a completely different approach than two-sided semantics.
nizing, all other externalsend) Arrays. We serviced. in-depth evaluation oncient support for multi-core/many-core architectures. MPI thread
to requests and FIN (finish) imporGlobal are also perform an It is messages.
a spec3.3.2 trum of software including communication benchmarks, apGet Operations
Since MPI-RMA has achieved low acceptance in comparison to
tant to note that each process needs to receive a termination mesFIN$
multiple
Variants such be full application, barAn process. A collectiveasa Receiver NWChem[18]. [2] mode allows invocation of MPI routines from any thread.
as a as initiated
two-sided semantics [10], most vendors choose to only provideevery other MPI get operation canand designedsuch request to get + receive
a
plication kernels, operation
sage from
operation atfor memory synchronization, get (the remote process
Get*
Put*
Acc*
compatibility port due to resource limitations. At the same time, cannot and Write basedsource protocols exist in
Algorithm 3: ComEx
rier/allreduce
be used the initiator. The [3] of the since it
Pi* Pj* Pk*
P Pj* Pk*
P * * Pk*
The computation model can bei*broadly classified iinPjterms of symwhich owns the memory from where the data is to be read) particPGAS communication runtimes such as ComEx and does not [17]
GASNet provide pair-wise causality.
Input: source address s, target address d, message size m,
Our performance evaluation reveals
PR approach
ipates in the get operation implicitly that servicing the get metric and asymmetric work being performed by the threads. The
request.
literature andthe other MPIby the proposedWe achieve a
practice.
are tailored to serve the needs of their respective PGAS models.
target process r
outperforms each of
symmetric model may require different thread support levels, deA possible implementation would use aapproaches. of MPI probe,
combination
As an example, UPC [17] and Co-Array Fortan [22] need active
Procedure PUT(s, d, m, r)
Typical Communication
MPI. The left
detecreceive and send in thatNWChem, 1.31x illustrates
3.3.5 Collective speedup of 2.17xon sparse matrix-vector multiplythis design. 2K Send* Figure 1:IProbe* Send* As anand theProtocols inshows the read
Operations on order. Figure 3 on graph communitypending upon algorithm design.
IProbe*
example, a stencil computamessages for linked data structures which are not well supported
Send* side
r1
TranslateRankstoPR(r);
side shows eager protocol, IProbe* right
tion, and 1.14x
using up to
Recv*
Recv*
Recv*
by MPI-RMA. Similarly, Global Futures [8] - an ComEx does not attempt to on two high-end systems.
extension of
tion canRecv* performed using a thread(request to model and Local*Acc*
be based protocol with RTS multiple send) (each thread
if m < then
FIN (finish)
processes reimplement the already highlySend*
Put*
Acc*
Global Arrays to perform locality driven execution - needs active collective Get*
buf
InlineDataWithHeader(m);
optimized MPI
operations such as all reduce. However,
reads/updates its individual edge) or thread serialized model (one
messages. Variants such as Receiver initiated [23] and Write
P * Pj*
Pi
Pj*
Pi*
messages for scaling and minimizing data movement.
Send(buf . . . r1);
based [25] protocols and in literature out as a sparse
This work has demonstrated that highly-tuned two-sided thread
since this design requires iall operations to be* non-blocking, enter- Pj* semantics coalesces reads/updates exist sends them and practice. col-

has been designed to use Openfabrics Verbs (OFA) for InfiniBand [26] and RoCE Interconnects, Distributed Memory Applications (DMAPP) for Cray Gemini Interconnect [30, 28], and PAMI
for x86, PERCS, and Blue Gene/Q Interconnects [29]. The specification is being extended to support multi-threading, group aware
communication, non-cache-coherent architectures and generic
active messages. ComEx provides abstractions for RMA operations such as get, put and atomic memory operations and provides
location consistency [11]. Figure 2 shows the software ecosystem
of Global Arrays and ComEx.

aspects of NWChem. It uses several PGAS abstractions: GET,
ACCUMULATE and NEXTTASK which is a load-balancing counter
based on fetch-and-add. The above algorithm is entirely dependent on the set of tasks(t) which a process gets during execution.
As a result, there is little temporal locality in NWChem. A process only uses indices(i) corresponding to the task id(t) to index
into distributed data structures and does not address the processes
directly. The scalability of NWChem is dependent upon the acceleration of these routines. NWChem uses Global Arrays [20] which
itself uses the native ports of ComEx, when available. As an example, the GET operation and NEXTTASK are accelerated on native
ports using RDMA and network-based atomics, respectively. The
non-native ports based on MPI two-sided must provide progress for
these operations, as they cannot be accelerated using RDMA (without invoking a protocol) or network atomics. The MPI-RMA based
port [10] can provide the acceleration. However, they typically provide sub-optimal implementation and performance as reported by
Dinan et al. [10].

PGAS Models using an MPI Runtime: Design Alternatives and Performance Evaluation
Jeff Daily, Abhinav Vishnu, Bruce Palmer, Hubertus van Dam

ComEx#
User-transparent split of the world communicator
(Communica<on#Run<me#For#Exascale)#
One master rank per shared memory domain
For performance reasons, uses two-sided semantics
Scalable#Protocols#Layer#
Asynchronous progress by master rank

source code change in the application. The upcoming section proGemini our proposed approach, which is subsequently on
vides details of Interconnect. We used the default Cray MPI library reWe abstract the primary computation and communication structure
ferredNWChem Progress system (PR) based approach for to this paas system. focus on the referred to as Hopper for rest of the
in tothis the and This Rank is main elements relevant rest of the
paper. performance1 abstracts the communication and computation
per. Algorithm evaluation.

Bandwidth (MB/s)

Bandwidth (MB/s)

Shared*Memory*

Shared*Memory*

Shared*Memory*

Software Ecosystem of
Global Arrays and ComEx.
Native implementations are
available for Cray, IBM, and
IB systems, but not for
Kcomputer and Tianhe-1A.

•  PR approach consistently
outperforms other approaches
•  2.17x speed-up on 1008
processors over other MPI
designs

RMA

PR

MT

RMA

PR

MT

RMA

PR

MT

RMA

PR

MT

RMA

PR

Relative Speedup to RMA

Communications Runtime for Exascale (ComEx):

MT

RMA

PR

MT

Relative Speedup

Time (ms)

PR

RMA

Coupled Cluster, Singles and Double (CCSD) and Triples (T) rea user-defined number of progress ranks on a node - without any
quire N 6 and N 7 computation, respectively.

MT

j,0 j,1

PR

i

RMA

j,0 j,1

MT

i

Triangle Calculation
Get

Protocols using Twoused primitives in designing parallel applications and
In NWChem data structures and does not address the processes Figureceiver.One-Sided Communication zero copy ical sections, a possibility is to use dynamic process management
into distributed and the TC algorithm (2) presented higher level 4:
However, for large messages, a
2.2 CommunicationMPI with Multi-threading inTranslation of communication
Runtime whichapproach
libraries. The two-sided semantics require of the accel(Not scalability the computation Process Sided is used in MPI with a rendezvous protocol. Hence,based ⇡we ·explore 5.the PERFORMANCE EVALUATION
directly. The Shown: Dynamictime an each
next section.
prohibitive to predictof NWChem is dependent uponimplicit synchroniza- Communication Protocols in
Tput m G,
We present a performance evaluation of the approaches discussed
for Exascale (ComEx)
eration tion between sender and receiver where the [21] which
operations in the PR approach.
and
are on
code, itof these routines. NWChemhow Global Arraysmessages are matched which is equivalent to the performance of the native ports. the left,
is difficult to predict
edges each
Management.) tag uses manyidentifier)an exam- (MT). Protocols for Get, PutruntimeAccumulate(ComEx) isUsing a
PNNL Institutional Computing Cluster (PIC) is a 604 node clusitself uses the a combinationComEx, when available. As and communicator
using native ports of of
(message
The Communication Communication a successor in the previous section using a set of communication benchmarks, a
for Exascale
One-Sided
middle, and analysis, Tget for large messages is initiates a similar to
vertex has in its adjacency list, especially for natural aaccelerated on native a wildcard similar right, respectively. Process Piexpected3.5 requestto[20]. Design: typical nodefull application with NWChem.
graphs, which
to be(ARMCI)
ter with each node consisting of two sixteen-core AMD Interlagos
N2#
Fourth graph kernel, a SpMV kernel, and with with two
ple, the(group of processes). MPI allows receiver to specify
GET operation and NEXTTASK are
Left: A
of the Aggregate Remote Memory Copy Interface
nativeComEx is handled interfaces for facilitating one-sided .
ports. uses message get transfer is impacted by
process Pj which Small native using Two-sided thePj,1commu- Table 2 shows the various design alternatives considered in this paasynchronously by thread copies
Protocols
follow a power-law distribution. and to receive aimportant respectively. and a wildcard
ports using(allowingHence, it is message with provide The
processors where the compute nodes are connected using a QLogic
tag RDMA it network-based atomics, to any tag)
DynamicPR rankswhether they were considered for evaluation.
Process Management yellow
on both sides. primitives in Global Arrays.combination of a send
InfiniBand QDR network. We have used MVAPICH2 for the MPI
a mechanism fornon-native ports based on it to receive a message from any source).
asynchronous progress two-sided mustto using MPI forAcc*
source (allowing MPI in addition provide progress
nication The protocol for get uses a As an example, ComEx per and indicates (blue and
Get*
Put*
Communication using MPI 3DPM4.is an MPI
Global#Arrays#
Algorithm 3: ComEx
and receive on each side as shown inProtocols in Hence: feature which allows an MPI process to spawn new
Algorithms and
these operations,Pas they cannot be accelerated*using RDMA (with- * P *
on this system. This N
system
N3
Pi* P
two-sided semantics. Pi* Pa* protocol) or networkPi* Pj* Pk MPI-RMA based j Designing a communication runtime
multi-threading
k*
k
circles). Blue is not considered for evaluation because libraryperformance evaluation.1# is referred to as IB for the# rest
PR is responsible
N1#
out invoking j
atomics. The
Tget = TRDM AGet +4· source address s, target address d, message size m, DPM, a new inter-communicator can
where is the memory copy cost on each
of the
MPI Input: Progress is that the lock(s) used The Using
is a non-trivial task. withprimary such as InfiniBandprocesses dynamically. TS implementation
The networks reason Ranks. and Gemport [10] can provide the acceleration. However, they typically proside. RDMA-enabled target process
[26]
be created which itfor used for communication. An advantage of
be performing RMA operations
For large put andvide sub-optimal implementation and performance as reported by by the ini [30] Protocolsabstracted of ⇡rperformanceimpact of
accumulate messages requiring a rendezvous proprogress engine are for Get, Put, and portabil- can is many orders of magnitude slower than rest of the implemen(for 1µs, hence the
provide a RDMAGet latency
that alleviates a need
Procedure PUT(s, d, m, r) such an approach tationsitconsidered in this paper. As an example, even on a modDinan et al. will
tocol, the sending process [10]. not complete the transfer until the
ity), which results in non-deterministic performance observed with iseratelythe memory to use multi-threading,
on sized system with NWChem, we by green perfor- 5.2 Simple Communication Benchmarks
hosted new procan be non-trivial on the latency for small messages.
ComEx#
IProbe* Send*
IProbe*
IProbe* Send*
r1 asynchronous thread isand yet it provides asynchronous progress by spawning noticed 10-15x
TranslateRankstoPR(r);inAccumulate are on the left,
Send*
The purpose of the simple communication benchmarks is to unthe Recv* design. Since the
MT
frequently
target process has initiated a receive. Unfortunately, the target promance degradation processes. native approach. The TS
in comparison
Recv*
(Communica<on#Run<me#For#Exascale)#
Recv*
user-level explicit process to the
cesses.
Algorithm on requests unless
derstand the raw performance of N #
communication primitives when
Recv*
cess cannot make progress1: Self-Consistent Field it also has called into
vokingThe FetchAndAdd operation is translatedcommunicator send on
MPI_Iprobe (even on a separate as an irecv and than
N4#
Local*Acc* middle,if m < right, respectively. the
approach requires
intervention for RMA progress
Send*
p
andrankthen to perform a recv of the
the processes are well synchronized. Figures 7 and 8 show the
the initiator side with a PR
buf needing
InlineDataWithHeader(m); makes it user-level processother approaches.
Procedure SCF(m, n)
which
Right: A very slow in comparison to the
Input: my rank m, total tasks n
Get communication bandwidth performance. As expected, NAT
request,Process P initiates . a request to the
a local compute, Send(buf . . initial value back toapproach isDPM approach is not(x) number of processes
and a send of the r1); A possible
The to spawn a few evaluated because dynamic process mani
…"
Data: current task ID t, data indices i, data d
initiator. By using two-sided semantics, our approach cannotand to useagement is not supported onprogress. The original on
communicates with PR systems considered for
per node take
them for asynchronous the high-end ranks
Scalable#Protocols#Layer#
else
the progress on the NIC such as the availadvantage of hardware atomics rank Pk for and spawned processes would then attach to the sameof the implementations, we
the ones
t
m;
evaluation in this paper. For the rest shared membuf InfiniBand [27]. The accumulate
PrepareHeader(m);
ComEx MPI
other process/thread pinning
nodes for RMA requests.
while t < n do
able for Cray Gemini [30] and
Key:*
Communica)on*
Computa)on*
Message*
used
over-subscription.
RMA targeting Pj . .Pj of theory region in order havethe spawned processes to with noprogress on
andput koperation
P
for
make
Send(buf . . r1);
T
Tsend
i
CalculateIndices(t);
operations are bounded by the performance
On-node requests are performed Figure 5: Translationput communication operations in the PR
behalf of
Tacc
d
Get(i);
Send(d . . s);
and the reside within the .same shared the processes within its shared memory domain. This
localacc function. For large accumulates, we expect the
of Tsend + Tlocalacc
Blue#
Tamo
Tsend + Tlocalacc + T PR
5.1 Experimental Testbed
Figure FockMatrix(t, i, d);
6: One-Sided Communication Protocols using Twoapproach is very similar to the approach memory.Krishnan et
using Global Arrays. The data distribution is owner-computes, reperformance to be similar to a native port implementation since
using shared proposed by
IB#
Cray#
MPI#
end
approach. Left: A typical node with with tworecv ranks (blue
Accumulate(t, i, d);
Gene#
Tget
Tsend + Trecv
We have used two production systems for performance evaluation: yellow circles). Blue PR is responsible for performing
sulting in local accumulates to b. The y is distributed evenly among
there memory domain. d, m, r) al.[19]
sided Communication Protocols in MPI with Progress Ranks.are no known network hardware implementations of arbitrary
and
Procedure GET(s,
t
NextTask();
processes. A request for non-local patches of y uses the get comsize accumulates.
Protocols for Get, Put, and Accumulate are on the left, midRMA operations on the memory hosted by green user-level
end
r1
TranslateRankstoPR(r);
Table 1: Translation of ComEx communication primitives to
NERSC Hopper is a Cray XE6 system with 6,384
munication primitive. The sparse matrix used in this calculation
dle, and right, respectively. Process Pi initiates a request to the
Unfortunately, dynamic process management is not available compute nodes
on
Figure 2: Software Ecosystem of Global Arrays and ComEx.
processes, Right: A user-level primitivescommunicatestime comhandle
Irecv(d . . . r1);
their corresponding MPI process with respect to with PR
made up of two twelve-core AMD MagnyCours processor. The
corresponds to a Laplacian operator using a 7-point stencil and 3D
References implementations are available for Cray, IBM, and IB
Acknowledgments
Native
progress rank Pk for the RMA targeting Pj . Pj and Pk reside
most high-end systems. As an example, the in a 3DGemini system the Cray on other nodes for RMA requests. On-node requests are
Cray torus topology with
ranks plexity.
buf
PrepareHeader(m);
compute nodes are connected
orthogonal grid.
Triangle Counting (TC), among other graph algorithms, exhibits
systems, but not for Kcomputer and Tianhe-1A.
within .	
  the same E.	
  patterns.L.	
   can be designed using PGAS
sharedTmemory domain. Pakin. Receiver-initiated message passing over rdma.networks. In IPDPS, pagesevaluation does not support dynamic process manageused in our 1–12, 2008.
[1]	
  M.	
  Valiev,	
  E.J.	
  Bylaska,	
  N.	
  Govind,	
  K.	
  Kowalski,	
  T.P.	
  Straatsma,	
  H.J.J.	
  Van	
  Dam,	
  D.	
  Wang,	
  communication Apra,	
   and
This research used resources of the National Energy Research Scientific Computing Center,
[2] Scott
irregular J Nieplocha,	
  
performed using shared memory.
Send(buf . . r1);
Windus,	
  and	
  W.A.	
  de	
  Jong.	
  Nwchem:	
  A	
  comprehensive	
  and	
  scalable	
  open-­‐source	
  smodels. fIn graphsswith molecular	
  
oluNon	
   or	
  large	
   cale	
   R-MAT structure such as twitter and facewhich is supported by the Office of Science of the U.S. Department of Energy under Contract No.
ment even though the system has been in production use for two
3. EXPLORATION SPACE
[3] Sayantan Sur, Hyun-Wook Jin, Lei Chai, andWait(handle);
Dhabaleswar K. Panda. Rdma read based rendezvous protocol
simulaNons.	
  Computer	
  Physics	
  Communica2ons,	
  181(9):1477	
  –	
  1489,	
  2010.	
  	
  
DE- AC02-05CH11231.
book, it is frequently important to detect communities. An imIn this section, we begin with a description of two example PGAS
for mpi over infiniband: design alternatives and benefits. ACC(s, d, pages 32–39, 2006. requires support from the process manager, however,
Procedure In PPOPP, m, r) years. DPM
portant method3 shows communities is by finding cliques in the
to detect the pseudocode executed by compute processes.
Algorithm
4.2 Primary Details
algorithms which motivate our design choices. We then present a
many do not
r1
TranslateRankstoPR(r); support dynamic process management since it is not
graphs. Since CLIQUE is an NP-complete problem, a popular
thorough exploration of design alternatives for using MPI as a comIt shows the protocols for each of the Get, Put, Accumulate and
Figure 5 shows the separation of data-serving processes in PR and
commonly used in MPI applications. Due to a lack of available imheuristic is to calculate cliques with a size of three which is equivaactive messages. ComEx provides abstractions forwork stealing.
RMA operations such as get, put and atomic memory operations and provides too
earlier, it is
location consistency [11]. Figure 2 shows the software ecosystem
task. In the TC
of Global Arrays and ComEx.

File Name // File Date // PNNL-SA-#####

j,0 j,1

k

RMA

i

j

PR

j

k

MT

i

t
m ⇤ v/p;
while t < (m + 1) ⇤ v/p do
n
NeighborList(t);
for vn 2 n do
nn
NeighborList(vn );
1
CalculateIfTriangleExists(t, vn , nn);
2
end
3
end
4
5 MPI Two-Sided + DPM DPM
Yes [19]
No Procedure NeighborList(v)
Input: vertex v
6
MPI Progress Rank
PR
Yes
Yes
Triangle Counting (TC), among other graph algorithms, exhibits Get(v);
f
irregular communication patterns and can in this paper and return f
Table 2: Different Approaches Consideredbe designed using PGAS
Performance Resultsstructure such as twitter and facetheir implementations. R-MAT
models. In graphs with
6000
3000
3500
book, it is frequently important to detect communities. An imMT
MT
MT
RMA
3000 RMA
portantPR
method to detect communities is by2500 RMAcliques in the
finding
5000
PR
PR
provides the best performance for Gemini and IB. The PR imple- First Design: MPI-RMA
NAT
NAT
graphs. Since CLIQUE is an NP-complete problem, 3.2
a popular
2500 NAT
4000
2000
mentation on Hopper is based on MPI/uGNI, while the nativeMPI-RMA 2.0 supports the one-sided operations get, put and
imheuristic is to calculate DMAPPwith aso the difference in peak
cliques [28], size of three which is equiva2000
plementation is based on
atomic memory operations in addition to supporting active and
3000 to finding triangles in a graph. We show an example of commu1500
lent
bandwidth for PR and NAT can be attributed to the use of differ1500
nity detection in natural (power-law) implementation provides
graphs, where the passive synchronization modes. MPI-RMA 3.0 has several features
algorithm
2000
1000
ent communication libraries. The RMA
1000
which facilitate the design and implementation of scalable communeeds to calculate the number of triangles comparison to the
sub-optimal performance on all message sizes in in a given graph. The
1000
500
nication runtime systems for PGAS models. It allows non-blocking
500
edges are easily distributed The MT implementation performs
NAT and PR implementations. using a compressed sparse row (CSR)
RMA requests, request-based transfers, window-based memory
0
0
0
poorly in The numberto the PR implementation primarily due to proformat. comparison of vertices are divided equally among the
16 64 256 1K 4K 16K 64K 256K
16 64 256 1K 4K data 64K 256K
16 64 256 1K 4K 16K 64K 256K
allocation, 16K type communication, and multiple synchronization
lock contention. On IB, the PR and RMA implementations percesses.
Message Size(Bytes)
Message MPI-RMA
Message Size(Bytes)
modes. Size(Bytes) 3.0 is semantically complete and suitable for
form similarly. The MT implementation uses the default threading
designing
Gemini, Get
Infiniband, Get
Gemini,
mode, and it consistently performs 2, the NEIGHBORLIST functionscalable PGAS communication subsystems. Put
in
As shown in the Algorithm sub-optimally45 comparison to
35
other approaches. Similar trends are observed for Put communiMultiply
translates to GET. Since the computation is negligible, the runtime has completely different semantics than the popularly
Multiply
Get
2
40
MPI-RMA
30
Get
cation primitives in Figure 9 with a drop at 8Kbytes for RMA, PR
is bound by the time for the GET function. Other kernels such
35
used
and MT implementations due to the change of eager to rendezvous send/receive and collective2 communication interface. An
25
as Sparse that message size.
Matrix-Vector Multiply (SpMV) 30
kernel, which are fre- implication is that an optimal design of MPI-RMA
important
protocol at
20
25
2
quently used in scientific applications and graph algorithms exhibit
needs a completely different approach than two-sided semantics.
20
15
similar structure. A PGAS incarnation of SpMV kernelSince MPI-RMA has achieved low acceptance in comparison to
is largely
1
15
10
6000on the performance of GET routine.
dependent
10
two-sided semantics [10], most vendors choose to only provide a
MT
0
5
5
RMA
compatibility port due to resource limitations. At the same time,
5000
0
0An understanding from the motivating examples can help us in de0
PR
PGAS communication runtimes such as ComEx and GASNet [17]
NAT
512
1,024
2,048
signing scalable communication 4,096
runtime systems for512 are tailored to serve 2,048 needs of their respective PGAS models.
PGAS modthe
512
1,024
2,048
1,024
Triangle Counting
4000
els using MPI. We begin with a possible design choice using MPIAs an example, UPC [17] and Co-Array Fortan [22] need active
RMA, followed by other choices.
Infiniband, SpMV
messages for linked data structures which Infiniband, supported
are not well TC
3000 Gemini, SpMV
by MPI-RMA. Similarly, Global Futures [8] - an extension of
3.5
Global Arrays to perform locality driven execution - needs active
2000 Total
3
Acc
messages for scaling and minimizing data movement.
Get
AddPatch
1000 SCF
2.5
CCSD(T)
MPI-RMA can be implemented either using native for
NWChem CCSD(T) results communica2 0
tion interfaces which leverage RDMA offloading, or by utilizing
PR relative to RMA thread in conjunction
16 64 256 1K 4K 16K 64K 256K an accelerating asynchronous RMA. MT did not with
1.5
Message Size(Bytes)
send/receive semantics. Either offor these runs.
finish execution these cases require significant
1
effort for scalable design and implementation. Dinan et al. have
presented an in-depth performance evaluation of Global Arrays
Figure 7: Gemini, Get Performance
0.5
using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA
0
For more information on the science
implementations perform 40-50% worse than comparable native
PIC−2,040
Hopper−1,008
PIC−1,020
It is expected that the NAT implementation provides the best possiyou see here, please with NWChem.
ble performance. For the rest of the sections, we only compare the on Blue Gene/P, Cray XT5 and InfiniBandcontact:
ports
performance of the MT, PR and RMA implementations since This observation implies that vendors are not providing optimal
they
provide fair comparison against each other.
implementations on high-end Daily
Jeff systems. Unfortunately, although
MPI-RMA is semantically complete as a backend for PGAS modPacific Northwest National Laboratory
els, sub-optimal implementations require MS-IN: K7-90alternative
5.3 Sparse Matrix Vector Multiply
P.O. Box 999, us to consider
MPI
SpMV (A · y = b) is an important kernel which is used in scien- features for designing PGAS communication subsystems.
Richland, WA 99352
tific applications and graph algorithms such as PageRank. Here,
(509) 372-6548
we have considered a block CSR format of A matrix and a onedimensional RHS vector (y), which are allocated and distributed
jeff.daily@pnnl.gov
PR

Message*

Procedure TC(p, m, v, e)
Input: job size p, my rank m, graph G=(v,e)
Data: vertex ID t

t
m;
while t < n do
i
CalculateIndices(t);
d
Get(i);
Approach
Symbol Implemented Evaluated
FockMatrix(t, i, d);
Native
NAT
Yes
Yes
Accumulate(t,RMA
i, d); Yes [10]
MPI-RMA
Yes
t Two-sided
NextTask();
MPI
TS
Yes
No
end
MPI Two-sided + MT
MT
Yes
Yes

Algorithm 4: ComEx PR Progress Blue#
IB#Input: sourceCray# s, target address d, message sizeMPI#
address
m,
Gene#

target process r
Procedure PROGRESS()
while running do
Figure 2: Software Ecosystem of Global Arrays and ComEx.
f lag
Iprobe();
Native implementations are available for Cray, IBM, and IB
if flag then
systems, but not for Kcomputer and Tianhe-1A.
header Recv();
switch header.messageT ype do
3. EXPLORATION SPACE
case PUT
if IsDataInline(header) then
In this section, we begin with a description of two example PGAS
CopyInlineData(header);
algorithms which motivate our design choices. We then present a
else
thorough exploration of design alternatives for using MPI as a comRecv(header.d . . header.r1);
munication runtime for PGAS models. We first. suggest the semanend
tically matching choice of using MPI-RMA before considering the
break;
use of two-sided protocols. While considering two-sided protocols,
end
the limitations of each case GET are discussed which motivate more
approach
complex approaches. For Send(header.spaper, the MPI two-sided
the rest of the . . . header.r1);
and send/receive semantics are used interchangeably.
break;
end
case ACC
3.1 Motivating Examples
LocalAcc(header.d);
Self-consistent field (SCF) is a module from NWChem [25] which
break;
we use as a motivating example to understand the asynchronous
end
nature of communication and computation within PGAS models.
case FADD
NWChem is a computational quantum chemistry package which
counter
LocalFAdd(header.d);
provides multiple modules for energy calculation varying in space
Send(counter . . . header.r1);
and time complexities. The self-consistency field (SCF) module
end
is less computationally expensive relative to other NWChem modendsw
ules, requiring ⇥(N 4 ) computation and N 2 data movement, where
end
split of the number of basis sets. Higher accuracy methods such As
N is the processes in compute ranks and progress ranks. as
shown in the end
figure, a simple configuration change would allow

3.3

Algorithm 2: Triangle Counting

Procedure SCF(m, n)
Input: my rank m, total tasks n
Data: current task ID t, data indices i, data d

RMA

Computa)on*

Algorithm 1: Self-Consistent Field

MT

Communica)on*

Shared*Memory*

Key:*

Shared*Memory*

Shared*Memory*

•  An in-depth analysis of design alternatives for a
PGAS communication subsystem using MPI. We
present a total of four design alternatives: MPI-RMA
(RMA), MPI Two-Sided (TS), MPI Two-Sided with
else
semantics
collective operation would certainly cause
Multi-threading (MT), and MPI be implemented either using nativeing into a synchronousareasufficient for implementing one-sidedcollective in thetoabsence or individual point-to-point communication).
Two-Sided with deadlock. The two-sided native implementation.perform a should continue lective One-Sided Communication
buf
PrepareHeader(m);
MPI-RMA can
communicaof design must then This result
affirm Protocols using Two- Sided
2.1.2 MPI-RMA
Send(buf . . . r1);
tion interfaces which leverage RDMA offloading, or by utilizing
MPI-RMA provides interfaces for get, put, and atomic memory
Send* addition to Send*
Send*
communication fence system procurementarequirements of optimized two-sided commu-improvement over the previous send/receive design, progress
in
control barrier prior to
As an
Dynamic Process Management (DPM). thread in conjunction with collective. while suggesting that one-sided communication is madeCommunication3.0 allows for explicit request handles, for
Send(d . . . s);
nication
can be
operations. MPI-RMA Protocols in MPI.
an accelerating asynchronous RMA
entering an MPI
using6: One-Sided Communicationshown inunderlying runtime,
an asynchronous thread as Protocols Figure Two4. In
readily improved in the future using the existing MPI interface Figure
request window memory to be allocated by the using
end
send/receive semantics. Either of these cases require significant
our proposed design of for ProtocolsPut and Progress Ranks.
Protocols ComEx on MPI multi-threading (MT), the
sided and for windows that allocate in MPI with MPI-RMA provides
Communication Get, shared memory.
based on ourIProbe*
proposed approach.
•  A novel approach which uses a design and implementation. of twoProcedure GET(s, d, m, r)
effort for scalable combination Dinan et al. have
IProbe*
IProbe* Protocols for Get, Put, and Accumulate arehas the left, midRecv*
asynchronous thread calls MPI_Iprobe after it target, where the RMA
multiple synchronization modes: active on finished servr1
TranslateRankstoPR(r);
Recv*
presented an in-depth performance evaluation of Global Arrays
3.3.6 Location Consistency
Recv*
Accumulate windowseparate initiates amiddle, for
areProcess Pparticipate the synchronization
The rest of Send*
the paper is organizedRecv*
as follows: In section 2, we dle, and right, respectively. a on the left, in request to the
owners
ing the sendsource and target
requests. We
handle
Irecv(d . . . r1);
sided semantics and an automatic, user-transparent consistency semantics required inourComExIncan be 3, weLocal*Acc*progressright, target,useRMAonly the communicatorP each
using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA
The location
rank P respectively.initiator of andRMA operation
for the
targeting thread-process. This
P . P an
reside
present some background for
work.
section
present and between
and
where
buf
PrepareHeader(m);
communication passive process-thread and
implementations perform 40-50% worse than comparable native using the buffer reuse semantics of MPI - invoking
achieved by
various alternatives when using MPI for designing ComEx. In sec- within the same shared memory domain.availability of generic RMA
is involved in synchronization. The
Send(buf . . . r1);
reduces
locking overhead in the MPI runtime.
ports on to act as asynchronous
split of MPI communicators Blue Gene/P, Cray XT5 and InfiniBand withwait on a requestKey:* 4, weCommunica)on* re-use and present aMessage* (Not Shown:synchronization mechanismsHowever, even
a NWChem.
handle canpresent our similar Computa)on*
provide proposed design semantics performance the operations and MPI-RMA.)
tion
makes MPI-RMA
Wait(handle);
This observation implies that vendors are not providing optimal MPI. In addition, section 5. We present related work in section 6 and optimization, it is notPGAS communication subsystems. However,
with this useful for designing possible to completely remove lockevaluation in messages are ordered between
ComEx
ACC(s, d, m, r)
Unfortunately, although
thethere
critical no known
progress ranks (PR) forimplementations on high-end systems.a backend fortoPGAS mod-as pairings by using the same MPI tag for all com- using Two-Algorithmtheare path. pseudocode executed by compute processes. high ProcedureTranslateRankstoPR(r);
designing scalable and fast rank Figure 3: One-Sided Communication Protocols ing from shows 3 shows the for implementations of MPI-RMA 3.0 and
conclude in section 7.
on
all process
r1
MPI-RMA is semantically complete as
protocols
each of the Get, Put, Accumulate
One-Sided that a seriesinof operations for Get, PutIt end systems (a reference implementation of MPI-RMA 3.0 within
Sided guaranteeingCommunication
Communication Protocols MPI. Protocols
munications,
if m < then
FetchAndAdd [1] is available). The implementations of previous
communication
The progress function
els, sub-optimal implementations require us to consider alternative implicitly
communication protocols.features for designing PGAS communication subsystems. area and2. BACKGROUND middle,the same respectively.Get% MPICHnecessary to makeprimitives.on outstandingHowever, MPI
Accumulate using executed in and right,
are on the left,
Put%
Acc%
buf InlineDataWithHeader(m);
Protocols
on the same
of remote memory are Two-Sided
MPI
specifications (such P %P progressare available. %P send and they
P % P %P %
%
P % as MPI 2.0)
%
P% P
In this section, we introduce the various features of MPI and is invoked as
Send(buf . . . r1);
order as initiated by a given process. Location consistency can
recv requests. poorly in comparison thenative implementations as shown
perform Algorithm 4 shows to progress function executed
Communication Protocols
ComEx
•  Implementation of TS, MT, and PR approachesbeand in conjunctionwhich influence our design decisions.in
else
3.3.3 Other Atomic Memorynon-blocking
guaranteed
with the exclusively Operations
by the progressetranks. The protocol processing is abstracted to
by Dinan al. [10].
buf PrepareHeader(m);
hide
MPI with Multi-threading (MT).
requirement of thisAtomic byMessage Passing Interface as the
design Memory Operations (AMOs) such
queuing requests and only testing fetch-and-add are the details such as a LocalAccumulate function, which can
2.1
Send(buf . . . r1);
their integration with ComEx - the communicationof the queue for completion before servicingmodel which provides a CSP exe- use high performance libraries/intrinsics directly.
critical operations in scaling applications such as NWChem [18].
head
the next item in
2.1.3 Multi-Threading
IProbe%
IProbe% Send%
MPI [13, 12] a programming
Send(d . . . s);
IProbe% Send%
Protocolsisfor Get, Put and
Send%
Recv%
OneRecv% most important Recv%
of the
features of MPI is supporting multithe queue.
cution model with send/receive semantics and a Bulk Synchronous
end
Recv%
runtime for Global Arrays. We perform an in-depth
Local%Acc%
Communication MPI supports multiple thread levels (sinthreaded
Accumulate are on the In addition, MPI provides a 4.3.1 Send%communication.Model
left,
Parallel (BSP) model with MPI-RMA.
Procedure FADD(s, d, m, r)
1 shows the translation of ComEx communication primitives
rich set of collective communication primitives, derived data types Table gle, funneled, serialized, and multiple). The multiple mode is least has to frequently relinquish the lock by using
r1
Primarymiddle, and right, section,abstractions: present
Issue: CommunicationPGAS we briefly GET,
evaluation on a spectrum of been designed to use Openfabrics Verbs (OFA) for Infini- aspectsand NWChem. It uses In this respectively. relevant to their correspondingallowstwo-sided primitives.ofprocesstothread), itAt the same TranslateRankstoPR(r);not used,
restrictive and it MPI an arbitrary number The purpose of MPI
threads make
has software including 3.3.7
of multi-threading. several Progress
buf time, if sched_yield is
InlineDataWithHeader(m);
calls simultaneously. In general, multiple is sched_yield.
the most commonly
The primary
with MPI two-sided is the a request to
this equivalence is to qualitatively analyze the performance degraBand [26] and RoCE Interconnects, Distributed Memory Applica- problem parts of and MPI specification.is general need for
ACCUMULATE the NEXTTASK which a load-balancing counter
Process P initiates
Send(buf . . . r1);
the resulting performance is non-deterministic.
tions application kernels,
28], and
The above algorithm is entirely
dationused threaded model in MPI.ports. design section, we explore the
in comparison to the native In the
communication based on for all operations, but especially for depencommunication benchmarks,(DMAPP) for Cray Gemini Interconnect [30,[29]. The PAMI progressfetch-and-add.i whichprocesshandledGet
Recv(d . . . r1);
Key:%
Communica.on%
Computa.on%
Message%
for x86, PERCS, and Blue Gene/Q Interconnects
specidentprocess P which a
on the set of tasks(t)
possibility of using thread multiple mode as an option for PGAS
and FetchAndAdd primitives. MPIj Two-Sidedis frequently comPGAS models are gets during execution.
2.1.1
Semantics
fication is being extended to support multi-threading, group aware
As a result, there is little temporal locality in NWChem. A procommunication.
To eliminate
binedgeneric non-SPMD executionand collectiveby to the task id(t) the index commonly For small messages, the Put primitive is expected to use the eagerthe non-determinism as a result of locking in the critwith
load balancing to most
and a full application, NWChem[1].
Send/receive models forcommunication are and.
communication, non-cache-coherent architectures and
cessasynchronously
only uses indices(i) corresponding thread P
MPI protocol which involves a copy by each of the sender and rej,1

Motivating Applications

Bandwidth (MB/s)

• 
• 
• 
• 

Bandwidth (MB/s)

Contributions:

Solution: Progress Ranks for Shared Memory Domains Evaluation of Approach

Time (ms)

What is the best way to design a partitioned
global address space (PGAS) communication
subsystem given that MPI is our only choice?

Design: Exploration Space

Global#Arrays#

Se

MPI two-s
tions. The
Their near
ily optimiz
hardware
two-sided
subsystem

A possible
be done b
sided sem
for data w
tiating com
As a resul
quests; all
is inevitab
operations
servicing
function,
function i
while map
the follow

3.3.1

In PGAS
Put operat
primitives
implicitly
process m
the operat
must also
ure 3 illus

3.3.2

An MPI g
operation
which own
ipates in t
A possible
receive an

Get
Pi*

Send*

Recv*

Key:*

Figure 3:
Sided Com
and Accu

3.3.3

Atomic M
critical op

Mais conteúdo relacionado

Mais procurados

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
Nexgen Technology
 
Multi-domain Virtual Content-Aware Networks Mapping on Network Resources
Multi-domain Virtual Content-Aware Networks Mapping on Network ResourcesMulti-domain Virtual Content-Aware Networks Mapping on Network Resources
Multi-domain Virtual Content-Aware Networks Mapping on Network Resources
Alpen-Adria-Universität
 

Mais procurados (20)

Performance Evaluation of Finite Queue Switching Under Two-Dimensional M/G/1...
Performance Evaluation of Finite Queue Switching  Under Two-Dimensional M/G/1...Performance Evaluation of Finite Queue Switching  Under Two-Dimensional M/G/1...
Performance Evaluation of Finite Queue Switching Under Two-Dimensional M/G/1...
 
Routing Optimization in IP/MPLS Networks under Per-Class Over-Provisioning Co...
Routing Optimization in IP/MPLS Networks under Per-Class Over-Provisioning Co...Routing Optimization in IP/MPLS Networks under Per-Class Over-Provisioning Co...
Routing Optimization in IP/MPLS Networks under Per-Class Over-Provisioning Co...
 
High performance nb-ldpc decoder with reduction of message exchange
High performance nb-ldpc decoder with reduction of message exchange High performance nb-ldpc decoder with reduction of message exchange
High performance nb-ldpc decoder with reduction of message exchange
 
Assignment2
Assignment2Assignment2
Assignment2
 
MFMP Vs ECMP Under RED Queue Management
MFMP Vs ECMP Under RED Queue ManagementMFMP Vs ECMP Under RED Queue Management
MFMP Vs ECMP Under RED Queue Management
 
Dimensioning of Multi-Class Over-Provisioned IP Networks
Dimensioning of Multi-Class Over-Provisioned IP NetworksDimensioning of Multi-Class Over-Provisioned IP Networks
Dimensioning of Multi-Class Over-Provisioned IP Networks
 
FPGA Implementation of Viterbi Decoder using Hybrid Trace Back and Register E...
FPGA Implementation of Viterbi Decoder using Hybrid Trace Back and Register E...FPGA Implementation of Viterbi Decoder using Hybrid Trace Back and Register E...
FPGA Implementation of Viterbi Decoder using Hybrid Trace Back and Register E...
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
Multiple Downlink Fair Packet Scheduling Scheme in Wi-Max
Multiple Downlink Fair Packet Scheduling Scheme in Wi-MaxMultiple Downlink Fair Packet Scheduling Scheme in Wi-Max
Multiple Downlink Fair Packet Scheduling Scheme in Wi-Max
 
Multi-domain Virtual Content-Aware Networks Mapping on Network Resources
Multi-domain Virtual Content-Aware Networks Mapping on Network ResourcesMulti-domain Virtual Content-Aware Networks Mapping on Network Resources
Multi-domain Virtual Content-Aware Networks Mapping on Network Resources
 
A Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic GraphsA Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic Graphs
 
Frame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMANFrame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMAN
 
An efficient list decoder architecture for polar codes
An efficient list decoder architecture for polar codesAn efficient list decoder architecture for polar codes
An efficient list decoder architecture for polar codes
 
A Novel Flipflop Topology for High Speed and Area Efficient Logic Structure D...
A Novel Flipflop Topology for High Speed and Area Efficient Logic Structure D...A Novel Flipflop Topology for High Speed and Area Efficient Logic Structure D...
A Novel Flipflop Topology for High Speed and Area Efficient Logic Structure D...
 
REDUCTION OF BUS TRANSITION FOR COMPRESSED CODE SYSTEMS
REDUCTION OF BUS TRANSITION FOR COMPRESSED CODE SYSTEMSREDUCTION OF BUS TRANSITION FOR COMPRESSED CODE SYSTEMS
REDUCTION OF BUS TRANSITION FOR COMPRESSED CODE SYSTEMS
 
TCP Fairness for Uplink and Downlink Flows in WLANs
TCP Fairness for Uplink and Downlink Flows in WLANsTCP Fairness for Uplink and Downlink Flows in WLANs
TCP Fairness for Uplink and Downlink Flows in WLANs
 
19
1919
19
 
PERFORMANCE COMPARISON DCM VERSUS QPSK FOR HIGH DATA RATES IN THE MBOFDM UWB ...
PERFORMANCE COMPARISON DCM VERSUS QPSK FOR HIGH DATA RATES IN THE MBOFDM UWB ...PERFORMANCE COMPARISON DCM VERSUS QPSK FOR HIGH DATA RATES IN THE MBOFDM UWB ...
PERFORMANCE COMPARISON DCM VERSUS QPSK FOR HIGH DATA RATES IN THE MBOFDM UWB ...
 
A fast re route method
A fast re route methodA fast re route method
A fast re route method
 
E04124030034
E04124030034E04124030034
E04124030034
 

Destaque

LINHA DE AÇÃO
LINHA DE AÇÃOLINHA DE AÇÃO
LINHA DE AÇÃO
alcimaria
 
Apostila 001 trigonometria
Apostila  001 trigonometriaApostila  001 trigonometria
Apostila 001 trigonometria
con_seguir
 

Destaque (18)

Conferencia de discapacidades
Conferencia de discapacidadesConferencia de discapacidades
Conferencia de discapacidades
 
NEWSPEAKERKIT
NEWSPEAKERKITNEWSPEAKERKIT
NEWSPEAKERKIT
 
Technical data sheet for post-installed rebar according to EC2
Technical data sheet for post-installed rebar according to EC2Technical data sheet for post-installed rebar according to EC2
Technical data sheet for post-installed rebar according to EC2
 
3
33
3
 
SEMESTRARIO_#23
SEMESTRARIO_#23SEMESTRARIO_#23
SEMESTRARIO_#23
 
lOS MEJORES AUTOS
lOS MEJORES AUTOS lOS MEJORES AUTOS
lOS MEJORES AUTOS
 
Shemot
ShemotShemot
Shemot
 
урок
урокурок
урок
 
A videira e os ramos
A videira e os ramosA videira e os ramos
A videira e os ramos
 
A telivisao e a criança
A telivisao e a criançaA telivisao e a criança
A telivisao e a criança
 
A partida 1
A partida 1A partida 1
A partida 1
 
A mulher ideal 1
A mulher ideal 1A mulher ideal 1
A mulher ideal 1
 
Utilidade de marca na gestão de atenção no meio digital
Utilidade de marca na gestão de atenção no meio digitalUtilidade de marca na gestão de atenção no meio digital
Utilidade de marca na gestão de atenção no meio digital
 
LINHA DE AÇÃO
LINHA DE AÇÃOLINHA DE AÇÃO
LINHA DE AÇÃO
 
Ciudadaniaeuropa
CiudadaniaeuropaCiudadaniaeuropa
Ciudadaniaeuropa
 
UTE OTRAS VARIABLES QUE DETERMINAN LA DIVERSIDAD EN EL AULA
UTE OTRAS VARIABLES QUE DETERMINAN LA DIVERSIDAD EN EL AULAUTE OTRAS VARIABLES QUE DETERMINAN LA DIVERSIDAD EN EL AULA
UTE OTRAS VARIABLES QUE DETERMINAN LA DIVERSIDAD EN EL AULA
 
Apostila 001 trigonometria
Apostila  001 trigonometriaApostila  001 trigonometria
Apostila 001 trigonometria
 
5to ccll estadistica
5to ccll estadistica5to ccll estadistica
5to ccll estadistica
 

Semelhante a Sc13 comex poster

A multi path routing algorithm for ip
A multi path routing algorithm for ipA multi path routing algorithm for ip
A multi path routing algorithm for ip
Alvianus Dengen
 

Semelhante a Sc13 comex poster (20)

Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Application-oriented ping-pong benchmarking: how to assess the real communica...
Application-oriented ping-pong benchmarking: how to assess the real communica...Application-oriented ping-pong benchmarking: how to assess the real communica...
Application-oriented ping-pong benchmarking: how to assess the real communica...
 
My ppt hpc u4
My ppt hpc u4My ppt hpc u4
My ppt hpc u4
 
A multi path routing algorithm for ip
A multi path routing algorithm for ipA multi path routing algorithm for ip
A multi path routing algorithm for ip
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
T3UC_PresentationT3UC2012_Rajesh
T3UC_PresentationT3UC2012_RajeshT3UC_PresentationT3UC2012_Rajesh
T3UC_PresentationT3UC2012_Rajesh
 
Week10 transport
Week10 transportWeek10 transport
Week10 transport
 
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGDYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
 
Jc2415921599
Jc2415921599Jc2415921599
Jc2415921599
 
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGHOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
 
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONSRIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
 
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONSRIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
 
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONSRIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
RIVERBED-BASED NETWORK MODELING FOR MULTI-BEAM CONCURRENT TRANSMISSIONS
 
project_531
project_531project_531
project_531
 
Multipath TCP-Goals and Issues
Multipath TCP-Goals and IssuesMultipath TCP-Goals and Issues
Multipath TCP-Goals and Issues
 
Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Sc13 comex poster

  • 1. n NeighborList(t); for vn 2 n do nn NeighborList(vn ); CalculateIfTriangleExists(t, vn , nn); end subsystems. A possible design of ComEx using MPI send/receive semantics can be done by carefully optimizing RMA operations using MPI twosided semantics. In this design, every process must service requests end for data while at the same time performing computation and initiating communication requests on behalf of the calling process. Procedure NeighborList(v) Input: vertex v As a result, this design is never allowed to make synchronous requests; all operations must be non-blocking. Otherwise, deadlock f Get(v); is for example, for load synchronization barriers The AMOs are used,inevitable. Furthermore, balancing in these ap- and collective the ComEx library having made a request of its own. The perreturn f plications. AMOs operations must also beby a simple extension of progress while of a compute-intensive large-message application would can be implemented non-blocking to facilitate formance servicing requests. ComEx does not provide an explicit progress degrade using this design, unless asynchronous progress the Get operation: In addition to servicing a get request, the remote certainly function, so progress can on behalf of the initia- other could be made. send/receive primitives typically use eager and rendezvous only be made when any ComEx The process also performs 1.1atomic operation an Contributions 3.2 First Design: MPI-RMA function is called. paperconsider design issues such as the above protocols for transferring small and large messages, respectively. Specifically, this We contributions: MPI_Send needs to bemakes the followinghost of MPI-RMA 2.0 supports the one-sided operations tor. An additional while mapping one-sided initiated by the two-sided semantics in get, put and semantics onto For high choices to facilitate communication progress: theactive and provide the original value before the increment. The target to There are two main performance interconnects such as InfiniBand [27], Cray atomic memory operations in addition to supporting Gemini [28] and Blue Gene/Q [29], the eager protocol involves a the following sections. accumulate operations do• not need to analysis the original value to a PGAS commulti-threading andboth the sender and the receiver, while large next sec-use An in-depth return of design alternatives for passive synchronization modes. MPI-RMA 3.0 has several features copy by dynamic process management. In the messages munication subsystem using MPI. We present a total tion, we discuss each of these alternativesRemote Direct Memory Access of four the initiator. which facilitate the design and implementation of scalable commua zero-copy mechanism such as in detail. 3.3.1 design alternatives: MPI-RMA (RMA), MPI Two-Sided Put/Accumulate Operations nication runtime systems for PGAS models. It allows non-blocking (RDMA). Figure 1 shows these communication protocols. In PGAS (TS), MPI Two-Sided with Multi-threading (MT), and MPI models like Global Arrays, blocking and non-blocking RMA requests, request-based transfers, window-based memory 3.3.4 Synchronization can be designed using MPI_Send and(DPM). 3.4 Put operations Two-Sided with Dynamic Process Management MPI_IsendThird Design: allocation, data type communication, and multiple synchronization P Pj$ Pi$ Pj$ ComEx supports location consistency with an active the sourcesynprimitives, respectively, issued from modeProtocols in theMPIi$ Send/Receive with Multi-threading Typical Communication of process. Due tosemodes. MPI-RMA 3.0 is semantically complete and suitable for • A novel approach which uses a combination of two-sided implicitly barrier is both a communication bar- the semantics of send/receive, chronization [11]. A ComExsynchronous automatic, user-transparent split of destination designing scalable PGAS communication subsystems. Multi-threading support is a feature which allows multiple threads an side shows RTS$ MPI.manticssomee.g. asinitiate a receiveeagertoMPI comTheandto point asynchronousThis can ranks complete leftact MPI_Barrier. progress in order (PR) for rier (fence) as wellprocess must at as a control barrier municators Data$ different threading modes. It is an importo make MPI calls with the pair-wise send/receiveand fast communication protocols.process operation. In the case of be achieved by usingprotocol, and theaccumulate, the proEach shows the designing scalable semantics. sidedestination tant feature in the multi-core era to facilitate hierarchical decomporight MPI-RMA has completely different semantics than the popularly must also perform a as soon as it has after receiving the data. Figlocal accumulate received the cess can An used send/receive and collective communication interface. exit a synchronization phase sition • based of TS, ure fromImplementationprocess.MT, with RTS (requestdata and computation on deep memory hierarchies. Shared read every otherprotocol and PR approaches and theirof Data$ termination messages 3 illustrates this design. While synchro- runtime for space programming models such as OpenMP provide effiimportant implication is that an optimal design of MPI-RMA address integration with ComEx - the communication needs a completely different approach than two-sided semantics. nizing, all other externalsend) Arrays. We serviced. in-depth evaluation oncient support for multi-core/many-core architectures. MPI thread to requests and FIN (finish) imporGlobal are also perform an It is messages. a spec3.3.2 trum of software including communication benchmarks, apGet Operations Since MPI-RMA has achieved low acceptance in comparison to tant to note that each process needs to receive a termination mesFIN$ multiple Variants such be full application, barAn process. A collectiveasa Receiver NWChem[18]. [2] mode allows invocation of MPI routines from any thread. as a as initiated two-sided semantics [10], most vendors choose to only provideevery other MPI get operation canand designedsuch request to get + receive a plication kernels, operation sage from operation atfor memory synchronization, get (the remote process Get* Put* Acc* compatibility port due to resource limitations. At the same time, cannot and Write basedsource protocols exist in Algorithm 3: ComEx rier/allreduce be used the initiator. The [3] of the since it Pi* Pj* Pk* P Pj* Pk* P * * Pk* The computation model can bei*broadly classified iinPjterms of symwhich owns the memory from where the data is to be read) particPGAS communication runtimes such as ComEx and does not [17] GASNet provide pair-wise causality. Input: source address s, target address d, message size m, Our performance evaluation reveals PR approach ipates in the get operation implicitly that servicing the get metric and asymmetric work being performed by the threads. The request. literature andthe other MPIby the proposedWe achieve a practice. are tailored to serve the needs of their respective PGAS models. target process r outperforms each of symmetric model may require different thread support levels, deA possible implementation would use aapproaches. of MPI probe, combination As an example, UPC [17] and Co-Array Fortan [22] need active Procedure PUT(s, d, m, r) Typical Communication MPI. The left detecreceive and send in thatNWChem, 1.31x illustrates 3.3.5 Collective speedup of 2.17xon sparse matrix-vector multiplythis design. 2K Send* Figure 1:IProbe* Send* As anand theProtocols inshows the read Operations on order. Figure 3 on graph communitypending upon algorithm design. IProbe* example, a stencil computamessages for linked data structures which are not well supported Send* side r1 TranslateRankstoPR(r); side shows eager protocol, IProbe* right tion, and 1.14x using up to Recv* Recv* Recv* by MPI-RMA. Similarly, Global Futures [8] - an ComEx does not attempt to on two high-end systems. extension of tion canRecv* performed using a thread(request to model and Local*Acc* be based protocol with RTS multiple send) (each thread if m < then FIN (finish) processes reimplement the already highlySend* Put* Acc* Global Arrays to perform locality driven execution - needs active collective Get* buf InlineDataWithHeader(m); optimized MPI operations such as all reduce. However, reads/updates its individual edge) or thread serialized model (one messages. Variants such as Receiver initiated [23] and Write P * Pj* Pi Pj* Pi* messages for scaling and minimizing data movement. Send(buf . . . r1); based [25] protocols and in literature out as a sparse This work has demonstrated that highly-tuned two-sided thread since this design requires iall operations to be* non-blocking, enter- Pj* semantics coalesces reads/updates exist sends them and practice. col- has been designed to use Openfabrics Verbs (OFA) for InfiniBand [26] and RoCE Interconnects, Distributed Memory Applications (DMAPP) for Cray Gemini Interconnect [30, 28], and PAMI for x86, PERCS, and Blue Gene/Q Interconnects [29]. The specification is being extended to support multi-threading, group aware communication, non-cache-coherent architectures and generic active messages. ComEx provides abstractions for RMA operations such as get, put and atomic memory operations and provides location consistency [11]. Figure 2 shows the software ecosystem of Global Arrays and ComEx. aspects of NWChem. It uses several PGAS abstractions: GET, ACCUMULATE and NEXTTASK which is a load-balancing counter based on fetch-and-add. The above algorithm is entirely dependent on the set of tasks(t) which a process gets during execution. As a result, there is little temporal locality in NWChem. A process only uses indices(i) corresponding to the task id(t) to index into distributed data structures and does not address the processes directly. The scalability of NWChem is dependent upon the acceleration of these routines. NWChem uses Global Arrays [20] which itself uses the native ports of ComEx, when available. As an example, the GET operation and NEXTTASK are accelerated on native ports using RDMA and network-based atomics, respectively. The non-native ports based on MPI two-sided must provide progress for these operations, as they cannot be accelerated using RDMA (without invoking a protocol) or network atomics. The MPI-RMA based port [10] can provide the acceleration. However, they typically provide sub-optimal implementation and performance as reported by Dinan et al. [10]. PGAS Models using an MPI Runtime: Design Alternatives and Performance Evaluation Jeff Daily, Abhinav Vishnu, Bruce Palmer, Hubertus van Dam ComEx# User-transparent split of the world communicator (Communica<on#Run<me#For#Exascale)# One master rank per shared memory domain For performance reasons, uses two-sided semantics Scalable#Protocols#Layer# Asynchronous progress by master rank source code change in the application. The upcoming section proGemini our proposed approach, which is subsequently on vides details of Interconnect. We used the default Cray MPI library reWe abstract the primary computation and communication structure ferredNWChem Progress system (PR) based approach for to this paas system. focus on the referred to as Hopper for rest of the in tothis the and This Rank is main elements relevant rest of the paper. performance1 abstracts the communication and computation per. Algorithm evaluation. Bandwidth (MB/s) Bandwidth (MB/s) Shared*Memory* Shared*Memory* Shared*Memory* Software Ecosystem of Global Arrays and ComEx. Native implementations are available for Cray, IBM, and IB systems, but not for Kcomputer and Tianhe-1A. •  PR approach consistently outperforms other approaches •  2.17x speed-up on 1008 processors over other MPI designs RMA PR MT RMA PR MT RMA PR MT RMA PR MT RMA PR Relative Speedup to RMA Communications Runtime for Exascale (ComEx): MT RMA PR MT Relative Speedup Time (ms) PR RMA Coupled Cluster, Singles and Double (CCSD) and Triples (T) rea user-defined number of progress ranks on a node - without any quire N 6 and N 7 computation, respectively. MT j,0 j,1 PR i RMA j,0 j,1 MT i Triangle Calculation Get Protocols using Twoused primitives in designing parallel applications and In NWChem data structures and does not address the processes Figureceiver.One-Sided Communication zero copy ical sections, a possibility is to use dynamic process management into distributed and the TC algorithm (2) presented higher level 4: However, for large messages, a 2.2 CommunicationMPI with Multi-threading inTranslation of communication Runtime whichapproach libraries. The two-sided semantics require of the accel(Not scalability the computation Process Sided is used in MPI with a rendezvous protocol. Hence,based ⇡we ·explore 5.the PERFORMANCE EVALUATION directly. The Shown: Dynamictime an each next section. prohibitive to predictof NWChem is dependent uponimplicit synchroniza- Communication Protocols in Tput m G, We present a performance evaluation of the approaches discussed for Exascale (ComEx) eration tion between sender and receiver where the [21] which operations in the PR approach. and are on code, itof these routines. NWChemhow Global Arraysmessages are matched which is equivalent to the performance of the native ports. the left, is difficult to predict edges each Management.) tag uses manyidentifier)an exam- (MT). Protocols for Get, PutruntimeAccumulate(ComEx) isUsing a PNNL Institutional Computing Cluster (PIC) is a 604 node clusitself uses the a combinationComEx, when available. As and communicator using native ports of of (message The Communication Communication a successor in the previous section using a set of communication benchmarks, a for Exascale One-Sided middle, and analysis, Tget for large messages is initiates a similar to vertex has in its adjacency list, especially for natural aaccelerated on native a wildcard similar right, respectively. Process Piexpected3.5 requestto[20]. Design: typical nodefull application with NWChem. graphs, which to be(ARMCI) ter with each node consisting of two sixteen-core AMD Interlagos N2# Fourth graph kernel, a SpMV kernel, and with with two ple, the(group of processes). MPI allows receiver to specify GET operation and NEXTTASK are Left: A of the Aggregate Remote Memory Copy Interface nativeComEx is handled interfaces for facilitating one-sided . ports. uses message get transfer is impacted by process Pj which Small native using Two-sided thePj,1commu- Table 2 shows the various design alternatives considered in this paasynchronously by thread copies Protocols follow a power-law distribution. and to receive aimportant respectively. and a wildcard ports using(allowingHence, it is message with provide The processors where the compute nodes are connected using a QLogic tag RDMA it network-based atomics, to any tag) DynamicPR rankswhether they were considered for evaluation. Process Management yellow on both sides. primitives in Global Arrays.combination of a send InfiniBand QDR network. We have used MVAPICH2 for the MPI a mechanism fornon-native ports based on it to receive a message from any source). asynchronous progress two-sided mustto using MPI forAcc* source (allowing MPI in addition provide progress nication The protocol for get uses a As an example, ComEx per and indicates (blue and Get* Put* Communication using MPI 3DPM4.is an MPI Global#Arrays# Algorithm 3: ComEx and receive on each side as shown inProtocols in Hence: feature which allows an MPI process to spawn new Algorithms and these operations,Pas they cannot be accelerated*using RDMA (with- * P * on this system. This N system N3 Pi* P two-sided semantics. Pi* Pa* protocol) or networkPi* Pj* Pk MPI-RMA based j Designing a communication runtime multi-threading k* k circles). Blue is not considered for evaluation because libraryperformance evaluation.1# is referred to as IB for the# rest PR is responsible N1# out invoking j atomics. The Tget = TRDM AGet +4· source address s, target address d, message size m, DPM, a new inter-communicator can where is the memory copy cost on each of the MPI Input: Progress is that the lock(s) used The Using is a non-trivial task. withprimary such as InfiniBandprocesses dynamically. TS implementation The networks reason Ranks. and Gemport [10] can provide the acceleration. However, they typically proside. RDMA-enabled target process [26] be created which itfor used for communication. An advantage of be performing RMA operations For large put andvide sub-optimal implementation and performance as reported by by the ini [30] Protocolsabstracted of ⇡rperformanceimpact of accumulate messages requiring a rendezvous proprogress engine are for Get, Put, and portabil- can is many orders of magnitude slower than rest of the implemen(for 1µs, hence the provide a RDMAGet latency that alleviates a need Procedure PUT(s, d, m, r) such an approach tationsitconsidered in this paper. As an example, even on a modDinan et al. will tocol, the sending process [10]. not complete the transfer until the ity), which results in non-deterministic performance observed with iseratelythe memory to use multi-threading, on sized system with NWChem, we by green perfor- 5.2 Simple Communication Benchmarks hosted new procan be non-trivial on the latency for small messages. ComEx# IProbe* Send* IProbe* IProbe* Send* r1 asynchronous thread isand yet it provides asynchronous progress by spawning noticed 10-15x TranslateRankstoPR(r);inAccumulate are on the left, Send* The purpose of the simple communication benchmarks is to unthe Recv* design. Since the MT frequently target process has initiated a receive. Unfortunately, the target promance degradation processes. native approach. The TS in comparison Recv* (Communica<on#Run<me#For#Exascale)# Recv* user-level explicit process to the cesses. Algorithm on requests unless derstand the raw performance of N # communication primitives when Recv* cess cannot make progress1: Self-Consistent Field it also has called into vokingThe FetchAndAdd operation is translatedcommunicator send on MPI_Iprobe (even on a separate as an irecv and than N4# Local*Acc* middle,if m < right, respectively. the approach requires intervention for RMA progress Send* p andrankthen to perform a recv of the the processes are well synchronized. Figures 7 and 8 show the the initiator side with a PR buf needing InlineDataWithHeader(m); makes it user-level processother approaches. Procedure SCF(m, n) which Right: A very slow in comparison to the Input: my rank m, total tasks n Get communication bandwidth performance. As expected, NAT request,Process P initiates . a request to the a local compute, Send(buf . . initial value back toapproach isDPM approach is not(x) number of processes and a send of the r1); A possible The to spawn a few evaluated because dynamic process mani …" Data: current task ID t, data indices i, data d initiator. By using two-sided semantics, our approach cannotand to useagement is not supported onprogress. The original on communicates with PR systems considered for per node take them for asynchronous the high-end ranks Scalable#Protocols#Layer# else the progress on the NIC such as the availadvantage of hardware atomics rank Pk for and spawned processes would then attach to the sameof the implementations, we the ones t m; evaluation in this paper. For the rest shared membuf InfiniBand [27]. The accumulate PrepareHeader(m); ComEx MPI other process/thread pinning nodes for RMA requests. while t < n do able for Cray Gemini [30] and Key:* Communica)on* Computa)on* Message* used over-subscription. RMA targeting Pj . .Pj of theory region in order havethe spawned processes to with noprogress on andput koperation P for make Send(buf . . r1); T Tsend i CalculateIndices(t); operations are bounded by the performance On-node requests are performed Figure 5: Translationput communication operations in the PR behalf of Tacc d Get(i); Send(d . . s); and the reside within the .same shared the processes within its shared memory domain. This localacc function. For large accumulates, we expect the of Tsend + Tlocalacc Blue# Tamo Tsend + Tlocalacc + T PR 5.1 Experimental Testbed Figure FockMatrix(t, i, d); 6: One-Sided Communication Protocols using Twoapproach is very similar to the approach memory.Krishnan et using Global Arrays. The data distribution is owner-computes, reperformance to be similar to a native port implementation since using shared proposed by IB# Cray# MPI# end approach. Left: A typical node with with tworecv ranks (blue Accumulate(t, i, d); Gene# Tget Tsend + Trecv We have used two production systems for performance evaluation: yellow circles). Blue PR is responsible for performing sulting in local accumulates to b. The y is distributed evenly among there memory domain. d, m, r) al.[19] sided Communication Protocols in MPI with Progress Ranks.are no known network hardware implementations of arbitrary and Procedure GET(s, t NextTask(); processes. A request for non-local patches of y uses the get comsize accumulates. Protocols for Get, Put, and Accumulate are on the left, midRMA operations on the memory hosted by green user-level end r1 TranslateRankstoPR(r); Table 1: Translation of ComEx communication primitives to NERSC Hopper is a Cray XE6 system with 6,384 munication primitive. The sparse matrix used in this calculation dle, and right, respectively. Process Pi initiates a request to the Unfortunately, dynamic process management is not available compute nodes on Figure 2: Software Ecosystem of Global Arrays and ComEx. processes, Right: A user-level primitivescommunicatestime comhandle Irecv(d . . . r1); their corresponding MPI process with respect to with PR made up of two twelve-core AMD MagnyCours processor. The corresponds to a Laplacian operator using a 7-point stencil and 3D References implementations are available for Cray, IBM, and IB Acknowledgments Native progress rank Pk for the RMA targeting Pj . Pj and Pk reside most high-end systems. As an example, the in a 3DGemini system the Cray on other nodes for RMA requests. On-node requests are Cray torus topology with ranks plexity. buf PrepareHeader(m); compute nodes are connected orthogonal grid. Triangle Counting (TC), among other graph algorithms, exhibits systems, but not for Kcomputer and Tianhe-1A. within .  the same E.  patterns.L.   can be designed using PGAS sharedTmemory domain. Pakin. Receiver-initiated message passing over rdma.networks. In IPDPS, pagesevaluation does not support dynamic process manageused in our 1–12, 2008. [1]  M.  Valiev,  E.J.  Bylaska,  N.  Govind,  K.  Kowalski,  T.P.  Straatsma,  H.J.J.  Van  Dam,  D.  Wang,  communication Apra,   and This research used resources of the National Energy Research Scientific Computing Center, [2] Scott irregular J Nieplocha,   performed using shared memory. Send(buf . . r1); Windus,  and  W.A.  de  Jong.  Nwchem:  A  comprehensive  and  scalable  open-­‐source  smodels. fIn graphsswith molecular   oluNon   or  large   cale   R-MAT structure such as twitter and facewhich is supported by the Office of Science of the U.S. Department of Energy under Contract No. ment even though the system has been in production use for two 3. EXPLORATION SPACE [3] Sayantan Sur, Hyun-Wook Jin, Lei Chai, andWait(handle); Dhabaleswar K. Panda. Rdma read based rendezvous protocol simulaNons.  Computer  Physics  Communica2ons,  181(9):1477  –  1489,  2010.     DE- AC02-05CH11231. book, it is frequently important to detect communities. An imIn this section, we begin with a description of two example PGAS for mpi over infiniband: design alternatives and benefits. ACC(s, d, pages 32–39, 2006. requires support from the process manager, however, Procedure In PPOPP, m, r) years. DPM portant method3 shows communities is by finding cliques in the to detect the pseudocode executed by compute processes. Algorithm 4.2 Primary Details algorithms which motivate our design choices. We then present a many do not r1 TranslateRankstoPR(r); support dynamic process management since it is not graphs. Since CLIQUE is an NP-complete problem, a popular thorough exploration of design alternatives for using MPI as a comIt shows the protocols for each of the Get, Put, Accumulate and Figure 5 shows the separation of data-serving processes in PR and commonly used in MPI applications. Due to a lack of available imheuristic is to calculate cliques with a size of three which is equivaactive messages. ComEx provides abstractions forwork stealing. RMA operations such as get, put and atomic memory operations and provides too earlier, it is location consistency [11]. Figure 2 shows the software ecosystem task. In the TC of Global Arrays and ComEx. File Name // File Date // PNNL-SA-##### j,0 j,1 k RMA i j PR j k MT i t m ⇤ v/p; while t < (m + 1) ⇤ v/p do n NeighborList(t); for vn 2 n do nn NeighborList(vn ); 1 CalculateIfTriangleExists(t, vn , nn); 2 end 3 end 4 5 MPI Two-Sided + DPM DPM Yes [19] No Procedure NeighborList(v) Input: vertex v 6 MPI Progress Rank PR Yes Yes Triangle Counting (TC), among other graph algorithms, exhibits Get(v); f irregular communication patterns and can in this paper and return f Table 2: Different Approaches Consideredbe designed using PGAS Performance Resultsstructure such as twitter and facetheir implementations. R-MAT models. In graphs with 6000 3000 3500 book, it is frequently important to detect communities. An imMT MT MT RMA 3000 RMA portantPR method to detect communities is by2500 RMAcliques in the finding 5000 PR PR provides the best performance for Gemini and IB. The PR imple- First Design: MPI-RMA NAT NAT graphs. Since CLIQUE is an NP-complete problem, 3.2 a popular 2500 NAT 4000 2000 mentation on Hopper is based on MPI/uGNI, while the nativeMPI-RMA 2.0 supports the one-sided operations get, put and imheuristic is to calculate DMAPPwith aso the difference in peak cliques [28], size of three which is equiva2000 plementation is based on atomic memory operations in addition to supporting active and 3000 to finding triangles in a graph. We show an example of commu1500 lent bandwidth for PR and NAT can be attributed to the use of differ1500 nity detection in natural (power-law) implementation provides graphs, where the passive synchronization modes. MPI-RMA 3.0 has several features algorithm 2000 1000 ent communication libraries. The RMA 1000 which facilitate the design and implementation of scalable communeeds to calculate the number of triangles comparison to the sub-optimal performance on all message sizes in in a given graph. The 1000 500 nication runtime systems for PGAS models. It allows non-blocking 500 edges are easily distributed The MT implementation performs NAT and PR implementations. using a compressed sparse row (CSR) RMA requests, request-based transfers, window-based memory 0 0 0 poorly in The numberto the PR implementation primarily due to proformat. comparison of vertices are divided equally among the 16 64 256 1K 4K 16K 64K 256K 16 64 256 1K 4K data 64K 256K 16 64 256 1K 4K 16K 64K 256K allocation, 16K type communication, and multiple synchronization lock contention. On IB, the PR and RMA implementations percesses. Message Size(Bytes) Message MPI-RMA Message Size(Bytes) modes. Size(Bytes) 3.0 is semantically complete and suitable for form similarly. The MT implementation uses the default threading designing Gemini, Get Infiniband, Get Gemini, mode, and it consistently performs 2, the NEIGHBORLIST functionscalable PGAS communication subsystems. Put in As shown in the Algorithm sub-optimally45 comparison to 35 other approaches. Similar trends are observed for Put communiMultiply translates to GET. Since the computation is negligible, the runtime has completely different semantics than the popularly Multiply Get 2 40 MPI-RMA 30 Get cation primitives in Figure 9 with a drop at 8Kbytes for RMA, PR is bound by the time for the GET function. Other kernels such 35 used and MT implementations due to the change of eager to rendezvous send/receive and collective2 communication interface. An 25 as Sparse that message size. Matrix-Vector Multiply (SpMV) 30 kernel, which are fre- implication is that an optimal design of MPI-RMA important protocol at 20 25 2 quently used in scientific applications and graph algorithms exhibit needs a completely different approach than two-sided semantics. 20 15 similar structure. A PGAS incarnation of SpMV kernelSince MPI-RMA has achieved low acceptance in comparison to is largely 1 15 10 6000on the performance of GET routine. dependent 10 two-sided semantics [10], most vendors choose to only provide a MT 0 5 5 RMA compatibility port due to resource limitations. At the same time, 5000 0 0An understanding from the motivating examples can help us in de0 PR PGAS communication runtimes such as ComEx and GASNet [17] NAT 512 1,024 2,048 signing scalable communication 4,096 runtime systems for512 are tailored to serve 2,048 needs of their respective PGAS models. PGAS modthe 512 1,024 2,048 1,024 Triangle Counting 4000 els using MPI. We begin with a possible design choice using MPIAs an example, UPC [17] and Co-Array Fortan [22] need active RMA, followed by other choices. Infiniband, SpMV messages for linked data structures which Infiniband, supported are not well TC 3000 Gemini, SpMV by MPI-RMA. Similarly, Global Futures [8] - an extension of 3.5 Global Arrays to perform locality driven execution - needs active 2000 Total 3 Acc messages for scaling and minimizing data movement. Get AddPatch 1000 SCF 2.5 CCSD(T) MPI-RMA can be implemented either using native for NWChem CCSD(T) results communica2 0 tion interfaces which leverage RDMA offloading, or by utilizing PR relative to RMA thread in conjunction 16 64 256 1K 4K 16K 64K 256K an accelerating asynchronous RMA. MT did not with 1.5 Message Size(Bytes) send/receive semantics. Either offor these runs. finish execution these cases require significant 1 effort for scalable design and implementation. Dinan et al. have presented an in-depth performance evaluation of Global Arrays Figure 7: Gemini, Get Performance 0.5 using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA 0 For more information on the science implementations perform 40-50% worse than comparable native PIC−2,040 Hopper−1,008 PIC−1,020 It is expected that the NAT implementation provides the best possiyou see here, please with NWChem. ble performance. For the rest of the sections, we only compare the on Blue Gene/P, Cray XT5 and InfiniBandcontact: ports performance of the MT, PR and RMA implementations since This observation implies that vendors are not providing optimal they provide fair comparison against each other. implementations on high-end Daily Jeff systems. Unfortunately, although MPI-RMA is semantically complete as a backend for PGAS modPacific Northwest National Laboratory els, sub-optimal implementations require MS-IN: K7-90alternative 5.3 Sparse Matrix Vector Multiply P.O. Box 999, us to consider MPI SpMV (A · y = b) is an important kernel which is used in scien- features for designing PGAS communication subsystems. Richland, WA 99352 tific applications and graph algorithms such as PageRank. Here, (509) 372-6548 we have considered a block CSR format of A matrix and a onedimensional RHS vector (y), which are allocated and distributed jeff.daily@pnnl.gov PR Message* Procedure TC(p, m, v, e) Input: job size p, my rank m, graph G=(v,e) Data: vertex ID t t m; while t < n do i CalculateIndices(t); d Get(i); Approach Symbol Implemented Evaluated FockMatrix(t, i, d); Native NAT Yes Yes Accumulate(t,RMA i, d); Yes [10] MPI-RMA Yes t Two-sided NextTask(); MPI TS Yes No end MPI Two-sided + MT MT Yes Yes Algorithm 4: ComEx PR Progress Blue# IB#Input: sourceCray# s, target address d, message sizeMPI# address m, Gene# target process r Procedure PROGRESS() while running do Figure 2: Software Ecosystem of Global Arrays and ComEx. f lag Iprobe(); Native implementations are available for Cray, IBM, and IB if flag then systems, but not for Kcomputer and Tianhe-1A. header Recv(); switch header.messageT ype do 3. EXPLORATION SPACE case PUT if IsDataInline(header) then In this section, we begin with a description of two example PGAS CopyInlineData(header); algorithms which motivate our design choices. We then present a else thorough exploration of design alternatives for using MPI as a comRecv(header.d . . header.r1); munication runtime for PGAS models. We first. suggest the semanend tically matching choice of using MPI-RMA before considering the break; use of two-sided protocols. While considering two-sided protocols, end the limitations of each case GET are discussed which motivate more approach complex approaches. For Send(header.spaper, the MPI two-sided the rest of the . . . header.r1); and send/receive semantics are used interchangeably. break; end case ACC 3.1 Motivating Examples LocalAcc(header.d); Self-consistent field (SCF) is a module from NWChem [25] which break; we use as a motivating example to understand the asynchronous end nature of communication and computation within PGAS models. case FADD NWChem is a computational quantum chemistry package which counter LocalFAdd(header.d); provides multiple modules for energy calculation varying in space Send(counter . . . header.r1); and time complexities. The self-consistency field (SCF) module end is less computationally expensive relative to other NWChem modendsw ules, requiring ⇥(N 4 ) computation and N 2 data movement, where end split of the number of basis sets. Higher accuracy methods such As N is the processes in compute ranks and progress ranks. as shown in the end figure, a simple configuration change would allow 3.3 Algorithm 2: Triangle Counting Procedure SCF(m, n) Input: my rank m, total tasks n Data: current task ID t, data indices i, data d RMA Computa)on* Algorithm 1: Self-Consistent Field MT Communica)on* Shared*Memory* Key:* Shared*Memory* Shared*Memory* •  An in-depth analysis of design alternatives for a PGAS communication subsystem using MPI. We present a total of four design alternatives: MPI-RMA (RMA), MPI Two-Sided (TS), MPI Two-Sided with else semantics collective operation would certainly cause Multi-threading (MT), and MPI be implemented either using nativeing into a synchronousareasufficient for implementing one-sidedcollective in thetoabsence or individual point-to-point communication). Two-Sided with deadlock. The two-sided native implementation.perform a should continue lective One-Sided Communication buf PrepareHeader(m); MPI-RMA can communicaof design must then This result affirm Protocols using Two- Sided 2.1.2 MPI-RMA Send(buf . . . r1); tion interfaces which leverage RDMA offloading, or by utilizing MPI-RMA provides interfaces for get, put, and atomic memory Send* addition to Send* Send* communication fence system procurementarequirements of optimized two-sided commu-improvement over the previous send/receive design, progress in control barrier prior to As an Dynamic Process Management (DPM). thread in conjunction with collective. while suggesting that one-sided communication is madeCommunication3.0 allows for explicit request handles, for Send(d . . . s); nication can be operations. MPI-RMA Protocols in MPI. an accelerating asynchronous RMA entering an MPI using6: One-Sided Communicationshown inunderlying runtime, an asynchronous thread as Protocols Figure Two4. In readily improved in the future using the existing MPI interface Figure request window memory to be allocated by the using end send/receive semantics. Either of these cases require significant our proposed design of for ProtocolsPut and Progress Ranks. Protocols ComEx on MPI multi-threading (MT), the sided and for windows that allocate in MPI with MPI-RMA provides Communication Get, shared memory. based on ourIProbe* proposed approach. •  A novel approach which uses a design and implementation. of twoProcedure GET(s, d, m, r) effort for scalable combination Dinan et al. have IProbe* IProbe* Protocols for Get, Put, and Accumulate arehas the left, midRecv* asynchronous thread calls MPI_Iprobe after it target, where the RMA multiple synchronization modes: active on finished servr1 TranslateRankstoPR(r); Recv* presented an in-depth performance evaluation of Global Arrays 3.3.6 Location Consistency Recv* Accumulate windowseparate initiates amiddle, for areProcess Pparticipate the synchronization The rest of Send* the paper is organizedRecv* as follows: In section 2, we dle, and right, respectively. a on the left, in request to the owners ing the sendsource and target requests. We handle Irecv(d . . . r1); sided semantics and an automatic, user-transparent consistency semantics required inourComExIncan be 3, weLocal*Acc*progressright, target,useRMAonly the communicatorP each using MPI-RMA [10]. Specifically, Dinan reports that MPI-RMA The location rank P respectively.initiator of andRMA operation for the targeting thread-process. This P . P an reside present some background for work. section present and between and where buf PrepareHeader(m); communication passive process-thread and implementations perform 40-50% worse than comparable native using the buffer reuse semantics of MPI - invoking achieved by various alternatives when using MPI for designing ComEx. In sec- within the same shared memory domain.availability of generic RMA is involved in synchronization. The Send(buf . . . r1); reduces locking overhead in the MPI runtime. ports on to act as asynchronous split of MPI communicators Blue Gene/P, Cray XT5 and InfiniBand withwait on a requestKey:* 4, weCommunica)on* re-use and present aMessage* (Not Shown:synchronization mechanismsHowever, even a NWChem. handle canpresent our similar Computa)on* provide proposed design semantics performance the operations and MPI-RMA.) tion makes MPI-RMA Wait(handle); This observation implies that vendors are not providing optimal MPI. In addition, section 5. We present related work in section 6 and optimization, it is notPGAS communication subsystems. However, with this useful for designing possible to completely remove lockevaluation in messages are ordered between ComEx ACC(s, d, m, r) Unfortunately, although thethere critical no known progress ranks (PR) forimplementations on high-end systems.a backend fortoPGAS mod-as pairings by using the same MPI tag for all com- using Two-Algorithmtheare path. pseudocode executed by compute processes. high ProcedureTranslateRankstoPR(r); designing scalable and fast rank Figure 3: One-Sided Communication Protocols ing from shows 3 shows the for implementations of MPI-RMA 3.0 and conclude in section 7. on all process r1 MPI-RMA is semantically complete as protocols each of the Get, Put, Accumulate One-Sided that a seriesinof operations for Get, PutIt end systems (a reference implementation of MPI-RMA 3.0 within Sided guaranteeingCommunication Communication Protocols MPI. Protocols munications, if m < then FetchAndAdd [1] is available). The implementations of previous communication The progress function els, sub-optimal implementations require us to consider alternative implicitly communication protocols.features for designing PGAS communication subsystems. area and2. BACKGROUND middle,the same respectively.Get% MPICHnecessary to makeprimitives.on outstandingHowever, MPI Accumulate using executed in and right, are on the left, Put% Acc% buf InlineDataWithHeader(m); Protocols on the same of remote memory are Two-Sided MPI specifications (such P %P progressare available. %P send and they P % P %P % % P % as MPI 2.0) % P% P In this section, we introduce the various features of MPI and is invoked as Send(buf . . . r1); order as initiated by a given process. Location consistency can recv requests. poorly in comparison thenative implementations as shown perform Algorithm 4 shows to progress function executed Communication Protocols ComEx •  Implementation of TS, MT, and PR approachesbeand in conjunctionwhich influence our design decisions.in else 3.3.3 Other Atomic Memorynon-blocking guaranteed with the exclusively Operations by the progressetranks. The protocol processing is abstracted to by Dinan al. [10]. buf PrepareHeader(m); hide MPI with Multi-threading (MT). requirement of thisAtomic byMessage Passing Interface as the design Memory Operations (AMOs) such queuing requests and only testing fetch-and-add are the details such as a LocalAccumulate function, which can 2.1 Send(buf . . . r1); their integration with ComEx - the communicationof the queue for completion before servicingmodel which provides a CSP exe- use high performance libraries/intrinsics directly. critical operations in scaling applications such as NWChem [18]. head the next item in 2.1.3 Multi-Threading IProbe% IProbe% Send% MPI [13, 12] a programming Send(d . . . s); IProbe% Send% Protocolsisfor Get, Put and Send% Recv% OneRecv% most important Recv% of the features of MPI is supporting multithe queue. cution model with send/receive semantics and a Bulk Synchronous end Recv% runtime for Global Arrays. We perform an in-depth Local%Acc% Communication MPI supports multiple thread levels (sinthreaded Accumulate are on the In addition, MPI provides a 4.3.1 Send%communication.Model left, Parallel (BSP) model with MPI-RMA. Procedure FADD(s, d, m, r) 1 shows the translation of ComEx communication primitives rich set of collective communication primitives, derived data types Table gle, funneled, serialized, and multiple). The multiple mode is least has to frequently relinquish the lock by using r1 Primarymiddle, and right, section,abstractions: present Issue: CommunicationPGAS we briefly GET, evaluation on a spectrum of been designed to use Openfabrics Verbs (OFA) for Infini- aspectsand NWChem. It uses In this respectively. relevant to their correspondingallowstwo-sided primitives.ofprocesstothread), itAt the same TranslateRankstoPR(r);not used, restrictive and it MPI an arbitrary number The purpose of MPI threads make has software including 3.3.7 of multi-threading. several Progress buf time, if sched_yield is InlineDataWithHeader(m); calls simultaneously. In general, multiple is sched_yield. the most commonly The primary with MPI two-sided is the a request to this equivalence is to qualitatively analyze the performance degraBand [26] and RoCE Interconnects, Distributed Memory Applica- problem parts of and MPI specification.is general need for ACCUMULATE the NEXTTASK which a load-balancing counter Process P initiates Send(buf . . . r1); the resulting performance is non-deterministic. tions application kernels, 28], and The above algorithm is entirely dationused threaded model in MPI.ports. design section, we explore the in comparison to the native In the communication based on for all operations, but especially for depencommunication benchmarks,(DMAPP) for Cray Gemini Interconnect [30,[29]. The PAMI progressfetch-and-add.i whichprocesshandledGet Recv(d . . . r1); Key:% Communica.on% Computa.on% Message% for x86, PERCS, and Blue Gene/Q Interconnects specidentprocess P which a on the set of tasks(t) possibility of using thread multiple mode as an option for PGAS and FetchAndAdd primitives. MPIj Two-Sidedis frequently comPGAS models are gets during execution. 2.1.1 Semantics fication is being extended to support multi-threading, group aware As a result, there is little temporal locality in NWChem. A procommunication. To eliminate binedgeneric non-SPMD executionand collectiveby to the task id(t) the index commonly For small messages, the Put primitive is expected to use the eagerthe non-determinism as a result of locking in the critwith load balancing to most and a full application, NWChem[1]. Send/receive models forcommunication are and. communication, non-cache-coherent architectures and cessasynchronously only uses indices(i) corresponding thread P MPI protocol which involves a copy by each of the sender and rej,1 Motivating Applications Bandwidth (MB/s) •  •  •  •  Bandwidth (MB/s) Contributions: Solution: Progress Ranks for Shared Memory Domains Evaluation of Approach Time (ms) What is the best way to design a partitioned global address space (PGAS) communication subsystem given that MPI is our only choice? Design: Exploration Space Global#Arrays# Se MPI two-s tions. The Their near ily optimiz hardware two-sided subsystem A possible be done b sided sem for data w tiating com As a resul quests; all is inevitab operations servicing function, function i while map the follow 3.3.1 In PGAS Put operat primitives implicitly process m the operat must also ure 3 illus 3.3.2 An MPI g operation which own ipates in t A possible receive an Get Pi* Send* Recv* Key:* Figure 3: Sided Com and Accu 3.3.3 Atomic M critical op