2. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
3. Shared Memory: all memory within a system is directly
addressable (ignoring access restrictions) by each process [or
thread]
Single- and multi-CPU desktops & laptops
Multi-threaded apps
GPGPU *
MPI *
Distributed Memory: memory available a given node within
a system is unique and distinct from its peers
MPI
Google MapReduce / Hadoop
6. Bandwidth (FSB, HT, Nehalem, CUDA, …)
Frequently run into with high-level languages (MATLAB)
Capacity – cost & availability
High-density chips are $$$ (if even available)
Memory limits on individual systems
Distributed computing addresses both bandwidth and
capacity with multiple systems
MPI is the glue used to connect multiple distributed
processes together
7. Custom iterative SENSE reconstruction
3 x 8 coils x 400 x 320 x 176 x 8 [complex float]
Profile data (img space)
Estimate (img<->k space)
Acquired data (k space)
> 4GB data touched during each iteration
16, 32 channel data here or on the way…
Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”
M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
8. FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)
CAL FTyz (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node
Worker nodes
Store /
RESULT
DICOM
9. Root Node 1Gb Eth Site Intranet
3.6GHz P4 16GB RAM 1Gb Eth
Worker Node (x7)
1Gb Eth
3.6GHz P4 16GB RAM
3.6GHz P4
1Gb Eth
80GB HDD 3.6GHz P4 2x8Gb IB
80GB HDD
500GB HDD 2x8Gb IB
16-Port Gigabit Ethernet Switch
x7 File system
connections
24-Port Infiniband Switch
x7x2 MPI interconnects
16Gb/s bandwidth per node
8Gb/s Connection
Key
Cluster Hardware
MRI System
External Hardware
2x8Gig Infiniband connection
1Gig Ethernet connection
12. Host Host I
OS OS I
Process A Thread 1 Process A
Thread 2
Host II
ThreadN
OS II
Process B Process B
Host N
OS N
Memory Transfers Process C
Network Transfers
13. Host Host I
OS OS I
Process A Thread 1 Process A Process D
Thread 2
Host II
ThreadN
OS II
Process B Process B Process E
Host N
OS N
Memory Transfers Process C Process F
Network Transfers
14. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
15. Message Passing Interface is…
“a library specification for message-passing” 1
Available in many implementations on multiple
platforms *
A set of functions for moving messages between
different processes without a shared memory
environment
Low-level*; no concept of overall computing tasks
to be performed
[1] http://www.mcs.anl.gov/research/projects/mpi/
16. MPI-1
Version 1.0 draft standard 1994
Version 1.1 in 1995
Version 1.2 in 1997
Version 1.3 in 2008
MPI-2
Added:
▪ 1-sided communication
▪ Dynamic “world” sizes; spawn / join
Version 2.0 in 1997
Version 2.1 in 2008
MPI-3
In process
Enhanced fault handling
Forward compatibility preserved
17. MPI is the de-facto standard for distributed computing
Freely available
Open source implementations exist
Portable
Mature
From a discussion of why MPI is dominant [1]:
[…] 100s of languages have come and gone.
Good stuff must have been created [… yet] it is broadly accepted in the field
that they’re not used.
MPI has a lock.
OpenMP is accepted, but a distant second.
There are substantial barriers to the introduction of new languages and
language constructs.
Economic, ecosystem related, psychological, a catch-22 of widespread
use, etc.
Any parallel language proposal must come equipped with reasons why it will
overcome those barriers.
[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html
18. MPI itself is just a specification. We want an implementation
MPICH, MPICH2
Widely portable
MVAPICH, MVAPICH2
Infiniband-centric; MPICH/MPICH2 based
OpenMPI
Plug-in architecture; many run-time options
And more:
IntelMPI
HP-MPI
MPI for IBM Blue Gene
MPI for Cray
Microsoft MPI
MPI for SiCortex
MPI for Myrinet Express (MX)
MPICH2 over SCTP
19. Without MPI:
Start all of the processes across bank of machines
(shell scripting + ssh)
socket(), bind(), listen(), accept() or connect() each
link
send(), read() on individual links
Raw byte interfaces; no discrete messages
23. Each process owns their data – there is no “our”
Makes many things simpler; no mutexes, condition
variables, semaphores, etc; memory access order race
conditions go away
Every message is an explicit copy
I have the memory I sent from, you have the memory you
used to received into
Even when running in a “shared memory” environment
Synchronization comes along for free
I won’t get your message (or data) until you choose to
send it
Programming to MPI first can make it easier to scale-
out later
24. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
25. Download / decompress MPICH source:
http://www.mcs.anl.gov/research/projects/mpic
h2/
Suports: c / c++ / Fortran
Requires Python >= 2.2
./configure
make install
installs into /usr/local by default, or use
--prefix=<chosen path>
Make sure <prefix>/bin is in PATH
Make sure <prefix>/share/man is in MANPATH
27. Set up passwordlessssh to workers
Start the daemons with mpdboot -n<N>
Requires ~/.mpd.conf to exist on each host
▪ Contains: (same on each host)
▪ MPD_SECRETWORD=<some gibberish string>
▪ permissions set to 600 (r/w access for owner only)
Requires ./mpd.hosts to list other host names
▪ Unless run as mpdboot -n 1 (run on current host only)
▪ Will not accept current host in list (implicit)
Check for running daemons with mpdtrace
For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
28.
29. Use mpicc/ mpicxx for c/c++ compiler
Wrapper script around c/c++ compilers detected
during install
▪ $ mpicc --show
gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -
L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -
lpthread -luuid -lpthread –lrt
$ mpicc -o hello hello.c
Use mpiexec -np<nproc><app><args> to launch
$ mpiexec -np 4 ./hello
30. /* hello.c */
#include <stdio.h> $ mpicc -o hello hello.c
#include <mpi.h> $ mpiexec -np 4 ./hello
Hello, from 0 of 4!
int main (int argc, char * argv[])
{ Hello, from 2 of 4!
inti, rank, nodes; Hello, from 1 of 4!
Hello, from 3 of 4!
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}
31. ./threaded_app
main()
Thread within
threaded_app process
pthread_create( func() )
func()
Do work Do work
Memory
pthread_join() pthread_exit()
exit()
32. mpiexec –np 4 ./mpi_app
mpd launches jobs
mpi_app [rank 0] mpi_app [rank 1] mpi_app [rank 3]
main() main() main()
MPI_Init() MPI_Init() MPI_Init() MPI comm.
MPI_Bcast() MPI_Bcast() MIP_Bcast() MPI comm.
Do Work on local mem Do Work on local mem Do Work on local mem
MPI_Allreduce() MPI_Allreduce() MPI_Allreduce() MPI comm.
MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI comm.
exit() exit() exit()
33. /* hello.c */
#include <stdio.h>
#include <mpi.h>
int
main (int argc, char * argv[])
{
int i;
int rank;
int nodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}
34. MPICH2 comes with mpe by default (unless disabled
during configure)
Multiple tracing / logging options to track MPI traffic
Enabled through –mpe=<option> at compile time
MacPro:code$ mpicc -mpe=mpilog -o hello hello.c
MacPro:code$ mpiexec -np 4 ./hello
Hello from 0 of 4!
Hello from 2 of 4!
Hello from 1 of 4!
Hello from 3 of 4!
Writing logfile....
Enabling the Default clock synchronization...
Finished writing logfile ./hello.clog2.
39. intMPI_Send(
void *buf,
memory location to send from
int count,
number of elements (of type datatype) at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc
intdest,
rank (within the communicator comm) of destination for this message
int tag,
used to distinguish this message from other messages
MPI_Commcomm )
communicator for this transfer
often MPI_COMM_WORLD
40. intMPI_Recv(
void *buf,
memory location to receive data into
int count,
number of elements (of type datatype) available to receive into at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc.
Typically matches sending datatype, but doesn’t have to…
int source,
rank (within the communicator comm) of source for this message
can also be MPI_ANY_SOURCE
int tag,
used to distinguish this message from other messages
can also be MPI_ANY_TAG
MPI_Commcomm,
communicator for this transfer
often MPI_COMM_WORLD
MPI_Status *status )
Structure describing the received message, including:
actual count (can be smaller than passed count)
source (useful if used with source = MPI_ANY_SOURCE)
tag (useful if used with tag = MPI_ANY_TAG)
42. $ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
43.
44. $ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<13”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 2 ./sr
^C
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
45. 3.4 Communication Modes
The send call described in Section Blocking send is blocking: it does not return until the message data
and envelope have been safely stored away so that the sender is free to access and overwrite the send
buffer. The message might be copied directly into the matching receive buffer, or it might be copied
into a temporary system buffer.
Message buffering decouples the send and receive operations. A blocking send can complete as soon
as the message was buffered, even if no matching receive has been executed by the receiver. On the
other hand, message buffering can be expensive, as it entails additional memory-to-memory
copying, and it requires the allocation of memory for buffering. MPI offers the choice of several
communication modes that allow one to control the choice of the communication protocol.
The send call described in Section Blocking send used the standard communication mode. In this
mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer
outgoing messages. In such a case, the send call may complete before a matching receive is
invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case, the send call will not complete until a
matching receive has been posted, and the data has been moved to the receiver.
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It
may complete before a matching receive is posted. The standard mode send is non-local: successful
completion of the send operation may depend on the occurrence of a matching receive.
http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40
46. Process 1 Process 2
Send “small”
message &
return Eager send Eager recv
Send “large” Request & receive
message Receive small message
Rndv. req.
Rndv. req.
Blocks until Match Rndv. Request large
completion. Rndv. send message
req.
Receive Receive large
Rndv. data message
User activity
MPI activity
47. MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)
Sends are “local” – they return independent of any remote activity
Message buffer can be touched immediately after call returns
Requires a user-provided buffer, provided via MPI_Buffer_attach()
Forces an “eager”-like message transfer from sender’s perspective
User can wait for completion by calling MPI_Buffer_detach()
MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)
Won’t return until matching receive is posted
Forces a “rendezvous”-like message transfer
Can be used to guarantee synchronization without additional MPI_Barrier() calls
MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)
Erroneous if matching receive has not been posted
Performance tweak (on some systems) when user can guarantee matching receive is posted
MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)
Non-blocking, immediate return once send/receive request is posted
Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion
Send/receive buffers should not be touched until completed
MPI_Request * argument used for eventual completion
The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to
receive any send mode.
49. $ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 4 ./sr2
0 sent 0; received 3
2 sent 2; received 1
1 sent 1; received 0
3 sent 3; received 2
50. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
51. Task parallelism
Each process handles a unique kind of task
▪ Example: multi-image uploader (with resize/recompress)
▪ Thread 1: GUI / user interaction
▪ Thread 2: file reader & decompression
▪ Thread 3: resize & recompression
▪ Thread 3: network communication
Can be used in a grid with a pipeline of separable tasks
to be performed on each data set
▪ Resample / warp volume
▪ Segment volume
▪ Calculate metrics on segmented volume
52. Data parallelism
Each process handles a portion of the entire data
Often used with large data sets
▪ [task 0… | … task 1 … | … | … task n]
Frequently used in MPI programming
Each process is “doing the same thing,” just on a
different subset of the whole
53. Layout is crucial in high-
performance computing
BW efficiency; cache efficiency
Even more important in
distributed Node 0
Poor layout extra Node 1
communication Node 2
Node 3
Shown is an example of Node 4
“block” data distribution Node 5
x is contiguous dimension Node 6
Node 7
z is slowest dimension x
Each node has contiguous y
portion of z z
54. FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)
CAL FTyz (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node
Worker nodes
Display /
RESULT
DICOM
55. Completely separable problems:
Add 1 to everyone
Multiply each a[i] * b[i]
Inseparable problems: [?]
Max of a vector
Sort a vector
MIP of a volume
1D FFT of a volume
2d FFT of a volume
3d FFT of a volume
[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
56.
57. Dynamic datatypes
MPI_Type_vector()
Enables communication of sub-sets without packing
Combined with DMA, permits zero-copy transposes, etc.
Other collectives
MPI_Reduce
MPI_Scatter
MPI_Gather
MPI-2 (MPICH2, MVAPICH2)
One-sided (DMA) communication
▪ MPI_Put()
▪ MPI_Get()
Dynamic world size
▪ Ability to spawn new processes during run
58. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
59. Take time on the algorithm & data layout
Minimize traffic between nodes / separate
problem
▪ FTx into xKyKz in SENSE example
Cache-friendly (linear, efficient) access patterns
Overlap processing and communication
MPI_Isend() / MPI_Irecv() with multiple work
buffers
While actively transferring one, process the other
Larger messages will hit a higher BW (in general)
60. Profile
Vtune (Intel; Linux / Windows)
Shark (Mac)
MPI profiling with -mpe=mpilog
Avoid “premature optimization” (Knuth)
Implementation time & effort vs. runtime
performance
Use derived datatypes rather than packing
Using a debugger with MPI is hard
Build in your own debugging messages from go
61. If you might need MPI, build to MPI.
Works well in shared memory environments
▪ It’s getting better all the time
Encourages memory locality in NUMA architectures
▪ Nehalem, AMD
Portable, reusable, open-source
Can be used in conjunction with threads / OpenMP /
TBB / CUDA / OpenCL “Hybrid model of parallel
programming”
Messaging paradigm can create “less obfuscated”
code than threads / OpenMP
62. Homogeneous nodes
Private network
Shared filesystem; ssh communication
Password-less SSH
High-bandwidth private interconnect
MPI communication exclusively
GbE, 10GbE
Infiniband
Consider using Rocks
CentOS / RHEL based
Built for building clusters
Rapid network boot based install/reinstall of nodes
http://www.rocksclusters.org/
63. MPI documents
http://www.mpi-forum.org/docs/
MPICH2
http://www.mcs.anl.gov/research/projects/mpich2
http://lists.mcs.anl.gov/pipermail/mpich-discuss/
OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/community/lists/ompi.php
MVAPICH[1|2] (Infiniband-tuned distribution)
http://mvapich.cse.ohio-state.edu/
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/
Rocks
http://www.rocksclusters.org/
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/
Books:
Pacheo, Peter S., Parallel Programming with MPI
Karniadakis, George E., Parallel Scientific Computing in C++ and MPI
Gropp, W., Using MPI-2
64.
65.
66. This is the painting operation #define RB 0x00FF00FFu
#define RB_8OFF 0xFF00FF00u
for one RGBA pixel (in) onto #define RGB 0x00FFFFFFu
#define G 0x0000FF00u
another (out) #define G_8OFF 0x00FF0000u
#define A 0xFF000000u
We can do red and blue
together, as we know they inlinevoid
blendPreToStatic(constuint32_t& in,
won’t collide, and we can mask uint32_t& out)
{
out the unwanted results. uint32_t alpha = in >>24;
if(alpha &0x00000080u) ++alpha;
Post-multiply masks are out = A | RGB&
(in +
applied in the shifted position (
to minimize the number of (
(alpha * (out &RB) &RB_8OFF) |
shift operations (alpha * (out &G) &G_8OFF)
) >>8
)
);
Note: we’re using pre- }
multiplied colors & painting
onto an opaque background
67. OUT = A | RGB&
(IN +
(
(
(ALPHA * (OUT &RB) &RB_8OFF) |
(ALPHA * (OUT &G) &G_8OFF)
) >>8
)
);
68. For cases where there is no overlap between
the four output pixels for four input pixels, we
can use vectorized (SSE2) code
128-bit wide registers; load four 32-bit RGBA
values, use the same approach as previously
(R|B and G) in two registers to perform four
paints at once
69. inline
void
blend4PreToStatic(uint32_t ** in,
uint32_t * out) // Paints in (quad-word) onto out
{
__m128irb, g, a, a_, o, mask_reg; // Registers
rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary)
a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call
*in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory
mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4)
g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4)
mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4)
rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue
rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing
a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word
mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word
a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word
// These steps add one to transparancy values >= 80
o = _mm_srli_epi16(a,7); // Now the high bit is the low bit
70. // We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want
// to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and
// storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're
// doing it in this fashion!)
rb = _mm_mulhi_epu16(rb,a);
g = _mm_mulhi_epu16(g,a);
g =_mm_slli_epi32(g,8); // Move green into the correct location.
// R and B, both the lower 8 bits of their 16 bits, don't need to be shifted
o = _mm_set1_epi32(0xFF000000); // Opaque alpha value
o = _mm_or_si128(o,g);
o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color
mask_reg = _mm_set1_epi32(0x00FFFFFF);
g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color
o = _mm_add_epi32(o,g); // Add foreground and background contributions together
_mm_storeu_si128((__m128i *) out,o); // Unaligned store
}
73. MPI_Init(3) MPI MPI_Init(3)
NAME
MPI_Init - Initialize the MPI execution environment
SYNOPSIS
int MPI_Init( int *argc, char ***argv )
INPUT PARAMETERS
argc - Pointer to the number of arguments
argv - Pointer to the argument vector
THREAD AND SIGNAL SAFETY
This routine must be called by one thread only. That thread is called
the mainthread and must be the thread that calls MPI_Finalize .
NOTES
The MPI standard does not say what a program can do before an MPI_INIT
or after an MPI_FINALIZE . In the MPICH implementation, you should do
as little as possible. In particular, avoid anything that changes the
external state of the program, such as opening files, reading standard
input or writing to standard output.
74. MPI_Barrier(3) MPI MPI_Barrier(3)
NAME
MPI_Barrier - Blocks until all processes in the communicator have
reached this routine.
SYNOPSIS
int MPI_Barrier( MPI_Commcomm )
INPUT PARAMETER
comm - communicator (handle)
NOTES
Blocks the caller until all processes in the communicator have called
it; that is, the call returns at any process only after all members of
the communicator have entered the call.
75. MPI_Finalize(3) MPI MPI_Finalize(3)
NAME
MPI_Finalize - Terminates MPI execution environment
SYNOPSIS
int MPI_Finalize( void )
NOTES
All processes must call this routine before exiting. The number of
processes running after this routine is called is undefined; it is best
not to perform much more than a returnrc after calling MPI_Finalize .
76. MPI_Comm_size(3) MPI MPI_Comm_size(3)
NAME
MPI_Comm_size - Determines the size of the group associated with a
communicator
SYNOPSIS
int MPI_Comm_size( MPI_Commcomm, int *size )
INPUT PARAMETER
comm - communicator (handle)
OUTPUT PARAMETER
size - number of processes in the group of comm (integer)
77. MPI_Comm_rank(3) MPI MPI_Comm_rank(3)
NAME
MPI_Comm_rank - Determines the rank of the calling process in the com-
municator
SYNOPSIS
int MPI_Comm_rank( MPI_Commcomm, int *rank )
INPUT ARGUMENT
comm - communicator (handle)
OUTPUT ARGUMENT
rank - rank of the calling process in the group of comm (integer)
78. MPI_Send(3) MPI MPI_Send(3)
NAME
MPI_Send - Performs a blocking send
SYNOPSIS
int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag,
MPI_Commcomm)
INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (nonnegative integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)
NOTES
This routine may block until the message is received by the destination
process.
79. MPI_Recv(3) MPI MPI_Recv(3)
NAME
MPI_Recv - Blocking receive for a message
SYNOPSIS
int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag,
MPI_Commcomm, MPI_Status *status)
OUTPUT PARAMETERS
buf - initial address of receive buffer (choice)
status - status object (Status)
INPUT PARAMETERS
count - maximum number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)
NOTES
The count argument indicates the maximum length of a message; the
actual length of the message can be determined with MPI_Get_count .
80. MPI_Isend(3) MPI MPI_Isend(3)
NAME
MPI_Isend - Begins a nonblocking send
SYNOPSIS
intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag,
MPI_Commcomm, MPI_Request *request)
INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)
OUTPUT PARAMETER
request
- communication request (handle)
81. MPI_Irecv(3) MPI MPI_Irecv(3)
NAME
MPI_Irecv - Begins a nonblocking receive
SYNOPSIS
intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source,
int tag, MPI_Commcomm, MPI_Request *request)
INPUT PARAMETERS
buf - initial address of receive buffer (choice)
count - number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)
OUTPUT PARAMETER
request
- communication request (handle)
82. MPI_Bcast(3) MPI MPI_Bcast(3)
NAME
MPI_Bcast - Broadcasts a message from the process with rank "root" to
all other processes of the communicator
SYNOPSIS
int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root,
MPI_Commcomm )
INPUT/OUTPUT PARAMETER
buffer - starting address of buffer (choice)
INPUT PARAMETERS
count - number of entries in buffer (integer)
datatype
- data type of buffer (handle)
root - rank of broadcast root (integer)
comm - communicator (handle)
83. MPI_Allreduce(3) MPI MPI_Allreduce(3)
NAME
MPI_Allreduce - Combines values from all processes and distributes the
result back to all processes
SYNOPSIS
int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm )
INPUT PARAMETERS
sendbuf
- starting address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- data type of elements of send buffer (handle)
op - operation (handle)
comm - communicator (handle)
OUTPUT PARAMETER
recvbuf
- starting address of receive buffer (choice)
84. MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)
NAME
MPI_Type_create_hvector - Create a datatype with a constant stride
given in bytes
SYNOPSIS
int MPI_Type_create_hvector(int count,
int blocklength,
MPI_Aint stride,
MPI_Datatypeoldtype,
MPI_Datatype *newtype)
INPUT PARAMETERS
count - number of blocks (nonnegative integer)
blocklength
- number of elements in each block (nonnegative integer)
stride - number of bytes between start of each block (address integer)
oldtype
- old datatype (handle)
OUTPUT PARAMETER
newtype
- new datatype (handle)
85. mpicc(1) MPI mpicc(1)
NAME
mpicc - Compiles and links MPI programs written in C
DESCRIPTION
This command can be used to compile and link MPI programs written in C.
It provides the options and any special libraries that are needed to
compile and link MPI programs.
It is important to use this command, particularly when linking pro-
grams, as it provides the necessary libraries.
COMMAND LINE ARGUMENTS
-show - Show the commands that would be used without runnning them
-help - Give short help
-cc=name
- Use compiler name instead of the default choice. Use this
only if the compiler is compatible with the MPICH library (see
below)
-config=name
- Load a configuration file for a particular compiler. This
allows a single mpicc command to be used with multiple compil-
ers.
[…]
86. mpiexec(1) MPI mpiexec(1)
NAME
mpiexec - Run an MPI program
SYNOPSIS
mpiexecargs executable pgmargs [ : args executable pgmargs ... ]
where args are command line arguments for mpiexec (see below), exe-
cutable is the name of an executable MPI program, and pgmargs are com-
mand line arguments for the executable. Multiple executables can be
specified by using the colon notation (for MPMD - Multiple Program Mul-
tiple Data applications). For example, the following command will run
the MPI program a.out on 4 processes:
mpiexec -n 4 a.out
The MPI standard specifies the following arguments and their meanings:
-n<np>
- Specify the number of processes to use
-host<hostname>
- Name of host on which to run processes
-arch<architecturename>
- Pick hosts with this architecture type
[…]
Notas do Editor
NUMANUMA is a distinction within shared memory systems. E.G. AMD HyperTransport or Intel QPI vs. Northbridge w/ FSBGPGPU: Sort of; xfers into and out of GPU memory are from the main shared system memory; xfers within GPU memory by GPU kernels are shared memory within their own private (GPU) memory spaceDistributed systems: comprised of multiple nodes. Each node typically == individual “computer”MPI can be used on shared memory systems; modern implementations use fastest xfer mechanism between each set of peers.
Some scale betterCPUs keep getting faster, either through GHz or # of cores; memory BW has not kept up.STREAM benchmark------------------------------------------------------------------ name kernel bytes/iter FLOPS/iter------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 ------------------------------------------------------------------8 million element double-precison arrays ~64MB arrays; ICC 10; -xPCPU manufacturers are focused on improving this; and have really sped things up with Nehalem;… what about Nehalem?
Examples of tasks that hit BW walls:Highly tuned inner loops (few op/s per element; running over large volume)Masking operations (multiply each element from one volume by a mask in another volume)Max / min / mean / std operationsMIPsStill an issue on new systems; likely to continue to be an issue;Nehalem is NUMA as well; another layer of complexity -> can control somewhat via binding (numactl; through task manager in windows)This is not to say the 8 processors are useless; on programs where the inner loop operation does more work, the scaling can be close to ideal. E.g. sin(x)
Front side bus, quick path interconnect, HyperTransportHigh-level languages: need to finish one operation (A += B) before doing the next operation (A = A*A)MPI is the de facto standard for parallel programs on distributed memory ssytems; from blue gene to off-the-shelf linux clusters1GB 1333 DDR3 : $95 ($800)2GB 1333 DDR3: $155 ($620)4GB 1333 DDR3: $322 ($644)8GB 1333 DDR3 chips: $3410 ($3410)Nehalem again makes this more confusing; memory bus clock changes based on # of modules…Also one of the key points that CUDA is focused on; 3 of the 8 called out improvements in the latest rev focus on efficient / improved memory bw usage.
Needs for large data sets in image processing are real and here now.
2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)(This is not the iterative reconstruction)
*MY* taxonomySETI@home 1999 – 2005; now part of BOINC 1.7PFlops > 1.4TFlops (RoadRunner)Grid: Jobs ~independent and asynchronous; Hadoop/ MapReduce; Cycle stealingScaleMP: Up to 32 processor (128 cores) and 4TB shared memoryCluster computing:Distributed “process” starts on multiple machines concurrentlyTypically cookie-cutter (although support for different architectures in possible in MPI)Significant communication between nodes during processingMassive simulationsApplications sensitive to timingsFolding@Home: Loosely coupled collective (GRID), tightly coupled within client (MPI); also Grid+GPU 4.6PFlops
More taxonomy:Grid:Loosely connected; nodes “unaware” of other nodes.Works great for “batch” problemsDifferent architectures; different implementations (CPU, GPU, … PS3 and Nvidia clients for Folding@Home)Wildly varying performance between nodes “easily” accommodatedFail-over almost “automatic”Sun grid engineMap-reduce / hadoopCan be a cycle-stealing background processCluster:Tightly connected; nodes in tight communication with each otherFailures are hard to handle – intermediate results often saved; MPI-2Usually homogeneous nodes; varying performance can cause severe performance loss if not accounted for carefullyMPI(SGE / other schedulers)We will be focusing on Clusters; this is where MPI is used
Network transfers (even on fast networks) are expensive compared to memory transactions
Number of bi-directional links for nodes N = (N-1)*(N)/2 = 15 for 6 nodes; 28 for 8; ~ N^2 / 2Managing this yourself is complicated and time-consuming!>>> This is what MPI simplifies for usSo what is MPI?
ANL = Agronne National Laboratory* Although available on many platforms, it has a unix heritage, and is most natural to use on unix-y (mac, linux, sun) environments. (OpenMPI ships Standard on macsw/Leopard)Low-level: there are some functions that operate on the data-type (Reduce operations) – but most “just” shuffle bytes around
MPI is everywhere in high-performance computing, but why?>>> So what does MPI do for you? Why should you use it? Look at the complexity of setting up a distributed system again.
Can also providing profiling; MPI can use different communication for different sets of peers (e.g. SMP, Infiniband, TCP/IP)You could (almost) write any MPI program with these 4 calls; much different from pthreadsw/ mutexes, OpenMP, GPU, etc; communication provides synchronization by its nature; no dealing with “locks” on “shared” variables, etc.BUT: need to be sure each node in initializing variables correctly…
Getting back to what MPI is a little more…Even though most MPI programs could be written with just a few MPI commands, there are quite a few available….
Linux / mac instructions; Leopard already has openmpi installed. In /usr/bin/mpi[cc|cxx|run]Not familiar with the windows version; see windows portion of http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdfFortran andc++ if desiredSupports shared mem and tcp channels.
MPD = multiprocessing daemon; used to start one daemon per host; these daemons are used to start the actual jobsTalk a little more about MPD
MPD = multiprocessing daemon; launched (and left running) on each node that want to be ready to participate in an MPI executionOther options (mpirun) exist, but mpd is fast for starting new jobs (as opposed to new ssh sessions created each time a job is run)
MPD = multiprocessing daemon
MPI_Init() Must be called in every program that will use MPI callsCaveat: Printing to stdout (stderr) from different nodes works; but it is not guaranteed to be synchronized; On click: note 2 printed before 1. (Even though 2 occurred after 1 as enforced by MPI_Barrier(); fflush() does not fix… send all IO to one process for printout)Now is a good time to discuss what actually happens when an MPI parallel job is run. (In contrast to a threaded job)
I want to look a little more into what actually happens when a parallel MPI program runs. Let’s start by looking at how a parallel threaded app run.Threads are spawned at runtime as requested by the program.Multiple threads may be spawned and joined over the course of a program.Each thread has access to memory to do its work (whatever it may be)main() is only entered and exited once
Multiprocessing daemons already running; know about each otherNOT SHOWING RANK 2Each rank is a full program; starts in main; exits from mainMPI_Bcast() / MPI_All_reduce() included here as a way to show communication between nodesProgram logic during execution determines who does what
Only the portions highlighted are different between the nodes; however, every line is executed – the full program – on each node; tests are performed to select different code at run time to run on each node.This is different from threaded apps, where common (global) code sections (initializations, etc) are really only run once. As long as init << parallel work, not a big performance issue(Doesn’t have to be done this way, but this is the typical way; different executables can be run as different processes, if desired.)
MPE is useful for understanding what MPI is doing
Black sections between barriers are the printf calls ~20usec each
More of a debug tool
Transpose; ~88MB data set in 70 ms (1.2GB/s)
320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw library!!!Custom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
Let’s get back to MPI programming by examining the two basic building blocks for any MPI program: MPI_Send & MPI_RecvYou can make communicators that include only a subset of the active nodes; useful for doing “broadcasts” within a subset, etc.Tag can be used to separate classes of messages; etc. up to the user.Can be used with zero-length messages to communicate something via the tag alone. E.g. “Ready”, or “complete”
It’s important to note that the types don’t have to be exactly the same; A strided vector could be received / sent from a contiguous vector)
Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
Threshold is 16kB
We can see that sends can complete before the matching receives are posted; but not vice-versa. (Timing enforced by message passing; no mutexes required!
Threshold is 16kB
“Small” messages get sent into pre-allocated (within the MPI library) buffers; allows sender to return quicker; less traffic; etc. “Eager”“Large” messages get sent only once the receiver has posted the receive request (with the receive buffer) “Rendezous”
Most of these also have _init modes to create a persistent request than can be started with MPI_Start[all]() and completed with MPI_[Test|Wait][any|all|some]* By basic, I mean excluding things like broadcasts, scatters, reduces; all of which have some send action included within them.
Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
5 – 10us for MPI_Isend / MPI_Irecv to return; xfer took 1.7ms(147MB/s)
It’s important to look at the work you need to speed up and understand which approach will do better for you.Multiple separable tasks each of ~ same difficulty works well with task parallelismData parallelism works will
Data parallelism works well with large data setsLoad balancing can become an issue if relative workloads aren’t known a priori
It’s important to consider how to split the data in a data-parallel systemSuppose you know you want to do mips across Z repeatedly; ignoring everythin else, would want to lay out with z available locally (but not necessarily contiguous; sse instructions for maximums don’t want to work along the four contiguous elements, but between a pair of elements in two four-value sets. (have z as your next-to-fastest dimension))Other examples of distributions are cyclic and block-cyclic; also high-dimension splitting (into a grid, for example)
People are really doing this… We’re really doing this…Data is split along x immediately after FTx; distributed to all nodesCalibration scan taken earlier(This is not the iterative reconstruction)GW is done one dimension at a time; requires data along that dimension to be local, so we transpose before the GWx correction2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)
It’s important to look at your problem and determine where it can be separated outIn general, MPI works better if you can separate it the large scale, rather than in the fine-scaleSIMD is an example of fine-scale parallelismAre each of these separable?Can do local maximums, and then max of maximumsParallel bitonic system; out of scope here; 55ms for ¼ qsort; 85 ms for full parallel sort; ~220 for one qsort of full vector (1 mega-element ints)1dfft : as long as 1dfft is not along split dimension (Assuming the time of a single 1d fft is small enough that you won’t try to split it up)2dffts : easy as long as not split along ffts3dffts: perform along contiguous dims; swap for final (“transposed input/output” options on fftw3 mpi implementation)
320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw libraryCustom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
One-sided communication opens up race conditions concerns again, but gains some latency / BW because of reduced negotiation
Efficient: make use of all data on a cache line when you read it; and only read it once
Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”There are some packages out there (openMPIw/ eclipse; TotalView) to help with debugging MPI.Errors on other nodes can cause the one you’re debugging to receive a signal to exit.
You can build a cluster virtually just to see how things work…
All of these mailing lists are active, and wonderful places to get help (After you’ve read the Docs & FAQ!)