2. We’ll discuss the following today
• Background of Heterogeneous Computing
• Message Passing Interface(MPI)
• Vector Addition Example(MPI Implementation)
• More implementation details of MPI
3. Background
• Heterogeneous Computing System(HCS)
• High Performance Computing & its uses
• Supercomputer vs. HCS
• Why use Heterogeneous Computers in HCS?
• MPI is the predominant message passing system forClusters
4. Introduction to MPI
• MPI stands for Message Passing Interface
• Predominant API
• Runs on virtually any hardware platform
• Programming Model – Distributed Memory Model
• Supports Explicit Parallelism
• Multiple Languages supported
5. Reasons for using MPI
• Standardization
• Portability
• Performance Opportunities
• Functionality
• Availability
6. MPI Model
• Flat view of the cluster to
programmer
• SPMD Programming Model
• No Global Memory
• Inter-processCommunication is
possible & required
• Process Synchronization
Primitives
9. Format of MPI
Calls
• Format of MPI Calls
• Case Sensitivity
• C –Yes
• Fortran – No
• Name Restrictions
• MPI_ *
• PMPI_* ( Profiling interface)
• Error Handling
• Handled via return parameter
10. Groups &
Communicators
Groups – Ordered set of processes
Communicators – Handle to a
group of processes
Most MPI Routines require a
communicator as argument
MPI_COMM_WORLD – Predefined
Communicator that includes all
processes
Rank – Unique ID
25. Point-to-Point Operations
• Typically involve two, and only two, different MPI threads
• Different types of send and receive routines
• Synchronous send
• Blocking send / blocking receive
• Non-blocking send / non-blocking receive
• Buffered send
• Combined send/receive
• "Ready" send
• Send/Receive Routines not tightly coupled
26. Buffering
• Why is buffering required?
• It is Implementation Dependent
• Opaque to the programmer and
managed by the MPI library
• Advantages
• Can exist on the sending side, the
receiving side, or both
• Improves program performance
• Disadvantages
• A finite resource that can be easy to
exhaust
• Often mysterious and not well
documented
27. Blocking vs. Non-blocking
Blocking Non Blocking
Send will only return after it’s safe to modify
application buffer
Send/Receive return almost immediately
Receive returns after the data has arrived and
ready for use by the application
Unsafe to modify our variables till we know
send operation has been completed
Synchronous Communication is possible OnlyAsynchronous Communication possible
Asynchronous Communication is also possible Primarily used to overlap computation with
communication to get performance gain
28. Order and Fairness
• Order
• MPI guarantees that messages will not overtake each other
• Order rules do not apply if there are multiple threads participating in the
communication operations
• Fairness
• MPI does not guarantee fairness - it's up to the programmer to prevent
"operation starvation"
30. Collective Communication Routines(contd.)
• Scope
• Must involve all processes within the scope of a communicator
• Unexpected behavior, including program failure, can occur if even one task in the
communicator doesn't participate
• Programmer's responsibility to ensure that all processes within a communicator
participate in any collective operations.
• Collective communication functions are highly optimized
31. Groups & Communicators(additional details)
• Group
• Represented within system memory as an object
• Only accessible as a handle
• Always associated with a communicator object
• Communicator
• Represented within system memory as an object.
• In the simplest sense, the communicator is an extra "tag" that must be included with
MPI calls
• Inter-group and Intra-group communicators available
• From the programmer's perspective, a group and a communicator are one
32. Primary Purposes of Group and
Communicator Objects
1. Allows you to organize tasks, based upon function, into task
groups.
2. Enable Collective Communications operations across a subset of
related tasks.
3. Provide basis for implementing user defined virtual topologies
4. Provide for safe communications
33. Programming Considerations and
Restrictions
• Groups/communicators are dynamic
• Processes may be in more than one group/communicator
• MPI provides over 40 routines related to groups, communicators, and virtual topologies.
• Typical usage:
• Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
• Form new group as a subset of global group using MPI_Group_incl
• Create new communicator for new group using MPI_Comm_create
• Determine new rank in new communicator using MPI_Comm_rank
• Conduct communications using any MPI message passing routine
• When finished, free up new communicator and group (optional) using MPI_Comm_free and
MPI_Group_free
34. Virtual Topologies
• Mapping/ordering of MPI processes into a geometric "shape“
• Similar to CUDA Grid / Block 2D/3D structure
• They are onlyVirtual
• Two MainTypes
• Cartesian(grid)
• Graph
• Virtual topologies are built upon MPI communicators and groups.
• Must be "programmed" by the application developer.
35. Why use Virtual Topologies?
• Convenience
• Useful for applications with specific communication patterns
• Communication Efficiency
• Penalty avoided on some hardware architectures for communication between
distant nodes
• Process Mapping may be optimized based on physical characteristics of the
machine
• MPI Implementation decides ifVT is ignored or not
HCS
Systems that use more than one kind of processor
So far we have discussed programming on a system with one host-one device
HPC & uses
Using more than one computer as a part of a cluster to get things done faster.
A computer cluster is just a bunch of computers connected to a local network or LAN
Uses
Stock Prediction & Trading
Rendering an very very high resolution picture(400,000,000 pixels)
Evolutionary Algorithms
SC vs. HCS
SC
Good only for specialized problems
Requires vast sums of money and specialized expertise to use
HCS
Can be managed without a lot of expense or expertise
Why use Heterogeneous Computers in HCS?
For better energy efficiency
Usage of GPU’s in clusters started in 2009, so it’s relatively new
Effectiveness of this approach – Lot of clusters in the Green 500 List
Green 500 – List of most energy-efficient/ greenest supercomputers in the world
Message Passing Interface
Originally designed for distributed memory architectures(1980’s to early 90’s)
Predominant API
Wiped out other API’s from before it
Runs on virtually any hardware platform
Distributed Memory
Shared Memory
Hybrid
Programming Model
Regardless of the underlying physical architecture of the machine
Explicit Parallelism
Programmer is responsible for identification and implementation of parallelism using algorithms and MPI Constructs
Languages
C, C++ and Fortran
Standardization
It’s supported on all HPC platforms like
MVAPICH – Linux Cluster
Open MPI – Linux Cluster
IBM MPI – BG/Q Cluster – part of their blue jean series
Portability
No source code modifications when porting between platforms - platform supports MPI Std.
Performance Opportunities
Vendors can tune it further based on native h/w
Functionality
Over 430 routines in MPI 3
Most use less than a dozen routines
Availability
Variety of implementations in vendor n public domain
SPMD Programming Model
Each process computes part of the output
Flat view of the cluster
Instead of having a node concept, MPI just has threads.
All threads given a flat index like global index in OpenCL
Programming is similar to CUDA & OpenCL
No Global Memory
No such thing
No shared memory between nodes
Inter-process Communication is possible
Since no GM, any data transfer has to be done via IPC using MPI constructs
Process Synchronization Primitives
We use MPI Collectives to provide Synchronization
Header File
Header file is mandatory
USE mpi_f08 module is preferred over the include
The highlighted portions are where we will use the MPI Constructs
0
Only one thread will execute
1
Process may be multi-threaded
However only main thread will make MPI Calls(funneled through main)
2
Process may be multi-threaded
Multiple threads may make MPI calls but only 1 at a time.
Concurrent calls are serialized.
3
Multiple threads may call MPI with no restrictions
Format of MPI Calls
Case Sensitivity
C – yes
Fortran - No
Name Restrictions
Prefixes starting with MPI_ * & PMPI_* ( profiling Interface)
Error Handling
Default behavior of an MPI Call is to abort if there is an error
Good news – Probably never see anything other than success
Bad new – Pain to debug
Default Handler can be over ridden
Errors displayed to user is implementation dependent
After 2nd point
MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
Rank
Unique identifier assigned by system to a process when process initializes
Sometimes called a task ID
They are contiguous and begin at 0
MPI_Init (&argc,&argv)
Initialized the MPI Execution Environment
Must be called in every MPI Program
Should be called only once and before any other MPI function
May be used to pass the command like arguments to all processes
Not required by std. & implementation dependent
MPI_Comm_size (comm,&size)
Returns total no. of MPI Processes in specified communicator
Size has the value
Required as the no. of allocated processes might not be the same as the no. of the requested processes
MPI_Comm_rank (comm,&rank)
Task id of the process
Will be an integer between 0 & n-1 withing MPI_COMM_WORLD communicator
If process is associated with another communicator, it will have unique rank within each of these communicators also
MPI_Abort (comm,errorcode)
Terminates all MPI Processes associated with a communicator
Communicator is ignored in most implementations and all processes are terminated
MPI_Get_processor_name (&name,&resultlength)
Returns processor name and it’s name length
May not be same as host name, it is implementation dependent
MPI_Get_version (&version,&subversion)
Returns version n subversion of MPI Std. implemented by the library
MPI_Initialized (&flag)
Indicates whether MPI_Init has been called.
MPI_Wtime ()
Returns elapsed wall clock time in seconds(double precision)
MPI_Wtick ()
Returns the resolution of MPI_Wtime in seconds
For example, if the clock is implemented by the hardware as a counter that is incremented every millisecond, the value returned by MPI_WTICK should be 10^-3
MPI_Finalize ()
Terminates MPI Execution environment
Should be the last MPI Routine called
No other MPI Routines may be called after it
Explain every line
Main will be executed by all the processes
np = no. of processes == griddim.x * blockdim.x
Pid == blockidx.x * blockdim.x + threadix.x
We request n no. of processes when we begin the program execution, we use MPI_Comm_rank to verify if we have got the requested no. of processes or not.
If the system does not have enough resources then it means that we don’t get enough processes for our program. We check to see if we have enough and abort if we don’t have atlease 3
We are printing the error message only from one process
We are aborting all the processes liked with the communicator
If the no. of processes are sufficient then we get into the real execution of the program
Control flow is used to specialize one of the processes
np-1 process acts as the server == host
n-2 act like the compute nodes == device
If you are compute node, u only receive a section of the input for computation
Once all the processes are complete, we clean up DS and release all resources to call MPI_Finalize.
This use used by one process to send data to another process
Very easy to use…As a beginner, you don’t need to know too much about the implementation to actually use it.
*buf – Starting address of the sending buffer i.e the location from which data has to be copied
Count
No. of elements in the buffer
Note : Elements not bytes
If we have a buffer of type double then it’s size is going to be more than the size of a buffer of type int even though the count is same
Datatype
Datatype of the elements in the buffer
Dest – process id of the target process
Tag
Message tag(integer)
Has to be a non negative
Comm
Communicator
Similar to the send data interface
Status
Output parameter
Status of the received message
This is 2 step process where the send has to be called by one process and receive has to called by the next
In CUDA, it’s one step with 2 directions
Host to Device
Device to Host
This is the server code
Only the (np-1)th process will be executing this.
Server is going to do the I/O and distribute the data to computer nodes
Eventually it will collect the output from all the compute nodes and do I/O again
Q = why is MPI_Comm_Size called here again?
A - Little cleaner code as no. of parameters are reduced
We are going to allocate memory for the entire input and output.
Program will abort if there isn’t enough memory available.
In a real program, we would be reading from the Input / disk to populate the data
Here we just fill the input vectors with random data
We initialize the pointers to these input vectors
We then go into the for loop, where each iteration will send a chuck from vector A and a section from vector B to the compute process
We start from 0 up to the no. of nodes == np-2 (bcoz last one is used for server process)
Once we send a section to a compute process, we increment the pointer in the input vectors by the section’s size so that we can send the next section to the subsequent thread.
For extremely large input sizes, we may have to further parallelize this server process
Perhaps by having more than 1 server process
Once data is distributed to all the compute processes,
the server process is going to wait till all the compute process are done with their processing
Once every one finish their work, everyone will be release from their barrier
Now the server process will collect the data from all the processes using MPI_Recv
Blocks caller until all group members have called it
It returns only after all group members have entered the call
As the name suggests this is called barrier synchronization which is similar to
syncthreads() in CUDA.
Once they finish copying the data from the compute processes, I/O is performed by the Server Process.
After the I/0 and before the program begins, the memory allocated on the heap is released.
Here we show the code for the compute process
Total np-1 no. of processes executing the compute code
By program design, we identify (np-1)th process as server hence we call MPI_Comm_size
Now we allocate memory for a section of data(not the whole)
Immediately go into MPI_Recv to receive the data from server
We then compute the output
Similar to how we do it CUDA, we should expect barrier synchronization
And we see the barrier synchronization as expected
Now once all the compute processes are done with the computation, they send the data back to the server process
They do free the local memory allocations
Finally as shown in main program before, before the main exits, it uses the MPI_Finalize() call to clear up all MPI DS and returns successfully
Typically involve two, and only tow, different MPI threads
One is performing send and the other is doing the matching receive operation
Different types of send and receive routines
6 types of send routines and 3 types of receive routines
Send/Receive Routines not tightly coupled
Any type of send can be used with any type of receive routine
Blocking
1
Safe means modification will not affect the data to be sent
It does not mean that data was actually received, it may still be in the system buffer
3
Handshake occurs with receive task to confirm safe send
4
If a system buffer is used
Order – pt 1
If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same receive, the receive operation will receive Message 1 before Message 2.
If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive 1 will receive the message before Receive 2.
Fairness
Task 0 sends a message to task 2. However, task 1 sends a competing message that matches task 2's receive. Only one of the sends will complete.(if no buffer)
Synchronization - processes wait until all members of the group have reached the synchronization point.
Data Movement - broadcast, scatter/gather, all to all.
Collective Computation(reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.
Collective communication functions are highly optimized
Using them usually leads to better performance as well as readability and productivity
From the programmer's perspective, a group and a communicator are one
The group routines are primarily used to specify which processes should be used to construct a communicator.
Groups/communicators are dynamic
created and destroyed during program execution
Processes may be in more than one group/communicator
They will have a unique rank within each group/communicator.
They are only Virtual
No relation between physical structure of machine and process topology
Useful for applications with specific communication patterns
Cartesian topology might prove convenient for an application that requires 4-way nearest neighbor communications for grid based data.
Tell them to see example