SlideShare uma empresa Scribd logo
1 de 36
Programming a
Heterogeneous
Computing Cluster
PRESENTED BY AASHRITH H. GOVINDRAJ
We’ll discuss the following today
• Background of Heterogeneous Computing
• Message Passing Interface(MPI)
• Vector Addition Example(MPI Implementation)
• More implementation details of MPI
Background
• Heterogeneous Computing System(HCS)
• High Performance Computing & its uses
• Supercomputer vs. HCS
• Why use Heterogeneous Computers in HCS?
• MPI is the predominant message passing system forClusters
Introduction to MPI
• MPI stands for Message Passing Interface
• Predominant API
• Runs on virtually any hardware platform
• Programming Model – Distributed Memory Model
• Supports Explicit Parallelism
• Multiple Languages supported
Reasons for using MPI
• Standardization
• Portability
• Performance Opportunities
• Functionality
• Availability
MPI Model
• Flat view of the cluster to
programmer
• SPMD Programming Model
• No Global Memory
• Inter-processCommunication is
possible & required
• Process Synchronization
Primitives
MPI Program
Structure
• Required Header File
• C - mpi.h
• Fortran - mpif.h
MPI Thread
Support
• Level 0
• Level 1
• Level 2
• Level 3
Format of MPI
Calls
• Format of MPI Calls
• Case Sensitivity
• C –Yes
• Fortran – No
• Name Restrictions
• MPI_ *
• PMPI_* ( Profiling interface)
• Error Handling
• Handled via return parameter
Groups &
Communicators
Groups – Ordered set of processes
Communicators – Handle to a
group of processes
Most MPI Routines require a
communicator as argument
MPI_COMM_WORLD – Predefined
Communicator that includes all
processes
Rank – Unique ID
Environment Management Routines
• MPI_Init (&argc,&argv)
• MPI_Comm_size (comm,&size)
• MPI_Comm_rank (comm,&rank)
• MPI_Abort (comm,errorcode)
• MPI_Get_processor_name (&name,&resultlength)
Environment Management Routines (contd.)
• MPI_Get_version (&version,&subversion)
• MPI_Initialized (&flag)
• MPI_Wtime ()
• MPI_Wtick ()
• MPI_Finalize ()
• Fortran – Extra parameter ierr in all functions except time
functions
Vector Addition Example
Vector Addition Example(contd.)
MPI Sending Data
MPI Receiving Data
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
MPI Barriers
• int MPI_Barrier (comm)
• comm – communicator
• This is very similar to barrier synchronization in CUDA
• __syncthreads( )
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Point-to-Point Operations
• Typically involve two, and only two, different MPI threads
• Different types of send and receive routines
• Synchronous send
• Blocking send / blocking receive
• Non-blocking send / non-blocking receive
• Buffered send
• Combined send/receive
• "Ready" send
• Send/Receive Routines not tightly coupled
Buffering
• Why is buffering required?
• It is Implementation Dependent
• Opaque to the programmer and
managed by the MPI library
• Advantages
• Can exist on the sending side, the
receiving side, or both
• Improves program performance
• Disadvantages
• A finite resource that can be easy to
exhaust
• Often mysterious and not well
documented
Blocking vs. Non-blocking
Blocking Non Blocking
Send will only return after it’s safe to modify
application buffer
Send/Receive return almost immediately
Receive returns after the data has arrived and
ready for use by the application
Unsafe to modify our variables till we know
send operation has been completed
Synchronous Communication is possible OnlyAsynchronous Communication possible
Asynchronous Communication is also possible Primarily used to overlap computation with
communication to get performance gain
Order and Fairness
• Order
• MPI guarantees that messages will not overtake each other
• Order rules do not apply if there are multiple threads participating in the
communication operations
• Fairness
• MPI does not guarantee fairness - it's up to the programmer to prevent
"operation starvation"
Types of Collective Communication Routines
Collective Communication Routines(contd.)
• Scope
• Must involve all processes within the scope of a communicator
• Unexpected behavior, including program failure, can occur if even one task in the
communicator doesn't participate
• Programmer's responsibility to ensure that all processes within a communicator
participate in any collective operations.
• Collective communication functions are highly optimized
Groups & Communicators(additional details)
• Group
• Represented within system memory as an object
• Only accessible as a handle
• Always associated with a communicator object
• Communicator
• Represented within system memory as an object.
• In the simplest sense, the communicator is an extra "tag" that must be included with
MPI calls
• Inter-group and Intra-group communicators available
• From the programmer's perspective, a group and a communicator are one
Primary Purposes of Group and
Communicator Objects
1. Allows you to organize tasks, based upon function, into task
groups.
2. Enable Collective Communications operations across a subset of
related tasks.
3. Provide basis for implementing user defined virtual topologies
4. Provide for safe communications
Programming Considerations and
Restrictions
• Groups/communicators are dynamic
• Processes may be in more than one group/communicator
• MPI provides over 40 routines related to groups, communicators, and virtual topologies.
• Typical usage:
• Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
• Form new group as a subset of global group using MPI_Group_incl
• Create new communicator for new group using MPI_Comm_create
• Determine new rank in new communicator using MPI_Comm_rank
• Conduct communications using any MPI message passing routine
• When finished, free up new communicator and group (optional) using MPI_Comm_free and
MPI_Group_free
Virtual Topologies
• Mapping/ordering of MPI processes into a geometric "shape“
• Similar to CUDA Grid / Block 2D/3D structure
• They are onlyVirtual
• Two MainTypes
• Cartesian(grid)
• Graph
• Virtual topologies are built upon MPI communicators and groups.
• Must be "programmed" by the application developer.
Why use Virtual Topologies?
• Convenience
• Useful for applications with specific communication patterns
• Communication Efficiency
• Penalty avoided on some hardware architectures for communication between
distant nodes
• Process Mapping may be optimized based on physical characteristics of the
machine
• MPI Implementation decides ifVT is ignored or not
Pheew!…All done!
ThankYou!
ANY QUESTIONS?

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Ch4 memory management
Ch4 memory managementCh4 memory management
Ch4 memory management
 
Chap8 slides
Chap8 slidesChap8 slides
Chap8 slides
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
24 Multithreaded Algorithms
24 Multithreaded Algorithms24 Multithreaded Algorithms
24 Multithreaded Algorithms
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Processor allocation in Distributed Systems
Processor allocation in Distributed SystemsProcessor allocation in Distributed Systems
Processor allocation in Distributed Systems
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Cache memory principles
Cache memory principlesCache memory principles
Cache memory principles
 
Cache memory
Cache memoryCache memory
Cache memory
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Branch and bound
Branch and boundBranch and bound
Branch and bound
 
Hyper Threading Technology
Hyper Threading TechnologyHyper Threading Technology
Hyper Threading Technology
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
Parallel Programing Model
Parallel Programing ModelParallel Programing Model
Parallel Programing Model
 

Destaque

The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalNeo4j
 
ScrumCertificate_90336Intro_Final (1)
ScrumCertificate_90336Intro_Final (1)ScrumCertificate_90336Intro_Final (1)
ScrumCertificate_90336Intro_Final (1)Samrat Munshi
 
148indianmobileinitiatifinal20110406 110408083717-phpapp02
148indianmobileinitiatifinal20110406 110408083717-phpapp02148indianmobileinitiatifinal20110406 110408083717-phpapp02
148indianmobileinitiatifinal20110406 110408083717-phpapp02Rovin Cutinho
 
A research report on the establishment of private independent blood banks in ...
A research report on the establishment of private independent blood banks in ...A research report on the establishment of private independent blood banks in ...
A research report on the establishment of private independent blood banks in ...Rijo Stephen Cletus
 
Certificate Of Participation
Certificate Of ParticipationCertificate Of Participation
Certificate Of ParticipationAnant Pradhan
 
Violin Memory DOAG (German Oracle User Group) Nov 2012
Violin Memory DOAG (German Oracle User Group) Nov 2012Violin Memory DOAG (German Oracle User Group) Nov 2012
Violin Memory DOAG (German Oracle User Group) Nov 2012Jack O'Brien
 
Reading user’s mind from their eye’s
Reading user’s mind from  their eye’sReading user’s mind from  their eye’s
Reading user’s mind from their eye’sArsha Raman
 
Compliance Management Software
Compliance Management SoftwareCompliance Management Software
Compliance Management SoftwareCompliance Mantra
 
Performance testing methodologies
Performance testing methodologiesPerformance testing methodologies
Performance testing methodologiesDhanunjay Rasamala
 
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017DreamOrbit Softech Pvt Ltd.
 

Destaque (20)

GoToMeetings
GoToMeetingsGoToMeetings
GoToMeetings
 
The five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar finalThe five graphs of telecommunications may 22 2013 webinar final
The five graphs of telecommunications may 22 2013 webinar final
 
ScrumCertificate_90336Intro_Final (1)
ScrumCertificate_90336Intro_Final (1)ScrumCertificate_90336Intro_Final (1)
ScrumCertificate_90336Intro_Final (1)
 
148indianmobileinitiatifinal20110406 110408083717-phpapp02
148indianmobileinitiatifinal20110406 110408083717-phpapp02148indianmobileinitiatifinal20110406 110408083717-phpapp02
148indianmobileinitiatifinal20110406 110408083717-phpapp02
 
A research report on the establishment of private independent blood banks in ...
A research report on the establishment of private independent blood banks in ...A research report on the establishment of private independent blood banks in ...
A research report on the establishment of private independent blood banks in ...
 
Pacmp
PacmpPacmp
Pacmp
 
Certificate Of Participation
Certificate Of ParticipationCertificate Of Participation
Certificate Of Participation
 
Violin Memory DOAG (German Oracle User Group) Nov 2012
Violin Memory DOAG (German Oracle User Group) Nov 2012Violin Memory DOAG (German Oracle User Group) Nov 2012
Violin Memory DOAG (German Oracle User Group) Nov 2012
 
Oracle switch over_back
Oracle switch over_backOracle switch over_back
Oracle switch over_back
 
Reading user’s mind from their eye’s
Reading user’s mind from  their eye’sReading user’s mind from  their eye’s
Reading user’s mind from their eye’s
 
Compliance Management Software
Compliance Management SoftwareCompliance Management Software
Compliance Management Software
 
Performance testing methodologies
Performance testing methodologiesPerformance testing methodologies
Performance testing methodologies
 
Santosh Das
Santosh DasSantosh Das
Santosh Das
 
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017
WHICH TECHNOLOGY WILL BE IN FOCUS FOR MOBILE APP DEVELOPMENT IN 2017
 
Pragadees Resume
Pragadees ResumePragadees Resume
Pragadees Resume
 
Test
TestTest
Test
 
Co
CoCo
Co
 
Presentation EOI - Apps & Tech 2.0
Presentation EOI - Apps & Tech 2.0Presentation EOI - Apps & Tech 2.0
Presentation EOI - Apps & Tech 2.0
 
Genesis
GenesisGenesis
Genesis
 
Агман Забуровна
Агман ЗабуровнаАгман Забуровна
Агман Забуровна
 

Semelhante a Presentation - Programming a Heterogeneous Computing Cluster

Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMPDivya Tiwari
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIDan Holmes
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsRonny López
 
Foundational Design Patterns for Multi-Purpose Applications
Foundational Design Patterns for Multi-Purpose ApplicationsFoundational Design Patterns for Multi-Purpose Applications
Foundational Design Patterns for Multi-Purpose ApplicationsChing-Hwa Yu
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelIntel® Software
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationHimanshi Kathuria
 
SC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Featherrcastain
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsInter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsAya Mahmoud
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
The Overview of Microservices Architecture
The Overview of Microservices ArchitectureThe Overview of Microservices Architecture
The Overview of Microservices ArchitectureParia Heidari
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopShaurya Shekhar
 
3 f6 9_distributed_systems
3 f6 9_distributed_systems3 f6 9_distributed_systems
3 f6 9_distributed_systemsop205
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesYong Feng
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsClemens Vasters
 

Semelhante a Presentation - Programming a Heterogeneous Computing Cluster (20)

Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
 
More mpi4py
More mpi4pyMore mpi4py
More mpi4py
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPI
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
Foundational Design Patterns for Multi-Purpose Applications
Foundational Design Patterns for Multi-Purpose ApplicationsFoundational Design Patterns for Multi-Purpose Applications
Foundational Design Patterns for Multi-Purpose Applications
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communication
 
SC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Feather
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systemsInter-Process Communication in distributed systems
Inter-Process Communication in distributed systems
 
Mcroservices with docker kubernetes, goang and grpc, overview
Mcroservices with docker kubernetes, goang and grpc, overviewMcroservices with docker kubernetes, goang and grpc, overview
Mcroservices with docker kubernetes, goang and grpc, overview
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
The Overview of Microservices Architecture
The Overview of Microservices ArchitectureThe Overview of Microservices Architecture
The Overview of Microservices Architecture
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
 
Mq Lecture
Mq LectureMq Lecture
Mq Lecture
 
3 f6 9_distributed_systems
3 f6 9_distributed_systems3 f6 9_distributed_systems
3 f6 9_distributed_systems
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Microservices-101
Microservices-101Microservices-101
Microservices-101
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
 

Presentation - Programming a Heterogeneous Computing Cluster

  • 2. We’ll discuss the following today • Background of Heterogeneous Computing • Message Passing Interface(MPI) • Vector Addition Example(MPI Implementation) • More implementation details of MPI
  • 3. Background • Heterogeneous Computing System(HCS) • High Performance Computing & its uses • Supercomputer vs. HCS • Why use Heterogeneous Computers in HCS? • MPI is the predominant message passing system forClusters
  • 4. Introduction to MPI • MPI stands for Message Passing Interface • Predominant API • Runs on virtually any hardware platform • Programming Model – Distributed Memory Model • Supports Explicit Parallelism • Multiple Languages supported
  • 5. Reasons for using MPI • Standardization • Portability • Performance Opportunities • Functionality • Availability
  • 6. MPI Model • Flat view of the cluster to programmer • SPMD Programming Model • No Global Memory • Inter-processCommunication is possible & required • Process Synchronization Primitives
  • 7. MPI Program Structure • Required Header File • C - mpi.h • Fortran - mpif.h
  • 8. MPI Thread Support • Level 0 • Level 1 • Level 2 • Level 3
  • 9. Format of MPI Calls • Format of MPI Calls • Case Sensitivity • C –Yes • Fortran – No • Name Restrictions • MPI_ * • PMPI_* ( Profiling interface) • Error Handling • Handled via return parameter
  • 10. Groups & Communicators Groups – Ordered set of processes Communicators – Handle to a group of processes Most MPI Routines require a communicator as argument MPI_COMM_WORLD – Predefined Communicator that includes all processes Rank – Unique ID
  • 11. Environment Management Routines • MPI_Init (&argc,&argv) • MPI_Comm_size (comm,&size) • MPI_Comm_rank (comm,&rank) • MPI_Abort (comm,errorcode) • MPI_Get_processor_name (&name,&resultlength)
  • 12. Environment Management Routines (contd.) • MPI_Get_version (&version,&subversion) • MPI_Initialized (&flag) • MPI_Wtime () • MPI_Wtick () • MPI_Finalize () • Fortran – Extra parameter ierr in all functions except time functions
  • 20. MPI Barriers • int MPI_Barrier (comm) • comm – communicator • This is very similar to barrier synchronization in CUDA • __syncthreads( )
  • 25. Point-to-Point Operations • Typically involve two, and only two, different MPI threads • Different types of send and receive routines • Synchronous send • Blocking send / blocking receive • Non-blocking send / non-blocking receive • Buffered send • Combined send/receive • "Ready" send • Send/Receive Routines not tightly coupled
  • 26. Buffering • Why is buffering required? • It is Implementation Dependent • Opaque to the programmer and managed by the MPI library • Advantages • Can exist on the sending side, the receiving side, or both • Improves program performance • Disadvantages • A finite resource that can be easy to exhaust • Often mysterious and not well documented
  • 27. Blocking vs. Non-blocking Blocking Non Blocking Send will only return after it’s safe to modify application buffer Send/Receive return almost immediately Receive returns after the data has arrived and ready for use by the application Unsafe to modify our variables till we know send operation has been completed Synchronous Communication is possible OnlyAsynchronous Communication possible Asynchronous Communication is also possible Primarily used to overlap computation with communication to get performance gain
  • 28. Order and Fairness • Order • MPI guarantees that messages will not overtake each other • Order rules do not apply if there are multiple threads participating in the communication operations • Fairness • MPI does not guarantee fairness - it's up to the programmer to prevent "operation starvation"
  • 29. Types of Collective Communication Routines
  • 30. Collective Communication Routines(contd.) • Scope • Must involve all processes within the scope of a communicator • Unexpected behavior, including program failure, can occur if even one task in the communicator doesn't participate • Programmer's responsibility to ensure that all processes within a communicator participate in any collective operations. • Collective communication functions are highly optimized
  • 31. Groups & Communicators(additional details) • Group • Represented within system memory as an object • Only accessible as a handle • Always associated with a communicator object • Communicator • Represented within system memory as an object. • In the simplest sense, the communicator is an extra "tag" that must be included with MPI calls • Inter-group and Intra-group communicators available • From the programmer's perspective, a group and a communicator are one
  • 32. Primary Purposes of Group and Communicator Objects 1. Allows you to organize tasks, based upon function, into task groups. 2. Enable Collective Communications operations across a subset of related tasks. 3. Provide basis for implementing user defined virtual topologies 4. Provide for safe communications
  • 33. Programming Considerations and Restrictions • Groups/communicators are dynamic • Processes may be in more than one group/communicator • MPI provides over 40 routines related to groups, communicators, and virtual topologies. • Typical usage: • Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group • Form new group as a subset of global group using MPI_Group_incl • Create new communicator for new group using MPI_Comm_create • Determine new rank in new communicator using MPI_Comm_rank • Conduct communications using any MPI message passing routine • When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free
  • 34. Virtual Topologies • Mapping/ordering of MPI processes into a geometric "shape“ • Similar to CUDA Grid / Block 2D/3D structure • They are onlyVirtual • Two MainTypes • Cartesian(grid) • Graph • Virtual topologies are built upon MPI communicators and groups. • Must be "programmed" by the application developer.
  • 35. Why use Virtual Topologies? • Convenience • Useful for applications with specific communication patterns • Communication Efficiency • Penalty avoided on some hardware architectures for communication between distant nodes • Process Mapping may be optimized based on physical characteristics of the machine • MPI Implementation decides ifVT is ignored or not

Notas do Editor

  1. HCS Systems that use more than one kind of processor So far we have discussed programming on a system with one host-one device HPC & uses Using more than one computer as a part of a cluster to get things done faster. A computer cluster is just a bunch of computers connected to a local network or LAN Uses Stock Prediction & Trading Rendering an very very high resolution picture(400,000,000 pixels) Evolutionary Algorithms SC vs. HCS SC Good only for specialized problems Requires vast sums of money and specialized expertise to use HCS Can be managed without a lot of expense or expertise Why use Heterogeneous Computers in HCS? For better energy efficiency Usage of GPU’s in clusters started in 2009, so it’s relatively new Effectiveness of this approach – Lot of clusters in the Green 500 List Green 500 – List of most energy-efficient/ greenest supercomputers in the world
  2. Message Passing Interface Originally designed for distributed memory architectures(1980’s to early 90’s) Predominant API Wiped out other API’s from before it Runs on virtually any hardware platform Distributed Memory Shared Memory Hybrid Programming Model Regardless of the underlying physical architecture of the machine Explicit Parallelism Programmer is responsible for identification and implementation of parallelism using algorithms and MPI Constructs Languages C, C++ and Fortran
  3. Standardization It’s supported on all HPC platforms like MVAPICH – Linux Cluster Open MPI – Linux Cluster IBM MPI – BG/Q Cluster – part of their blue jean series Portability No source code modifications when porting between platforms - platform supports MPI Std. Performance Opportunities Vendors can tune it further based on native h/w Functionality Over 430 routines in MPI 3 Most use less than a dozen routines Availability Variety of implementations in vendor n public domain
  4. SPMD Programming Model Each process computes part of the output Flat view of the cluster Instead of having a node concept, MPI just has threads. All threads given a flat index like global index in OpenCL Programming is similar to CUDA & OpenCL No Global Memory No such thing No shared memory between nodes Inter-process Communication is possible Since no GM, any data transfer has to be done via IPC using MPI constructs Process Synchronization Primitives We use MPI Collectives to provide Synchronization
  5. Header File Header file is mandatory USE mpi_f08 module is preferred over the include The highlighted portions are where we will use the MPI Constructs
  6. 0 Only one thread will execute 1 Process may be multi-threaded However only main thread will make MPI Calls(funneled through main) 2 Process may be multi-threaded Multiple threads may make MPI calls but only 1 at a time. Concurrent calls are serialized. 3 Multiple threads may call MPI with no restrictions
  7. Format of MPI Calls Case Sensitivity C – yes Fortran - No Name Restrictions Prefixes starting with MPI_ * & PMPI_* ( profiling Interface) Error Handling Default behavior of an MPI Call is to abort if there is an error Good news – Probably never see anything other than success Bad new – Pain to debug Default Handler can be over ridden Errors displayed to user is implementation dependent
  8. After 2nd point MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Rank Unique identifier assigned by system to a process when process initializes Sometimes called a task ID They are contiguous and begin at 0
  9. MPI_Init (&argc,&argv) Initialized the MPI Execution Environment Must be called in every MPI Program Should be called only once and before any other MPI function May be used to pass the command like arguments to all processes Not required by std. & implementation dependent MPI_Comm_size (comm,&size) Returns total no. of MPI Processes in specified communicator Size has the value Required as the no. of allocated processes might not be the same as the no. of the requested processes MPI_Comm_rank (comm,&rank) Task id of the process Will be an integer between 0 & n-1 withing MPI_COMM_WORLD communicator If process is associated with another communicator, it will have unique rank within each of these communicators also MPI_Abort (comm,errorcode) Terminates all MPI Processes associated with a communicator Communicator is ignored in most implementations and all processes are terminated MPI_Get_processor_name (&name,&resultlength) Returns processor name and it’s name length May not be same as host name, it is implementation dependent
  10. MPI_Get_version (&version,&subversion) Returns version n subversion of MPI Std. implemented by the library MPI_Initialized (&flag)  Indicates whether MPI_Init has been called. MPI_Wtime () Returns elapsed wall clock time in seconds(double precision) MPI_Wtick () Returns the resolution of MPI_Wtime in seconds For example, if the clock is implemented by the hardware as a counter that is incremented every millisecond, the value returned by MPI_WTICK should be 10^-3 MPI_Finalize () Terminates MPI Execution environment Should be the last MPI Routine called No other MPI Routines may be called after it
  11. Explain every line  Main will be executed by all the processes np = no. of processes == griddim.x * blockdim.x Pid == blockidx.x * blockdim.x + threadix.x We request n no. of processes when we begin the program execution, we use MPI_Comm_rank to verify if we have got the requested no. of processes or not. If the system does not have enough resources then it means that we don’t get enough processes for our program. We check to see if we have enough and abort if we don’t have atlease 3 We are printing the error message only from one process We are aborting all the processes liked with the communicator
  12. If the no. of processes are sufficient then we get into the real execution of the program Control flow is used to specialize one of the processes np-1 process acts as the server == host n-2 act like the compute nodes == device If you are compute node, u only receive a section of the input for computation Once all the processes are complete, we clean up DS and release all resources to call MPI_Finalize.
  13. This use used by one process to send data to another process Very easy to use…As a beginner, you don’t need to know too much about the implementation to actually use it. *buf – Starting address of the sending buffer i.e the location from which data has to be copied Count No. of elements in the buffer Note : Elements not bytes If we have a buffer of type double then it’s size is going to be more than the size of a buffer of type int even though the count is same Datatype Datatype of the elements in the buffer Dest – process id of the target process Tag Message tag(integer) Has to be a non negative Comm Communicator
  14. Similar to the send data interface Status Output parameter Status of the received message This is 2 step process where the send has to be called by one process and receive has to called by the next In CUDA, it’s one step with 2 directions Host to Device Device to Host
  15. This is the server code Only the (np-1)th process will be executing this. Server is going to do the I/O and distribute the data to computer nodes Eventually it will collect the output from all the compute nodes and do I/O again Q = why is MPI_Comm_Size called here again? A - Little cleaner code as no. of parameters are reduced We are going to allocate memory for the entire input and output. Program will abort if there isn’t enough memory available.
  16. In a real program, we would be reading from the Input / disk to populate the data Here we just fill the input vectors with random data We initialize the pointers to these input vectors We then go into the for loop, where each iteration will send a chuck from vector A and a section from vector B to the compute process We start from 0 up to the no. of nodes == np-2 (bcoz last one is used for server process) Once we send a section to a compute process, we increment the pointer in the input vectors by the section’s size so that we can send the next section to the subsequent thread. For extremely large input sizes, we may have to further parallelize this server process Perhaps by having more than 1 server process
  17. Once data is distributed to all the compute processes, the server process is going to wait till all the compute process are done with their processing Once every one finish their work, everyone will be release from their barrier Now the server process will collect the data from all the processes using MPI_Recv
  18. Blocks caller until all group members have called it It returns only after all group members have entered the call As the name suggests this is called barrier synchronization which is similar to syncthreads() in CUDA.
  19. Once they finish copying the data from the compute processes, I/O is performed by the Server Process. After the I/0 and before the program begins, the memory allocated on the heap is released.
  20. Here we show the code for the compute process Total np-1 no. of processes executing the compute code By program design, we identify (np-1)th process as server hence we call MPI_Comm_size Now we allocate memory for a section of data(not the whole)
  21. Immediately go into MPI_Recv to receive the data from server We then compute the output Similar to how we do it CUDA, we should expect barrier synchronization
  22. And we see the barrier synchronization as expected Now once all the compute processes are done with the computation, they send the data back to the server process They do free the local memory allocations Finally as shown in main program before, before the main exits, it uses the MPI_Finalize() call to clear up all MPI DS and returns successfully
  23. Typically involve two, and only tow, different MPI threads One is performing send and the other is doing the matching receive operation Different types of send and receive routines 6 types of send routines and 3 types of receive routines Send/Receive Routines not tightly coupled Any type of send can be used with any type of receive routine
  24. Blocking 1 Safe means modification will not affect the data to be sent It does not mean that data was actually received, it may still be in the system buffer 3 Handshake occurs with receive task to confirm safe send 4 If a system buffer is used
  25. Order – pt 1 If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same receive, the receive operation will receive Message 1 before Message 2. If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive 1 will receive the message before Receive 2. Fairness Task 0 sends a message to task 2. However, task 1 sends a competing message that matches task 2's receive. Only one of the sends will complete.(if no buffer)
  26. Synchronization - processes wait until all members of the group have reached the synchronization point. Data Movement - broadcast, scatter/gather, all to all. Collective Computation(reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.
  27. Collective communication functions are highly optimized Using them usually leads to better performance as well as readability and productivity
  28. From the programmer's perspective, a group and a communicator are one The group routines are primarily used to specify which processes should be used to construct a communicator.
  29. Groups/communicators are dynamic created and destroyed during program execution Processes may be in more than one group/communicator They will have a unique rank within each group/communicator.
  30. They are only Virtual No relation between physical structure of machine and process topology
  31. Useful for applications with specific communication patterns Cartesian topology might prove convenient for an application that requires 4-way nearest neighbor communications for grid based data. Tell them to see example