O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.
This presentation was originally given at CUG 2013.
Optimizing GPU to GPUCommunication on Cray XK7Jeff Larkin
What Amdahl says about GPU communicationIf you make your GPU computation infinitely fast, performancewill be bound by your communication.GPU-2-GPU communication hasHigher latency (additional hop over PCIe)Lower bandwidth (limited by lowest bandwidth link)G2G communication cannot be an afterthought when running atscale.
CPU0MPI_Send ()GPU0CPU1MPI_Recv()GPU1How do GPUs communicate?MPIBut whatreallyhappenshere?
Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from a pointer valueEnable libraries to simplify their interfaces (e.g. MPI andcudaMemcpy)Supported on Tesla starting with Fermi 64-bit applications onLinux and Windows TCCPinned fabric BufferHost Buffer memcpyGPU BuffercudaMemcpyPinned fabric BufferPinned fabric Buffer
MPI+CUDA//MPI rank 0MPI_Send(s_buf_d,size,…);//MPI rank n-1MPI_Recv(r_buf_d,size,…);With UVA and CUDA-aware MPI//MPI rank 0cudaMemcpy(s_buf_h,s_buf_d,size,…);MPI_Send(s_buf_h,size,…);//MPI rank n-1MPI_Recv(r_buf_h,size,…);cudaMemcpy(r_buf_d,r_buf_h,size,…);No UVA and regular MPICUDA-aware MPI makes MPI+CUDA easier.
CUDA-Aware MPI Libraries MayUse RDMA to completely remove CPU from the picture.Stage GPU buffers through CPU memory automaticallyPipeline messagesAll the programmer needs to know is they’ve passed a GPUpointer to MPI, the library developer can optimize the restGPU0GPU1GPU0GPU1CPU0CPU1GPU0GPU1
GPU-Awareness in Cray MPICray began supporting GPU-awareness in 5.6.3Functions on XK7, but not optimally performingExpected to work very well on XC30Must be explicitly enabled via run-time environment variableMPICH_RDMA_ENABLED_CUDAWorks with both CUDA and OpenACCVersion 5.6.4 adds a pipelining feature that should help largemessagesEnabled with MPICH_G2G_PIPELINE
OMB LatencyHost-to-Host will alwayshave the lowest latency(fewest hops)Staging through hostmemory explicitly addssignificant latencyGPU-aware library is ableto fall in the middle.Note: 2 nodes on separateblades.
OMB BandwidthOnce again, H2H wins out(probably by a difference oflatency)Direct RDMA suffers badlywith this benchmark.SettingMPICH_G2G_PIPELINE=64pipelines messages andopens up more concurrency.Hand-PipeliningCame ~Here
OMB Bandwidth, Varying PipelinesOMB sends messages in awindow of 64, so that isnaturally optimal.Counter-intuitive that nointermediate values seemedto help.Additionally tried varyingchunk sizes with no benefit.
Optimizing Performance of a MessageMPI vendors know how to optimize performance for aninterconnectDifferent approaches for different message sizesMultiple algorithmsUnfortunately, on the XK7, this may not be the optimal approach.
MPI Lacks the Ability to ExpressDependenciesOne way to negate the cost ofG2G communication is tooverlap with somethingcomputation.Restructuring order ofcomputation may allow suchoverlappingSome communication patternshave a natural concurrencythat isn’t easily exploited.NW EExchangeCells WithN,S,E,WNeighbors.S
Exploiting Communication ConcurrencyThis cannot be expressed in GPU-aware MPI today.PackNorthD2HNorthTransferNorthH2DNorthUnpackNorthPackEastD2HEastTransferEastH2DEastUnpackEastPackSouthD2HSouthTransferSouthH2DSouthUnpackSouthPackWestD2HWestTransferWestH2DWestUnpackWest
Talks Related to this OptimizationHOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
SummaryOptimizing kernels will only take you so far as you scale,communication cannot be an afterthought.GPU-aware MPI libraries are becoming availableEasier to programCan optimize performance of individual message transfersSome communication patterns have a natural concurrency thatcan be exploited to make communication “free”, but this takesadditional effort.