GPU Introdução Processamento Alto Desempenho

Universidade Federal de Viçosa Departamento de Informática Introdução ao Processamento de Alto Desempenho com GPU (Graphic Processor Units) Ricardo Ferreira [email_address]

Sumário ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

1. Arquitetura de Computadores ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

1.1 Pipeline D(i)+A(i)*B(i) 4 ciclos Multiplicação 3ciclos – Adição Teremos 7 ciclos para Calcular cada elemento Do vetor Pipeline -> 1 operação por ciclo !!!

1.2 Escalares x = y * w z = x + 3 Sub-utilização dos recursos

1.3 Escalonamento ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Estático – Compilador Dinâmico – Processador (1963) Fortran 1957, C 1969

1.4 Memória ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],antes depois

1.5 Cache Tempo = 1ns + falhas * 10ns = = 1ns + 2% * 10ns = 1,2 ns !!!

Lei Amdahl ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Processadores ALU Controle Escalonamento Desvios Busca Memorias Caches

Desvios If ( x > lim ) If ( y < th ) If ( z = (a/b*d) ) Processor precisa buscar as instruções Qual a próxima ? 20% desvios nos codigos.....

Computação Paralela ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Arquiteturas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

GPU Arquitetura, CUDA, Linguagens, Exemplos

CPU versus GPU ,[object Object],[object Object],ALU Cache Controle

Fermi – nova geração ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Aplicações ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

CUDA (Compute Unified Device Architecture) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

História ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cuda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cálculo com Vetor – Versão CPU Suponha N muito grande, 10.000 elementos

Thread, Blocos e Vetores Bloco = 2, Thread 3 Indice = (bloco – 1) * tamanho Bloco + thread = (2 – 1 ) * 4 + 3 = 7

Versao em Fortran ! Kernel definition attributes(global) subroutine ksaxpy( n, a, x, y ) real, dimension(*) :: x,y real, value :: a integer, value :: n, I i = (blockidx%x-1) * blockdim%x + threadidx%x if( i <= n ) y(i) = a * x(i) + y(i) end subroutine ! Host subroutine subroutine solve( n, a, x, y ) real, device, dimension(*) :: x, y real :: a integer :: n ! call the kernel call ksaxpy<<<n/64, 64>>>( n, a, x, y ) end subroutine

Modelo de programação ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Kernel ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Threads, Blocos e Grids Exemplo: kernel executando em 72 threads Grid 2D com: Dimensão 3×2×1 6 blocos Blocos 2D com: Dimensão 4×3×1 12 threads cada

Warps, Threads, Blocos & Processadores ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Diretivas (em C) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Fortran ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Variaveis ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Transferencia dados em fortran ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Programa básico em C para CUDA ,[object Object],[object Object],[object Object],[object Object],[object Object]

Arranjo de multiprocessadores (SMs) ,[object Object],[object Object],[object Object],[object Object],[object Object]

Arranjo de multiprocessadores (SMs)

Warps e Threads ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Modelo de Memória ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Memória Compartilhada - Mesmo bloco e 16Kb - Acesso 400 a 600 mais rápido que memória global Permite colaboração entre threads (do bloco) - Cada thread do bloco transfere dados da memória global para memória compartilhada Após a transferência __syncthreads() - Kernel efetua operações usando os dados da memória compartilhada: -Cada thread do bloco transfere dados da memória compartilhada para memória global

Codigo de Multiplicacao Matrizes ! start the module containing the matmul kernel module mmul_mod use cudafor contains ! mmul_kernel computes A*B into C where ! A is NxM, B is MxL, C is then NxL attributes(global) subroutine mmul_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L) integer, value :: N, M, L integer :: i, j, kb, k, tx, ty ! submatrices stored in shared memory real, shared :: Asub(16,16), Bsub(16,16) ! the value of C(i,j) being computed real :: Cij

continua... ! Get the thread indices tx = threadidx%x ty = threadidx%y ! This thread computes C(i,j) = sum(A(i,:) * B(:,j)) i = (blockidx%x-1) * 16 + tx j = (blockidx%y-1) * 16 + ty Cij = 0.0 ! Do the k loop in chunks of 16, the block size do kb = 1, M, 16 ! Fill the submatrices ! Each of the 16x16 threads in the thread block ! loads one element of Asub and Bsub Asub(tx,ty) = A(i,kb+ty-1) Bsub(tx,ty) = B(kb+tx-1,j) ! Wait until all elements are filled call syncthreads()

continua.... Multiply the two submatrices ! Each of the 16x16 threads accumulates the ! dot product for its element of C(i,j) do k = 1,16 Cij = Cij + Asub(tx,k) * Bsub(k,ty) enddo ! Synchronize to make sure all threads are done ! reading the submatrices before overwriting them ! in the next iteration of the kb loop call syncthreads() enddo ! Each of the 16x16 threads stores its element ! to the global C array C(i,j) = Cij end subroutine mmul_kernel

Na CPU.... ! The host routine to drive the matrix multiplication subroutine mmul( A, B, C ) real, dimension(:,:) :: A, B, C ! allocatable device arrays real, device, allocatable, dimension(:,:) :: Adev, Bdev,Cdev ! dim3 variables to define the grid and block shapes type(dim3) :: dimGrid, dimBlock ! Get the array sizes N = size( A, 1 ) M = size( A, 2 ) L = size( B, 2 ) ! Allocate the device arrays allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) )

Continua.... ! Copy A and B to the device Adev = A(1:N,1:M) Bdev(:,:) = B(1:M,1:L) ! Create the grid and block dimensions dimGrid = dim3( N/16, M/16, 1 ) dimBlock = dim3( 16, 16, 1 ) call mmul_kernel<<<dimGrid,dimBlock>>> ( Adev, Bdev, Cdev, N, M, L ) ! Copy the results back and free up memory C(1:N,1:L) = Cdev deallocate ( Adev, Bdev, Cdev ) end subroutine mmul end module mmul_mod

Dicas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Fermi ,[object Object],[object Object],[object Object],[object Object]

2 warps por SM ,[object Object],[object Object],[object Object],Warp Scheduler Warp Scheduler Inst Dispatch Unit Inst Dispatch Unit … … Warp 8 inst 11 Warp 9 inst 11 Warp 2 inst 42 Warp 3 inst 33 Warp 14 inst 95 Warp 15 inst 95 : : Warp 8 inst 12 Warp 9 inst 12 Warp 14 inst 96 Warp 3 inst 34 Warp 2 inst 43 Warp 15 inst 96

Dupla Precisão ,[object Object],[object Object],[object Object],[object Object],[object Object]

Cache ,[object Object],[object Object],[object Object],[object Object],[object Object]

GPU Introdução Processamento Alto Desempenho

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Semelhante a GPU Introdução Processamento Alto Desempenho

Semelhante a GPU Introdução Processamento Alto Desempenho (20)

GPU Introdução Processamento Alto Desempenho