SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Can FPGAs compete with GPUs?
John Romein, Bram Veenboer
GPU Technology Conference (GTC’19)
March 18-21, 2019
2
Outline
●
explain FPGA
–
hardware
●
FPGA vs. GPU
–
programming models (OpenCL)
–
case studies
●
matrix multiplication
●
radio-astronomical imaging
–
lessons learned
●
answer the question in the title
analyze performance & energy efficiency
3
What is a Field-Programmable Gate Array (FPGA)?
●
configurable processor
●
collection of
–
multipliers / adders
–
registers
–
memories
–
logic
–
transceivers
–
clocks
–
interconnect
4
What is an FPGA program?
●
connect elements
–
performs fixed function
●
data-flow engine
●
Hardware Description Language (HDL)
–
Verilog, VHDL
–
difficult
5
FPGA vs GPU
• FPGA advantages
– high energy efficiency
• no instruction decoding etc.
• use/move as few bits as possible
– configurable I/O
• high bandwidth
• e.g., many times 100 GbE
• GPU advantages
– easier to program
– more flexible
– short compilation times
6
New Intel (Altera) FPGA technologies
1) high-level programming language (OpenCL)
2) hard Floating-Point Units
3) tight integration with CPU cores
➔ simple tasks: less programming effort than HDL
➔ allows complex HPC applications
7
FPGA ≠ GPU
• hardware
– GPU: use hardware
– FPGA: create hardware
• execution model
– GPU: instructions
– FPGA: data flow
8
Common language: OpenCL
• OpenCL
– similar to CUDA
– C + explicit parallelism + synchronization + software-managed cache
– offload to GPU/FPGA
• GPU programs not suitable for FPGAs
– FPGA: data-flow engine
9
FPGA: data-flow pipeline example
• create hardware for complex multiply-add
C.real += A.real * B.real;
C.real += -A.imag * B.imag;
C.imag += A.real * B.imag;
C.imag += A.imag * B.real;
• needs four FPUs
• new input enters every cycle
BA ACr Cir iBr i
Cr Ci
FPU
FPU
FPU
FPU
10
Local memory
• GPU: use memory, FPGA: create memory
float tmp[128] __attribute__((…));
• properties:
– registers or memory blocks
– #banks
– bank width
– bank address selection bits attributes
– #read ports
– #write ports
– single/double pumped
– with/without arbiter
– replication factor
• well designed program: few resources, stall-free
11
Running multiple kernels
• GPU: consecutively
• FPGA: concurrently
– channels (OpenCL extension)
channel float2 my_channel __attribute__((depth(256)));
– requires less memory bandwidth
– all kernels resident → must fit
GPU:
FPGA: channel
device memorykernel_1
kernel_1
kernel_2
kernel_2
12
Parallel constructs in OpenCL
• work groups, work items (CUDA: thread blocks, threads)
– GPU: parallel
– FPGA: default: one work item/cycle; not useful
• FPGA:
– compiler auto-parallelizes whole kernel
– #pragma unroll
• create hardware for concurrent loop iterations
– replicate kernels
0 1 2
#pragma unroll
a[i] = i;
for (int i = 0; i < 3; i ++)
BA ACr Cir iBr i
Cr Ci
FPU
FPU
FPU
FPU
13
More GPU/FPGA differences
• code outside critical loop:
– GPU: not an issue
– FPGA: can use many resources
• Shift registers
– FPGA: single cycle
– GPU: expensive
init_once();
for (int i = 0; i < 100000; i ++)
do_many_times();
for (int i = 256; i > 0; i --)
a[i] = a[i – 1];
14
Resource optimizations
• compiler feedback
– after minutes, not hours
– HTML-based reports
• indispensible tool
15
Frequency optimization
● Compiler (place & route) determines Fmax
–
Unlike GPUs
● Full FPGA: longest path limits Fmax
●
HDL: fine-grained control
●
OpenCL: one clock for full design
●
FPGA: 450 MHz; BSP: 400 Mhz; 1 DSP: 350 Mhz
●
Little compiler feedback
●
Recompile with random seeds
●
Compile kernels in isolation to find frequency limiters
–
Tedious, but useful
16
Applications
●
dense matrix multiplication
●
radio-astronomical imager
●
Signal processing (FIR filter, FFT, correlations, beam forming)
17
• complex<float>
• horizontal & vertical memory accesses (→ coalesce)
• reuse of data (→ cache)
Dense matrix multiplication
18
Matrix multiplication
on FPGA
• hierarchical approach
– systolic array of
Processing Elements (PE)
– each PE computes 32x32
submatrix
repeat
repeatreorder
repeatreorder
repeatreorder
readAmatrix
reorder
PE
PE
PE PE
PE PE
PE PE
PE PE PE
PE PE
PE
PE PE
read B matrix
repeat
reorder
repeat
reorder
repeat
reorder
repeat
reorder
collect
collect
collect
write C matrix
collect
19
Matrix multiplication performance
• Arria 10:
– uses 89% of the DSPs, 40% on-chip memory
– clock (288 MHz) at 64% of peak (450 Mhz)
– nearly stall free
Type Device Performance
(TFlop/s)
Power (W) Efficiency
(GFlop/W)
FPGA Intel Arria 10 0.774 37 20.9
GPU NVIDIA Titan X
(Pascal)
10.1 263 38.4
GPU AMD Vega FE 9.73 265 36.7
20
Image-Domain Gridding for Radio-Astronomical Imaging
●
see talk S9306 (Astronomical Imaging on GPUs)
21
Image-Domain Gridding algorithm
#pragma parallel
for s = 1...S :
complex<float> subgrid[P ][N ×N ];
for i = 1...N ×N :
float offset = compute_offset(s, i);
for t = 1...T :
float index = compute_index(s, i, t);
for c = 1...C :
float scale = scales[c];
float phase = offset - (index × scale);
complex<float> phasor = {cos(phase), sin(phase)};
#pragma unroll
for p = 1...P : // 4 polarizations
complex<float> visibility = visibilities[t][c][p];
subgrid[p][i] += cmul(phasor, visibility);
apply_aterm(subgrid);
apply_taper(subgrid);
apply_ifft(subgrid);
store(subgrid);
22
FPGA gridding design ●
Create a dataflow network:
–
Use all available DSPs
–
Every DSP performs a useful
computation every cycle
–
no device memory use
(except kernel args)
23
FPGA OpenCL gridder kernel__attribute__((max_global_work_dim(0)))
__attribute__((autorun))
__attribute__((num_compute_units(NR_GRIDDERS)))
__kernel void gridder()
{
int gridder = get_compute_id(0);
float8 subgrid[NR_PIXELS];
for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) {
| subgrid[pixel] = 0;
}
#pragma ivdep
for (unsigned short vis_major = 0; vis_major < NR_VISIBILITIES; vis_major += UNROLL_FACTOR) {
| float8 visibilities[UNROLL_FACTOR] __attribute__((register));
|
| for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) {
| | visibilities[vis_minor] = read_channel_intel(visibilities_channel[gridder]);
| }
|
| for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel++) {
| | float8 pixel_value = subgrid[pixel];
| | float8 phasors = read_channel_intel(phasors_channel[gridder]); // { cos(phase), sin(phase) }
| |
| | #pragma unroll
| | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) {
| | | pixel_value.even += phasors[vis_minor] * visibilities[vis_minor].even + -phasors[vis_minor] * visibilities[vis_minor].odd;
| | | pixel_value.odd += phasors[vis_minor] * visibilities[vis_minor].odd + phasors[vis_minor] * visibilities[vis_minor].even;
| | }
| |
| | subgrid[pixel] = pixel_value;
| }
}
for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) {
| write_channel_intel(pixel_channel[gridder], subgrid[pixel]);
}
}
24
Sine/cosine optimization
●
compiler-generated: 8 DSPs
●
limited-precision lookup-table: 1 DSP
–
more DSPs available → replicate Φ 14 20x
25
FPGA resource usage + Fmax
ALUTs FFs RAMs DSPs MLABs ɸ Fmax
gridding-ip 43% 31% 64% 1439 (95%) 71% 14 258
degridding-ip 47% 35% 72% 1441 (95%) 78% 14 254
gridding-lu 27% 32% 61% 1498 (99%) 57% 20 256
degridding-lu 33% 38% 73% 1503 (99%) 69% 20 253
●
almost all of 1518 DSPs available used
● Fmax < 350 Mhz
26
Experimental setup
●
compare Intel Arria 10 FPGA to comparable CPU and GPU
●
CPU and GPU implementations are both optimized
Type Device #FPUs Peak Bandwidth TDP Process
CPU Intel Xeon
E5-2697v3
224 1.39 TFlop/s 68 GB/s 145W 28nm
(TSMC)
FPGA Nallatech 385A 1518 1.37 TFlop/s 34 GB/s 75W 20nm
(TSMC)
GPU NVIDIA GTX 750 Ti 640 1.39 TFlop/s 88 GB/s 60W 28nm
(TSMC)
27
Throughput and energy-efficiency comparison
●
throughput: number of visibilities processed per second (Mvisibilities/s)
●
energy-efficiency: number of visibilities processed per Joule (MVisibilities/J)
●
similar peak performance, what causes the performance differences?
28
Performance analysis
●
CPU: sin/cos in software takes 80% of total runtime
●
GPU: sin/cos overlapped with other computations
●
FPGA: sin/cos overlapped, but cannot use all DSPs for FMAs
29
FPGAs vs GPUs: lessons learned (1/3)
●
dataflow vs. imperative
–
rethink your algorithm
–
different program code
●
on FPGAs:
–
frequent use of OpenCL extensions
–
think about resource usage, occupancy, timing
–
less use of off-chip memory
●
parallelism:
–
both: kernel replication and vectorization
–
FPGA: pipelining and loop unrolling
30
FPGAs vs GPUs: lessons learned (2/3)
●
FPGA ≠ GPU, yet: same optimizations …
–
exploit parallelism
–
maximize FPU utilization
●
hide latency
–
optimize memory performance
●
hide latency → prefetch
●
maximize bandwidth → avoid bank conflicts, unit-stride access (coalescing)
●
reuse data (caching)
●
… but achieved in very different ways!
●
+ architecture-specific optimizations
31
FPGAs vs GPUs: lessons learned (3/3)
●
much simpler than HDL
–
FPGAs accessible to wider audience
●
long learning curve
–
small performance penalty
●
yet not as easy as GPUs
–
distribute resources in complex dataflow
–
optimize for high clock
●
tools still maturing
32
Current/future work
• Stratix 10 versus Volta/Turing
– Stratix 10 imager port not trivial ← different optimizations for new routing
architecture
• 100 GbE FPGA support
– send/receive UDP packets in OpenCL
– BSP firmware development
• streaming data applications
– signal processing (filtering, correlations, beam forming, ...)
• OpenCL FFT library
33
Conclusions
• complex applications on FPGAs now possible
– high-level language
– hard FPUs
– tight integration with CPUs
• GPUs vs FPGAs
– 1 language; no (performance) portability
– very different architectures
– yet many optimizations similar
34
So can FPGAs compete with GPUs?
• depends on application
– compute → GPUs
– energy efficiency → GPUs
– I/O → FPGA
– flexibility → CPU/GPU
– programming ease → CPU, then GPU, then FPGA
FPGA not far behind; CPU way behind
35
Acknowledgements
• This work was funded by
– Netherlands eScience Center (Triple-A 2)
– EU H2020 FETHPC (DEEP-EST, Grant Agreement nr. 754304)
– NWO Netherlands Foundation for Scientific Research (DAS-5)
• Other people
– Suleyman Demirsoy (Intel), Johan Hiddink (NLeSC), Atze v.d. Ploeg (NLeSC),
Daniël v.d. Schuur (ASTRON), Merijn Verstraaten (NLeSC), Ben v. Werkhoven
(NLeSC)

Mais conteúdo relacionado

Mais procurados

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 

Mais procurados (20)

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 

Semelhante a Can FPGAs Compete with GPUs?

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 

Semelhante a Can FPGAs Compete with GPUs? (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Can FPGAs Compete with GPUs?

  • 1. Can FPGAs compete with GPUs? John Romein, Bram Veenboer GPU Technology Conference (GTC’19) March 18-21, 2019
  • 2. 2 Outline ● explain FPGA – hardware ● FPGA vs. GPU – programming models (OpenCL) – case studies ● matrix multiplication ● radio-astronomical imaging – lessons learned ● answer the question in the title analyze performance & energy efficiency
  • 3. 3 What is a Field-Programmable Gate Array (FPGA)? ● configurable processor ● collection of – multipliers / adders – registers – memories – logic – transceivers – clocks – interconnect
  • 4. 4 What is an FPGA program? ● connect elements – performs fixed function ● data-flow engine ● Hardware Description Language (HDL) – Verilog, VHDL – difficult
  • 5. 5 FPGA vs GPU • FPGA advantages – high energy efficiency • no instruction decoding etc. • use/move as few bits as possible – configurable I/O • high bandwidth • e.g., many times 100 GbE • GPU advantages – easier to program – more flexible – short compilation times
  • 6. 6 New Intel (Altera) FPGA technologies 1) high-level programming language (OpenCL) 2) hard Floating-Point Units 3) tight integration with CPU cores ➔ simple tasks: less programming effort than HDL ➔ allows complex HPC applications
  • 7. 7 FPGA ≠ GPU • hardware – GPU: use hardware – FPGA: create hardware • execution model – GPU: instructions – FPGA: data flow
  • 8. 8 Common language: OpenCL • OpenCL – similar to CUDA – C + explicit parallelism + synchronization + software-managed cache – offload to GPU/FPGA • GPU programs not suitable for FPGAs – FPGA: data-flow engine
  • 9. 9 FPGA: data-flow pipeline example • create hardware for complex multiply-add C.real += A.real * B.real; C.real += -A.imag * B.imag; C.imag += A.real * B.imag; C.imag += A.imag * B.real; • needs four FPUs • new input enters every cycle BA ACr Cir iBr i Cr Ci FPU FPU FPU FPU
  • 10. 10 Local memory • GPU: use memory, FPGA: create memory float tmp[128] __attribute__((…)); • properties: – registers or memory blocks – #banks – bank width – bank address selection bits attributes – #read ports – #write ports – single/double pumped – with/without arbiter – replication factor • well designed program: few resources, stall-free
  • 11. 11 Running multiple kernels • GPU: consecutively • FPGA: concurrently – channels (OpenCL extension) channel float2 my_channel __attribute__((depth(256))); – requires less memory bandwidth – all kernels resident → must fit GPU: FPGA: channel device memorykernel_1 kernel_1 kernel_2 kernel_2
  • 12. 12 Parallel constructs in OpenCL • work groups, work items (CUDA: thread blocks, threads) – GPU: parallel – FPGA: default: one work item/cycle; not useful • FPGA: – compiler auto-parallelizes whole kernel – #pragma unroll • create hardware for concurrent loop iterations – replicate kernels 0 1 2 #pragma unroll a[i] = i; for (int i = 0; i < 3; i ++) BA ACr Cir iBr i Cr Ci FPU FPU FPU FPU
  • 13. 13 More GPU/FPGA differences • code outside critical loop: – GPU: not an issue – FPGA: can use many resources • Shift registers – FPGA: single cycle – GPU: expensive init_once(); for (int i = 0; i < 100000; i ++) do_many_times(); for (int i = 256; i > 0; i --) a[i] = a[i – 1];
  • 14. 14 Resource optimizations • compiler feedback – after minutes, not hours – HTML-based reports • indispensible tool
  • 15. 15 Frequency optimization ● Compiler (place & route) determines Fmax – Unlike GPUs ● Full FPGA: longest path limits Fmax ● HDL: fine-grained control ● OpenCL: one clock for full design ● FPGA: 450 MHz; BSP: 400 Mhz; 1 DSP: 350 Mhz ● Little compiler feedback ● Recompile with random seeds ● Compile kernels in isolation to find frequency limiters – Tedious, but useful
  • 16. 16 Applications ● dense matrix multiplication ● radio-astronomical imager ● Signal processing (FIR filter, FFT, correlations, beam forming)
  • 17. 17 • complex<float> • horizontal & vertical memory accesses (→ coalesce) • reuse of data (→ cache) Dense matrix multiplication
  • 18. 18 Matrix multiplication on FPGA • hierarchical approach – systolic array of Processing Elements (PE) – each PE computes 32x32 submatrix repeat repeatreorder repeatreorder repeatreorder readAmatrix reorder PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE read B matrix repeat reorder repeat reorder repeat reorder repeat reorder collect collect collect write C matrix collect
  • 19. 19 Matrix multiplication performance • Arria 10: – uses 89% of the DSPs, 40% on-chip memory – clock (288 MHz) at 64% of peak (450 Mhz) – nearly stall free Type Device Performance (TFlop/s) Power (W) Efficiency (GFlop/W) FPGA Intel Arria 10 0.774 37 20.9 GPU NVIDIA Titan X (Pascal) 10.1 263 38.4 GPU AMD Vega FE 9.73 265 36.7
  • 20. 20 Image-Domain Gridding for Radio-Astronomical Imaging ● see talk S9306 (Astronomical Imaging on GPUs)
  • 21. 21 Image-Domain Gridding algorithm #pragma parallel for s = 1...S : complex<float> subgrid[P ][N ×N ]; for i = 1...N ×N : float offset = compute_offset(s, i); for t = 1...T : float index = compute_index(s, i, t); for c = 1...C : float scale = scales[c]; float phase = offset - (index × scale); complex<float> phasor = {cos(phase), sin(phase)}; #pragma unroll for p = 1...P : // 4 polarizations complex<float> visibility = visibilities[t][c][p]; subgrid[p][i] += cmul(phasor, visibility); apply_aterm(subgrid); apply_taper(subgrid); apply_ifft(subgrid); store(subgrid);
  • 22. 22 FPGA gridding design ● Create a dataflow network: – Use all available DSPs – Every DSP performs a useful computation every cycle – no device memory use (except kernel args)
  • 23. 23 FPGA OpenCL gridder kernel__attribute__((max_global_work_dim(0))) __attribute__((autorun)) __attribute__((num_compute_units(NR_GRIDDERS))) __kernel void gridder() { int gridder = get_compute_id(0); float8 subgrid[NR_PIXELS]; for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) { | subgrid[pixel] = 0; } #pragma ivdep for (unsigned short vis_major = 0; vis_major < NR_VISIBILITIES; vis_major += UNROLL_FACTOR) { | float8 visibilities[UNROLL_FACTOR] __attribute__((register)); | | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) { | | visibilities[vis_minor] = read_channel_intel(visibilities_channel[gridder]); | } | | for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel++) { | | float8 pixel_value = subgrid[pixel]; | | float8 phasors = read_channel_intel(phasors_channel[gridder]); // { cos(phase), sin(phase) } | | | | #pragma unroll | | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) { | | | pixel_value.even += phasors[vis_minor] * visibilities[vis_minor].even + -phasors[vis_minor] * visibilities[vis_minor].odd; | | | pixel_value.odd += phasors[vis_minor] * visibilities[vis_minor].odd + phasors[vis_minor] * visibilities[vis_minor].even; | | } | | | | subgrid[pixel] = pixel_value; | } } for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) { | write_channel_intel(pixel_channel[gridder], subgrid[pixel]); } }
  • 24. 24 Sine/cosine optimization ● compiler-generated: 8 DSPs ● limited-precision lookup-table: 1 DSP – more DSPs available → replicate Φ 14 20x
  • 25. 25 FPGA resource usage + Fmax ALUTs FFs RAMs DSPs MLABs ɸ Fmax gridding-ip 43% 31% 64% 1439 (95%) 71% 14 258 degridding-ip 47% 35% 72% 1441 (95%) 78% 14 254 gridding-lu 27% 32% 61% 1498 (99%) 57% 20 256 degridding-lu 33% 38% 73% 1503 (99%) 69% 20 253 ● almost all of 1518 DSPs available used ● Fmax < 350 Mhz
  • 26. 26 Experimental setup ● compare Intel Arria 10 FPGA to comparable CPU and GPU ● CPU and GPU implementations are both optimized Type Device #FPUs Peak Bandwidth TDP Process CPU Intel Xeon E5-2697v3 224 1.39 TFlop/s 68 GB/s 145W 28nm (TSMC) FPGA Nallatech 385A 1518 1.37 TFlop/s 34 GB/s 75W 20nm (TSMC) GPU NVIDIA GTX 750 Ti 640 1.39 TFlop/s 88 GB/s 60W 28nm (TSMC)
  • 27. 27 Throughput and energy-efficiency comparison ● throughput: number of visibilities processed per second (Mvisibilities/s) ● energy-efficiency: number of visibilities processed per Joule (MVisibilities/J) ● similar peak performance, what causes the performance differences?
  • 28. 28 Performance analysis ● CPU: sin/cos in software takes 80% of total runtime ● GPU: sin/cos overlapped with other computations ● FPGA: sin/cos overlapped, but cannot use all DSPs for FMAs
  • 29. 29 FPGAs vs GPUs: lessons learned (1/3) ● dataflow vs. imperative – rethink your algorithm – different program code ● on FPGAs: – frequent use of OpenCL extensions – think about resource usage, occupancy, timing – less use of off-chip memory ● parallelism: – both: kernel replication and vectorization – FPGA: pipelining and loop unrolling
  • 30. 30 FPGAs vs GPUs: lessons learned (2/3) ● FPGA ≠ GPU, yet: same optimizations … – exploit parallelism – maximize FPU utilization ● hide latency – optimize memory performance ● hide latency → prefetch ● maximize bandwidth → avoid bank conflicts, unit-stride access (coalescing) ● reuse data (caching) ● … but achieved in very different ways! ● + architecture-specific optimizations
  • 31. 31 FPGAs vs GPUs: lessons learned (3/3) ● much simpler than HDL – FPGAs accessible to wider audience ● long learning curve – small performance penalty ● yet not as easy as GPUs – distribute resources in complex dataflow – optimize for high clock ● tools still maturing
  • 32. 32 Current/future work • Stratix 10 versus Volta/Turing – Stratix 10 imager port not trivial ← different optimizations for new routing architecture • 100 GbE FPGA support – send/receive UDP packets in OpenCL – BSP firmware development • streaming data applications – signal processing (filtering, correlations, beam forming, ...) • OpenCL FFT library
  • 33. 33 Conclusions • complex applications on FPGAs now possible – high-level language – hard FPUs – tight integration with CPUs • GPUs vs FPGAs – 1 language; no (performance) portability – very different architectures – yet many optimizations similar
  • 34. 34 So can FPGAs compete with GPUs? • depends on application – compute → GPUs – energy efficiency → GPUs – I/O → FPGA – flexibility → CPU/GPU – programming ease → CPU, then GPU, then FPGA FPGA not far behind; CPU way behind
  • 35. 35 Acknowledgements • This work was funded by – Netherlands eScience Center (Triple-A 2) – EU H2020 FETHPC (DEEP-EST, Grant Agreement nr. 754304) – NWO Netherlands Foundation for Scientific Research (DAS-5) • Other people – Suleyman Demirsoy (Intel), Johan Hiddink (NLeSC), Atze v.d. Ploeg (NLeSC), Daniël v.d. Schuur (ASTRON), Merijn Verstraaten (NLeSC), Ben v. Werkhoven (NLeSC)