2. 2
CUDA 6 Performance Report
CUDART CUDA Runtime Library
cuFFT Fast Fourier Transforms Library
cuBLAS Complete BLAS Library
cuSPARSE Sparse Matrix Library
cuRAND Random Number Generation (RNG) Library
NPP Performance Primitives for Image & Video Processing
Thrust Templated Parallel Algorithms & Data Structures
math.h C99 floating-point Library
Included in the CUDA Toolkit (free download):
developer.nvidia.com/cuda-toolkit
For more information on CUDA libraries:
developer.nvidia.com/gpu-accelerated-libraries
3. 3
0
5
10
15
20
25
30
35
40
CUDA 5 CUDA 6 CUDA 5 CUDA 6
Back to Back Launches Launch and Synchronize
usec
1.5x
Faster
2.0x
Faster
CUDA 6: 2x Faster GPU Kernel Launches
Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20
a<<<...>>>;
b<<<...>>>;
a<<<...>>>;
cudaDeviceSynchronize();
Dynamic Parallel Kernel Launches
4. 4
cuFFT: Multi-dimensional FFTs
Real and complex
Single- and double-precision data types
1D, 2D and 3D batched transforms
Flexible input and output data layouts
New in
CUDA 6
XT interface supports dual-GPU cards
(Tesla K10, GeForce GTX690, …)
5. 5
cuFFT: up to 700 GFLOPS
Performance may vary based on OS version and motherboard configuration
0
100
200
300
400
500
600
700
800
1 3 5 7 9 11 13 15 17 19 21 23 25
GFLOPS
log2(transform_size)
Single Precision
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23 25
GFLOPS
log2(transform_size)
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
6. 6
cuFFT: Consistently High Performance
0
100
200
300
400
500
600
700
800
1 100 10,000 1,000,000 100,000,000
GFLOPS
Transform Size
Single Precision
0
50
100
150
200
250
300
1 100 10,000 1,000,000 100,000,000
GFLOPS
Transform Size
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on devicePerformance may vary based on OS version and motherboard configuration
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
8. 8
cuBLAS: Dense Linear Algebra on GPUs
Complete BLAS implementation plus useful extensions
Supports all 152 standard routines for single, double,
complex, and double complex
Host and device-callable interface
New in
CUDA 6
XT Interface for Level 3 BLAS
Distributed computations across multiple GPUs
Out-of-core streaming to GPU, no upper limit on matrix size
“Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
9. 9
>3 TFLOPS single-precision
>1 TFLOPS double-precision
Performance may vary based on OS version and motherboard configuration
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• m=n=k=4096, transpose=no, side=right, fill=lower
cuBLAS:
0
500
1000
1500
2000
2500
3000
3500
SGEMM
SSYMM
STRSM
SSYRK
CGEMM
CSYMM
CTRSM
CSYRK
DGEMM
DSYMM
DTRSM
DSYRK
ZGEMM
ZSYMM
ZTRSM
ZSYRK
Single Single Complex Double Double Complex
GFLOPS
10. 10
cuBLAS: ZGEMM 5x Faster than MKL
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHzPerformance may vary based on OS version and motherboard configuration
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000
GFLOPS
Matrix Dimension (m=n=k)
cuBLAS
MKL
11. 11
cuBLAS-XT: Multi-GPU Performance Scaling
• cuBLAS-XT 6.0 on K10, ECC ON, input and output data on hostPerformance may vary based on OS version and motherboard configuration
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
New in
CUDA 6
16K x 16K SGEMM on Tesla K10
12. 12
cuSPARSE: Sparse linear algebra routines
Optimized sparse linear algebra BLAS routines – matrix-
vector, matrix-matrix, triangular solve
Support for variety of formats (CSR, COO, block variants)
1.0
2.0
3.0
4.0
y1
y2
y3
y4
alpha + beta
1.0
6.0
4.0
7.0
3.02.0
5.0
y1
y2
y3
y4
New in
CUDA 6
Many improvements to triangular solvers,
Incomplete-LU, and Cholesky preconditioners
13. 13
cuSPARSE: 5x Faster than MKL
• Average of s/c/d/z routines
• cuSPARSE 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
• Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/Performance may vary based on OS version and motherboard configuration
0x
1x
2x
3x
4x
5x
6x
SpeedupoverMKL
Sparse Matrix x Dense Vector (SpMV)
14. 14
cuRAND: Random Number Generation
Generating high quality random numbers in parallel is hard
Don’t do it yourself, use a library!
Pseudo- and Quasi-RNGs
Supports several output distributions
Statistical test results in documentation
New in
CUDA 6 Mersenne Twister 19937
15. 15
cuRAND: Up to 75x Faster vs. Intel MKL
• cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device
• MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHzPerformance may vary based on OS version and motherboard configuration
0
2
4
6
8
10
12
14
16
Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a
Uniform Distribution Normal Distribution Log-Normal Distribution
Gsamples/sec
cuRAND
MKL
16. 16
cuRAND: High Performance RNGs
Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device
0
2
4
6
8
10
12
14
16
18
XORWOW Philox MRG32k3a MTGP32 Sobol32
Scrambled
Sobol64
Scrambled
Pseudo-random Quasi-random
Gsamples/sec
Uniform Distribution Normal Distribution Log-Normal Distribution
17. 17
NPP: NVIDIA Performance Primitives
Over 5000 image and signal processing routines:
color transforms, geometric transforms, move operations, linear
filters, image & signal statistics, image & signal arithmetic, JPEG
building blocks, image segmentation
New in
CUDA 6
Over 500 new routines, including:
median filter, BGR/YUV conversion, 3D LUT color conversion,
improvements to JPEG primitives, plus many more
18. 18
NPP Speedup vs. Intel IPP
Performance may vary based on OS version and motherboard configuration
• NPP 6.0 on K40m, input and output data on device
• IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
5.7x
14.4x
17.8x
6.3x
12.9x
28.5x
0x
5x
10x
15x
20x
25x
30x
Image Set
(8-bit RGB)
Image Set Channel
(8-bit RGB)
Image Resize
(8-bit RGB)
Image Gaussian Filter
(32-bit float)
Color Conversion
8-bit YUV422 to 8-bit
RGB
JPEG 8x8 Forward
DCT
19. 19
CUDA C++ Template Library
Template library for CUDA C++
Host and Device Containers that mimic the C++ STL
Optimized Algorithms for sort, reduce, scan, etc.
OpenMP Backend for portability
Also available on github: thrust.github.com
Allows applications and prototypes to be built quickly
20. 20
Thrust Performance vs. Intel TBB
Performance may vary based on OS version and motherboard configuration
• Thrust v1.7.1 on K40m, ECC ON, input and output data on device
• TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
0x
2x
4x
6x
8x
10x
12x
14x
reduce transform scan sort
Speedup
Thrust vs. TBB on 32M integers
43.7x
24.8x
13.3x
5.2x
15.0x
5.8x
0x
10x
20x
30x
40x
50x
char short int long float double
Speedup
Thrust Sort vs. TBB on 32M samples
21. 21
math.h: C99 floating-point library + extras
CUDA math.h is industry proven, high performance, accurate
•Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes)
•Exponentials: exp, exp2, log, log2, log10, ...
•Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ...
•Special functions: lgamma, tgamma, erf, erfc
•Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ...
•Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],...
New in
CUDA 6
• Over 80 new SIMD instructions
• Useful for video processing: _v*2, _v*4
• Cylindrical bessel: cyl_i{0,1}
• 1/hypotenuse: rhypot