This document provides an overview of OpenCL libraries for GPU programming. It discusses specialized GPU libraries like clFFT for fast Fourier transforms and Random123 for random number generation. It also covers general GPU libraries like Bolt, OpenCV, and ArrayFire. ArrayFire is highlighted as it provides a flexible array data structure and hundreds of parallel functions across domains like image processing, machine learning, and linear algebra. It supports JIT compilation and data-parallel constructs like GFOR to improve performance.
6. Library Types
● Specialized GPU Libs
○ Targeted at a specific set of operators (functionality)
○ Optimized for specific systems
○ C-like interface
○ Raw pointer interface
● General GPU Libs
○ Manage GPU resources using containers
○ Applicable to a large set of applications and domains
○ Portable across multiple architectures
○ Higher level functions
○ C++ interface (supports templates)
7. Specialized GPU Libraries
● Fast Fourier Transforms
○ clFFT
● Random Number Generation
○ Random123
● Linear Algebra
○ clBLAS
○ MAGMA
● Signal and Image Processing
○ OpenCLIPP
8. Specialized GPU Libraries
● C Interface
○ Use pointers to reference data
● Memory management is programmer responsibility
● Mimic existing libraries
○ clBLAS ≈ BLAS
○ MAGMA ≈ BLAS + LAPACK
○ clFFT ≈ FFTW
● Simplifies GPU integration of specialized scientific
libraries
○ Still requires setting up the GPU
9. clFFT
● 1D, 2D and 3D transforms
● CPU and GPU backends
● Supports
○ Real and complex data types
○ Single and double-precision
○ Execution of multiple transformations concurrently
10. Random123
● Counter-based RNG
● Passed SmallCrush, Crush and BigCrush tests
● Four RNG families
○ Threefry
○ Philox
○ AESNI
○ ARS
● Not suitable for cryptography
11. Magma & clBLAS
● Implements many popular linear algebra routines
● Supports
○ Real and complex data types
○ Single and double-precision
12. OpenCLIPP
● Supports multiple image types
● Similar to Intel IPP
● Primitives
○ Arithmetic and logic
○ LUT
○ Morphology
○ Transform
○ Resize
○ Histogram
○ Many more…
● C and C++ interface
14. Bolt
● GPU library which resembles C++ STL
○ STL like data structures
○ Iterators
○ Fully interoperable with OpenCL
● Parallel vector operation methods
○ Reductions
○ Sorting
○ Prefix-Sum
● Customizable GPU kernels using functors
● Some functions only supported on AMD GPUs
15. Bolt - Data Structures
● Built around the device_vector
● Supports the same data types as C++
○ device_vector<float> data(2e6);
● Useful when performing multiple operations on a
vector
● Can be passed into STL algorithms
○ Always interoperability
○ Data transfer will be costly
16. Bolt - Algorithms
● Uses a C++ STL like interface
○ Pass the begin and end iterators
● Accept functors which allow you to run custom
operations on OpenCL devices
● Multiple backends
○ OpenCL, C++AMP, and TBB
○ Not all algorithms implemented across all backends
● Works on vector and device_vector
17. OpenCV
● Open source computer vision library
● C++ interface with many language wrappers
● Hundreds of CV functions
19. ArrayFire - Data Structures
● Built around a flexible data structure named "array"
○ Lightweight wrapper around the data on the compute device
○ Manages the data and basic metadata such as size, type and
dimensions
● You can transfer data into an array using constructors
● Column major
float hA[6] = {0, 1, 2, 3, 4, 5};
array A(2, 3, hA);
20. ArrayFire - Indexing
#include <arrayfire.h>
#include <af/utils.h>
void af_example()
{
float f[8] = {1, 2, 4, 8, 16, 32, 64, 128};
array a(2, 4, f); // 2 rows x 4 col array initialized with f values
array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column
print(sumSecondCol); // 12
}
21. Using ArrayFire:
array tmp = img(span,span,0); // save the R channel
img(span,span,0) = img(span,span,2); // R channel gets values of B
img(span,span,2) = tmp; // B channel gets value of R
Can also do it this way:
array swapped = join(2, img(span,span,2), // blue
img(span,span,1), // green
img(span,span,0)); // red
Or simply:
array swapped = img(span,span,seq(2,-1,0));
ArrayFire Example - swap R and B
22. Using ArrayFire:
array img = loadimage("image.jpg", false); // load grayscale image from disk to
device
array img_T = img.T(); // transpose
ArrayFire Functions
33. Image smoothing
ArrayFire
array S = bilateral(I, sigma_r, sigma_c);
array M = meanshift(I, sigma_r, sigma_c, iter);
array R = medfilt(img, 3, 3);
// Gaussian blur
array gker = gaussiankernel(ncols, ncols);
array res = convolve(img, gker);
34. FFT
ArrayFire
array R1 = fft2(I); // 2d fft. check fft, fft3
array R2 = fft2(I, M, N); // fft2 with padding
array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2
35. ArrayFire Capabilities
● Hundreds of parallel functions for multi-disciplinary
work
○ Image processing
○ Machine learning
○ Graphics
○ Sets
● Support for multiple languages
○ C/C++, Fortran, Java and R
● Linux, Windows, Mac OS X
36. ArrayFire Capabilities
● OpenGL based graphics
● JIT
○ Combine multiple operations into one kernel
● GFOR - data parallel loop
○ Allows concurrent execution over multiple data sets (for example
images)
37. ArrayFire Functions
● Supports hundreds of parallel functions
○ Building blocks
■ Reductions
■ Scan
■ Set operations
■ Sorting
■ Statistics
■ Basic matrix manipulation
Images taken from:
http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.html
http://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png
38. ArrayFire Functions
● Hundreds of highly-optimized parallel functions
○ Signal/image processing
■ Convolution
■ FFT
■ Histograms
■ Interpolation
■ Connected components
○ Linear Algebra
■ Matrix multiply
■ Linear system solving
■ Factorization
39. GFOR: What is it?
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
Parallel matrix-vector multiplications (1 kernel launch)
40. Example: Matrix Multiply
• Data-Parallel for loop, e.g.
*
BA(,,1)
iteration i = 1
C(,,1)
=
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
41. Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
=
Serial matrix-vector multiplications (3 kernel launches)
42. Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
= *
BA(,,3)
iteration i = 3
C(,,3)
=
Serial matrix-vector multiplications (3 kernel launches)
44. Example: Matrix Multiply
simultaneous iterations i = 1:3
BA(,,1:3)C(,,1:3)
*=
*=
*=
Think of GFOR as compiling 1 stacked kernel with all iterations.
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
45. JIT Code Generation
● Run time kernel generation
● Combines multiple element wise operations into one
kernel
● Reduces kernel launching overhead
● Intermediate data not allocated
● Improves cache performance
46. Success Stories
Field Application Speedup
Academia Power Systems Simulations 35x
Finance Option Pricing 52x
Government Radar Image Formation 45x
Life Sciences Pathology Advances > 100x
Manufacturing Tomography of Vegetation 10x
Media & Computer Vision Digital Holography 17x
Oil & Gas Ground Water Simulations > 20x
47. Future capabilities
● We are interested in Big Data applications
● Create capabilities for
○ Streaming video
○ Large number of images
○ Machine learning
○ Data analysis
○ Dynamic data
● Faster rendering utilities for Big Data
48. Comments on Open Source
● https://github.com/arrayfire-community
49. Q & A
Speaker: Oded Green (oded@arrayfire.com)
Engineers:
Umar Urshad (umar@ArrayFire.com)
Pavan Yalamanchili (pavan@ArrayFire.com)
Sales:
Scott Blakeslee (scott@ArrayFire.com)