SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
An Introduction to OpenCL Libraries
Productive OpenCL
Programming
● We make code run faster
○ Started in 2007 by Georgia Tech researchers
○ 1000s of paying customers
● We build an acceleration library
○ for really cool science, engineering, and finance applications
○ for mobile computing
Libraries are Great!
Eliminate Hidden Costs
Library Types
● Specialized GPU Libs
○ Targeted at a specific set of operators (functionality)
○ Optimized for specific systems
○ C-like interface
○ Raw pointer interface
● General GPU Libs
○ Manage GPU resources using containers
○ Applicable to a large set of applications and domains
○ Portable across multiple architectures
○ Higher level functions
○ C++ interface (supports templates)
Specialized GPU Libraries
● Fast Fourier Transforms
○ clFFT
● Random Number Generation
○ Random123
● Linear Algebra
○ clBLAS
○ MAGMA
● Signal and Image Processing
○ OpenCLIPP
Specialized GPU Libraries
● C Interface
○ Use pointers to reference data
● Memory management is programmer responsibility
● Mimic existing libraries
○ clBLAS ≈ BLAS
○ MAGMA ≈ BLAS + LAPACK
○ clFFT ≈ FFTW
● Simplifies GPU integration of specialized scientific
libraries
○ Still requires setting up the GPU
clFFT
● 1D, 2D and 3D transforms
● CPU and GPU backends
● Supports
○ Real and complex data types
○ Single and double-precision
○ Execution of multiple transformations concurrently
Random123
● Counter-based RNG
● Passed SmallCrush, Crush and BigCrush tests
● Four RNG families
○ Threefry
○ Philox
○ AESNI
○ ARS
● Not suitable for cryptography
Magma & clBLAS
● Implements many popular linear algebra routines
● Supports
○ Real and complex data types
○ Single and double-precision
OpenCLIPP
● Supports multiple image types
● Similar to Intel IPP
● Primitives
○ Arithmetic and logic
○ LUT
○ Morphology
○ Transform
○ Resize
○ Histogram
○ Many more…
● C and C++ interface
General-Purpose GPU Libraries
● Bolt
● OpenCV
● ArrayFire
Images taken from:
http://wordlesstech.com/2012/10/12/leatherman-oht-multi-tool/
Bolt
● GPU library which resembles C++ STL
○ STL like data structures
○ Iterators
○ Fully interoperable with OpenCL
● Parallel vector operation methods
○ Reductions
○ Sorting
○ Prefix-Sum
● Customizable GPU kernels using functors
● Some functions only supported on AMD GPUs
Bolt - Data Structures
● Built around the device_vector
● Supports the same data types as C++
○ device_vector<float> data(2e6);
● Useful when performing multiple operations on a
vector
● Can be passed into STL algorithms
○ Always interoperability
○ Data transfer will be costly
Bolt - Algorithms
● Uses a C++ STL like interface
○ Pass the begin and end iterators
● Accept functors which allow you to run custom
operations on OpenCL devices
● Multiple backends
○ OpenCL, C++AMP, and TBB
○ Not all algorithms implemented across all backends
● Works on vector and device_vector
OpenCV
● Open source computer vision library
● C++ interface with many language wrappers
● Hundreds of CV functions
OpenCV ArrayFire Interop
● Helper Functions
○ https://github.com/arrayfire-community/arrayfire_opencv.git
Mat R; Rodrigues(poses(Rect(0, 0, 1, 3)), R);
af::array af_R = mat_to_array(R);
ArrayFire - Data Structures
● Built around a flexible data structure named "array"
○ Lightweight wrapper around the data on the compute device
○ Manages the data and basic metadata such as size, type and
dimensions
● You can transfer data into an array using constructors
● Column major
float hA[6] = {0, 1, 2, 3, 4, 5};
array A(2, 3, hA);
ArrayFire - Indexing
#include <arrayfire.h>
#include <af/utils.h>
void af_example()
{
float f[8] = {1, 2, 4, 8, 16, 32, 64, 128};
array a(2, 4, f); // 2 rows x 4 col array initialized with f values
array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column
print(sumSecondCol); // 12
}
Using ArrayFire:
array tmp = img(span,span,0); // save the R channel
img(span,span,0) = img(span,span,2); // R channel gets values of B
img(span,span,2) = tmp; // B channel gets value of R
Can also do it this way:
array swapped = join(2, img(span,span,2), // blue
img(span,span,1), // green
img(span,span,0)); // red
Or simply:
array swapped = img(span,span,seq(2,-1,0));
ArrayFire Example - swap R and B
Using ArrayFire:
array img = loadimage("image.jpg", false); // load grayscale image from disk to
device
array img_T = img.T(); // transpose
ArrayFire Functions
Original
Grayscale
Box filter blur
Gaussian blur
Image Negative
ArrayFire
// erode an image, 8-neighbor connectivity
array mask8 = constant(1,3, 3);
array img_out = erode(img_in, mask8);
// erode an image, 4-neighbor connectivity
const float h_mask4[] = { 0.0, 1.0, 0.0,
1.0, 1.0, 1.0,
0.0, 1.0, 0.0 };
array mask4 = array(3, 3, h_mask4);
array img_out = erode(img_in, mask4);
Erosion
Erosion
ArrayFire
array R = convolve(img, ker); // 1, 2 and 3d convolution filter
array R = convolve(fcol, frow, img); // Separable convolution
array R = filter(img, ker); // 2d correlation filter
Filtering
Histograms
ArrayFire
int nbins = 256;
array hist = histogram(img,nbins);
Transforms
ArrayFire
array half = resize(0.5, img);
array rot90 = rotate(img, af::Pi/2);
array warped = approx2(img, xLocations, yLocations);
Image smoothing
ArrayFire
array S = bilateral(I, sigma_r, sigma_c);
array M = meanshift(I, sigma_r, sigma_c, iter);
array R = medfilt(img, 3, 3);
// Gaussian blur
array gker = gaussiankernel(ncols, ncols);
array res = convolve(img, gker);
FFT
ArrayFire
array R1 = fft2(I); // 2d fft. check fft, fft3
array R2 = fft2(I, M, N); // fft2 with padding
array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2
ArrayFire Capabilities
● Hundreds of parallel functions for multi-disciplinary
work
○ Image processing
○ Machine learning
○ Graphics
○ Sets
● Support for multiple languages
○ C/C++, Fortran, Java and R
● Linux, Windows, Mac OS X
ArrayFire Capabilities
● OpenGL based graphics
● JIT
○ Combine multiple operations into one kernel
● GFOR - data parallel loop
○ Allows concurrent execution over multiple data sets (for example
images)
ArrayFire Functions
● Supports hundreds of parallel functions
○ Building blocks
■ Reductions
■ Scan
■ Set operations
■ Sorting
■ Statistics
■ Basic matrix manipulation
Images taken from:
http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.html
http://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png
ArrayFire Functions
● Hundreds of highly-optimized parallel functions
○ Signal/image processing
■ Convolution
■ FFT
■ Histograms
■ Interpolation
■ Connected components
○ Linear Algebra
■ Matrix multiply
■ Linear system solving
■ Factorization
GFOR: What is it?
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
Parallel matrix-vector multiplications (1 kernel launch)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
*
BA(,,1)
iteration i = 1
C(,,1)
=
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
=
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++)
C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
= *
BA(,,3)
iteration i = 3
C(,,3)
=
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
simultaneous iterations i = 1:3
*
BA(,,1)C(,,1)
= *
BA(,,2)C(,,2)
= *
BA(,,3)C(,,3)
=
Example: Matrix Multiply
simultaneous iterations i = 1:3
BA(,,1:3)C(,,1:3)
*=
*=
*=
Think of GFOR as compiling 1 stacked kernel with all iterations.
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
JIT Code Generation
● Run time kernel generation
● Combines multiple element wise operations into one
kernel
● Reduces kernel launching overhead
● Intermediate data not allocated
● Improves cache performance
Success Stories
Field Application Speedup
Academia Power Systems Simulations 35x
Finance Option Pricing 52x
Government Radar Image Formation 45x
Life Sciences Pathology Advances > 100x
Manufacturing Tomography of Vegetation 10x
Media & Computer Vision Digital Holography 17x
Oil & Gas Ground Water Simulations > 20x
Future capabilities
● We are interested in Big Data applications
● Create capabilities for
○ Streaming video
○ Large number of images
○ Machine learning
○ Data analysis
○ Dynamic data
● Faster rendering utilities for Big Data
Comments on Open Source
● https://github.com/arrayfire-community
Q & A
Speaker: Oded Green (oded@arrayfire.com)
Engineers:
Umar Urshad (umar@ArrayFire.com)
Pavan Yalamanchili (pavan@ArrayFire.com)
Sales:
Scott Blakeslee (scott@ArrayFire.com)
Look us up
www.ArrayFire.com
For language wrappers and examples
https://github.com/ArrayFire

Mais conteúdo relacionado

Mais procurados

Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
Alexey Smirnov
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 

Mais procurados (20)

Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
Andes open cl for RISC-V
Andes open cl for RISC-VAndes open cl for RISC-V
Andes open cl for RISC-V
 
C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
64-bit Android
64-bit Android64-bit Android
64-bit Android
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 

Destaque

Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 

Destaque (20)

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 

Semelhante a Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green

20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
Computer Science Club
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
Kernel TLV
 

Semelhante a Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green (20)

Auto Tuning
Auto TuningAuto Tuning
Auto Tuning
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Fosdem2017 Scientific computing on Jruby
Fosdem2017  Scientific computing on JrubyFosdem2017  Scientific computing on Jruby
Fosdem2017 Scientific computing on Jruby
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
HPC Essentials 0
HPC Essentials 0HPC Essentials 0
HPC Essentials 0
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with Multithreading
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 
SLE2015: Distributed ATL
SLE2015: Distributed ATLSLE2015: Distributed ATL
SLE2015: Distributed ATL
 
Writing MySQL UDFs
Writing MySQL UDFsWriting MySQL UDFs
Writing MySQL UDFs
 

Mais de AMD Developer Central

Mais de AMD Developer Central (11)

Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green

  • 1. An Introduction to OpenCL Libraries Productive OpenCL Programming
  • 2. ● We make code run faster ○ Started in 2007 by Georgia Tech researchers ○ 1000s of paying customers
  • 3. ● We build an acceleration library ○ for really cool science, engineering, and finance applications ○ for mobile computing
  • 6. Library Types ● Specialized GPU Libs ○ Targeted at a specific set of operators (functionality) ○ Optimized for specific systems ○ C-like interface ○ Raw pointer interface ● General GPU Libs ○ Manage GPU resources using containers ○ Applicable to a large set of applications and domains ○ Portable across multiple architectures ○ Higher level functions ○ C++ interface (supports templates)
  • 7. Specialized GPU Libraries ● Fast Fourier Transforms ○ clFFT ● Random Number Generation ○ Random123 ● Linear Algebra ○ clBLAS ○ MAGMA ● Signal and Image Processing ○ OpenCLIPP
  • 8. Specialized GPU Libraries ● C Interface ○ Use pointers to reference data ● Memory management is programmer responsibility ● Mimic existing libraries ○ clBLAS ≈ BLAS ○ MAGMA ≈ BLAS + LAPACK ○ clFFT ≈ FFTW ● Simplifies GPU integration of specialized scientific libraries ○ Still requires setting up the GPU
  • 9. clFFT ● 1D, 2D and 3D transforms ● CPU and GPU backends ● Supports ○ Real and complex data types ○ Single and double-precision ○ Execution of multiple transformations concurrently
  • 10. Random123 ● Counter-based RNG ● Passed SmallCrush, Crush and BigCrush tests ● Four RNG families ○ Threefry ○ Philox ○ AESNI ○ ARS ● Not suitable for cryptography
  • 11. Magma & clBLAS ● Implements many popular linear algebra routines ● Supports ○ Real and complex data types ○ Single and double-precision
  • 12. OpenCLIPP ● Supports multiple image types ● Similar to Intel IPP ● Primitives ○ Arithmetic and logic ○ LUT ○ Morphology ○ Transform ○ Resize ○ Histogram ○ Many more… ● C and C++ interface
  • 13. General-Purpose GPU Libraries ● Bolt ● OpenCV ● ArrayFire Images taken from: http://wordlesstech.com/2012/10/12/leatherman-oht-multi-tool/
  • 14. Bolt ● GPU library which resembles C++ STL ○ STL like data structures ○ Iterators ○ Fully interoperable with OpenCL ● Parallel vector operation methods ○ Reductions ○ Sorting ○ Prefix-Sum ● Customizable GPU kernels using functors ● Some functions only supported on AMD GPUs
  • 15. Bolt - Data Structures ● Built around the device_vector ● Supports the same data types as C++ ○ device_vector<float> data(2e6); ● Useful when performing multiple operations on a vector ● Can be passed into STL algorithms ○ Always interoperability ○ Data transfer will be costly
  • 16. Bolt - Algorithms ● Uses a C++ STL like interface ○ Pass the begin and end iterators ● Accept functors which allow you to run custom operations on OpenCL devices ● Multiple backends ○ OpenCL, C++AMP, and TBB ○ Not all algorithms implemented across all backends ● Works on vector and device_vector
  • 17. OpenCV ● Open source computer vision library ● C++ interface with many language wrappers ● Hundreds of CV functions
  • 18. OpenCV ArrayFire Interop ● Helper Functions ○ https://github.com/arrayfire-community/arrayfire_opencv.git Mat R; Rodrigues(poses(Rect(0, 0, 1, 3)), R); af::array af_R = mat_to_array(R);
  • 19. ArrayFire - Data Structures ● Built around a flexible data structure named "array" ○ Lightweight wrapper around the data on the compute device ○ Manages the data and basic metadata such as size, type and dimensions ● You can transfer data into an array using constructors ● Column major float hA[6] = {0, 1, 2, 3, 4, 5}; array A(2, 3, hA);
  • 20. ArrayFire - Indexing #include <arrayfire.h> #include <af/utils.h> void af_example() { float f[8] = {1, 2, 4, 8, 16, 32, 64, 128}; array a(2, 4, f); // 2 rows x 4 col array initialized with f values array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column print(sumSecondCol); // 12 }
  • 21. Using ArrayFire: array tmp = img(span,span,0); // save the R channel img(span,span,0) = img(span,span,2); // R channel gets values of B img(span,span,2) = tmp; // B channel gets value of R Can also do it this way: array swapped = join(2, img(span,span,2), // blue img(span,span,1), // green img(span,span,0)); // red Or simply: array swapped = img(span,span,seq(2,-1,0)); ArrayFire Example - swap R and B
  • 22. Using ArrayFire: array img = loadimage("image.jpg", false); // load grayscale image from disk to device array img_T = img.T(); // transpose ArrayFire Functions
  • 28. ArrayFire // erode an image, 8-neighbor connectivity array mask8 = constant(1,3, 3); array img_out = erode(img_in, mask8); // erode an image, 4-neighbor connectivity const float h_mask4[] = { 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0 }; array mask4 = array(3, 3, h_mask4); array img_out = erode(img_in, mask4); Erosion
  • 30. ArrayFire array R = convolve(img, ker); // 1, 2 and 3d convolution filter array R = convolve(fcol, frow, img); // Separable convolution array R = filter(img, ker); // 2d correlation filter Filtering
  • 31. Histograms ArrayFire int nbins = 256; array hist = histogram(img,nbins);
  • 32. Transforms ArrayFire array half = resize(0.5, img); array rot90 = rotate(img, af::Pi/2); array warped = approx2(img, xLocations, yLocations);
  • 33. Image smoothing ArrayFire array S = bilateral(I, sigma_r, sigma_c); array M = meanshift(I, sigma_r, sigma_c, iter); array R = medfilt(img, 3, 3); // Gaussian blur array gker = gaussiankernel(ncols, ncols); array res = convolve(img, gker);
  • 34. FFT ArrayFire array R1 = fft2(I); // 2d fft. check fft, fft3 array R2 = fft2(I, M, N); // fft2 with padding array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2
  • 35. ArrayFire Capabilities ● Hundreds of parallel functions for multi-disciplinary work ○ Image processing ○ Machine learning ○ Graphics ○ Sets ● Support for multiple languages ○ C/C++, Fortran, Java and R ● Linux, Windows, Mac OS X
  • 36. ArrayFire Capabilities ● OpenGL based graphics ● JIT ○ Combine multiple operations into one kernel ● GFOR - data parallel loop ○ Allows concurrent execution over multiple data sets (for example images)
  • 37. ArrayFire Functions ● Supports hundreds of parallel functions ○ Building blocks ■ Reductions ■ Scan ■ Set operations ■ Sorting ■ Statistics ■ Basic matrix manipulation Images taken from: http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.html http://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png
  • 38. ArrayFire Functions ● Hundreds of highly-optimized parallel functions ○ Signal/image processing ■ Convolution ■ FFT ■ Histograms ■ Interpolation ■ Connected components ○ Linear Algebra ■ Matrix multiply ■ Linear system solving ■ Factorization
  • 39. GFOR: What is it? • Data-Parallel for loop, e.g. for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B; gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; Serial matrix-vector multiplications (3 kernel launches) Parallel matrix-vector multiplications (1 kernel launch)
  • 40. Example: Matrix Multiply • Data-Parallel for loop, e.g. * BA(,,1) iteration i = 1 C(,,1) = for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B; Serial matrix-vector multiplications (3 kernel launches)
  • 41. Example: Matrix Multiply • Data-Parallel for loop, e.g. for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B; * BA(,,1) iteration i = 1 C(,,1) = * BA(,,2) iteration i = 2 C(,,2) = Serial matrix-vector multiplications (3 kernel launches)
  • 42. Example: Matrix Multiply • Data-Parallel for loop, e.g. for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B; * BA(,,1) iteration i = 1 C(,,1) = * BA(,,2) iteration i = 2 C(,,2) = * BA(,,3) iteration i = 3 C(,,3) = Serial matrix-vector multiplications (3 kernel launches)
  • 43. Example: Matrix Multiply gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; Parallel matrix multiplications (1 kernel launch) simultaneous iterations i = 1:3 * BA(,,1)C(,,1) = * BA(,,2)C(,,2) = * BA(,,3)C(,,3) =
  • 44. Example: Matrix Multiply simultaneous iterations i = 1:3 BA(,,1:3)C(,,1:3) *= *= *= Think of GFOR as compiling 1 stacked kernel with all iterations. gfor (array i, 3) C(span,span,i) = A(span,span,i) * B; Parallel matrix multiplications (1 kernel launch)
  • 45. JIT Code Generation ● Run time kernel generation ● Combines multiple element wise operations into one kernel ● Reduces kernel launching overhead ● Intermediate data not allocated ● Improves cache performance
  • 46. Success Stories Field Application Speedup Academia Power Systems Simulations 35x Finance Option Pricing 52x Government Radar Image Formation 45x Life Sciences Pathology Advances > 100x Manufacturing Tomography of Vegetation 10x Media & Computer Vision Digital Holography 17x Oil & Gas Ground Water Simulations > 20x
  • 47. Future capabilities ● We are interested in Big Data applications ● Create capabilities for ○ Streaming video ○ Large number of images ○ Machine learning ○ Data analysis ○ Dynamic data ● Faster rendering utilities for Big Data
  • 48. Comments on Open Source ● https://github.com/arrayfire-community
  • 49. Q & A Speaker: Oded Green (oded@arrayfire.com) Engineers: Umar Urshad (umar@ArrayFire.com) Pavan Yalamanchili (pavan@ArrayFire.com) Sales: Scott Blakeslee (scott@ArrayFire.com)
  • 50. Look us up www.ArrayFire.com For language wrappers and examples https://github.com/ArrayFire