SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
OpenCL Overview

                                  Ofer Rosenberg

Contributors: Tim Mattson (Intel), Aaftab Munshi (Apple)
Agenda

• OpenCL intro
  – GPGPU in a nutshell
  – OpenCL roots
  – OpenCL status
• OpenCL 1.0 deep dive
  –   Programming Model
  –   Framework API
  –   Language
  –   Embedded Profile & Extensions
• Summary




                                      2
GPGPU in a nutshell

                        Disclaimer:
       1. GPGPU is a lot of things to a lot of people.
           This is my view & vision on GPGPU…

     2. GPGPU is a huge subject, and this is a Nutshell.
      I recommend Kayvon’s lecture at http://s08.idav.ucdavis.edu/




                                 3
GPGPU in a nutshell
 #include <stdio.c>
 …                                        • On the right there is a very
 void main (int argc, char* argv[])         artificial example to explain
 {
                                            GPGPU.
     …
                                          • A simple program , with a “for” loop
     for (i=0 ; i < iBuffSize; i++)         which takes two buffers and adds
          C[i] = A[i] + B[i];
                                            them into a third buffer
     …
                                          (did we mention the word artificial yet ???)
 }




                                      4
GPGPU in a nutshell
 #include <stdio.c>
 …                                                         • The example after an expert visit:
 void main (int argc, char* argv[])                           – Dual threaded to support Dual Core
 {
   …                                                          – SSE2 code is doing a vectorized
     _beginthread(vec_add, 0, A, B, C, iBuffSize/2);
                                                                operation
     _beginthread(vec_add, 0, A[iBuffSize/2],
                  B[iBuffSize/2], C[iBuffSize/2],
                  iBuffSize/2);
     …

 }

 void vec_add (const float *A, const float *B, float *C,
                 int iBuffSize)
 {
     __m128 vC, vB, vA;
     for (i=0 ; i < iBuffSize/4; i++)
     {
         vA = _mm_load_sp(&A[i*4]);
         vB = _mm_load_sp(&B[i*4]);
         vC = _mm_add_ps (vA, vB);
         _mm_store_ps (&C[i*4], vC);
     }
     _endthread();
 }




                                                       5
Traditional GPGPU…
• Write in graphics language and use the GPU
• Highly effective, but :
   – The developer needs to learn another (not intuitive) language
   – The developer was limited by the graphics language




                                              6
GPGPU reloaded
 #include <stdio.c>
 …                                                          • CUDA was a disruptive technology
 Void main (int argc, char* argv[])                             – Write C on the GPU
 {                                                              – Extend to non-traditional usages
     …                                                          – Provide synchronization mechnaism
                                                            • OpenCL deepens and extends the
     for (i=0 ; i < iBuffSize; i++)                           revolution
          C[i] = A[i] + B[i];
                                                            • GPGPU now used for games to enhance the
     …                                                        standard GFX pipe
 }                                                              – Physics
                                                                – Advanced Rendering


                      __kernel void dot_product (__global const float4 *a,
                                                  __global const float4 *b,
                                                  __global float *c)
                      {
                         int gid = get_global_id(0);
                         c[gid] = a[gid] + b[gid];
                      }


                                           OpenCL


                                                        7
GPGPU also in Games…




              Non-interactive point light       Dynamic light affects character & environment



Jagged edge                                                                             Clean edge
artifacting                                                                             details




                                            8
A new type of programming…

“The way the processor industry is going is to add more and more cores, but
  nobody knows how to program those things. I mean, two, yeah; four, not
  really; eight, forget it.”
         Steve Jobs, NY Times interview, June 10 2008

         What About GPU’s ?
                  NVIDIA G80: 16 Cores, 8 HW threads per core
                  Larrabee: XX Cores, Y HW threads per core


“Basically it lets you use graphics processors to do computation,” he said. “It’s
  way beyond what Nvidia or anyone else has, and it’s really simple.”
         Steve Jobs on OpenCL, NY Times interview, June 10 2008

          http://bits.blogs.nytimes.com/2008/06/10/apple-in-parallel-turning-the-pc-world-upside-down/




                                                     9
OpenCL in a nutshell
• OpenCL is :
    – An opened Standard managed by the Khronos group (Cross-IHV, Cross-OS)
          – Influenced & guided by Apple
          – Spec 1.0 approved Dec’08
    – A system for executing short "Enhanced C" routines (kernels) across devices
          – All around Heterogeneous Platforms – Host & Devices
          – Devices: CPU, GPU, Accelerator (FPGA)
    – Skewed towards GPU HW
          – Samplers, Vector Types, etc.
    – Offers hybrid execution capability


• OpenCL is not:
    – OpenGL, or any other 3D graphics
      language
    – Meant to replace C/C++ (don’t write the
      entire application in it…)

  Khronos WG Key contributors

  Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts
  (EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,
  Codeplay, Ericsson, Freescale, HI, Imagination Technologies,
  Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,
  Seaweed, Takumi, TI and Umeå University.



                                                                 10
The Roots of OpenCL
• Apple Has History on GPGPU…
    – Developed a Framework called “Core Image”
      (& Core Video)
    – Based on OpenGL for GPU operation and
      Optimized SSE code for CPU
• Feb 15th 2007 – NVIDIA introduces CUDA
    – Stream Programming Language – C with
      extensions for GPU
    – Supported by Any Geforce 8 GPU
    – Works on XP & Vista (CUDA 2.0)
    – Amazing adoption rate
        – 40 university courses worldwide
        – 100+ Applications/Articles
• Apple & NVIDIA cooperate to create
  OpenCL – Open Compute Language




                                             11
1
OpenCL Status

• Apple Submitted the OpenCL 1.0 specification draft to Khronos (owner of OpenGL)
• June 16th 2008 - Khronos established the “Compute Working Group”
     – Members: AMD, Apple, Ardites, ARM, Blizzard, Broadcom, Codeplay, EA, Ericsson, Freescale, Hi Corp., IBM, Imagination
       Technologies, Intel, Kestrel Institute, Movidia, Nokia, Nvidia, Qualcomm, Rapid Mind, Samsung, Takumi and TI.
• Dec. 1st 2008 – OpenCL 1.0 ratification
• Apple is expected to release “Snow Leopard” Mac OS (containing OpenCL 1.0) by end of 2009
• Apple already began porting code to OpenCL




                                                           12
 1
Agenda

• OpenCL intro
    – GPGPU in a nutshell
    – OpenCL roots
    – OpenCL status
• OpenCL 1.0 deep dive
    –   Programming Model
    –   Framework API
    –   Language
    –   Embedded Profile & Extensions
• Summary




                                        13
1
OpenCL from 10,000 feet…

• The Standard Defines two major elements:
    – The Framework/Software Stack
    – The Language
     __kernel void dot_product (__global const float4 *a,
   __kernel void dot_product (__global const float4 *a,
 __kernel void dot_product (__global const float4 *a, *b,
                                  __global const float4
                                 __global const float4 *b,
                                  __global float *c)
                               __global const float4 *b,
     {                           __global float *c)
                               __global float *c)
                                                                     Application
   {          int tid = get_global_id(0);
 {          int tid = get_global_id(0);
          int c[tid] get_global_id(0);
              tid = = dot(a[tid], b[tid]);
            c[tid] = dot(a[tid], b[tid]);
     }    c[tid] = dot(a[tid], b[tid]);
   }
 }




      Chapter 6


                                                                  OpenCL Framework
                         Chapters 2-5




                                                             14
1
OpenCL Platform Model

• The basic platform composed of a Host and a few Devices
• Each device is made of a few compute units (well, cores…)
• Each compute unit is made of a few processing elements (virtual scalar processor)




     Under OpenCL the CPU is also compute device
                                       15
1
Compute Device Memory Model
• Compute Device – CPU or GPU
• Compute Unit = Core
• Compute Kernel
    – A function written in OpenCL C
    – Mapped to Work Item(s)
• Work-item
    – A single copy of the compute kernel,
      running on one data element
    – In Data Parallel mode, kernel execution
      contains multiple work-items
    – In Task Parallel mode, kernel execution
      contains a single work-item
• Four Memory Types:
    –   Global : default for images/buffers
    –   Constant : global const variables
    –   Local : shared between work-items
    –   Private : kernel internal variables




                                                16
1
Execution Model

• Host defines a command queue and associates it with a context (devices, kernels, memory,
  etc).
• Host enqueues commands to the command queue
• Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract
  Index Space
• Work items execute together as a work-group.


                                                                                 (wx,
                                                                                 w y)

                                                          (wxSx + sx, wySy        (wxSx + sx, wySy
                                                          + sy)                   + sy)
                                                          (sx, sy) = (0,0)        (sx, sy) = (Sx-1,0)



                               Gy
                                                          (wxSx + sx, wySy        (wxSx + sx, wySy
                                                          + sy)                   + sy)
                                                          (sx, sy) = (0, Sy-1)    (sx, sy) = (Sx-1,
                                                                                  Sy- 1)

                                                G
                                                x
                                           17
Programming Model

• Data Parallel, SPMD
  – Work-items in a work-group run the same program
  – Update data structures in parallel using the work-item ID to select data
    and guide execution.
• Task Parallel
  – One work-item per work group … for coarse grained task-level
    parallelism.
  – Native function interface: trap-door to run arbitrary code from an
    OpenCL command-queue.




                                   18
Compilation Model

• OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL)
• Static compilation:
   – The code is compiled from source to machine execution code at a specific point in the
     past (when the developer complied it using the IDE)
• Dynamic compilation:
   – Also known as runtime compilation
   – Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an
     assembler of a virtual machine. This step is known as offline compilation, and it’s done by
     the Front-End compiler
   – Step 2: The IR is compiled to a machine code for execution. This step is much shorter. It
     is known as online compilation, and it’s done by the Back-end compiler
• In dynamic compilation, step 1 is done usually only once, and the IR is stored. The
  App loads the IR and performs step 2 during the App’s runtime (hence the term…)




                                             19
OpenCL Framework overview

                                             Kernels to Accelerate (OpenCL C)
       Application                                 __kernel void dot (__global const float4 *a,
       (written in C++, C#, Java, …)             __kernel void dot (__global const float4 *a, *b,
                                                                         __global const float4
                                               __kernel void dot (__global constfloat *c) *b,
                                                                         __global float4 *a,
                                                                       __global const float4
                                                   {                 __global const float4 *b,
                                                                       __global float *c)
                                                 {                   __global float *c)
                                                            int tid = get_global_id(0);
                                               {         intc[tid] = dot(a[tid], b[tid]);
                                                             tid = get_global_id(0);                Open CL Framework
                                                   } intc[tid] = dot(a[tid], b[tid]);
                                                          tid = get_global_id(0);

                                               }
                                                 }    c[tid] = dot(a[tid], b[tid]);                 Allows applications to use a host
                                                                                                    and one or more OpenCL devices as
                                                                                                    a single heterogeneous parallel
                                                                                                    computer system.
                 Platform     Runtime                OpenCL Framework
                 API          API
                                                                                                    OpenCL Runtime:
                                                                                                    Allows the host program to
                                                           OpenCL Compiler
                                                            Front-End Compiler                      manipulate contexts once they
                            OpenCL Runtime
                                                                                                    have been created.
                                                                                      IR
                                                                Back-End Compiler                   Front End Compiler:
               OpenCL Platform                                 Back-End Compiler                    Compile from source into common
                                                                                                    binary intermediate that contain
                                                                                                    OpenCL kernels.



                        CPU Device                       GPU Device                                 Backend compiler:
                                                                                                    Compile from general intermediate
Note:                                                                                               binary into a device specific binary
                                                                                                    with device specific optimizations
The CPU is both the Host and a compute device


                                                                   20
Agenda

• OpenCL intro
  – GPGPU in a nutshell
  – OpenCL roots
  – OpenCL status
• OpenCL 1.0 deep dive
  –   Programming Model
  –   Framework API
  –   Language
  –   Embedded Profile & Extensions
• Summary




                                      21
The Platform Layer
                                                                  clGetDeviceIDs (cl_device_type device_type                            …

 • Query the Platform Layer                           cl_device_type                    Description
     – clGetPlatformInfo                                                                An OpenCL device that is the host processor. The host processor
                                                      CL_DEVICE_TYPE_CPU                runs the OpenCL implementations and is a single or multi-core CPU.
 • Query Devices (by type)
                                                                                        An OpenCL device that is a GPU. By this we mean that the device
     – clGetDeviceIDs                                 CL_DEVICE_TYPE_GPU                can also be used to accelerate a 3D API such as OpenGL or DirectX.

 • For each device, Query Device Configuration                                          Dedicated OpenCL accelerators (for example the IBM CELL Blade).
                                                      CL_DEVICE_TYPE_ACCELERATOR        These devices communicate with the host processor using a
     – clGetDeviceConfigInfo                                                            peripheral interconnect such as PCIe.

     – clGetDeviceConfigString                        CL_DEVICE_TYPE_DEFAULT            The default OpenCL device in the system.

 • Create Contexts using the devices found by         CL_DEVICE_TYPE_ALL                All OpenCL devices available in the system.

   the “get” function
                                                                   clGetDeviceInfo (cl_device_id device,
 • Context is the central element used by the                                      cl_device_info param_name,
   runtime layer to manage:
                                                      cl_device_info                Description
     –   Command Queque
                                                                                    The OpenCL device type. Currently supported values are:
     –   Memory objects                               CL_DEVICE_TYPE
                                                                                    CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU,
                                                                                    CL_DEVICE_TYPE_ACCELERATOR, CL_DEVICE_TYPE_DEFAULT or a
     –   Programs                                                                   combination of the above.

     –   Kernels                                      CL_DEVICE_MAX_COMPUTE_        The number of parallel compute cores on the OpenCL device. The
                                                                                    minimum value is one.
                                                      UNITS
                                                                                    Maximum dimensions that specify the global and local work-item IDs
                                                      CL_DEVICE_MAX_WORK_
                           clCreateContext            ITEM_DIMENSIONS
                                                                                    used by the data-parallel execution model. (Refer to
                                                                                    clEnqueueNDRangeKernel). The minimum value is 3.

compute_device[0]                                     CL_DEVICE_MAX_WORK_           Maximum number of work-items in a work-group executing a kernel
                                                                                    using the data-parallel execution model. (Refer to
compute_device[1]                                     GROUP_SIZE                    clEnqueueNDRangeKernel).
                                  Context
compute_device[2]                                     CL_DEVICE_MAX_CLOCK_          Maximum configured clock frequency of the device in MHz.
                                                      FREQUENCY
compute_device[3]
                                                                                    Device address space size specified as an unsigned integer value in
                                                      CL_DEVICE_ADDRESS_BITS        bits. Currently supported values are 32 or 64 bits.

                                                      CL_DEVICE_MAX_MEM_            Max size of memory object allocation in bytes. The minimum value is
                                                                                    max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)
                                                      ALLOC_SIZE
                                                 22
                                                                               And many more : 40+ parameters
OpenCL Runtime
                                      Everything in OpenCL Runtime is happening within a context

    compute_device[0]
    compute_device[1]                                             Context
    compute_device[2]



                  OpenCL Programs                       Kernels              Memory Objects        Command Queues


                                                                                                                Device[2]
#define g_c __global const
                                           Compiled    Kernel              Images     Buffers
__kernel void dot_prod (g_c float4 *a,
                          g_c float4 *b,   Program     Handler                                              Device[1]
                          g_c float *c)
{
                                                        Function                                        Device[0]
         int tid = get_global_id(0);
         c[tid] = dot(a[tid], b[tid]);
                                                                                                             In      Out of
                                            dot_prod
}                                                                                                         Order      Order
                                                                                                        In      Out of
                                            buff_add    Args List                                        Queue OrderQueue
__kernel void buff_add (g_c float4 *a,
                          g_c float4 *b,
                                                                                                     Order
                          g_c float *c)                  arg[0]                                  In Queue Out Queue
                                                                                                                of
{
         int tid = get_global_id(0);
                                                                                                Order       Order
                                                         arg[1]
         c[tid] = a[tid] +b[tid];                                                               Queue      Queue
}
…




                                                                         Create data &                Send to
                             Compile code
                                                                          arguments                  execution

                                                                    23
OpenCL “boot” process
Platform Layer
                    Query Platform

                     Query Devices

                    Create Context

Runtime
                 Create Command Queue

                 Create Memory Object

 Compiler
                    Create Program

                     Build Program

                     Create Kernel

                    Set Kernel Args

                    Enqueue Kernel

                       24
2
OpenCL C Programming Language in a
Nutshell
• Derived from ISO C99
• A few restrictions:
     – Recursion
     – Function pointers
     – Functions in C99 standard headers
• New Data Types
     – New Scalar types                           __global float4 *color; // An array of float4
     – Vector Types                               typedef struct {
                                                       float a[3];
     – Image types                                     int b[2];
                                                  } foo_t;
• Address Space Qualifiers                        __global image2d_t texture; // A 2D texture image

• Synchronization objects                         __kernel void stam_dugma(
                                                        __global float *output,
     – Barrier                                          __global float *input,
• Built-in functions                              {
                                                        __local float *tile)

• IEEE 754 compliant with a few exceptions              // private variables
                                                        int in_x, in_y ;
                                                        const unsigned int lid = get_local_id(0));

                                                       // declares a pointer p in the __private
                                                       // that points to an int object in __global
                                                       __global int *p;




                                             25
 2
Agenda

• OpenCL intro
  – GPGPU in a nutshell
  – OpenCL roots
  – OpenCL status
• OpenCL 1.0 deep dive
  –   Programming Model
  –   Framework API
  –   Language
  –   Embedded Profile & Extensions
• Summary




                                      26
OpenCL Extensions

• As in OpenGL, OpenCL supports Specification Extensions
• Extension is an optional feature, which might be supported by a device,
  but not part of the “Core features” (Khronos term)
   – Application is required to query the device using CL_DEVICE_EXTENSIONS
     parameter
• Two types of extensions:
   – Extensions approved by Khronos OpenCL working group
      – Uses the “KHR” in functions/enums/etc.
      – Might be promoted to required Core feature on next versions of OpenCL
   – Extensions which are Vendor Specific
• The specification already provides some KHR extensions




                                        27
OpenCL 1.0 KHR Extensions

• Double Precision Floating Point
   – Support Double as data type and extend built-in functions to support it
• Selecting Rounding Mode
   – Add to the mandatory “round to nearest” : “round to nearest even”, “round to
     zero”, “round to positive infinity”, “round to negative infinity”
• Atomic Functions (for 32-bit integers, for Local memory, for 64-bit)
• Writing to 3D image memory objects
   – OpenCL mandates only read.
• Byte addressable stores
   – In OpenCL core, Write to Pointers is limited to 32bit granularity
• Half floating point
   – Add 16bit Floating point type




                                             28
OpenCL 1.0 Embedded Profile

•   A “relaxed” version for embedded devices (as in OpenGL ES)
•   No 64bit integers
•   No 3D images
•   Reduced requirements on Samplers
     – No CL_FILTER_LINEAR for Float/Half
     – Less addressing modes
• Not IEEE 754 compliant on some functions
     – Example: Min Accuracy for atan() >= 5 ULP
• Reduced set of minimal device requirements
     –   Image height/width : 2048 instead of 8192
     –   Number of samplers : 8 instead of 16
     –   Local memory size : 1K instead of 16K
     –   More…



                                            29
Agenda

• OpenCL intro
  – GPGPU in a nutshell
  – OpenCL roots
  – OpenCL status
• OpenCL 1.0 deep dive
  –   Programming Model
  –   Framework API
  –   Language
  –   Embedded Profile & Extensions
• Summary




                                      30
OpenCL Unique Features

• As a Summary, here are some unique features of OpenCL :
• An Opened Standard for Cross-OS, Cross-Platform, heterogeneous
  processing
   – Khronos owned
• Creates a unified, flat, system model where the GPU, CPU and other
  devices are treated (almost) the same
• Includes Data & Task Parallelism
   – Extend GPGPU beyond the traditional usages
• Supports Native functions (C++ interop)
• Derived from ISO C99 with additional types, functions, etc. (and some
  restrictions)
• IEEE 754 compliant




                                         31
Backups




          32
3
Building OpenCL Code
                                                          clCreateProgramWithSource()

1.    Creating OpenCL Programs
     – From Source : receive array of strings                                         clCreateProgramWithBinary()
     – From Binaries : receive array of binaries
         – Intermediate Representation
         – Device specific executable                            clBuildProgram()
2.    Building the programs
     –     The developer can define a subgroup of
           devices to build on                                    clCreateKernel()
3.    Creating Kernel objects
     –     Single kernel : according to kernel name       cl_program clCreateProgramWithSource (cl_context context,
                                                                                                  cl_uint count,
     –     All kernels in program                                                                 const char **strings,
                                                                                                  const size_t *lengths,
                                                                                                  cl_int *errcode_ret)

                                                          cl_program clCreateProgramWithBinary (cl_context context,
                                                                                                  cl_uint num_devices,
                                                                                                  const cl_device_id *device_list,
                                                                                                  const size_t *lengths,
OpenCL supports dynamic compilation scheme – the                                                  const void **binaries,
                                                                                                  cl_int *binary_status,
Application uses “create from source” on the first time                                           cl_int *errcode_ret)
and then use “create from binaries” on next times         cl_int clBuildProgram (cl_program program,
                                                                                  cl_uint num_devices,
                                                                                  const cl_device_id *device_list,
                                                                                  const char *options,
                                                                                  void (*pfn_notify)(cl_program, void *user_data),
                                                                                  void *user_data)

                                                          cl_kernel clCreateKernel (cl_program program,
                                                                                     const char *kernel_name,
                                                                                     cl_int *errcode_ret)
                                                   33
Some order here, please…

• OpenCL defines a command queue, which is                         Context 1              Context 2
  created on a single device                            Device 1                          Device 1
                                                        Device 2                          Device 3
   – within the scope of a context, of course…
• Commands are enqueued to a specific                   Q1,2       Q1,2   Q1,3   Q1,4     Q2,1       Q2,2
  queue                                                  D1         D1     D2     D2       D1         D3
                                                        IOQ        IOQ    IOQ    OOQ      OOQ        IOQ
   – Kernels Execution
                                                                                  C4
   – Memory Operations
                                                         C3                       C3
• Events                                                 C2         C2            C2
                                                         C1         C1            C1
   – Each command can be created with an event
     associated to it
                                                         • A few queues can be created on the same
   – Each command execution can be dependent
                                                           device
     in a list of pre-created events
                                                         • Commands can dependent on events
• Two types of queues                                      created on other queues/contexts
   – In order queue : commands are executed in           • In the example above :
     the order of issuing                                      – C3 from Q1,2 depends on C1 & C2 from
   – Out of order queue : command execution is                   Q1,2
     dependent only on its event list completion               – C1 from Q1,4 depends on C2 from Q1,2
                                                               – In Q1,4, C3 depends on C2




                                                   34
Memory Objects

• OpenCL defines Memory Objects (Buffers/Images)
   – Reside in Global Memory
   – Object is defined in the scope of a context
   – Memory Objects are the only way to pass a large amount of data between the Host & the devices.
• Two mechanisms to sync Memory Objects:
• Transactions - Read/Write
   – Read - take a “snapshot” of the buffer/image to Host memory
   – Write – overwrites the buffer/image with data from the Host
   – Can be blocking or non-blocking (app needs to sync on event)
• Mapping
   – Similar to DX lock/map and command
   – Passes ownership of the buffer/image to the host, and back to the device




                                                35
Executing Kernels

• A Kernel is executed by enqueueing it to a Specific command queue
• The App must set the Kernel Arguments before enqueueing
     – Setting the arguments is done one-by-one
     – The kernel's arguments list is preserved after enqueueing
     – This enables changing only the required arguments before enqueueing again
• There are two separate enqueueing API’s – Data Parallel & Task Parallel
• In Data Parallel enqueueing, the App specifies the global & local work size
     – Global : the overall work-items to be executed, described by an N dimensional matrix
     – Local : the breakdown of the global to fit the specific device (can be left NULL)


                                                            cl_int clSetKernelArg (cl_kernel kernel,
                                                                                   cl_uint arg_index,
                                                                                   size_t arg_size,
                                                                                   const void *arg_value)



                                                            cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
                                                                                           cl_kernel kernel,
                                                                                           cl_uint work_dim,
                                                                                           const size_t *global_work_offset,
                                                                                           const size_t *global_work_size,
                                                                                           const size_t *local_work_size,
                                                                                           cl_uint num_events_in_wait_list,
                                                                                           const cl_event *event_wait_list,
                                                                                           cl_event *event)

                                                            cl_int clEnqueueTask (cl_command_queue command_queue,
                                                                                  cl_kernel kernel,


                   Back…
                                                                                  cl_uint num_events_in_wait_list,
                                                   36
 3                                                                                const cl_event *event_wait_list,
                                                                                  cl_event *event)
Reality Check – Apple compilation scheme
• OCL compilation process on
  SnowLeopard (OSX 10.6)                                     OpenCL
                                                         Compute Program
    – Step1: Compile OCL to LLVM IR
      (Intermediate Representation)                              OpenCL Front-End
                                                                 (Apple)
    – Step2: Compile to target device
• NVIDIA GPU device compiles the LLVM                        LLVM IR
  IR in two steps:
                                              NVIDIA                       x86 Back-End
    – LLVM IR to PTX (CUDA IR)
                                                                           (LLVM Project)
    – PTX to target GPU
• CPU device uses LLVM x86 BE to                                        x86
                                                    PTX IR
                                                                       binary
  compile directly to x86 Binary code.
• So what is LLVM ? Next slide…
                                               G80    G92   G200
                                              binary binary binary




                                         37
3
The LLVM Project
• LLVM - Low Level Virtual Machine        CLang         GCC         GLSL+
• Open Source Compiler for
   – Multi Language                                 LLVM IR
   – Cross Platform/Architecture
   – Cross OS
                                          x86     PPC         ARM     MIPS




                                     38
OCL C Data Types

• OpenCL C Programming language supports all ANSI C data types
• In addition, the following Scalar types are supported:
     Type            Description

     half            A 16-bit float. The half data type must conform to the IEEE 754-2008 half precision storage format.

     size_t          The unsigned integer type of the result of the sizeof operator (32bit or 64bit).

     ptrdiff_t       A signed integer type that is the result of subtracting two pointers (32bit or 64bit).

                     A signed integer type with the property that any valid pointer to void can be converted to this type, then
     intptr_t
                     converted back to pointer to void, and the result will compare equal to the original pointer.

• And the following vector to pointer(n void, and the resultor 16) pointer toto thecan be converted to this type, then
   uintptr_t
                An unsigned integer type with the property that any valid
                converted back
                                types to can be 2,4,8, will compare equal void original pointer.




                                                                39
 3
Address Space Qualifiers

 • OpenCL Memory model defined 4 memory spaces
 • Accordingly, the OCL C defines 4 qualifiers:
     – __global, __local, __constant, __private
 • Best to explain on a piece of code:

__global float4 *color; // An array of float4 elements           Variables outside the
typedef struct {                                                 scope of kernels must be
     float a[3];                                                 global
     int b[2];
} foo_t;
__global image2d_t texture; // A 2D texture image
                                                                 Variables passed to the
__kernel void stam_dugma(                                        kernel can be of any type
      __global float *output,
      __global float *input,
      __local float *tile)
                                                                 Internal Variables are
{
      // private variables                                       private unless specified
      int in_x, in_y ;                                           otherwise
      const unsigned int lid = get_local_id(0));

     // declares a pointer p in the __private address space      And here’s an example to
     // that points to an int object in address space __global   the “specified otherwise”…
     __global int *p;




                                                   40
Built-in functions
 • The Spec specified over 80 built-in functions which must be supported
 • The built-in functions are divided to the following types:
    – Work-item functions                              – Relational functions
         – get_work_dim, get_global_id, etc…              – isequal, isgreater, isfinite
    – Math functions                                   – Vector data load & store functions
         – acos, asin, atan, ceil, hypot, ilogb           – vloadn, vstoren,
    – Integer functions                                – Image Read & Write Functions
         – abs, add_sat, mad_hi, max, mad24               – read_imagef, read_imagei,
    – Common functions (float only)                    – Synchronization Functions
         – Clamp, min, max, radians, step                 – barrier
    – Geometric functions                              – Memory Fence Functions
         – cross, dot, distance, normalize                – Read_mem_fence




                                                  41
4
Vector Addition – Kernel Code

__kernel void
dot_product (__global const float4 *a,
             __global const float4 *b, __global float *c)
{
     int gid = get_global_id(0);
     c[gid] = a[gid] + b[gid];
}




                            42
Vector Addition – Host Code
void delete_memobjs(cl_mem *memobjs, int n)
{
    int i;
    for (i=0; i<n; i++)
    clReleaseMemObject(memobjs[i]);
}

int exec_dot_product_kernel(const char *program_source, int n, void *srcA,
                            void *srcB, void *dst)
{
    cl_context context;
    cl_command_queue cmd_queue;
    cl_device_id *devices;
    cl_program program;
    cl_kernel kernel;
    cl_mem memobjs[3];
    size_t global_work_size[1], size_t local_work_size[1], cb;
    cl_int err;
    // create the OpenCL context on a GPU device
    context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
                                      NULL, NULL, NULL);
    // get the list of GPU devices associated with context
    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);




                                        43
Vector Addition – Host Code
  devices = malloc(cb);
  clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
  // create a command-queue
  cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
  free(devices);

  // allocate the buffer memory objects
  memobjs[0] = clCreateBuffer(context,
                              CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                              sizeof(cl_float4) * n, srcA, NULL);
  memobjs[1] = clCreateBuffer(context,
                              CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                              sizeof(cl_float4) * n, srcB, NULL);
  memobjs[2] = clCreateBuffer(context,
                              CL_MEM_READ_WRITE,
                              sizeof(cl_float) * n, NULL, NULL);

  // create the program
  program = clCreateProgramWithSource(context,
                             1, (const char**)&program_source, NULL, NULL);
  // build the program
  err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
  // create the kernel
  kernel = clCreateKernel(program, "dot_product", NULL);




                                      44
Vector Addition – Host Code
    // set the args values
    err = clSetKernelArg(kernel, 0,
                         sizeof(cl_mem), (void *) &memobjs[0]);
    err |= clSetKernelArg(kernel, 1,
                         sizeof(cl_mem), (void *) &memobjs[1]);
    err |= clSetKernelArg(kernel, 2,
                         sizeof(cl_mem), (void *) &memobjs[2]);

    // set work-item dimensions
    global_work_size[0] = n;
    local_work_size[0]= 1;

    // execute kernel
    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
                                 global_work_size, local_work_size,
                                 0, NULL, NULL);
    // read output image
    err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE,
                              0, n * sizeof(cl_float), dst,
                              0, NULL, NULL);

    // release kernel, program, and memory objects
    delete_memobjs(memobjs, 3); clReleaseKernel(kernel);
    clReleaseProgram(program); clReleaseCommandQueue(cmd_queue);
    clReleaseContext(context); return 0;
}




                                        45
Sources

• OpenCL at Khronos
  – http://www.khronos.org/opencl/


• “Beyond Programmable Shading” workshop at SIGGRAPH 2008
  – http://s08.idav.ucdavis.edu/


• Same workshop at SIGGRAPH Asia 2008
  – http://sa08.idav.ucdavis.edu/




                                     46

Mais conteúdo relacionado

Mais procurados

Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingAndreas Schreiber
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceMr. Vengineer
 
OpenCV 3.0 - Latest news and the Roadmap
OpenCV 3.0 - Latest news and the RoadmapOpenCV 3.0 - Latest news and the Roadmap
OpenCV 3.0 - Latest news and the RoadmapEugene Khvedchenya
 

Mais procurados (20)

Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Cuda
CudaCuda
Cuda
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda
CudaCuda
Cuda
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
OpenCV Workshop
OpenCV WorkshopOpenCV Workshop
OpenCV Workshop
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel Programming
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & Inference
 
OpenCV 3.0 - Latest news and the Roadmap
OpenCV 3.0 - Latest news and the RoadmapOpenCV 3.0 - Latest news and the Roadmap
OpenCV 3.0 - Latest news and the Roadmap
 
The Future of Qt Widgets
The Future of Qt WidgetsThe Future of Qt Widgets
The Future of Qt Widgets
 

Destaque

Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCL
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCLAngelo Impedovo, Linux Day 2016, Programmazione Parallela in openCL
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCLAngelo Impedovo
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 

Destaque (20)

Gpgpu
GpgpuGpgpu
Gpgpu
 
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCL
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCLAngelo Impedovo, Linux Day 2016, Programmazione Parallela in openCL
Angelo Impedovo, Linux Day 2016, Programmazione Parallela in openCL
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 

Semelhante a Open CL For Haifa Linux Club

Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Implementing OpenCL support in GEGL and GIMP
Implementing OpenCL support in GEGL and GIMPImplementing OpenCL support in GEGL and GIMP
Implementing OpenCL support in GEGL and GIMPlgworld
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20Tyrone Systems
 
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...Yandex
 
ELC-E Linux Awareness
ELC-E Linux AwarenessELC-E Linux Awareness
ELC-E Linux AwarenessPeter Griffin
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdfJunZhao68
 
"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from IntelEdge AI and Vision Alliance
 
Devoxx 2015 - Building the Internet of Things with Eclipse IoT
Devoxx 2015 - Building the Internet of Things with Eclipse IoTDevoxx 2015 - Building the Internet of Things with Eclipse IoT
Devoxx 2015 - Building the Internet of Things with Eclipse IoTBenjamin Cabé
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 

Semelhante a Open CL For Haifa Linux Club (20)

Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
Baiscs of OpenGL
Baiscs of OpenGLBaiscs of OpenGL
Baiscs of OpenGL
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Implementing OpenCL support in GEGL and GIMP
Implementing OpenCL support in GEGL and GIMPImplementing OpenCL support in GEGL and GIMP
Implementing OpenCL support in GEGL and GIMP
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20
 
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...
Разработка кросс-платформенного кода между iPhone &lt; -> Windows с помощью o...
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
ELC-E Linux Awareness
ELC-E Linux AwarenessELC-E Linux Awareness
ELC-E Linux Awareness
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
 
"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel
 
Devoxx 2015 - Building the Internet of Things with Eclipse IoT
Devoxx 2015 - Building the Internet of Things with Eclipse IoTDevoxx 2015 - Building the Internet of Things with Eclipse IoT
Devoxx 2015 - Building the Internet of Things with Eclipse IoT
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 

Mais de Ofer Rosenberg

Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012Ofer Rosenberg
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFOfer Rosenberg
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup WorkshopOfer Rosenberg
 

Mais de Ofer Rosenberg (8)

HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
From fermi to kepler
From fermi to keplerFrom fermi to kepler
From fermi to kepler
 
Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 

Open CL For Haifa Linux Club

  • 1. OpenCL Overview Ofer Rosenberg Contributors: Tim Mattson (Intel), Aaftab Munshi (Apple)
  • 2. Agenda • OpenCL intro – GPGPU in a nutshell – OpenCL roots – OpenCL status • OpenCL 1.0 deep dive – Programming Model – Framework API – Language – Embedded Profile & Extensions • Summary 2
  • 3. GPGPU in a nutshell Disclaimer: 1. GPGPU is a lot of things to a lot of people. This is my view & vision on GPGPU… 2. GPGPU is a huge subject, and this is a Nutshell. I recommend Kayvon’s lecture at http://s08.idav.ucdavis.edu/ 3
  • 4. GPGPU in a nutshell #include <stdio.c> … • On the right there is a very void main (int argc, char* argv[]) artificial example to explain { GPGPU. … • A simple program , with a “for” loop for (i=0 ; i < iBuffSize; i++) which takes two buffers and adds C[i] = A[i] + B[i]; them into a third buffer … (did we mention the word artificial yet ???) } 4
  • 5. GPGPU in a nutshell #include <stdio.c> … • The example after an expert visit: void main (int argc, char* argv[]) – Dual threaded to support Dual Core { … – SSE2 code is doing a vectorized _beginthread(vec_add, 0, A, B, C, iBuffSize/2); operation _beginthread(vec_add, 0, A[iBuffSize/2], B[iBuffSize/2], C[iBuffSize/2], iBuffSize/2); … } void vec_add (const float *A, const float *B, float *C, int iBuffSize) { __m128 vC, vB, vA; for (i=0 ; i < iBuffSize/4; i++) { vA = _mm_load_sp(&A[i*4]); vB = _mm_load_sp(&B[i*4]); vC = _mm_add_ps (vA, vB); _mm_store_ps (&C[i*4], vC); } _endthread(); } 5
  • 6. Traditional GPGPU… • Write in graphics language and use the GPU • Highly effective, but : – The developer needs to learn another (not intuitive) language – The developer was limited by the graphics language 6
  • 7. GPGPU reloaded #include <stdio.c> … • CUDA was a disruptive technology Void main (int argc, char* argv[]) – Write C on the GPU { – Extend to non-traditional usages … – Provide synchronization mechnaism • OpenCL deepens and extends the for (i=0 ; i < iBuffSize; i++) revolution C[i] = A[i] + B[i]; • GPGPU now used for games to enhance the … standard GFX pipe } – Physics – Advanced Rendering __kernel void dot_product (__global const float4 *a, __global const float4 *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } OpenCL 7
  • 8. GPGPU also in Games… Non-interactive point light Dynamic light affects character & environment Jagged edge Clean edge artifacting details 8
  • 9. A new type of programming… “The way the processor industry is going is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.” Steve Jobs, NY Times interview, June 10 2008 What About GPU’s ? NVIDIA G80: 16 Cores, 8 HW threads per core Larrabee: XX Cores, Y HW threads per core “Basically it lets you use graphics processors to do computation,” he said. “It’s way beyond what Nvidia or anyone else has, and it’s really simple.” Steve Jobs on OpenCL, NY Times interview, June 10 2008 http://bits.blogs.nytimes.com/2008/06/10/apple-in-parallel-turning-the-pc-world-upside-down/ 9
  • 10. OpenCL in a nutshell • OpenCL is : – An opened Standard managed by the Khronos group (Cross-IHV, Cross-OS) – Influenced & guided by Apple – Spec 1.0 approved Dec’08 – A system for executing short "Enhanced C" routines (kernels) across devices – All around Heterogeneous Platforms – Host & Devices – Devices: CPU, GPU, Accelerator (FPGA) – Skewed towards GPU HW – Samplers, Vector Types, etc. – Offers hybrid execution capability • OpenCL is not: – OpenGL, or any other 3D graphics language – Meant to replace C/C++ (don’t write the entire application in it…) Khronos WG Key contributors Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts (EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom, Codeplay, Ericsson, Freescale, HI, Imagination Technologies, Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung, Seaweed, Takumi, TI and Umeå University. 10
  • 11. The Roots of OpenCL • Apple Has History on GPGPU… – Developed a Framework called “Core Image” (& Core Video) – Based on OpenGL for GPU operation and Optimized SSE code for CPU • Feb 15th 2007 – NVIDIA introduces CUDA – Stream Programming Language – C with extensions for GPU – Supported by Any Geforce 8 GPU – Works on XP & Vista (CUDA 2.0) – Amazing adoption rate – 40 university courses worldwide – 100+ Applications/Articles • Apple & NVIDIA cooperate to create OpenCL – Open Compute Language 11 1
  • 12. OpenCL Status • Apple Submitted the OpenCL 1.0 specification draft to Khronos (owner of OpenGL) • June 16th 2008 - Khronos established the “Compute Working Group” – Members: AMD, Apple, Ardites, ARM, Blizzard, Broadcom, Codeplay, EA, Ericsson, Freescale, Hi Corp., IBM, Imagination Technologies, Intel, Kestrel Institute, Movidia, Nokia, Nvidia, Qualcomm, Rapid Mind, Samsung, Takumi and TI. • Dec. 1st 2008 – OpenCL 1.0 ratification • Apple is expected to release “Snow Leopard” Mac OS (containing OpenCL 1.0) by end of 2009 • Apple already began porting code to OpenCL 12 1
  • 13. Agenda • OpenCL intro – GPGPU in a nutshell – OpenCL roots – OpenCL status • OpenCL 1.0 deep dive – Programming Model – Framework API – Language – Embedded Profile & Extensions • Summary 13 1
  • 14. OpenCL from 10,000 feet… • The Standard Defines two major elements: – The Framework/Software Stack – The Language __kernel void dot_product (__global const float4 *a, __kernel void dot_product (__global const float4 *a, __kernel void dot_product (__global const float4 *a, *b, __global const float4 __global const float4 *b, __global float *c) __global const float4 *b, { __global float *c) __global float *c) Application { int tid = get_global_id(0); { int tid = get_global_id(0); int c[tid] get_global_id(0); tid = = dot(a[tid], b[tid]); c[tid] = dot(a[tid], b[tid]); } c[tid] = dot(a[tid], b[tid]); } } Chapter 6 OpenCL Framework Chapters 2-5 14 1
  • 15. OpenCL Platform Model • The basic platform composed of a Host and a few Devices • Each device is made of a few compute units (well, cores…) • Each compute unit is made of a few processing elements (virtual scalar processor) Under OpenCL the CPU is also compute device 15 1
  • 16. Compute Device Memory Model • Compute Device – CPU or GPU • Compute Unit = Core • Compute Kernel – A function written in OpenCL C – Mapped to Work Item(s) • Work-item – A single copy of the compute kernel, running on one data element – In Data Parallel mode, kernel execution contains multiple work-items – In Task Parallel mode, kernel execution contains a single work-item • Four Memory Types: – Global : default for images/buffers – Constant : global const variables – Local : shared between work-items – Private : kernel internal variables 16 1
  • 17. Execution Model • Host defines a command queue and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue • Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space • Work items execute together as a work-group. (wx, w y) (wxSx + sx, wySy (wxSx + sx, wySy + sy) + sy) (sx, sy) = (0,0) (sx, sy) = (Sx-1,0) Gy (wxSx + sx, wySy (wxSx + sx, wySy + sy) + sy) (sx, sy) = (0, Sy-1) (sx, sy) = (Sx-1, Sy- 1) G x 17
  • 18. Programming Model • Data Parallel, SPMD – Work-items in a work-group run the same program – Update data structures in parallel using the work-item ID to select data and guide execution. • Task Parallel – One work-item per work group … for coarse grained task-level parallelism. – Native function interface: trap-door to run arbitrary code from an OpenCL command-queue. 18
  • 19. Compilation Model • OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL) • Static compilation: – The code is compiled from source to machine execution code at a specific point in the past (when the developer complied it using the IDE) • Dynamic compilation: – Also known as runtime compilation – Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an assembler of a virtual machine. This step is known as offline compilation, and it’s done by the Front-End compiler – Step 2: The IR is compiled to a machine code for execution. This step is much shorter. It is known as online compilation, and it’s done by the Back-end compiler • In dynamic compilation, step 1 is done usually only once, and the IR is stored. The App loads the IR and performs step 2 during the App’s runtime (hence the term…) 19
  • 20. OpenCL Framework overview Kernels to Accelerate (OpenCL C) Application __kernel void dot (__global const float4 *a, (written in C++, C#, Java, …) __kernel void dot (__global const float4 *a, *b, __global const float4 __kernel void dot (__global constfloat *c) *b, __global float4 *a, __global const float4 { __global const float4 *b, __global float *c) { __global float *c) int tid = get_global_id(0); { intc[tid] = dot(a[tid], b[tid]); tid = get_global_id(0); Open CL Framework } intc[tid] = dot(a[tid], b[tid]); tid = get_global_id(0); } } c[tid] = dot(a[tid], b[tid]); Allows applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. Platform Runtime OpenCL Framework API API OpenCL Runtime: Allows the host program to OpenCL Compiler Front-End Compiler manipulate contexts once they OpenCL Runtime have been created. IR Back-End Compiler Front End Compiler: OpenCL Platform Back-End Compiler Compile from source into common binary intermediate that contain OpenCL kernels. CPU Device GPU Device Backend compiler: Compile from general intermediate Note: binary into a device specific binary with device specific optimizations The CPU is both the Host and a compute device 20
  • 21. Agenda • OpenCL intro – GPGPU in a nutshell – OpenCL roots – OpenCL status • OpenCL 1.0 deep dive – Programming Model – Framework API – Language – Embedded Profile & Extensions • Summary 21
  • 22. The Platform Layer clGetDeviceIDs (cl_device_type device_type … • Query the Platform Layer cl_device_type Description – clGetPlatformInfo An OpenCL device that is the host processor. The host processor CL_DEVICE_TYPE_CPU runs the OpenCL implementations and is a single or multi-core CPU. • Query Devices (by type) An OpenCL device that is a GPU. By this we mean that the device – clGetDeviceIDs CL_DEVICE_TYPE_GPU can also be used to accelerate a 3D API such as OpenGL or DirectX. • For each device, Query Device Configuration Dedicated OpenCL accelerators (for example the IBM CELL Blade). CL_DEVICE_TYPE_ACCELERATOR These devices communicate with the host processor using a – clGetDeviceConfigInfo peripheral interconnect such as PCIe. – clGetDeviceConfigString CL_DEVICE_TYPE_DEFAULT The default OpenCL device in the system. • Create Contexts using the devices found by CL_DEVICE_TYPE_ALL All OpenCL devices available in the system. the “get” function clGetDeviceInfo (cl_device_id device, • Context is the central element used by the cl_device_info param_name, runtime layer to manage: cl_device_info Description – Command Queque The OpenCL device type. Currently supported values are: – Memory objects CL_DEVICE_TYPE CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU, CL_DEVICE_TYPE_ACCELERATOR, CL_DEVICE_TYPE_DEFAULT or a – Programs combination of the above. – Kernels CL_DEVICE_MAX_COMPUTE_ The number of parallel compute cores on the OpenCL device. The minimum value is one. UNITS Maximum dimensions that specify the global and local work-item IDs CL_DEVICE_MAX_WORK_ clCreateContext ITEM_DIMENSIONS used by the data-parallel execution model. (Refer to clEnqueueNDRangeKernel). The minimum value is 3. compute_device[0] CL_DEVICE_MAX_WORK_ Maximum number of work-items in a work-group executing a kernel using the data-parallel execution model. (Refer to compute_device[1] GROUP_SIZE clEnqueueNDRangeKernel). Context compute_device[2] CL_DEVICE_MAX_CLOCK_ Maximum configured clock frequency of the device in MHz. FREQUENCY compute_device[3] Device address space size specified as an unsigned integer value in CL_DEVICE_ADDRESS_BITS bits. Currently supported values are 32 or 64 bits. CL_DEVICE_MAX_MEM_ Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024) ALLOC_SIZE 22 And many more : 40+ parameters
  • 23. OpenCL Runtime Everything in OpenCL Runtime is happening within a context compute_device[0] compute_device[1] Context compute_device[2] OpenCL Programs Kernels Memory Objects Command Queues Device[2] #define g_c __global const Compiled Kernel Images Buffers __kernel void dot_prod (g_c float4 *a, g_c float4 *b, Program Handler Device[1] g_c float *c) { Function Device[0] int tid = get_global_id(0); c[tid] = dot(a[tid], b[tid]); In Out of dot_prod } Order Order In Out of buff_add Args List Queue OrderQueue __kernel void buff_add (g_c float4 *a, g_c float4 *b, Order g_c float *c) arg[0] In Queue Out Queue of { int tid = get_global_id(0); Order Order arg[1] c[tid] = a[tid] +b[tid]; Queue Queue } … Create data & Send to Compile code arguments execution 23
  • 24. OpenCL “boot” process Platform Layer Query Platform Query Devices Create Context Runtime Create Command Queue Create Memory Object Compiler Create Program Build Program Create Kernel Set Kernel Args Enqueue Kernel 24 2
  • 25. OpenCL C Programming Language in a Nutshell • Derived from ISO C99 • A few restrictions: – Recursion – Function pointers – Functions in C99 standard headers • New Data Types – New Scalar types __global float4 *color; // An array of float4 – Vector Types typedef struct { float a[3]; – Image types int b[2]; } foo_t; • Address Space Qualifiers __global image2d_t texture; // A 2D texture image • Synchronization objects __kernel void stam_dugma( __global float *output, – Barrier __global float *input, • Built-in functions { __local float *tile) • IEEE 754 compliant with a few exceptions // private variables int in_x, in_y ; const unsigned int lid = get_local_id(0)); // declares a pointer p in the __private // that points to an int object in __global __global int *p; 25 2
  • 26. Agenda • OpenCL intro – GPGPU in a nutshell – OpenCL roots – OpenCL status • OpenCL 1.0 deep dive – Programming Model – Framework API – Language – Embedded Profile & Extensions • Summary 26
  • 27. OpenCL Extensions • As in OpenGL, OpenCL supports Specification Extensions • Extension is an optional feature, which might be supported by a device, but not part of the “Core features” (Khronos term) – Application is required to query the device using CL_DEVICE_EXTENSIONS parameter • Two types of extensions: – Extensions approved by Khronos OpenCL working group – Uses the “KHR” in functions/enums/etc. – Might be promoted to required Core feature on next versions of OpenCL – Extensions which are Vendor Specific • The specification already provides some KHR extensions 27
  • 28. OpenCL 1.0 KHR Extensions • Double Precision Floating Point – Support Double as data type and extend built-in functions to support it • Selecting Rounding Mode – Add to the mandatory “round to nearest” : “round to nearest even”, “round to zero”, “round to positive infinity”, “round to negative infinity” • Atomic Functions (for 32-bit integers, for Local memory, for 64-bit) • Writing to 3D image memory objects – OpenCL mandates only read. • Byte addressable stores – In OpenCL core, Write to Pointers is limited to 32bit granularity • Half floating point – Add 16bit Floating point type 28
  • 29. OpenCL 1.0 Embedded Profile • A “relaxed” version for embedded devices (as in OpenGL ES) • No 64bit integers • No 3D images • Reduced requirements on Samplers – No CL_FILTER_LINEAR for Float/Half – Less addressing modes • Not IEEE 754 compliant on some functions – Example: Min Accuracy for atan() >= 5 ULP • Reduced set of minimal device requirements – Image height/width : 2048 instead of 8192 – Number of samplers : 8 instead of 16 – Local memory size : 1K instead of 16K – More… 29
  • 30. Agenda • OpenCL intro – GPGPU in a nutshell – OpenCL roots – OpenCL status • OpenCL 1.0 deep dive – Programming Model – Framework API – Language – Embedded Profile & Extensions • Summary 30
  • 31. OpenCL Unique Features • As a Summary, here are some unique features of OpenCL : • An Opened Standard for Cross-OS, Cross-Platform, heterogeneous processing – Khronos owned • Creates a unified, flat, system model where the GPU, CPU and other devices are treated (almost) the same • Includes Data & Task Parallelism – Extend GPGPU beyond the traditional usages • Supports Native functions (C++ interop) • Derived from ISO C99 with additional types, functions, etc. (and some restrictions) • IEEE 754 compliant 31
  • 32. Backups 32 3
  • 33. Building OpenCL Code clCreateProgramWithSource() 1. Creating OpenCL Programs – From Source : receive array of strings clCreateProgramWithBinary() – From Binaries : receive array of binaries – Intermediate Representation – Device specific executable clBuildProgram() 2. Building the programs – The developer can define a subgroup of devices to build on clCreateKernel() 3. Creating Kernel objects – Single kernel : according to kernel name cl_program clCreateProgramWithSource (cl_context context, cl_uint count, – All kernels in program const char **strings, const size_t *lengths, cl_int *errcode_ret) cl_program clCreateProgramWithBinary (cl_context context, cl_uint num_devices, const cl_device_id *device_list, const size_t *lengths, OpenCL supports dynamic compilation scheme – the const void **binaries, cl_int *binary_status, Application uses “create from source” on the first time cl_int *errcode_ret) and then use “create from binaries” on next times cl_int clBuildProgram (cl_program program, cl_uint num_devices, const cl_device_id *device_list, const char *options, void (*pfn_notify)(cl_program, void *user_data), void *user_data) cl_kernel clCreateKernel (cl_program program, const char *kernel_name, cl_int *errcode_ret) 33
  • 34. Some order here, please… • OpenCL defines a command queue, which is Context 1 Context 2 created on a single device Device 1 Device 1 Device 2 Device 3 – within the scope of a context, of course… • Commands are enqueued to a specific Q1,2 Q1,2 Q1,3 Q1,4 Q2,1 Q2,2 queue D1 D1 D2 D2 D1 D3 IOQ IOQ IOQ OOQ OOQ IOQ – Kernels Execution C4 – Memory Operations C3 C3 • Events C2 C2 C2 C1 C1 C1 – Each command can be created with an event associated to it • A few queues can be created on the same – Each command execution can be dependent device in a list of pre-created events • Commands can dependent on events • Two types of queues created on other queues/contexts – In order queue : commands are executed in • In the example above : the order of issuing – C3 from Q1,2 depends on C1 & C2 from – Out of order queue : command execution is Q1,2 dependent only on its event list completion – C1 from Q1,4 depends on C2 from Q1,2 – In Q1,4, C3 depends on C2 34
  • 35. Memory Objects • OpenCL defines Memory Objects (Buffers/Images) – Reside in Global Memory – Object is defined in the scope of a context – Memory Objects are the only way to pass a large amount of data between the Host & the devices. • Two mechanisms to sync Memory Objects: • Transactions - Read/Write – Read - take a “snapshot” of the buffer/image to Host memory – Write – overwrites the buffer/image with data from the Host – Can be blocking or non-blocking (app needs to sync on event) • Mapping – Similar to DX lock/map and command – Passes ownership of the buffer/image to the host, and back to the device 35
  • 36. Executing Kernels • A Kernel is executed by enqueueing it to a Specific command queue • The App must set the Kernel Arguments before enqueueing – Setting the arguments is done one-by-one – The kernel's arguments list is preserved after enqueueing – This enables changing only the required arguments before enqueueing again • There are two separate enqueueing API’s – Data Parallel & Task Parallel • In Data Parallel enqueueing, the App specifies the global & local work size – Global : the overall work-items to be executed, described by an N dimensional matrix – Local : the breakdown of the global to fit the specific device (can be left NULL) cl_int clSetKernelArg (cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value) cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) cl_int clEnqueueTask (cl_command_queue command_queue, cl_kernel kernel, Back… cl_uint num_events_in_wait_list, 36 3 const cl_event *event_wait_list, cl_event *event)
  • 37. Reality Check – Apple compilation scheme • OCL compilation process on SnowLeopard (OSX 10.6) OpenCL Compute Program – Step1: Compile OCL to LLVM IR (Intermediate Representation) OpenCL Front-End (Apple) – Step2: Compile to target device • NVIDIA GPU device compiles the LLVM LLVM IR IR in two steps: NVIDIA x86 Back-End – LLVM IR to PTX (CUDA IR) (LLVM Project) – PTX to target GPU • CPU device uses LLVM x86 BE to x86 PTX IR binary compile directly to x86 Binary code. • So what is LLVM ? Next slide… G80 G92 G200 binary binary binary 37 3
  • 38. The LLVM Project • LLVM - Low Level Virtual Machine CLang GCC GLSL+ • Open Source Compiler for – Multi Language LLVM IR – Cross Platform/Architecture – Cross OS x86 PPC ARM MIPS 38
  • 39. OCL C Data Types • OpenCL C Programming language supports all ANSI C data types • In addition, the following Scalar types are supported: Type Description half A 16-bit float. The half data type must conform to the IEEE 754-2008 half precision storage format. size_t The unsigned integer type of the result of the sizeof operator (32bit or 64bit). ptrdiff_t A signed integer type that is the result of subtracting two pointers (32bit or 64bit). A signed integer type with the property that any valid pointer to void can be converted to this type, then intptr_t converted back to pointer to void, and the result will compare equal to the original pointer. • And the following vector to pointer(n void, and the resultor 16) pointer toto thecan be converted to this type, then uintptr_t An unsigned integer type with the property that any valid converted back types to can be 2,4,8, will compare equal void original pointer. 39 3
  • 40. Address Space Qualifiers • OpenCL Memory model defined 4 memory spaces • Accordingly, the OCL C defines 4 qualifiers: – __global, __local, __constant, __private • Best to explain on a piece of code: __global float4 *color; // An array of float4 elements Variables outside the typedef struct { scope of kernels must be float a[3]; global int b[2]; } foo_t; __global image2d_t texture; // A 2D texture image Variables passed to the __kernel void stam_dugma( kernel can be of any type __global float *output, __global float *input, __local float *tile) Internal Variables are { // private variables private unless specified int in_x, in_y ; otherwise const unsigned int lid = get_local_id(0)); // declares a pointer p in the __private address space And here’s an example to // that points to an int object in address space __global the “specified otherwise”… __global int *p; 40
  • 41. Built-in functions • The Spec specified over 80 built-in functions which must be supported • The built-in functions are divided to the following types: – Work-item functions – Relational functions – get_work_dim, get_global_id, etc… – isequal, isgreater, isfinite – Math functions – Vector data load & store functions – acos, asin, atan, ceil, hypot, ilogb – vloadn, vstoren, – Integer functions – Image Read & Write Functions – abs, add_sat, mad_hi, max, mad24 – read_imagef, read_imagei, – Common functions (float only) – Synchronization Functions – Clamp, min, max, radians, step – barrier – Geometric functions – Memory Fence Functions – cross, dot, distance, normalize – Read_mem_fence 41 4
  • 42. Vector Addition – Kernel Code __kernel void dot_product (__global const float4 *a, __global const float4 *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } 42
  • 43. Vector Addition – Host Code void delete_memobjs(cl_mem *memobjs, int n) { int i; for (i=0; i<n; i++) clReleaseMemObject(memobjs[i]); } int exec_dot_product_kernel(const char *program_source, int n, void *srcA, void *srcB, void *dst) { cl_context context; cl_command_queue cmd_queue; cl_device_id *devices; cl_program program; cl_kernel kernel; cl_mem memobjs[3]; size_t global_work_size[1], size_t local_work_size[1], cb; cl_int err; // create the OpenCL context on a GPU device context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); 43
  • 44. Vector Addition – Host Code devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); free(devices); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcA, NULL); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * n, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, (const char**)&program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, "dot_product", NULL); 44
  • 45. Vector Addition – Host Code // set the args values err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0]= 1; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); // read output image err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n * sizeof(cl_float), dst, 0, NULL, NULL); // release kernel, program, and memory objects delete_memobjs(memobjs, 3); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); return 0; } 45
  • 46. Sources • OpenCL at Khronos – http://www.khronos.org/opencl/ • “Beyond Programmable Shading” workshop at SIGGRAPH 2008 – http://s08.idav.ucdavis.edu/ • Same workshop at SIGGRAPH Asia 2008 – http://sa08.idav.ucdavis.edu/ 46