SlideShare a Scribd company logo
1 of 27
Download to read offline
Optimizing OpenCL on CPUs                                      TM




                                                                     Ofer Rosenberg
                                                               Visual Computing Software




                                                                           © Intel Corporation, 2010
* OpenCL is trademarks of Apple Inc. used by permission by Khronos
OpenCL and Heterogeneous computing
• OpenCL is a Platform API which supports a uniform programming
  environment across devices
   – Enables heterogeneous parallel computations
   – Unique in its ability to coordinate CPUs, GPUs, etc


• The goal of using OpenCL should be to make the best use of all the available
  resources (CPU’s, GPU’s) from within a single program:
   – One program that runs well (i.e. reasonably close to “hand-tuned” performance) on
     a heterogeneous mixture of processors.
   – Intel new generation of Processors: a new level of integration between CPU & GPU




   2
Writing OpenCL for the CPU
• The challenge of unleashing the performance of modern CPU’s
  – Multi-Core / SMT
  – Vector Units (SSE ISA)


• OpenCL is a great framework to harness Intel CPUs
  – Intuitive, easy & maintainable
  – Unleash the performance of Intel CPUs
      – Multi Core
      – Utilize the vector units (SSE ISA)
      – Close to hand-tuned code!
  – Performance-portable code
      – Forward compatibility between CPU generations
      – Aspire compatibility between devices




  3
OpenCL view of CoreTM i7
                                                     OpenCL Platform Model*




                                  CoreTM i7 975
                                  • 8 Compute Units
                   L1                –   Quad Core + SMT
L1       L1             L1
                                  • 4/8/16 Processing Elements per Unit
L2       L2        L2   L2           –   128bit XMM registers
                                     –   Data type determines # of elements…
                                  • (32K L1 + 256K L2) Per Core, 8M L3 Shared
              L3                     –   Not part of OpenCL platform model, but useful 


                             * Taken from OpenCL 1.1 Specification, Rev 33
     4
Mapping OpenCL Data-Parallel Execution Model

• Implicit SIMD data parallelism (i.e. shader-style):
  – Write the kernel as a “scalar program”
  – Use vector data types sized naturally to the algorithm
  – Kernel automatically mapped to SIMD-compute-resources and
    cores by the compiler/runtime/hardware.


• Explicit SIMD data parallelism:
  – The kernel defines one stream of instructions
  – Parallelism from source-level wide vector types
       – Size vector types to match native HW width
  – Programmer hints on the vector data type using attributes
       – vec_type_hint(typen)




                   CPU supports both mapping
   5
Example: N-body Simulation
       Given N bodies with an initial position xi and velocity vi for,
       the force fij on body i caused by body j is given by following
       (G is gravity):




where mi and mj are the masses of bodies i and j, respectively;
    rij = xj-xi

       The acceleration is computed as ai = Fi/mi




   6
NBody – from C to CL




  7
NBody – from C to CL



                       External loop: calculate new
                       position & speed for all bodies




                       Internal loop: calculate
                       influence of all the bodies on [i]




                        Calculate [i]’s new position &
                        speed




  8
NBody – from C to CL




  9
NBody – from C to CL



                                                    Kernel handles one body - i




                                                   Internal loop: calculate
                                                   influence of all the bodies on [i]




                                                    Calculate [i]’s new position &
                                                    speed




       What do the developer gains from converting to OpenCL C ?

  10
NBody Performance
Results from Intel’s internal OpenCL implementation: *

                                                                Performance



• Porting C to CL - Implicit Data Parallelism
       – “shader-style” code
       – Benefit from multi-core/SMT


• Implicit Data Parallelism with Vectorization
       – Intel’s internal implementation
       – Cross-workitem Vectorization/Packing
       – Benefit from SSE (128bit registers)

                                                           x1
• Explicit Data-Parallelism
       – Hand tuned OpenCL C code                                             Code
                                                                              Versions




*   Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3
    Results depends on the algorithm/code running




       11
NBody Performance
Results from Intel’s internal OpenCL implementation: *

                                                                  Performance



• Porting C to CL - Implicit Data Parallelism
       – “shader-style” code
       – Benefit from multi-core/SMT


• Implicit Data Parallelism with Vectorization
       – Intel’s internal implementation
       – Cross-workitem Vectorization/Packing
                                                           X5.6
       – Benefit from SSE (128bit registers)
                                                                  X5.6

                                                             x1
• Explicit Data-Parallelism
       – Hand tuned OpenCL C code                                               Code
                                                                                Versions




*   Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3
    Results depends on the algorithm/code running




       12
NBody Performance
Results from Intel’s internal OpenCL implementation: *

                                                                  Performance



• Porting C to CL - Implicit Data Parallelism
       – “shader-style” code                                                     X1.45

       – Benefit from multi-core/SMT                       x15


• Implicit Data Parallelism with Vectorization                            X3.1

       – Intel’s internal implementation
       – Cross-workitem Vectorization/Packing
                                                           X5.6
       – Benefit from SSE (128bit registers)
                                                                  X5.6

                                                             x1
• Explicit Data-Parallelism
       – Hand tuned OpenCL C code                                                        Code
                                                                                         Versions




*   Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3
    Results depends on the algorithm/code running




       13
NBody Performance
Results from Intel’s internal OpenCL implementation: *

                                                                  Performance


                                                           x29                           +14%
• Porting C to CL - Implicit Data Parallelism              x25

       – “shader-style” code                                                     X1.45

       – Benefit from multi-core/SMT                       x15


• Implicit Data Parallelism with Vectorization                            X3.1

       – Intel’s internal implementation
       – Cross-workitem Vectorization/Packing
                                                           X5.6
       – Benefit from SSE (128bit registers)
                                                                  X5.6

                                                             x1
• Explicit Data-Parallelism
       – Hand tuned OpenCL C code                                                        Code
                                                                                         Versions




*   Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3
    Results depends on the algorithm/code running

                      OpenCL Explicit version is x25 faster than Naïve C *
              Explicit version is only 14% slower than highly optimized code *
       14
Implicit Data Parallelism on the CPU
On Intel internal Implementation                      OCL                 CPU
• One workitem runs on a single SSE lane
• Workitems are packed to SSE registers as part of
  Intel OpenCL Compilation process
    –   One code stream is mapped to one SSE lane    Workitem
    –   All operations/calls are vectorized
• Vector data types inside the kernel code are
  scalarized and mapped to a single SSE lane

• The Vectorizer generates a Workgroup
    –   Further workgroup level optimizations
• Workgroup is executed on a compute unit (HW              …
  Thread)                                                                      L1
                                                                               L2
                                                     Workgroup



• Kernel is executed over an N-D Range, which is
  divided to workgroups
• Several Workgroups run concurrently on all
                                                                     L1   L1        L1   L1
  compute unit (HW threads)                                      …
                                                                     L2   L2        L2   L2


                                                                               L3



                                                     N-D Range



        15
Overview of the Vectorization Stage

            __kernel void program(float4* pos, int numBodies, float deltaTime)
            {
                float myPos = gid;
                float refPos = numBodies + deltaTime;
                float4 r = pos[refPos – myPos];
                float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
                float invDist = sqrt(distSqr + epsSqr);
                float invDistCube = invDist * invDist * invDist;
                float4 acc = invDistCube * r;
                float4 oldVel = vel[gid];
                float newPos = myPos.w;
            }




                                                OpenCL Ckernel code
                                                 OpenCL Kernel Code



  16
Overview of the Vectorization Stage

            __kernel void program(float4* pos, int numBodies, float deltaTime)
            {
                float myPos = gid;
                float refPos = numBodies + deltaTime;
                float4 r = pos[refPos – myPos];
                float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
                float invDist = sqrt(distSqr + epsSqr);
                float invDistCube = invDist * invDist * invDist;
                float4 acc = invDistCube * r;
                float4 oldVel = vel[gid];
                float newPos = myPos.w;
            }




                                                  Multiple work items
                                                  OpenCL kernel code
                                                      Next: Visualize



  17
Overview of the Vectorization Stage

            __kernel void program(float4* pos, int numBodies, float deltaTime)
            {
                float myPos = gid;
                float refPos = numBodies + deltaTime;
                                Vector instructions
                float4 r = pos[refPos – myPos];
                float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
                float invDist = sqrt(distSqr + epsSqr);
                float invDistCube = invDist * invDist * invDist;
                float4 acc = invDistCube * r;
                float4 oldVel = vel[gid];
                float newPos = myPos.w;
            }




                                                Graphic visualization…
                                                 Multiple work items
                                                 OpenCL kernel code
                                                      Next: Scalarize
                                                            Visualize



  18
Overview of the Vectorization Stage




                                             Vector instructions




                      Graphic visualization…
                        Scalarizing code…
                       Multiple work items
                       OpenCL kernel code
                           Next: Vectorize
                                 Scalarize
                                 Visualize



  19
Overview of the Vectorization Stage




                      Graphic visualization…
                        Scalarizing code…
                       Multiple work items
                        Vectorizing code…
                       OpenCL kernel code
                           Next: Vectorize
                                 Scalarize
                                 Visualize



  20
Overview of the Vectorization Stage




                      Graphic visualization…
                        Scalarizing code…
                       Multiple work items
                        Vectorizing code…
                       OpenCL kernel code
                           Next: Vectorize
                                 Scalarize
                                 Visualize



  21
Overview of the Vectorization Stage


                                         Reduced number of
                                            invocations




       Vectorization enables developer to exploit the CPU Vector Units
                         in Implicit Data Parallelism
  22
Explicit Data Parallelism on the CPU
                                                         OCL                     CPU
• Workitem is executed solely on a single
  compute unit (HW Thread)
• Vector operations are mapped to SSE
  instructions                                                                        L1

• Vectors which are wider than the physical                                           L2
  register are processed by interleaving                Workitem



• Workgroup is executed on a single compute
  unit (HW Thread)                                             …
                                                                                      L1
• Barrier & Fence built-ins impose some
                                                                                      L2
  penalty (context saving)

                                                    Workgroup



• Kernel is executed over an N-D Range, which       …                   …
  is divided to workgroups
                                                                            L1   L1        L1   L1
• Several Workgroups run concurrently on all
  compute unit (HW threads)                                                 L2   L2        L2   L2
                                                    …               …
• Optimizations hints here:                                                           L3

  http://www.khronos.org/developers/library/2009-
  hotchips/Intel_OpenCL-and-CPUs.pdf                    N-D Range



       23
Demo




 24
Summary
Conclusions
• OpenCL provides the developer with the tools to unleash the
  performance of CPU’s
  – Multi-thread/SMT , Vector units (SSE)
  – Forward Compatibility


• OpenCL supports two forms of data parallelism, both map well to
  Intel Architectures

• Implicit Data Parallelism has the best chance of mapping onto a
  diverse range of hardware
  – Requires advanced compilation techniques


• Intel® OpenCL SDK “Alpha” software will be available by end of 2010
  on whatif.intel.com



  25
References

• s09.idav.ucdavis.edu for slides from a Siggraph2009
  course titled “Beyond Programmable Shading”

• Tim Mattson, “OpenCL, Heterogeneous Computing and the
  CPU”, OpenCL Workshop in HotChips 2009.
  http://www.khronos.org/developers/library/2009-
  hotchips/Intel_OpenCL-and-CPUs.pdf

• Fatahalian, K., Houston, M., “GPUs: a closer look”,
  Communications of the ACM October 2008, vol 51 #10.
  graphics.stanford.edu/~kayvonf/papers/fatahalianCACM.pdf




 26
Legal Disclaimer
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,
  EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED
  BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH
  PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
  WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES
  RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
  PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR
  USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• All products, dates, and figures specified are preliminary based on current expectations, and are subject to
  change without notice.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
  may cause the product to deviate from published specifications. Current characterized errata are available
  on request.
• Any code names featured are used internally within Intel to identify products that are in development and
  not yet publicly announced for release. Customers, licensees and other third parties are not authorized by
  Intel to use code names in advertising, promotion or marketing of any product or services and any such use
  of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and
  reflect the approximate performance of Intel products as measured by those tests. Any difference in
  system hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside, the Intel logo, and Intel Core are trademarks of Intel Corporation in the United States and
  other countries.
• OpenCL is trademarks of Apple Inc. used by permission by Khronos.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2010 Intel Corporation. All rights reserved.


       27

More Related Content

What's hot

HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLinaro
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesKoan-Sin Tan
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 

What's hot (20)

HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 

Viewers also liked

IBM z/OS V2R2 Networking Technologies Update
IBM z/OS V2R2 Networking Technologies UpdateIBM z/OS V2R2 Networking Technologies Update
IBM z/OS V2R2 Networking Technologies UpdateAnderson Bassani
 
IBM z/OS V2R2 Performance and Availability Topics
IBM z/OS V2R2 Performance and Availability TopicsIBM z/OS V2R2 Performance and Availability Topics
IBM z/OS V2R2 Performance and Availability TopicsAnderson Bassani
 
Embedded Solutions 2010: Intel Multicore by Eastronics
Embedded Solutions 2010:  Intel Multicore by Eastronics Embedded Solutions 2010:  Intel Multicore by Eastronics
Embedded Solutions 2010: Intel Multicore by Eastronics New-Tech Magazine
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performanceso61pi
 
可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释Yanpo Zhang
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesTanel Poder
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Computex 2014 AMD Press Conference
Computex 2014 AMD Press ConferenceComputex 2014 AMD Press Conference
Computex 2014 AMD Press ConferenceAMD
 
AMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores ArchitectureAMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores ArchitectureLow Hong Chuan
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 

Viewers also liked (20)

IBM z/OS V2R2 Networking Technologies Update
IBM z/OS V2R2 Networking Technologies UpdateIBM z/OS V2R2 Networking Technologies Update
IBM z/OS V2R2 Networking Technologies Update
 
Ludden q3 2008_boston
Ludden q3 2008_bostonLudden q3 2008_boston
Ludden q3 2008_boston
 
IBM z/OS V2R2 Performance and Availability Topics
IBM z/OS V2R2 Performance and Availability TopicsIBM z/OS V2R2 Performance and Availability Topics
IBM z/OS V2R2 Performance and Availability Topics
 
Embedded Solutions 2010: Intel Multicore by Eastronics
Embedded Solutions 2010:  Intel Multicore by Eastronics Embedded Solutions 2010:  Intel Multicore by Eastronics
Embedded Solutions 2010: Intel Multicore by Eastronics
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
z/OS V2R2 Enhancements
z/OS V2R2 Enhancementsz/OS V2R2 Enhancements
z/OS V2R2 Enhancements
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performance
 
可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释可靠分布式系统基础 Paxos的直观解释
可靠分布式系统基础 Paxos的直观解释
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
SMP/Multithread
SMP/MultithreadSMP/Multithread
SMP/Multithread
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Computex 2014 AMD Press Conference
Computex 2014 AMD Press ConferenceComputex 2014 AMD Press Conference
Computex 2014 AMD Press Conference
 
AMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores ArchitectureAMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores Architecture
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 

Similar to Intel's Presentation in SIGGRAPH OpenCL BOF

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup WorkshopOfer Rosenberg
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™Yury Gorbachev
 
Unleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel LibraryUnleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel LibraryIntel IT Center
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Pradeep Singh
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Kynetics
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights LandingAndrey Vladimirov
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Intel Xeon Phi Hotchips Architecture Presentation
Intel Xeon Phi Hotchips Architecture PresentationIntel Xeon Phi Hotchips Architecture Presentation
Intel Xeon Phi Hotchips Architecture PresentationChris O'Neal
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 

Similar to Intel's Presentation in SIGGRAPH OpenCL BOF (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
 
Unleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel LibraryUnleash performance through parallelism - Intel® Math Kernel Library
Unleash performance through parallelism - Intel® Math Kernel Library
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Intel Xeon Phi Hotchips Architecture Presentation
Intel Xeon Phi Hotchips Architecture PresentationIntel Xeon Phi Hotchips Architecture Presentation
Intel Xeon Phi Hotchips Architecture Presentation
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
FPGA @ UPB-BGA
FPGA @ UPB-BGAFPGA @ UPB-BGA
FPGA @ UPB-BGA
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 

Intel's Presentation in SIGGRAPH OpenCL BOF

  • 1. Optimizing OpenCL on CPUs TM Ofer Rosenberg Visual Computing Software © Intel Corporation, 2010 * OpenCL is trademarks of Apple Inc. used by permission by Khronos
  • 2. OpenCL and Heterogeneous computing • OpenCL is a Platform API which supports a uniform programming environment across devices – Enables heterogeneous parallel computations – Unique in its ability to coordinate CPUs, GPUs, etc • The goal of using OpenCL should be to make the best use of all the available resources (CPU’s, GPU’s) from within a single program: – One program that runs well (i.e. reasonably close to “hand-tuned” performance) on a heterogeneous mixture of processors. – Intel new generation of Processors: a new level of integration between CPU & GPU 2
  • 3. Writing OpenCL for the CPU • The challenge of unleashing the performance of modern CPU’s – Multi-Core / SMT – Vector Units (SSE ISA) • OpenCL is a great framework to harness Intel CPUs – Intuitive, easy & maintainable – Unleash the performance of Intel CPUs – Multi Core – Utilize the vector units (SSE ISA) – Close to hand-tuned code! – Performance-portable code – Forward compatibility between CPU generations – Aspire compatibility between devices 3
  • 4. OpenCL view of CoreTM i7 OpenCL Platform Model* CoreTM i7 975 • 8 Compute Units L1 – Quad Core + SMT L1 L1 L1 • 4/8/16 Processing Elements per Unit L2 L2 L2 L2 – 128bit XMM registers – Data type determines # of elements… • (32K L1 + 256K L2) Per Core, 8M L3 Shared L3 – Not part of OpenCL platform model, but useful  * Taken from OpenCL 1.1 Specification, Rev 33 4
  • 5. Mapping OpenCL Data-Parallel Execution Model • Implicit SIMD data parallelism (i.e. shader-style): – Write the kernel as a “scalar program” – Use vector data types sized naturally to the algorithm – Kernel automatically mapped to SIMD-compute-resources and cores by the compiler/runtime/hardware. • Explicit SIMD data parallelism: – The kernel defines one stream of instructions – Parallelism from source-level wide vector types – Size vector types to match native HW width – Programmer hints on the vector data type using attributes – vec_type_hint(typen) CPU supports both mapping 5
  • 6. Example: N-body Simulation Given N bodies with an initial position xi and velocity vi for, the force fij on body i caused by body j is given by following (G is gravity): where mi and mj are the masses of bodies i and j, respectively; rij = xj-xi The acceleration is computed as ai = Fi/mi 6
  • 7. NBody – from C to CL 7
  • 8. NBody – from C to CL External loop: calculate new position & speed for all bodies Internal loop: calculate influence of all the bodies on [i] Calculate [i]’s new position & speed 8
  • 9. NBody – from C to CL 9
  • 10. NBody – from C to CL Kernel handles one body - i Internal loop: calculate influence of all the bodies on [i] Calculate [i]’s new position & speed What do the developer gains from converting to OpenCL C ? 10
  • 11. NBody Performance Results from Intel’s internal OpenCL implementation: * Performance • Porting C to CL - Implicit Data Parallelism – “shader-style” code – Benefit from multi-core/SMT • Implicit Data Parallelism with Vectorization – Intel’s internal implementation – Cross-workitem Vectorization/Packing – Benefit from SSE (128bit registers) x1 • Explicit Data-Parallelism – Hand tuned OpenCL C code Code Versions * Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3 Results depends on the algorithm/code running 11
  • 12. NBody Performance Results from Intel’s internal OpenCL implementation: * Performance • Porting C to CL - Implicit Data Parallelism – “shader-style” code – Benefit from multi-core/SMT • Implicit Data Parallelism with Vectorization – Intel’s internal implementation – Cross-workitem Vectorization/Packing X5.6 – Benefit from SSE (128bit registers) X5.6 x1 • Explicit Data-Parallelism – Hand tuned OpenCL C code Code Versions * Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3 Results depends on the algorithm/code running 12
  • 13. NBody Performance Results from Intel’s internal OpenCL implementation: * Performance • Porting C to CL - Implicit Data Parallelism – “shader-style” code X1.45 – Benefit from multi-core/SMT x15 • Implicit Data Parallelism with Vectorization X3.1 – Intel’s internal implementation – Cross-workitem Vectorization/Packing X5.6 – Benefit from SSE (128bit registers) X5.6 x1 • Explicit Data-Parallelism – Hand tuned OpenCL C code Code Versions * Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3 Results depends on the algorithm/code running 13
  • 14. NBody Performance Results from Intel’s internal OpenCL implementation: * Performance x29 +14% • Porting C to CL - Implicit Data Parallelism x25 – “shader-style” code X1.45 – Benefit from multi-core/SMT x15 • Implicit Data Parallelism with Vectorization X3.1 – Intel’s internal implementation – Cross-workitem Vectorization/Packing X5.6 – Benefit from SSE (128bit registers) X5.6 x1 • Explicit Data-Parallelism – Hand tuned OpenCL C code Code Versions * Results measured on CoreTM i7 975, 3.3 GHz, 6GB DDR3 Results depends on the algorithm/code running OpenCL Explicit version is x25 faster than Naïve C * Explicit version is only 14% slower than highly optimized code * 14
  • 15. Implicit Data Parallelism on the CPU On Intel internal Implementation OCL CPU • One workitem runs on a single SSE lane • Workitems are packed to SSE registers as part of Intel OpenCL Compilation process – One code stream is mapped to one SSE lane Workitem – All operations/calls are vectorized • Vector data types inside the kernel code are scalarized and mapped to a single SSE lane • The Vectorizer generates a Workgroup – Further workgroup level optimizations • Workgroup is executed on a compute unit (HW … Thread) L1 L2 Workgroup • Kernel is executed over an N-D Range, which is divided to workgroups • Several Workgroups run concurrently on all L1 L1 L1 L1 compute unit (HW threads) … L2 L2 L2 L2 L3 N-D Range 15
  • 16. Overview of the Vectorization Stage __kernel void program(float4* pos, int numBodies, float deltaTime) { float myPos = gid; float refPos = numBodies + deltaTime; float4 r = pos[refPos – myPos]; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float invDist = sqrt(distSqr + epsSqr); float invDistCube = invDist * invDist * invDist; float4 acc = invDistCube * r; float4 oldVel = vel[gid]; float newPos = myPos.w; } OpenCL Ckernel code OpenCL Kernel Code 16
  • 17. Overview of the Vectorization Stage __kernel void program(float4* pos, int numBodies, float deltaTime) { float myPos = gid; float refPos = numBodies + deltaTime; float4 r = pos[refPos – myPos]; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float invDist = sqrt(distSqr + epsSqr); float invDistCube = invDist * invDist * invDist; float4 acc = invDistCube * r; float4 oldVel = vel[gid]; float newPos = myPos.w; } Multiple work items OpenCL kernel code Next: Visualize 17
  • 18. Overview of the Vectorization Stage __kernel void program(float4* pos, int numBodies, float deltaTime) { float myPos = gid; float refPos = numBodies + deltaTime; Vector instructions float4 r = pos[refPos – myPos]; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float invDist = sqrt(distSqr + epsSqr); float invDistCube = invDist * invDist * invDist; float4 acc = invDistCube * r; float4 oldVel = vel[gid]; float newPos = myPos.w; } Graphic visualization… Multiple work items OpenCL kernel code Next: Scalarize Visualize 18
  • 19. Overview of the Vectorization Stage Vector instructions Graphic visualization… Scalarizing code… Multiple work items OpenCL kernel code Next: Vectorize Scalarize Visualize 19
  • 20. Overview of the Vectorization Stage Graphic visualization… Scalarizing code… Multiple work items Vectorizing code… OpenCL kernel code Next: Vectorize Scalarize Visualize 20
  • 21. Overview of the Vectorization Stage Graphic visualization… Scalarizing code… Multiple work items Vectorizing code… OpenCL kernel code Next: Vectorize Scalarize Visualize 21
  • 22. Overview of the Vectorization Stage Reduced number of invocations Vectorization enables developer to exploit the CPU Vector Units in Implicit Data Parallelism 22
  • 23. Explicit Data Parallelism on the CPU OCL CPU • Workitem is executed solely on a single compute unit (HW Thread) • Vector operations are mapped to SSE instructions L1 • Vectors which are wider than the physical L2 register are processed by interleaving Workitem • Workgroup is executed on a single compute unit (HW Thread) … L1 • Barrier & Fence built-ins impose some L2 penalty (context saving) Workgroup • Kernel is executed over an N-D Range, which … … is divided to workgroups L1 L1 L1 L1 • Several Workgroups run concurrently on all compute unit (HW threads) L2 L2 L2 L2 … … • Optimizations hints here: L3 http://www.khronos.org/developers/library/2009- hotchips/Intel_OpenCL-and-CPUs.pdf N-D Range 23
  • 25. Summary Conclusions • OpenCL provides the developer with the tools to unleash the performance of CPU’s – Multi-thread/SMT , Vector units (SSE) – Forward Compatibility • OpenCL supports two forms of data parallelism, both map well to Intel Architectures • Implicit Data Parallelism has the best chance of mapping onto a diverse range of hardware – Requires advanced compilation techniques • Intel® OpenCL SDK “Alpha” software will be available by end of 2010 on whatif.intel.com 25
  • 26. References • s09.idav.ucdavis.edu for slides from a Siggraph2009 course titled “Beyond Programmable Shading” • Tim Mattson, “OpenCL, Heterogeneous Computing and the CPU”, OpenCL Workshop in HotChips 2009. http://www.khronos.org/developers/library/2009- hotchips/Intel_OpenCL-and-CPUs.pdf • Fatahalian, K., Houston, M., “GPUs: a closer look”, Communications of the ACM October 2008, vol 51 #10. graphics.stanford.edu/~kayvonf/papers/fatahalianCACM.pdf 26
  • 27. Legal Disclaimer • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. • Intel may make changes to specifications and product descriptions at any time, without notice. • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. • Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. • Intel, Intel Inside, the Intel logo, and Intel Core are trademarks of Intel Corporation in the United States and other countries. • OpenCL is trademarks of Apple Inc. used by permission by Khronos. • *Other names and brands may be claimed as the property of others. • Copyright © 2010 Intel Corporation. All rights reserved. 27