2. STATE OF GPU COMPUTING
Today’s Challenges
Separate address spaces
Copies
Can’t share pointers
New language required for compute kernel
EX: OpenCL™ runtime API
Compute kernel compiled separately than
host code
Emerging Solution
HSA Hardware
Single address space
Coherent
Virtual
Fast access from all components
Can share pointers
Bring GPU computing to
existing, popular, programming models
Single-source, fully supported by
compiler
HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
Proofpoints : Green500, top end is now heterogeneous computing solutions.Components = CPU Cores, 3rd Party IP, GPU cores.
SPIR = Standard Portable Intermediate Representation. Standard compiler intermediate language. More detail later. At this point, note that SPIR is a high-level IR and HSAIL is a low-level IR.
Familiar to folks used to GPUs.Grid contains work-groups; work-group contains workitems.Programmer specifies the global dims and the work-group ; Work-group can often be determined automatically but also can be tuned for peak performance.Wavefront – hardware-specific concept. Similar to the “vector width” of a machine. For example SSE=128, AVX=256.
We want to run a Sobel edge detect filter on Peaches the dog. The equation represents the kernel that is run on each pixel in the image. Note the equation examines surrounding pixels to determine the rate of change, ie an edge.
To use the HSA Parallel Execution Model, we map the image to a 2D grid. Each pixel is mapped to a work-item, and the grid specifies all the pixels in the image. The same kernel is run for each work-item in the grid, but each work-item gets it’s unique (x,y) coordinate. By using the coordinate, work-items can read the pixel values of the surrounding pixels.Key observation is that the programmer writes the code for single pixel in what looks single thread, and the execution model exposes a huge amount of parallelism.
Work-groups provide an additional optional optimization opportunity. HSA supports special “group” memory that is shared by all work-items in the work-group, and is typically very high-performance. In this case, we have read-sharing between neighboring pixels, so we have picked a relatively large square grid so that each work-item can pull from neighboring pixels. This
Destination first. Uses actual registers.Atomics are platform level.
“Spill” = register spill.
Some tension between programmability and HW implementation.
Explain how SIMD hardware uses masks to run both halves of the if-then-else statements.
Virtual ISA is SIMT (one thread), but can use cross-lane operations.Non-portable code.Other cross-lane operations: countlane, countuplane, masklane, sendlane, receivelane
HSA supports IEEE-defined precisions.
Key points : This is in Java.Standard Java8. Uses Java Stream class, and a lambda function.Player class has a pointer (object reference) to a parent Team.Stream supports parallel exectuion, including both multi-core and GPU accelerators. This is goal of HSA – make GPU programming as easy as multi-core CPU programming, and easier than multi-core + vector ISA programming.Note all functions have been inlined.
This is the code for the Java ForEach loop – not translating the entire code into HSAIL.Line4 – kernarg signature.Line5 – returns the coordinate of this work-item, so we know which player to operate on.Line10 – multiple, destination first.Line19 – dereference of team variable.
SPIR = Standard Portable Intermediate Representation. Standard compiler intermediate language. More detail later. At this point, note that SPIR is a high-level IR and HSAIL is a low-level IR.
“Faraway accelerator” -> Separate address spaces, no pointers, perf/power cost of copiesShared virtual mem and platform atomics – HSA systems looks more like CPUs with fast vector units than discrete accelerators.