Explore practical elements, such as performance profiling, debugging, and porting advice. Get an overview of advanced programming topics, like common design patterns, SIMD lane interoperability, data conversions, and more.
Semelhante a Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Implicit SPMD Program Compiler | SIGGRAPH 2019 Technical Sessions
Semelhante a Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Implicit SPMD Program Compiler | SIGGRAPH 2019 Technical Sessions (20)
2. Jefferson Amstutz, Dmitry Babokin, Pete Brubaker
Contributions by Jon Kennedy, Jeff Rous, Arina Neshlyaeva
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Advanced SIMD Programming with the
Intel® ISPC Compiler
https://ispc.github.io/
Epic Chaos Demo - Image courtesy of Epic Game® Epic Chaos Demo - Image courtesy of Epic Games ®Intel® OSPRay
4. ISPC : A Brief Recap
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Intel® OSPRay : Disney’s Moana Island Scene: over 15 billion instanced primitives rendered interactively
5. • Exploiting Parallelism is essential for obtaining peak
performance on modern computing hardware
• Task Parallelism : Multithreading - Utilize all the cores
• SIMD Parallelism : SIMD Programming - Utilize all the vector
units
• Learning intrinsics is time consuming, and not always accessible
to every programmer.
• Make it easier to get all the FLOPs without being a ninja
programmer
• Reduce the development cost by working with a high level
language
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Why ISPC?
ISPC : A Brief Recap
6. • The Intel SPMD Program Compiler
• SPMD == Single Program, Multiple Data programming model
• It’s a compiler and a language for writing vector (SIMD) code.
• Open-source, LLVM-based language and compiler for many SIMD architectures.
• Generates high performance vector code targeting many vector ISAs.
• Cross platform support (Windows/Linux/MacOS/PS4/Xbox/ARM AARCH64)
• The language is C based
• Simple to use and easy to integrate with existing codebase.
• ISPC is not an “autovectorizing” compiler!
• Vectors are built into the type system, not discovered
• The programmer explicitly specifies vector or scalar variables
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
What is ISPC?
ISPC : A Brief Recap
7. ISPC : A Brief Recap
• C based, so it’s easy to read and
understand
• Code looks sequential, but executes
in parallel
• Easily mixes scalar and vector
computation
• Explicit vectorization using two new
keywords, uniform and varying
• Vector iteration via foreach keyword
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
https://ispc.godbolt.org/z/sOpQ8Z
What does the language look like?
It is basically shader programming for the CPU!
8. • The ISPC compiler produces everything required for very simple
integration into application code.
• C/C++ header file
• Contains the API/function call for each kernel you have written
• Contains any data structures defined in your ISPC kernel and
required by the application code.
• Object files to link against
• No bulky runtime or verbose API
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC : A Brief Recap
Easy integration
9. • Programmers no longer need to know the ISA to write good vector code.
• More accessible to programmers who aren’t familiar with SIMD intrinsics.
• More programmers are able to fully utilize the CPU in different areas of application
development.
• Reduced development cost
• It’s easier to develop and maintain. Simple integration. It looks like scalar code.
• Increased optimization reach
• Supporting a new ISA is as easy as changing a command line option and recompiling.
• Increased performance over scalar code
• SSE : ~3-4x; AVX2 : ~5-6x
• YMMV ☺
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC : A Brief Recap
Why is this good?
10. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Vector Loops
Epic Chaos Demo - Image courtesy of Epic Games®
11. Vector Loops
• Foreach is a convenience mechanism:
• It is a simd_for loop and iterates in chunks of
simd width sized steps
• Unmasked main body for when all SIMD
lanes are enabled
• Masked tail body for when some SIMD lanes
are disabled
• Foreach can be N dimensional, where each
dimensional index is a varying
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• For loop
• A for loop with a varying index will use
masking in the loop body
• Safe, but with a slight cost
• A for loop with a uniform index will have no
masking
• The user will need to add a tail body
https://ispc.godbolt.org/z/r1eflk
foreach(…) vs for(…)
12. Vector Loops
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
foreach example
https://ispc.godbolt.org/z/00eIcH
Unmasked Main Body
Masked Tail Body
13. Vector Loops
• Serializes over each active SIMD lane
• Many Uses :
• Atomic operations
• Custom reductions
• Calls to uniform functions
• …
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
https://ispc.godbolt.org/z/i18Lux
Unreal Engine 4.23, Chaos Physics ISPC Source
foreach_active
14. Vector Loops
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• Loop over each unique value in a varying only once
• Execution mask enabled for all SIMD lanes with the same value
https://ispc.godbolt.org/z/r49y7i
foreach_unique
15. Vector Loops
Naïve ports to uniform code paths can miss opportunities
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Axis of parallelization
Try looking for a new axis of parallelization
https://ispc.godbolt.org/z/GF7myA
Scalar
Vector
16. Vector Loops
• ISPC supports multiple axis of
parallelization within a kernel
• HLSL/GLSL/CL only support 1
• User controlled
• Provides optimization opportunities
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
https://github.com/ispc/ispc/blob/master/examples/sgemm/SGEMM_kernels.ispc
Multiple axes of parallelisation
17. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Structures and Pointers
Intel® OSPRay : Gramophone rendered in Pixar’s usdview
18. Structures and Pointers
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
struct vec3f {
float x, y, z;
};
struct Ray {
vec3f origin;
vec3f direction;
float tnear;
float tfar;
};
Uniform Ray
uniform Ray r;
Varying Ray
varying Ray r;
Uniform vs. Varying structures
19. Structures and Pointers
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
struct vec3f {
float x, y, z;
};
struct PerspRay {
uniform vec3f origin;
vec3f direction;
float tnear;
float tfar;
};
Uniform PerspRay
uniform PerspRay r;
Varying PerspRay
varying PerspRay r;
Uniform vs. Varying structures
20. • Pointers are complex
• The variability is specified like ‘const’ in C/C++
uniform float * varying vPtr;
• Variability: 2 parts
• The pointer itself
• Single pointer? Different pointer per SIMD lane?
• Default: varying
• The item pointed-to
• Scalar value? Vector value?
• Default: uniform
• Be explicit and specify the variability so it’s correct and clear to the reader
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Structures and Pointers
ISPC pointers
21. Structures and Pointers
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
->
Pointer Data
f
-> f f f f
-> -> -> ->
f f
f f
-> -> -> ->
f f f f
f f f f
f f f f
f f f f
uniform float * uniform uPtr2u;
varying float * uniform uPtr2v;
uniform float * varying vPtr2u;
varying float * varying vPtr2v;
ISPC pointers
22. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Memory Access
Epic Chaos Demo - Image courtesy of Epic Games®
23. uniform vec3f uPos
{
}
varying vec3f vPos
{
}
Memory Access
struct vec3f
{
float x;
float y;
float z;
};
Memory Layout:
x
y
z
x y z x y z …
x
y
z
x
y
z
x
y
z
x
y
z
x x x x y y y y …
Uniform vs. Varying data layout
24. varying Ray uRay
{
origin {
}
direction {
}
tnear
tfar
}
Memory Access
Complex data layout
uniform Ray uRay
{
origin {
}
direction {
}
tnear
tfar
}
struct Ray {
vec3f origin;
vec3f direction;
float tnear;
float tfar;
};
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
y
z
25. Memory Access
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• ISPC will automatically transpose your array of structures (AoS) data to structures of
arrays (SoA) and back
• Useful for block copying uniform structs into varyings
• It will just work!
• But there may be faster alternatives?
Data transposition
https://ispc.godbolt.org/z/4_p44L
26. Memory Access
• Vector reads/writes to non-contiguous
memory
• AVX2 onwards supports an optimised
gather instruction
• AVX512 supports an optimised scatter
instruction
• ISPC will use these if available
• ISPC will emit performance warnings when it
finds gather/scatters
#pragma ignore warning(perf)
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• Gather performance has improved over
successive generations
• But there can be faster alternatives,
especially if there is cacheline locality
• Aos_to_Soa() helpers
• Good for packed float3/float4 data types
• Shuffle()
• Load a vector register from memory and
swizzle the data
• You will need to experiment on your dataset.
• The fastest form of gather is no gather –
read contiguous memory where possible!
Scatter/Gather
27. Memory Access
• It's best to use SoA or AoSoA layouts with
ISPC
• Re-arranging data is not always easy
• Transposing the input data can be
faster than using gather/scatter
instructions.
• When to transpose?
• If the algorithm is cheap, it's best to
convert the data into a temporary
buffer, do the work then convert back.
• Otherwise transpose live data on the
way in/out of the kernel.
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
AOS to SOA
Transpose
Array of Structures
(AoS)
Structure of Arrays
(SoA)
Hybrid Array of Structures of Arrays
(AoSoA)
28. Memory Access
• There are stdlib functions,
aos_to_soa3/4.
• They assume arrays of
vec3/vec4 input data.
• What about strided data?
• You can write your own
transpose functions using
the stdlib.
• Use loads, shuffles, inserts, etc.
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
AOS to SOA
Vector Load Vector Load Vector Load
Vector Store Vector Store Vector Store
Shuffle
Shuffle
29. Memory Access
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
AOS to SOA example
https://ispc.godbolt.org/z/NwLihI
Unreal Engine 4.23, Chaos Physics ISPC Source
30. DRAM
Memory Access
• Allows writes to memory to occur bypassing the cache
• Avoids cacheline reads and cache pollution
• Useful when bandwidth limited
• Not always faster than normal stores
• Never read the memory straight after the write
• It won’t be in cache and will be slow…
• Write full cachelines to avoid partial writes
• Used for techniques such as :
• Texture writes
• Geometry transformations
• Compression
• …
• Experiment with your dataset.
• What about streaming loads?
• Unless the memory was specifically allocated with the
write combining flag, they won’t do anything
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Streaming stores
Normal Write
Cache Hierarchy Write Combine Buffer
Streaming Store
31. Memory Access
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Streaming stores example
https://ispc.godbolt.org/z/bKOJ1m
32. Memory Access
• Loads and stores can be aligned or unaligned
(default)
• There are specific instructions for each type
• Historically this had a performance impact
• Unaligned loads/stores may straddle cachelines
• Newer Intel architectures have reduced/removed
this impact
• Alignment needs to be the register width
• SSE : 16byte, AVX2 32byte, AVX512 64byte
• Simple to enable in ISPC
• --opt=force-aligned-memory
• Try it – YMMV!
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Aligned memory
Cacheline Cacheline
Cacheline
Unaligned Load
CachelineAligned Load
33. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Control Flow
Intel® OSPRay : Richtmyer–meshkov volume shown with shadows and ambient occlusion
34. Control Flow
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Divergent control flow
Control flow divergence can be costly
1 1 1 10 1 0 11 0 1 0
1 1 1 11 1 1 10 0 0 0
Divergent branch causes both expensive
operations to be executed
Uniform branch causes a single
expensive operation to be executed
Consider this :
Now consider this :
Execution Mask
Execution Mask
https://ispc.godbolt.org/z/XM0MEw
35. Control Flow
Unmasked Functions
• Avoids masked operations
• Useful if you want to use a different execution
mask
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Unmasked Blocks
• An optimisation
• Avoids masked operations
• Useful when you know there are no side
effects
Unmasked
https://ispc.godbolt.org/z/i18Lux
36. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Interfacing Tricks
Epic Chaos Demo - Image courtesy of Epic Games®
37. Interfacing Tricks
• Input data is generally an array of
uniforms
• These can be copied directly to varyings
by using a varying index
• Such as programIndex
• They can be cast to a varying pointer and
dereferenced
• Applications can pass in ‘fake’ varyings
which still generates SIMD code
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Mapping input data to ispc varyings
https://ispc.godbolt.org/z/-hbfO1
38. Interfacing Tricks
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• Just like normal C/C++ code, there are times when you need to call external code
• ISPC supports this for any external function using ‘C’ linkage
Calling back to C
https://ispc.godbolt.org/z/P5XcuT
39. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Choosing the Right Target
Epic Chaos Demo - Image courtesy of Epic Games®
40. Choosing the Right Target
• ISPC has a limited set of decoupling of SIMD width
and ISA
• “Double Pumped”
• Vector instructions executed twice to
emulate double width registers
• Can be effective at hiding latency
• sse4-i32x8, avx2-i32x16, etc
• “Half Pumped”
• Vector instructions executed with
narrower SIMD width registers
• Use a richer ISA for performance
gains
• avx512skl-i32x8
• Avoids platform specific AVX512
power scaling
• As simple as changing the command line
• --target=...
• Experiment to find the best targets for your
workload
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Asymmetrical SIMD register width and target SIMD ISA
https://ispc.godbolt.org/z/4EhA2A
41. Choosing the Right Target
ISPC supports compiling to multiple targets
at once
• Currently, only 1 target per ISA
• Auto dispatch will choose the highest
supported compiled target that a platform
supports, at runtime
• Manual dispatch will be coming in a future
release…
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Compile for all of the main targets
• SSE4, AVX2, AVX512
• This will allow the best performing ISA to run
on your system
• Unreal Engine and OSPRay compile for all of
the main targets by default.
Auto dispatch : multi-target compilation
--target=sse4-i32x4,avx2-i32x8,
avx512skx-i32x16
42. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC StdLib
Intel® OSPRay : OSPRay’s path tracer supports physically-based materials and a common principled material
43. SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC STDLIB
Use ISPC stdlib
ISPC provides a rich stdlib of operations:
• Logical operators
• Bit ops
• Math
• Clamping and Saturated Arithmetic
• Transcendental Operations
• RNG (Not the fastest!)
• Mask/Cross-lane Operations
• Reductions
• And that’s not all!
https://github.com/ispc/ispc/blob/master/stdlib.ispc
44. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Floating Point Determinism
Epic Chaos Demo - Image courtesy of Epic Games®
45. To increase floating point precision/determinism :
• Don’t use `--opt=fast-maths`
• Do use `--opt=disable-fma`
• But, there will be a performance penalty
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Floating Point Determinism
A Quick note!
46. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Debugging and Optimizing ISPC Kernels
Epic Chaos Demo - Image courtesy of Epic Games®
47. • Compile ISPC kernels with –g
• Visual Studio, gdb, lldb etc
works as expected
• View registers, uniform and
varying data
• Visual Studio Code ISPC
Plugin available
• Syntax highlights, Auto-
complete stdlib, Real-time
validation
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Debugging ISPC Kernels
Debugging
48. • The best way to check for performance deltas when optimising code is to
benchmark it
• Sometimes the code of interest is too small, so need a microbenchmark
• A small ISPC kernel run many times, ideally on real data
• Caution as the results may not be representative of the final gains
• ISPC git repo will soon contain a microbenchmark `ispc-bench`
• Based on google benchmark
• Simple to use and augment
• ISPC Dev team are looking for contributions to help improve ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
Optimising ISPC kernels
Benchmarking
49. Optimising ISPC kernels
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC is supported by the Compiler Explorer
• Simply copy and paste your kernels into a browser
• Try different command line arguments
• Look for optimization opportunities in the ASM code
• Experiment with all of the example code from this presentation
• Now supports using ispc (trunk)
Godbolt Compiler Explorer
http://ispc.godbolt.org/
50. Optimising ISPC kernels
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• LLVM-MCA provides static code uOp/cycle counts
• Doesn’t accurately report the cost of memory ops, but still useful
Godbolt Compiler Explorer : llvm-mca
https://ispc.godbolt.org/z/etmC_T
51. Optimising ISPC kernels
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
• Profile your ispc kernels looking for hotspots
• Compile the kernels with –g for debugging symbols
• ISPC heavily inlines, so use ‘noinline’ to target hotspot functions
VTune
https://software.intel.com/en-us/vtune
52. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC Roadmap
Intel® OSPRay : Disney’s Moana Island Scene: over 15 billion instanced primitives rendered interactively
53. ISPC Roadmap
ISPC v1.12
• ARM support
• Cross compilation support
(iOS/Android/Switch/Xbox/PS4)
• Noinline keyword
• Performance improvements
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC v1.next
• Performance improvements
• Future hardware support
• Manual dispatch
ISPC roadmap
File an issue on github – let us know what you need!
Submit a patch – show us what you need!
54. Advanced ISPC
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC Resources
Intel® OSPRay : OSPRay’s path tracer supports physically-based materials and a common principled material
55. ISPC Resources
ISPC Home Page
• https://ispc.github.io/ispc.html
ISPC Origins
• https://pharr.org/matt/blog/2018/04/18/ispc-origins.html
ISPC on Intel® Developer Zone
• https://software.intel.com/en-
us/search/site/language/en?query=ispc
Visual Studio Code ISPC Plugin
• https://marketplace.visualstudio.com/items?itemName=intel-
corporation.ispc
ISPC Compiler Explorer
• https://ispc.godbolt.org/
Intel® Intrinsics Guide
• https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Agner Fog Instruction Tables
• https://www.agner.org/optimize/instruction_tables.pdf
uOps Latency, Throughput and Port Usage Information
• http://uops.info/
SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST
ISPC Github
• https://github.com/ispc/ispc/
Intel® OSPRay
• https://www.ospray.org/
Unreal Engine
• https://www.unrealengine.com/en-US/
ISPC Texture Compressor
• https://github.com/GameTechDev/ISPCTextureCompressor
ISPC DX12 nBodies Sample
• https://github.com/GameTechDev/ISPC-DirectX-Graphics-
Samples
SPIRV to ISPC Project
• https://github.com/GameTechDev/SPIRV-Cross
ISPC in Unreal Engine Blog Post
• https://software.intel.com/en-us/articles/unreal-engines-new-
chaos-physics-system-screams-with-in-depth-intel-cpu-
optimizations
ISPC on the web