Learn how Intel worked with Pixar Animation Studios* and Sony Imageworks* to realize dynamic SIMD code generation of Open Shading Language shader networks, achieving 3-9x speedups with Intel® AVX-512.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session
1. Presenter: Stephen Friedman (Pixar Animation Studios)
Authors: Alex Wells (Intel),
Max Liani & Stephen Friedman (Pixar Animation Studios) ,
Larry Gritz (Sony Pictures Imageworks)
Contributors: Steena Monteiro & Louis Feng (Intel)
August 16, 2018
3. 3
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
4. 4
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
6. Shading Networks
Develop reusable shading nodes
Connect nodes to define complex
materials
Production shading networks can grow
very large to 100s, 1000s of nodes.
6
7. C++ Shader Limitations
Lack of context at compile time
Input parameters unknown
Geometry being shaded unknown
Mode of shading unknown
Surrounding shading network unknown
Branchy testing required
Lack of portability
Requires “Performance Ninjas”
7
Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez)
8. Open Shading Language
Developed by Sony Pictures Imageworks*
C-like DSL for programmable shading
API to connect shaders into networks
Open source
http://github.com/imageworks/OpenShadingLanguage
Sci-Tech Award* in 2017
8Logo owned by Academy of Motion Picture Arts and Sciences for Infobox
*Other names and brands may be claimed as the property of others.
9. 9
Poster images (c) Sony Pictures*, Paramount*, Warner Brothers*, Disney*, Fox*, Universal*
10. Example OSL Shader (marble)
10
shader marble (color Cin = .5,
float freq = 1.0,
output color Cout = 0)
{
float sum = 0;
float freqVal = freq;
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
Shader
Globals
(input set by renderer)
Library CallsLibrary CallsLibrary Calls
11. 11
oslc
Offline
compiler
Shader
Written in OSL
Intermediate OSO
(Instructions + operands)
OSL Runtime
Library
Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
OSL Runtime
Build
Shading
Network
callbacks
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Execute
Shading
Network
(per Point)
Optimized x86
QueryOutputs
*Other names and brands may be claimed as the property of others.
12. Complexity
12
Image (c) 21st Century Fox
173 billion shader invocations
> 16 hours to execute (1t)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
13. OSL Render Time Optimization
280M ops -> 2.68M
(-99.0%)
161M symbols -> 1.9M
98.8% reduction
150,612 empty shader instances
63% optimized away
13
Image (c) 21st Century Fox
99% reduction of operations
Can outperform precompiled C++ shaders
(mostly because of Render Time optimization)
14. 14
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
15. OSL Scalability Limitations
Single sample execution
limited opportunities to leverage SIMD
high execution cost
Block vectorization using Intel® Streaming SIMD Extensions (Intel® SSE)
only 4-wide
very limited support (noise functions, texturing)
No benefits from modern 8/16-wide CPUs
15
No benefits from modern 8/16-wide CPUs
Image (c) Pixar Animation Studios
17. 17
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
19. 19
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
20. Renderer
INTERFACE to Renderer
20
Shading System
execute(ShaderGlobals,…)
symbol_address(…)
execute_batch(ShaderGlobalsBatch, …)
symbol_batch_accessor(…)
Submit Single Point
Query Results
Submit Batch
of Points
Query Batch of
Results
ShaderGlobalsBatch
Uniform:
context *’s
Raytype
…
Queue of Varying:
Surface Position
Incident Ray
Surface Normal
…
ShaderGlobals
New “Batched”
Interface
22. Accessing Varying Data
Inspired by techniques from Intel’s SIMD Data Layout Templates
22
my_callback(ConstWideAccessor<float> wScale,
ConstWideAccessor<Matrix44> wM,
WideAccessor<Vec3> wVS,
MaskedAccessor<Vec3> wVT) {
for(int i=0; i < wr.width; ++i) {
Vec3 V = wVS[i];
float F = wScale[i];
Matrix44 M = wRM[i];
wVS[i] = V*F;
wVT[i] = transform(M,V);
}
}
Array subscript returns a
proxy object to that lane
Accessors transparent
AOS view of SOA
Extract data
from a lane
of the SOA
Skips assignment if
lane masked off
23. BatchedRendererServices
texture(“MyTex”,…, );
Uniform Texture binning
23
texture(“MyTex”, u, v); No Overhead
if (layer == 1)
file = “r.tex”;
else if (layer == 2)
file = “g.tex”;
else if (layer == 3)
file = “b.tex”;
texture(file, u, v);
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1layer =
file =
Mask =
texture(“b.tex”,…, );
texture(“r.tex”,…, );
texture(“g.tex”,…, );
JIT’d
Binning
Full flexibility
24. 24
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
25. 25
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• …
• Utilizing SIMD as a Language Author
• Uniform Computation Optimization
• Control Flow Management
• Mapping OSL to Assembly
• OSL Library SIMD Implementations
• …
26. Uniform vs Varying Variables
In some other high performance shading languages, it is left as an exercise for the
programmer to identify what variables and parameters are uniform or varying with keywords:
• Pixar’s RenderMan* Shading Language (RSL)
• https://renderman.pixar.com/resources/RenderMan_20/shadingLanguage.html
• Intel SPMD Program Compiler
• https://ispc.github.io/
• OpenMP*
• https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
Furthermore, multiple versions of functions may need to be created to handle different
combinations of uniform vs. varying.
26
varying float patctx = 0; /* initialize the context */
varying float f = gridpattern("concentric", patctx);
for (uniform int j = 0; j < height; j++)
#pragma omp declare simd uniform(a) linear(1: b)
somefunc(float a, float * b, float c);
*Other names and brands may be claimed as the property of others.
27. HIDE UNIFORM vs varying from
user• Goals:
• No changes to shader source/.oso files.
• Allow shader authors to leverage SIMD hardware without added complexity.
• Leverage common operations across batches of shading points.
• Implications:
• Can't add new keywords (uniform, varying, forall)
• Must leverage uniform computations when possible for:
• data layout
• control flow
• code generation
• New interfaces to allow renderers to leverage uniform computations
27
28. HIDING UNIFORM vs varying: Is it
possible?
• We can leverage domain specific restrictions
• No external library functions
• User functions are part of the shader source/.oso
• Well defined contracts of varying/uniform in OSL library and
RendererServices
28
YES WE CAN!
29. Identifying Uniform vs Varying
• Variables are uniform until proven varying
• Varying is proven by tracing dependence from known-varying shader globals
29
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
P is a varying
Shader Global
Uniform because no dependency on
varying Shader Global
Uniform because no dependency on
varying Shader Global
30. 30
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• …
• Utilizing SIMD as a Language Author
• Uniform Computation Optimization
• Control Flow Management
• Mapping OSL to Assembly
• OSL Library SIMD Implementations
• …
33. 33
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• …
• Utilizing SIMD as a Language Author
• Uniform Computation Optimization
• Control Flow Management
• Mapping OSL to Assembly
• OSL Library SIMD Implementations
• …
35. The Starting Line: marble single sample execute
35
95% of the time spent
in the OSL library
“_2” is the JIT
(marble shader)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
36. 36
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• …
• Utilizing SIMD as a Language Author
• Uniform Computation Optimization
• Control Flow Management
• Mapping OSL to Assembly
• OSL Library SIMD Implementations
• …
38. The Finish Line: marble batch
execute
38
Wide version of noise:
4x speedup
JIT of marble.osl:
13.2x speedup
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with
Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice.
39. 39
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
51. 51
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward
56. Configurations
56
Config 1 Config 2 Config 3
Model name Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Core(s) per socket 20 20 18
Socket(s) 2 2 2
Memory192GB, DDR4-2666 Mhz (12 x 16GB) 192GB, DDR4-2666 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB)
CPU Power PolicyPerformance powersave Performance
Hyperthreading Enabled Enabled Enabled
Turbo Boost Tech Enabled Enabled Enabled
L1d cache 32K 32K 32K
L1i cache 32K 32K 32K
L2 cache 1024K 1024K 256K
L3 cache 28160K 28160K 46080K
Operating System RHEL 7.4 CentOS Linux release 7.3.1611 (Core)
Red Hat Enterprise Linux Server release 7.2
(Maipo)
Bios Version SE5C620.86B.00.01.0009.101920170742 SE5C620.86B.01.00.0412.020920172159 GRRFSDP1.86B0271.R00.1510301446
• All non-interactive tests run on a single socket of these configurations
• Expected environment in render farms
57. 57
OSL ShaderS
• Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl
• Modifications:
• Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl
• Diamond plate - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/diamondplateshader.osl
• Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-Experimental/Threads.osl
• Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-Experimental/TheDonutShader.osl
• Oak – https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl
• Marble - https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl
< float grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0,0));
---
> float grain=noise("gabor",p,8);
*Other names and brands may be claimed as the property of others.
58.
59. Minimizing Masked Instructions
59
surface test_conditional_masking(output color ResultRGB = 0) {
if (P[0] > 0.5) {
if (P[1] > 0.5) {
float powB = pow(P[2], 5.3);
float g;
float inv_g;
if (powB > 0.75) {
inv_g = 1.0/P[1];
inv_g = inv_g*inv_g;
g = smoothstep(P[0],P[2],inv_g);
} else {
float in_red = P[0];
float inv_r = 1.0/in_red;
g = noise("perlin",inv_r);
}
ResultRGB[1] = g;
}
}
}Implicit read of output ResultRGB
Read with conditional
mask that is not subset
of the last write
Read with logical mask
that is not a subset of
the a write
Track
Assignments
Track
Assignments
Require
Masking
Assignments
track
logical mask
Notas do Editor
Not rendering final color, just figuring out diffuse color, surface reflection, or any other property of a material.
A Renderer does ray-tracing and light integration but needs these properties from a material given the position of the ray/object intersection in space and the incoming ray orientation, surface normal and other input values. We refer to this process as Shading.
Historically the definition of each node was done in a static programming language (like C++) and execution flow through the graph to produce the requested outputs from the graph.
Source code, nodes instantiation and nodes connections is described using the OSL ShadingSystem API.
Note the different parameters that the “2d Texture Placement” node has to handle, the underlying code has to be able to handle any combination of those settings, and resulting is often not optimized well.
Designed for physically based rendering
patterns
compute radiance closures (BxDFs) , not view-dependent final colors
no ray tracing, sampling, integrations, light loops (these are in the renderer)
Efficient execution
JIT to machine code, extensive runtime optimization
Shading networks with lazy evaluation
It enables artists at all levels of technical proficiency to create physically plausible materials for efficient production rendering.
Wide industry adoption
Input & Output parameters
Shader Globals (how renderer passes in position, surface normal, ray direction, etc.) for the sample they want evaluated by the shader network.
NOTE: single precision floating point, 32 bits
Not F.P.S, but Hours or Days per frame
We will rely on LLVM to lower vector operations to capabilities of underlying architecture, which for AVX512 is pretty simple.
Worst case is a loop is generated or multiple instructions are issued to satisfy the logical vector operations
Split globals into different structs based on uniformity
ShaderGlobalBatched contains both the Uniform & Varying Shader Globals
Provides a queue like interface to set the values of the next varying instance and push it into the queue.
Renderer needs to use new interfaces and support wide callbacks
For VaryingShaderGlobals as well as callbacks through the renderer, we want a SIMD friendly data layout. This layout isn’t always convenient to code against, and we would rather just program against the original Vec3 or other existing data types
For VaryingShaderGlobals as well as callbacks through the renderer.
Accessors just look like an array of the data type, but under the hood is a Wide SOA version backing it.
NOTE: assignment to a MaskedAccessor will transparently apply the mask skipping assignment to the data lane.
All textures parameters in a batch could all be varying
Texture Subsystem doesn’t want to deal with every option being different per data lane.
One more complexity for artists to deal with, may not get it right cause sub-optimal performance
So we know exactly which functions depend on ShaderGlobals that may be varying and which don’t.
This is true for RendererServices as well.
Because we don’t optimize until after a shader network is built, we can actually follow variables through connected parameters all the way back to their origin (another thing you can’t do in a traditional programming language).
The upshot is that we can follow the dependency chains to automatically identify all variables who’s values are in some way dependent upon a varying ShaderGlobal.
This include implicit temporaries that exist in chains of operations.
NOTE: analysis of loops is more complex, because break, continue, return, or exit that happen in a non-uniform conditional branch can cause a loop control to be promoted from uniform to varying.
We can blend together results from branches with a bit mask.
LLVM has a “Select” operation to choose the original
System to track logical masks based on varying conditional operations
Added analysis to identify which stores need to be masked
During generation of LLVM IR, keep a stack of non-uniform conditional results (masks are <16 x i1>)
When a mask is pushed onto the stack, it is first combined with the mask already on the top of the stack.
Handle the “else” of a conditional by tracking a “negated” flag in the stack vs. negating it.
When blending can just reverse order of blend vs. extra instructions to negate the mask
This gets much more complicated for loop control, early return, break, continue, and exit operations.
We added support to hookup the profiling from OSL runtime to LLVM with Intel JIT profiling enabled so we could actually see the dynamic code generation from inside Vtune.
Special Build of LLVM
-DLLVM_USE_INTEL_JITEVENTS=1
Modified OSL to enable debug info in the LLVM JIT module
emit LLVM debug info and locations as OSL operations are generated
Now we can:
profile
map OSL to assembly with VTune
We can run GDB with full line table and callstack through inlined OSL function calls viewing the OSL shader and assembly
Non batched execution of marble.osl.
Just wanted to highlight that fully SIMD version of the LLVM IR alone would not help too much. We need everything possible in SIMD including the built-in library calls
The scalar version of this perlin noise computation actually had been optimized to perform block vectorization within the algorithm using SSE2 intrinics.
To perform outer loop vectorization we needed to remove these intrinsics and revert to the original C++ version of the algorithm, to not mess up the original version’s performance we made a new helper function “perlin_scalar”.
Now the wide version is very similar, except its data types are our WideAccessors which using an array subscript can import/export the data type out of the underlying SOA data layout.
We explicitly declare our outer loop to be SIMD using an OpenMP 4 #pragma and specifiy the Width, this tells the compiler that “I the programmer have declared that each iteration of this loop can operate in parallel and unordered”. Now the compiler can emit SIMD code and know it is legal, because “we said it was”, no better logic than that!
Inside the loop we export the data for the current lane, perform the scalar computation, then import the results for the lane.
Also note that the actual scalar computation is not aware of our data layout or our outer loop.
Once it’s all inlined, the compiler can produce striking good code for multiple target ISA’s (SSE2, AVX, AVX2, AVX512, etc.)
Inspect the optimization reports from the compiler to check on success and quality of code generation.
If the compiler ran into issues when vectorizing it will tell you there. (example: could vectorize but would be inefficient)
To ensure proper inlining, on the Intel® C++ Compiler we can use the “#pragma force inline recursive”
Note the other “wide” functions may also be vectorized and enjoy the reduced overhead of being called up to 16 times fewer than the non-batched interface needs to call them.
Intent is to show high performance potential but exemplify how that scales down with batch utilization.
As batch utilization is partially in the renderer’s hands, renderers should work to improve there batches to reach top performance.
The actual shader could be taking more/less paths because we skip code blocks when lanes are completely masked off.
As we increase active SIMD lanes, the chance of skipping a branch of code goes down.
So skipping branches is more effective with small batch sizes, and we are likely executing more code blocks of a shader as the active number of SIMD lanes increases.
This can cause non-linear performance / vs active simd lanes for a shader.
Sometimes a shader could be slower at low batch utilization, but there usually a point at which it becomes more profitable.
We might be able to better optimize the used OSL library API’s to take a different code path when batch utilization is low, which could have the effect of improving low batch utilization performance
Concrete uses expensive gabor noise and enjoys a hyper speedup from the improved implementation
Concrete uses expensive gabor noise and enjoys a hyper speedup from the improved implementation
100% batch utilization of 16 points.
When I run to convergence, the frame takes 2:20" with scalar OSL and 1:02" with AVX-512
Discuss batch utilization from the Renderer’s usage vs. batch utilization within shaders due to control flow divergence.
Discuss the ray bounces hitting disparate shading network with possibly smaller and smaller #’s of rays. We can see as the # of bounces is decreased the batch utilization increases and so does the Shading Systems’ speedup.
Discuss batch utilization from the Renderer’s usage vs. batch utilization within shaders due to control flow divergence.
quality of code generation for IA
reducing JIT compile time (pain point)I
If your renderer doesn’t use OSL, you might want to consider adding support.
Not all renderers are capable of generating batches of material requests, your renderer might need rework to operate on batches. OSL with Scalable SIMD execution can be your shading system providing good ROI for updating the rest of your renderer.
We historically track the place in the logical mask stack each assignment happens for each operation.
When a symbol is read, we can compare the current logical mask and determine if it is a subset of the mask of any operations that wrote to the symbol.
If it is not a subset, then that assignment operation will need to masked.
For assignment operations that have been earlier identified as “requires masking”, just use the mask on the top of the stack to select the correct value
“select” is the LLVM IR we use to blend values together based on a mask
IMPORTANT NOTE: We don’t need to execute masked version of smoothstep or noise, just need to maks the assignment of their results.