Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session

Presenter: Stephen Friedman (Pixar Animation Studios)
Authors: Alex Wells (Intel),
Max Liani & Stephen Friedman (Pixar Animation Studios) ,
Larry Gritz (Sony Pictures Imageworks)
Contributors: Steena Monteiro & Louis Feng (Intel)
August 16, 2018

Legal Disclaimers and Optimization Notices
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest
forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer
system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or
configuration may affect your actual performance.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates
may make these results inapplicable to your device or system.
Intel, Xeon and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
2

3
Agenda
Taking Advantage of Modern SIMD in a Domain Specific Language
• Open Shading Language (OSL) Overview
• How Modern SIMD Can Improve OSL
• Utilizing SIMD as a Shading/Renderer/Language Author
• Reaping the Benefits of SIMD
• Moving Forward

4
Agenda
• Moving Forward

Shading in Physically Based
Rendering
5

Shading Networks
Develop reusable shading nodes
Connect nodes to define complex
materials
Production shading networks can grow
very large to 100s, 1000s of nodes.
6

C++ Shader Limitations
Lack of context at compile time
 Input parameters unknown
 Geometry being shaded unknown
 Mode of shading unknown
 Surrounding shading network unknown
 Branchy testing required
Lack of portability
Requires “Performance Ninjas”
7
Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez)

Open Shading Language
Developed by Sony Pictures Imageworks*
C-like DSL for programmable shading
API to connect shaders into networks
Open source
 http://github.com/imageworks/OpenShadingLanguage
Sci-Tech Award* in 2017
8Logo owned by Academy of Motion Picture Arts and Sciences for Infobox
*Other names and brands may be claimed as the property of others.

9
Poster images (c) Sony Pictures*, Paramount*, Warner Brothers*, Disney*, Fox*, Universal*

Example OSL Shader (marble)
10
shader marble (color Cin = .5,
float freq = 1.0,
output color Cout = 0)
{
float sum = 0;
float freqVal = freq;
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
Shader
Globals
(input set by renderer)
Library CallsLibrary CallsLibrary Calls

11
oslc
Offline
compiler
Shader
Written in OSL
Intermediate OSO
(Instructions + operands)
OSL Runtime
Library
Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
OSL Runtime
Build
Shading
Network
callbacks
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Execute
Shading
Network
(per Point)
Optimized x86
QueryOutputs

Complexity
12
Image (c) 21st Century Fox
173 billion shader invocations
> 16 hours to execute (1t)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

OSL Render Time Optimization
280M ops -> 2.68M
 (-99.0%)
161M symbols -> 1.9M
 98.8% reduction
150,612 empty shader instances
 63% optimized away
13
Image (c) 21st Century Fox
99% reduction of operations
Can outperform precompiled C++ shaders
(mostly because of Render Time optimization)

14
Agenda
• Moving Forward

OSL Scalability Limitations
Single sample execution
 limited opportunities to leverage SIMD
 high execution cost
Block vectorization using Intel® Streaming SIMD Extensions (Intel® SSE)
 only 4-wide
 very limited support (noise functions, texturing)
No benefits from modern 8/16-wide CPUs
15
No benefits from modern 8/16-wide CPUs
Image (c) Pixar Animation Studios

Creating a SIMD Scalable OSL
Use Single Program Multiple Data (SPMD) techniques
 No changes to the OSL language specs
– Leverage OSL render time optimization
 Create “Batched” interface to OSL
– Retain “single point” interface
 Create “wide” backend to generate code
– Directly emit vector operations
– Leverage LLVM* vector data types<16 x float>
 Create “wide” library
– Leverage OpenMP Explicit Vectorization
16
Image © Disney/Pixar

17
Agenda
• Moving Forward

18
This Page Intentionally Left Blank

19
Agenda
• Moving Forward

Renderer
INTERFACE to Renderer
20
Shading System
execute(ShaderGlobals,…)
symbol_address(…)
execute_batch(ShaderGlobalsBatch, …)
symbol_batch_accessor(…)
Submit Single Point
Query Results
Submit Batch
of Points
Query Batch of
Results
ShaderGlobalsBatch
Uniform:
context *’s
Raytype
…
Queue of Varying:
Surface Position
Incident Ray
Surface Normal
…
ShaderGlobals
New “Batched”
Interface

Representing Varying Data
21
template <int WidthT>
struct Wide<Vec3, WidthT> {
float x[WidthT];
float y[WidthT];
float z[WidthT];
};
Fixed size SOA
(Structure of Arrays)
friendly for SIMD hardware
Define “Wide” wrappers for
existing data types:
Vec3, Matrix44, float, etc.

Accessing Varying Data
Inspired by techniques from Intel’s SIMD Data Layout Templates
22
my_callback(ConstWideAccessor<float> wScale,
ConstWideAccessor<Matrix44> wM,
WideAccessor<Vec3> wVS,
MaskedAccessor<Vec3> wVT) {
for(int i=0; i < wr.width; ++i) {
Vec3 V = wVS[i];
float F = wScale[i];
Matrix44 M = wRM[i];
wVS[i] = V*F;
wVT[i] = transform(M,V);
}
}
Array subscript returns a
proxy object to that lane
Accessors transparent
AOS view of SOA
Extract data
from a lane
of the SOA
Skips assignment if
lane masked off

BatchedRendererServices
texture(“MyTex”,…, );
Uniform Texture binning
23
texture(“MyTex”, u, v); No Overhead
if (layer == 1)
file = “r.tex”;
else if (layer == 2)
file = “g.tex”;
else if (layer == 3)
file = “b.tex”;
texture(file, u, v);
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1layer =
file =
Mask =
texture(“b.tex”,…, );
texture(“r.tex”,…, );
texture(“g.tex”,…, );
JIT’d
Binning
Full flexibility

24
Agenda
• Moving Forward

25
Agenda
• …
• Utilizing SIMD as a Language Author
• Uniform Computation Optimization
• Control Flow Management
• Mapping OSL to Assembly
• OSL Library SIMD Implementations
• …

Uniform vs Varying Variables
In some other high performance shading languages, it is left as an exercise for the
programmer to identify what variables and parameters are uniform or varying with keywords:
• Pixar’s RenderMan* Shading Language (RSL)
• https://renderman.pixar.com/resources/RenderMan_20/shadingLanguage.html
• Intel SPMD Program Compiler
• https://ispc.github.io/
• OpenMP*
• https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
Furthermore, multiple versions of functions may need to be created to handle different
combinations of uniform vs. varying.
26
varying float patctx = 0; /* initialize the context */
varying float f = gridpattern("concentric", patctx);
for (uniform int j = 0; j < height; j++)
#pragma omp declare simd uniform(a) linear(1: b)
somefunc(float a, float * b, float c);

HIDE UNIFORM vs varying from
user• Goals:
• No changes to shader source/.oso files.
• Allow shader authors to leverage SIMD hardware without added complexity.
• Leverage common operations across batches of shading points.
• Implications:
• Can't add new keywords (uniform, varying, forall)
• Must leverage uniform computations when possible for:
• data layout
• control flow
• code generation
• New interfaces to allow renderers to leverage uniform computations
27

HIDING UNIFORM vs varying: Is it
possible?
• We can leverage domain specific restrictions
• No external library functions
• User functions are part of the shader source/.oso
• Well defined contracts of varying/uniform in OSL library and
RendererServices
28
YES WE CAN!

Identifying Uniform vs Varying
• Variables are uniform until proven varying
• Varying is proven by tracing dependence from known-varying shader globals
29
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
P is a varying
Shader Global
Uniform because no dependency on
varying Shader Global
Uniform because no dependency on
varying Shader Global

30
Agenda
• …
• …

Handling Divergence
• System to track logical masks.
• Which stores need to be masked?
31
Variable Value
Before Branch
<16 x float>
Mask
<16 x i1>
(1 = execute Branch)
(0 = skip Branch)
Variable Value
Modified in Branch
<16 x float>
Select
Result
<16 x float>

Keep a Stack of Masks
32
surface test_conditional_masking(output color ResultRGB = 0) {
if (x > 0.5) {
if (y > 0.5) {
float powB = pow(z, 5.3);
float g;
float inv_g;
if (powB > 0.75) {
inv_g = 1.0/y;
inv_g = inv_g*inv_g;
g = smoothstep(x,z,inv_g);
} else {
float in_red = x;
float inv_r = 1.0/in_red;
g = noise("perlin",inv_r);
}
ResultRGB[1] = g;
}
}
}
Logical Mask Stack <16 x i1>
empty
(x > 0.5)
(y > 0.5) &&
(x > 0.5)
(powB > 0.75) &&
(P[1] > 0.5) &&
(P[0] > 0.5)
(P[1] > 0.5) &&
(P[0] > 0.5)
!(powB > 0.75) &&
(P[1] > 0.5) &&
(P[0] > 0.5)
(P[1] > 0.5) &&
(P[0] > 0.5)
(P[0] > 0.5)
empty
true or false
1-bit per data lane

33
Agenda
• …
• …

OSL Shader -> Assembly
34
Line Level Debugging
(even inlined functions)

The Starting Line: marble single sample execute
35
95% of the time spent
in the OSL library
“_2” is the JIT
(marble shader)

36
Agenda
• …
• …

Noise: Before & After
37
Scalar computation
with
Scalar data types
Block Vectorization
with intrinsics
template<int WidthT> void operator() (MaskedAccessor<float, WidthT> wresult,
ConstWideAccessor<Vec3, WidthT> wp) const {
#pragma forceinline recursive
{
#pragma omp simd simdlen(WidthT)
for(int l=0; l< WidthT; ++l) {
Vec3 p = wp[l];
float perlinResult;
HashScalar h;
perlin_scalar(perlinResult, h, p.x, p.y, p.z);
float scaledResult = 0.5f * (perlinResult + 1.0f);
wresult[l] = scaledResult;
}
}
}
inline void operator() (float &result, const Vec3 &p) const
{
HashScalar h;
perlin(result, h, p.x, p.y, p.z);
result = 0.5f * (result + 1.0f);
}
Explicit
Outer Loop
Vectorization
(Intel® C++ Compiler)
(Clang 5+)

The Finish Line: marble batch
execute
38
Wide version of noise:
4x speedup
JIT of marble.osl:
13.2x speedup
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with
Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice.

39
Agenda
• Moving Forward

40
0.125
0.25
0.5
1
2
4
8
16
null
sin
cos
tan
asin
acos
atan
sinh
cosh
tanh
atan2
sincos
log
log2
log10
logb
exp
exp2
expm1
pow
erf
erfc
radians
degrees
sqrt
inversesqrt
hypot
abs
fabs
sign
floor
ceil
round
trunc
mod
min
max
clamp
mix
isnan
isfinite
select
dot
cross
length
distance
normalize
reflect
fresnel
rotate
transform
transform_matrix
matrix_object_camera
determinant
transpose
linearstep
smooth_linearstep
noise_perlin
noise_cell
noise_simplex
noise_gabor
pnoise_perlin
pnoise_cell
pnoise_gabor
spline_bezier
spline_bspline
spline_catmull-rom
spline_hermite
spline_linear
spline_constant
Batch Size 1 [0.56x] Batch Size 2 [1x] Batch Size 4 [2.12x] Batch Size 8 [4x] Batch Size 12 [5.7x] Batch Size 16 [7.6x]
Speedup
MICRO-BENCHMARK of OSL library
OSL’s testshade running on 40 threads of
Intel® Xeon® Gold 6148 @2.4Ghz (config 2)

0
2
4
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
Batch Size
threads.osl
(Batched Speedup vs. Single Point)
Intel® AVX-512 performance vs
BATCH UTLIZATION
41
5.7x
Speedup

0
1
2
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
Batch Size
TheDonutShader.osl
(Batched Speedup vs. Single Point)
Intel® AVX-512 performance vs
BATCH UTLIZATION
42
3.13x
Speedup
Break
even

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
Points in Batch
concrete
leopard
oak
marble
diamond
plate
43
21x
4.3x
4.1x
3.7x
5.6x
Intel® AVX-512 performance
vs BATCH UTLIZATION

1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Speedup
concrete leopard oak marble diamond
plate
thread donut
BATCHED Intel® AVX-512
vs Intel® AVX2
44
1.5x 1.52x 1.47x
1.27x
1.7x
1.38x
1.2x
OSL’s testshade running Intel AVX512 on 40 threads of
OSL’s testshade running Intel AVX2 on 36 threads of
Intel® Xeon® E5-2697v4 @2.3Ghz (config 3)

Statue2
• High quality settings
• Expensive “gabor” noise
45
Single Point Batched
2.82x Shading
Speedup
87% Batch
Utilization
2x Overall Speedup
Pixar’s RenderMan* 22.dev running on all 40 threads of Intel®
Xeon® Gold 6148 @2.4Ghz (config 2)

Bonnie
46
1.33x Shading
Speedup
62% Batch
Utilization
• Real production character with 55 shader networks
• 85663 shader operations on 67680 symbols (post-optimization)
Amdahl’s
Law

1.33x
1.40x
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
0
10
20
30
40
50
60
70
80
90
100
9 Bounces 5 Bounces 3 Bounces 2 Bounces 1 Bounce
ShadingSpeedup
BatchUtilization%
Batch Utilization Shading Speedup
Batch Utilization/ Ray Bounces
(bonnie)
47

1 point
2 points
3 points
4 points
5 points
6 points
7 points
8 points
9 points
10 points
11 points
12 points
13 points
14 points
15 points
16 points
0
10
20
30
40
50
60
70
80
1 Bounce 2 Bounces 3 Bounces 5 Bounces 9 Bounces
7.3%
13.9%
18.9%
22.3%
25.4%
76.6%
67.1%
60.9%
56.5%
52.6%
%ofBatchesSubmitted
Bucket sizes / RAY BOUNCES
(Bonnie)
48

Fillmore
49
1.58x Shading
Speedup
• Real production character with 56 shader networks
• 153397 shader operations on 123055 symbols (post-optimization)
1.21x Overall Speedup

Interactive Rendering
in Maya*
50
• Statue2 “fast” quality
• Perlin Noise
• Shader small portion of
overall render time
• Batched Intel® AVX-512
shading
• 10% faster render time
+ 15% frame rate
Pixar’s RenderMan* 22.dev running on 2 of Intel® Xeon® Gold 6148 @2.4Ghz (config 2)
Video & images © Disney/Pixar

51
Agenda
• Moving Forward

Next Steps
• Remaining features
• Closures
• Further optimize generated code
• Explore SIMD Texture System
• Intel contributes SIMD OSL to Open Source
• Pixar’s RenderMan* 22
• Upcoming release with SIMD OSL
• Investigate batch utilization
52

53
Conclusion and Call to Action
• Does your renderer already use OSL?
• Is your renderer capable of batching
requests per Shader Group?
• Unleash the power of Intel® AVX-512
inside the Open Shading Language
• https://github.com/imageworks/OpenShadingL
anguage/tree/IntelBatchedOSL
• Try it out in an upcoming release of

Configurations
56
Config 1 Config 2 Config 3
Model name Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Core(s) per socket 20 20 18
Socket(s) 2 2 2
Memory192GB, DDR4-2666 Mhz (12 x 16GB) 192GB, DDR4-2666 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB)
CPU Power PolicyPerformance powersave Performance
Hyperthreading Enabled Enabled Enabled
Turbo Boost Tech Enabled Enabled Enabled
L1d cache 32K 32K 32K
L1i cache 32K 32K 32K
L2 cache 1024K 1024K 256K
L3 cache 28160K 28160K 46080K
Operating System RHEL 7.4 CentOS Linux release 7.3.1611 (Core)
Red Hat Enterprise Linux Server release 7.2
(Maipo)
Bios Version SE5C620.86B.00.01.0009.101920170742 SE5C620.86B.01.00.0412.020920172159 GRRFSDP1.86B0271.R00.1510301446
• All non-interactive tests run on a single socket of these configurations
• Expected environment in render farms

57
OSL ShaderS
• Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl
• Modifications:
• Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl
• Diamond plate - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/diamondplateshader.osl
• Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-Experimental/Threads.osl
• Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-Experimental/TheDonutShader.osl
• Oak – https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl
• Marble - https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl
< float grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0,0));
---
> float grain=noise("gabor",p,8);

Minimizing Masked Instructions
59
surface test_conditional_masking(output color ResultRGB = 0) {
if (P[0] > 0.5) {
if (P[1] > 0.5) {
float powB = pow(P[2], 5.3);
float g;
float inv_g;
if (powB > 0.75) {
inv_g = 1.0/P[1];
inv_g = inv_g*inv_g;
g = smoothstep(P[0],P[2],inv_g);
} else {
float in_red = P[0];
float inv_r = 1.0/in_red;
g = noise("perlin",inv_r);
}
ResultRGB[1] = g;
}
}
}Implicit read of output ResultRGB
Read with conditional
mask that is not subset
of the last write
Read with logical mask
that is not a subset of
the a write
Track
Assignments
Track
Assignments
Require
Masking
Assignments
track
logical mask

Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session

Semelhante a Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session (20)

Mais de Intel® Software

Mais de Intel® Software (20)

Último

Último (20)

Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the Open Shading Language | SIGGRAPH 2018 Tech Session

Notas do Editor