From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
1. HOLY SMOKE!
FASTER PARTICLE RENDERING USING DIRECTCOMPUTE
AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM
GARETH THOMAS
2ND JUNE 2014
2. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM2
PLAN FOR TODAY
Simulation Overview
Collisions
Sorting
Tiled Rendering
Conclusions
3. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM3
OVERVIEW
Why use the gpu for simulation?
‒Highly parallel workload
‒Free your CPU to do other cool stuff
‒Leverage compute
‒ Take advantage of the Local Data Store (LDS)
‒ Asynchronous compute on some platforms
MOTIVATION
4. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM4
OVERVIEW
Emit
Simulate
Sort
Render
‒ Rasterize billboards
‒ Tiled Rendering using DirectCompute
HOW TO BUILD A GPU PARTICLE SYSTEM
5. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM5
SIMULATION OVERVIEW
HOW THE SIMULATION FITS TOGETHER
Simulate Compute Shader
Update Particles. Add alive ones to Alive List, add dead ones to Dead List
Dead List
Persistent list of particle indices
Alive List
List of alive particle indices. Rebuilt each frame by Simulation
CS
Emit Compute Shader
Reads free indices from dead list. Writes new
particle data into global array
Particle Array
Persistent list of particle indices
6. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM6
COLLISIONS
Can no longer use CPU-side physics engine for collisions
Use depth buffer [Tchou11]
‒ Project particle into screen space and read depth buffer
‒ Project particle into view space
‒ Transform depth buffer value into view space and compare depths
Generate collision response
‒ Use G-buffer normals
‒ Or take multiple depth samples to reconstruct the normal
A GPU-BASED SOLUTION
view space
P(n)
P(n+1)
thickness
Z
7. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM7
COLLISIONS
Only collides against geometry in the depth buffer
Particles would collide against depth buffer even if they
are behind the geometry
‒ Use a thickness value to assume particles are in free space
behind geometry
Particles don’t collide when they are off screen
‒ Causes issues when particles that are at rest on the floor have
gone off-screen and have now disappeared
‒ Put particles to sleep in the simulation once they have come to
rest
‒ Use G-buffer to mark parts of the scene that particles can sleep
on (static objects)
Not Multi-GPU Friendly!
‒ Switch off depth buffer collisions in MGPU mode
PROBLEMS WITH USING THE DEPTH BUFFER
Fallen through world!
8. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM8
7 3 6 8 1 4 2 5
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2)
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT
9. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM9
2 51 46 87 3
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 2
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 1)
10. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM10
3 7 8 6 1 4 5 2
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 4
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 2
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 2)
11. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM11
3 6 8 7 5 4 1 2
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 4
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 3)
12. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM12
3 6 7 8 5 4 2 1
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 8
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 4
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 4)
13. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM13
3 4 2 1 5 6 7 8
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 8
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 2
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 5)
14. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM14
2 1 3 4 5 6 7 8
for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) // subArraySize == 8
{
for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1
{
// Begin: GPU part of the sort
for each element n
n = selectBitonic(n, n^compareDist);
// End: GPU part of the sort
}
}
BITONIC SORT (PASS 6)
15. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM15
Sorted Alive List
Vertex Shader
Read Particle Buffer
Geometry Shader
Expand one point to four. Billboard in view space.
Pixel Shader
Texturing and tinting. Depth fade for soft particles.
Particle Pool
RENDERING
16. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM16
Sorted Alive List
Vertex Shader
Read particle buffer and billboard in view space
Pixel Shader
Texturing and tinting. Depth fade for soft particles.
Particle Pool
Index Buffer
RENDERING
17. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM17
RENDERING
The alive particle count is only available on the GPU
‒ Use Indirect API
DrawInstancedIndirect( GPU-args ) for Geometry Shader billboards
‒ D3DPT_POINTLIST with no VB, IB or IA
‒ VertexId = Particle index
‒ VertexCountPerInstance = NumParticles
‒ InstanceCount = 1
‒ Geometry Shader expands the point into four vertices and a 2 triangle strip per billboard
Or better still……. DrawIndexedInstancedIndirect( GPU-args )
‒ D3DPT_TRIANGLELIST, use IB
‒ VertexId / 4 = Particle index
‒ VertexId % 4 = Billboard corner index
‒ IndexCountPerInstance = NumParticles * 6
‒ InstanceCount = 1
RASTERIZATION – FOR OLD SCHOOL GPU PARTICLE SYSTEMS
18. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM18
RENDERING
Overdraw from large particles kills game performance!
‒ Get artists to throttle back on the VFX
Optimizations
‒ Tightly fit polygons around texture [Persson09]
‒ Render to smaller buffer [Cantlay07]
‒ Sorting issues
‒ Loss of fidelity
PROBLEMS WITH RASTERIZATION
19. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM19
TILED RENDERING
Inspired by Forward+ [Harada12]
‒ Screen-space binning of particles instead of
point lights!
Use a 32x32 thread group to shade a 32x32
pixel tile in screen space
‒ Cull particles (just like Forward+)
‒ Sort particles
‒ Per pixel/thread
‒ Evaluate colour of each particle
‒ Blend together
‒ Composite back onto scene
OVERVIEW
20. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM20
TILED RENDERING
1
2
3
[1] [1,2,3] [2,3]
Divide screen into tiles
Build index lists of intersecting
particles per tile
21. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM21
TILED RENDERING
View space asymmetric frustum
generated per tile
Use camera’s near plane
Use camera’s far plane
Or calculate far plane from depth
buffer
Tile0 Tile1 Tile2 Tile3
22. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM22
TILED RENDERING
23. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM23
TILED RENDERING
numthreads[ 32,32,1]
Culling 1024 particles in parallel
Add to LDS index list
Write out to memory
‒ Particle count
‒ Particle indices
THREAD GROUP VIEW
24. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM24
TILED RENDERING
TILE COMPLEXITY
25. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM25
TILED RENDERING
Cannot sort global list of particles
‒ Because 1024 particles get culled in parallel they get
added to visible list in arbitrary order
Need to sort particles per-tile
‒ This is a good thing!
‒ Only need to sort a subset of the global list
‒ Sorting particles in single pass in LDS vs main memory
and in multiple passes
PER TILE BITONIC SORT
26. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM26
TILED RENDERING
numthreads[ 32, 32, 1 ] 1 thread = 1 pixel in screen space
Set accumulation colour to float4( 0, 0, 0, 0 )
For each particle in tile (back to front)
‒ Evaluate particle contribution
‒ UV generation & radius check
‒ Texture lookup
‒ Normal generation and lighting
‒ Manually blend
‒ Colour = ( srcA x srcCol ) + ( invSrcA x destCol )
‒ Alpha = srcA + ( invSrcA x destA )
‒ Write result to screen size UAV
EVALUATING TILE COLOUR
27. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM27
TILED RENDERING
numthreads[ 32, 32, 1 ] 1 thread = 1 pixel in screen space
Set accumulation colour to float4( 0, 0, 0, 0 )
For each particle in tile (front to back)
‒ Evaluate particle contribution
‒ UV generation & radius check
‒ Texture lookup
‒ Normal generation and lighting
‒ Manually blend [Bavoil08]
‒ Colour = ( invDestA x srcA x srcCol ) + destCol
‒ Alpha = srcA + ( invSrcA x destA )
‒ if ( accumulation alpha > threshold )
accumulation alpha = 1 and bail
‒ Write result to screen size UAV
EVALUATING TILE COLOUR – IMPROVED!!!
28. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM28
TILED RENDERING
Bin particles into 8x8 grid
For each particle
‒ For each bin
‒ Test particle against bin
‒ Add particle if visible
UAV0 for particle indices (size = 8 x 8 x maxparticles)
‒ Array split into 64 bins using offsets
UAV1 for storing particle count per bin (size = 8 x 8)
‒ 1 element per bin
‒ Use InterlockedAdd() to bump bin’s counter
COARSE CULLING
29. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM29
TILED RENDERING
COMPUTE SHADER SETUP
Per-bin particle indices
Per-tile sorted particle indices
Screen space colour buffer
Per-bin frustum planes
Per-tile particle indices and
distances
Particle data (position, radius,
colour etc)
Compute ShadersLDS Shader Output
Updated particle dataSimulation
numthreads[256, 1, 1], 1 thread per particle
Coarse Culling
numthreads[256, 1, 1], 1 thread per particle
Tile Culling and Sorting
numthreads[32, 32, 1], 1 thread per particle
Tile Rendering
numthreads[32, 32, 1], 1 thread per pixel
30. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM30
mode frame time (ms)*
Rasterization 5.2
Tiled 3.4
*AMD Radeon R9 290X @ 1080p
Breakdown frame time (ms)*
Simulation 0.50
Coarse Culling 0.06
Tile Culling and Sorting 0.37
Tiled Rendering 1.86
PERFORMANCE RESULTS
Default View, ~35K particles
31. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM31
mode frame time (ms)*
Rasterization 27.3
Tiled 6.2
*AMD Radeon R9 290X @ 1080p
PERFORMANCE RESULTS
In Smoke View, ~35K particles
32. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM32
CONCLUSIONS
Depth buffer collisions
‒ Great bang-for-buck
‒ Not perfect!
Bitonic sort
‒ Good fit for sorting on the GPU
Tiled Rendering
‒ Faster than rasterization
‒ Great for combatting heavy overdraw
‒ More predictable behaviour
Future work
‒ Add arbitrary geometry for OIT
‒ Volume tracing
33. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM33
QUESTIONS?
Demo with full source coming soon
http://developer.amd.com/tools/graphics-development/amd-radeon-sdk/
34. | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM34
REFERENCES
[Tchou11] Chris Tchou, “Halo Reach Effects Tech”, GDC 2011
[Persson09] Emil Persson, http://www.humus.name/index.php?page=News&ID=266
[Cantlay07] Iain Cantlay, “High-Speed, Off-Screen Particles”, GPU Gems 3 2007
[Harada12] Takahiro Harada et al, “Forward+: Bringing Deferred Lighting to the Next Level”, Short Papers,
Eurographics 2012
[Bavoil08] Louis Bavoil et al, “Order Independent Transparency with Dual Depth Peeling”, 2008