SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
1Spring 2011 – Beyond Programmable Shading
Graphics with GPU Compute APIs
Mike Houston, AMD / Stanford
Aaron Lefohn, Intel / University of Washington
2Spring 2011 – Beyond Programmable Shading
What’s In This Talk?
• Brief review of last lecture
• Advanced usage patterns of GPU compute languages
• Rendering uses cases for GPU Computing Languages
– Histograms (for shadows, tone mapping, etc)
– Deferred rendering
– Writing new graphics pipelines (sort of )
3Spring 2011 – Beyond Programmable Shading
Remember:
Figure by Kayvon Fatahalian
4Spring 2011 – Beyond Programmable Shading
Definitions: Execution
• Task
– A logically related set of instructions executed in a single execution context
(aka shader, instance of a kernel, task)
• Concurrent execution
– Multiple tasks that may execute simultaneously
(because they are logically independent)
• Parallel execution
– Multiple tasks whose execution contexts are guaranteed to be live simultaneously
(because you want them to be for locality, synchronization, etc)
5Spring 2011 – Beyond Programmable Shading
Synchronization
• Synchronization
– Restricting when tasks are permitted to execute
• Granularity of permitted synchronization determines at which
granularity system allows user to control scheduling
6Spring 2011 – Beyond Programmable Shading
GPU Compute Languages Review
• “Write code from within two nested concurrent/parallel loops”
• Abstracts
– Cores, execution contexts, and SIMD ALUs
• Exposes
– Parallel execution contexts on same core
– Fast R/W on-core memory shared by the execution contexts on same core
• Synchronization
– Fine grain: between execution contexts on same core
– Very coarse: between large sets of concurrent work
– No medium-grain synchronization “between function calls” like task
systems provide
7Spring 2011 – Beyond Programmable Shading
GPU Compute Pseudocode
void myWorkGroup()
{
parallel_for(i = 0 to NumWorkItems - 1)
{
… GPU Kernel Code … (This is where you write GPU compute code)
}
}
void main()
{
concurrent_for( i = 0 to NumWorkGroups - 1)
{
myWorkGroup();
}
sync;
}
8Spring 2011 – Beyond Programmable Shading
DX CS/OCL/CUDA Execution Model
• Fundamental unit is work-item
– Single instance of “kernel” program
(i.e., “task” using the definitions in this talk)
– Each work-item executes in single SIMD lane
• Work items collected in work-groups
– Work-group scheduled on single core
– Work-items in a work-group
– Execute in parallel
– Can share R/W on-chip scratchpad memory
– Can wait for other work-items in work-group
• Users launch a grid of work-groups
– Spawn many concurrent work-groups
void f(...) {
int x = ...;
...;
...;
if(...) {
...
}
}
Figure by Tim Foley
9Spring 2011 – Beyond Programmable Shading
When Use GPU Compute vs Pixel
Shader?
• Use GPU compute language if your algorithm needs on-chip memory
– Reduce bandwidth by building local data structures
• Otherwise, use pixel shader
– All mapping, decomposition, and scheduling decisions automatic
– (Easier to reach peak performance)
10Spring 2011 – Beyond Programmable Shading
Conventional Thread Parallelism on GPUs
• Also called “persistent threads”
• “Expert” usage model for GPU compute
– Defeat abstractions over cores, execution contexts, and SIMD functional units
– Defeat system scheduler, load balancing, etc.
– Code not portable between architectures
11Spring 2011 – Beyond Programmable Shading
• Execution
– Two-level parallel execution model
– Lower level: parallel execution of M identical tasks on M-wide SIMD
functional unit
– Higher level: parallel execution of N different tasks on N execution contexts
• What is abstracted?
– Nothing (other than automatic mapping to SIMD lanes)
• Where is synchronization allowed?
– Lower-level: between any task running on same SIMD functional unit
– Higher-level: between any execution context
Conventional Thread Parallelism on GPUs
12Spring 2011 – Beyond Programmable Shading
Why Persistent Threads?
• Enable alternate programming models that require different
scheduling and synchronization rules than the default model
provides
• Example alternate programming models
– Task systems (esp. nested task parallelism)
– Producer-consumer rendering pipelines
– (See references at end of this slide deck for more details)
13Spring 2011 – Beyond Programmable Shading
“Sample Distribution Shadow Maps,”
Lauritzen et al., SIGGRAPH 2010
Without ComputeShader With ComputeShader
Sample Distribution Shadow Maps
14Spring 2011 – Beyond Programmable Shading
Sample Distribution Shadow Maps
• Algorithm Overview
– Render depth buffer from eye
– Analyze depth buffer with Compute Shader:
– Build histogram of pixel positions
– Analyze histogram to optimize shadow rendering
– Renders shadow maps
• Compute Shader to build histogram (from pixel data)
– Sequential and parallel code in single kernel (many fork-joins)
– Liberal use of atomic to shared (on-chip) and global (off-chip) memory
– Gather and scatter to arbitrary locations in off-chip and on-chip memory
• Compute Shader to analyze histogram
– Sequential and parallel code in single kernel (many fork-joins)
15Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
[numthreads(BLOCK_DIM, BLOCK_DIM, 1)]
void ScatterHistogram(uint3 groupId : SV_GroupID,
uint3 groupThreadId : SV_GroupThreadID,
uint groupIndex : SV_GroupIndex)
{
// Initialize local histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to histogram bins
// - Between threadgroups: Each threadgroup has own histogram
localHistogram[groupIndex] = emptyBin();
GroupMemoryBarrierWithGroupSync();
. . .
16Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Build histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to pixels in image tile
// - Between threadgroups: Each threadgroup maps to image tile
// Read and compute surface data
uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy;
SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords);
// Bin based on view space Z
// Scatter data to the right bin in our local (on-chip) histogram
int bin = int(ZToBin(data.positionView.z));
InterlockedAdd(localHistogram[bin].count, 1U);
InterlockedMin(localHistogram[bin].bounds.minTexCoordX, data.texCoordX);
InterlockedMax(localHistogram[bin].bounds.maxTexCoordX, data.texCoordX);
//… (more atomic min/max operations for other values in histogram bin) …
GroupMemoryBarrierWithGroupSync();
. . .
17Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Use per-threadgroup scalar code to atomically merge all on-chip histograms into
// single histogram in global memory.
// Parallelism
// - Within threadgroup: SIMD lanes map to histogram elements
// - Between threadgroups: Each threadgroup writing to single global histogram
uint i = groupIndex;
if (localHistogram[i].count > 0) {
InterlockedAdd(gHistogram[i].count, histogram[i].count);
InterlockedMin(gHistogram[i].bounds.minTexCoordX, histogram[i].bounds.minTexCoordX );
InterlockedMin(gHistogram[i].bounds.minTexCoordY, histogram[i].bounds.minTexCoordY );
InterlockedMin(gHistogram[i].bounds.minLightSpaceZ, histogram[i].bounds.minLightSpaceZ);
InterlockedMax(gHistogram[i].bounds.maxTexCoordX, histogram[i].bounds.maxTexCoordX );
InterlockedMax(gHistogram[i].bounds.maxTexCoordY, histogram[i].bounds.maxTexCoordY );
InterlockedMax(gHistogram[i].bounds.maxLightSpaceZ, histogram[i].bounds.maxLightSpaceZ);
}
}
18Spring 2011 – Beyond Programmable Shading
Optimization: Moving farther away
from basic data-parallelism
• Problem---1:1 mapping between workgroups and image tiles
– Flushes local memory to global memory more times than necessary
– Would like larger workgroups but limited to 1024 workitems per group
• Solution
– Use the largest workgroups possible (1024 workitems)
– Launch fewer workgroups. Find sweet spot that fills all threads on all cores to
maximize latency hiding but minimizes the writes to global memory
– Loop over multiple image tiles within a single compute shader
• Take-away
– “Invoke just enough parallel work to fill the SIMD lanes, threads, and cores of
the machine to achieve sufficient latency hiding”
– The abstraction is broken because this optimization exposes the number of
hardware resources 
19Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Build histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to pixels in image tile
// - Between threadgroups: Each threadgroup maps to image tile
uint2 tileStart = groupId.xy * TILE_DIM + groupThreadId.xy;
for (uint tileY = 0; tileY < TILE_DIM; tileY += BLOCK_DIM) {
for (uint tileX = 0; tileX < TILE_DIM; tileX += BLOCK_DIM) {
// Read and compute surface data
uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy;
SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords);
// Bin based on view space Z
// Scatter data to the right bin in our local (on-chip) histogram
int bin = int(ZToBin(data.positionView.z));
InterlockedAdd(localHistogram[bin].count, 1U);
InterlockedMin(localHistogram[bin].bounds.minTexCoordX,
data.texCoordX);
… (more atomic min/max ops for other values in histogram bin) …
}}
GroupMemoryBarrierWithGroupSync();
. . .
20Spring 2011 – Beyond Programmable Shading
SW Pipeline 1: Particle Rasterizer
• Mock-up particle rendering pipeline with render-target-read
– Written by 2 people over the course of 1 week
– Runs ~2x slower than D3D rendering pipeline (but has glass jaws)
Without Volumetric Shadow With Volumetric Shadow
21Spring 2011 – Beyond Programmable Shading
Tiled Particle Rasterizer in DX11 CS
[numthreads(RAST_THREADS_X, RAST_THREADS_Y, 1)]
void RasterizeParticleCS(uint3 groupId : SV_GroupID,
uint3 groupThreadId : SV_GroupThreadID,
uint groupIndex : SV_GroupIndex)
{
uint i = 0; // For all particles..
while (i < mParticleCount) {
GroupMemoryBarrierWithGroupSync();
const uint particlePerIter = min(mParticleCount - i, NT_X * NT_Y);
// Vertex shader and primitive assembly
// Parallelism: SIMD lanes map over particles.
if (groupIndex < particlePerIter) {
const uint particleIndex = i + groupIndex;
// … read vertex data for this particle from memory,
// construct screen-facing quad, test if particle intersects tile,
// use atomics to on-chip memory to append to list of particles
}
GroupMemoryBarrierWithGroupSync();
. . .
22Spring 2011 – Beyond Programmable Shading
Tiled Particle Rasterizer in DX11 CS
// Find all particles that intersect this pixel
// Parallelism: SIMD lanes map over pixels in image tile
for (n = 0; n < gVisibileParticlePerIter; n++) {
if (ParticleIntersectsPixel(gParticles[n], fragmentPos)) {
float dx, dy;
ComputeInterpolants(gParticles[n], fragmentPos, dx, dy);
float3 viewPos = BilinearInterp3(gParticles[n].viewPos, dx, dy);
float3 entry, exit, t;
if (IntersectParticle(viewPos, gParticles[n], entry, exit, t)) {
// Run pixel shader on this particle
// Read-modify-write framebuffer held in global off-chip memory
}
}
}
i += particlePerIter;
}
23Spring 2011 – Beyond Programmable Shading
SW Pipeline 1: Particle Rasterizer
• Usage
– Atomics to on-chip memory
– Gather/scatter to on-chip and off-chip memory
– Latency hiding of off-chip memory accesses
• Lesson learned
– The programmer productivity of these programming models is impressive
– This pipeline is statically scheduled (from a SW perspective) but underlying
hardware scheduler is dynamically scheduling threadgroups
– Needs to be doing dynamic SW scheduling to achieve more stable / higher
performance
24Spring 2011 – Beyond Programmable Shading
Software Pipeline 2: OptiX
• NVIDIA interactive ray-tracing library
– Started as research project, product announced 2009
• Custom rendering pipeline
– Implemented entirely with CUDA using “persistent threads” usage pattern
– Users define geometry, materials, lights in C-for-CUDA
– PTX intermediate layer used to synthesize optimized code
• Other custom pipelines in the research world
– Stochastic rasterization, particles, conservative rasterization, …
25Spring 2011 – Beyond Programmable Shading
Deferred Rendering
(Slides by Andrew Lauritzen)
(Possibly the most important use of ComputeShader)
26Spring 2011 – Beyond Programmable Shading
Overview
• Forward shading
• Deferred shading and lighting
• Tile-based deferred shading
27Spring 2011 – Beyond Programmable Shading
Forward Shading
• Do everything we need to shade a pixel
– for each light
– Shadow attenuation (sampling shadow maps)
– Distance attenuation
– Evaluate lighting and accumulate
• Multi-pass requires resubmitting scene geometry
– Not a scalable solution
28Spring 2011 – Beyond Programmable Shading
Forward Shading Problems
• Ineffective light culling
– Object space at best
– Trade-off with shader permutations/batching
• Memory footprint of all inputs
– Everything must be resident at the same time (!)
• Shading small triangles is inefficient
– Covered earlier in this course: [Fatahalian 2010]
29Spring 2011 – Beyond Programmable Shading
Conventional Deferred Shading
• Store lighting inputs in memory (G-buffer)
– for each light
– Use rasterizer to scatter light volume and cull
– Read lighting inputs from G-buffer
– Compute lighting
– Accumulate lighting with additive blending
• Reorders computation to extract coherence
30Spring 2011 – Beyond Programmable Shading
Modern Implementation
• Cull with screen-aligned quads
– Cover light extents with axis-aligned bounding box
– Full light meshes (spheres, cones) are generally overkill
– Can use oriented bounding box for narrow spot lights
– Use conservative single-direction depth test
– Two-pass stencil is more expensive than it is worth
– Depth bounds test on some hardware, but not batch-friendly
31Spring 2011 – Beyond Programmable Shading
Lit Scene (256 Point Lights)
32Spring 2011 – Beyond Programmable Shading
Deferred Shading Problems
• Bandwidth overhead when lights overlap
– for each light
– Use rasterizer to scatter light volume and cull
– Read lighting inputs from G-buffer  overhead
– Compute lighting
– Accumulate lighting with additive blending  overhead
• Not doing enough work to amortize overhead
33Spring 2011 – Beyond Programmable Shading
Improving Deferred Shading
• Reduce G-buffer overhead
– Access fewer things inside the light loop
– Deferred lighting / light pre-pass
• Amortize overhead
– Group overlapping lights and process them together
– Tile-based deferred shading
34Spring 2011 – Beyond Programmable Shading
Tile-Based Deferred Rendering
Parallel_for over lights
Atomically append lights that affect tile to shared list
Barrier
Parallel_for over pixels in tile
Evaluate all selected lights at each pixel
35Spring 2011 – Beyond Programmable Shading
Tile-Based Deferred Shading
• Goal: amortize overhead
– Large reduction in bandwidth requirements
• Use screen tiles to group lights
– Use tight tile frusta to cull non-intersecting lights
– Reduces number of lights to consider
– Read G-buffer once and evaluate all relevant lights
– Reduces bandwidth of overlapping lights
• See [Andersson 2009] for more details
36Spring 2011 – Beyond Programmable Shading
Lit Scene (1024 Point Lights)
37Spring 2011 – Beyond Programmable Shading
Tile-Based Light Culling
38Spring 2011 – Beyond Programmable Shading
Quad-Based Lighting Culling
39Spring 2011 – Beyond Programmable Shading
1
2
4
8
16
16 32 64 128 256 512 1024
FrameTime(ms)
Number of Point Lights
Quad (ATI 5870)
Quad (NVIDIA 480)
Tiled (NVIDIA 480)
Tiled (ATI 5870)
Light Culling Only at 1080p
Slope ~ 0.5 µs / light
Slope ~ 7 µs / light
Tile setup dominates
40Spring 2011 – Beyond Programmable Shading
1
2
4
8
16
32
16 32 64 128 256 512 1024
FrameTime(ms)
Number of Point Lights
Deferred Shading (NVIDIA 480)
Deferred Shading (ATI 5870)
Deferred Lighting (ATI 5870)
Deferred Lighting (NVIDIA 480)
Tiled (NVIDIA 480)
Tiled (ATI 5870)
Total Performance at 1080p
Deferred lighting slightly faster, but trends similarly
Slope ~ 4 µs / light
Slope ~ 20 µs / light
Few lights overlap
41Spring 2011 – Beyond Programmable Shading
Anti-aliasing
• Multi-sampling with deferred rendering requires some work
– Regular G-buffer couples visibility and shading
• Handle multi-frequency shading in user space
– Store G-buffer at sample frequency
– Only apply per-sample shading where necessary
– Offers additional flexibility over forward rendering
42Spring 2011 – Beyond Programmable Shading
Identifying Edges
• Forward MSAA causes redundant work
– It applies to all triangle edges, even for continuous, tessellated surfaces
• Want to find surface discontinuities
– Compare sample depths to depth derivatives
– Compare (shading) normal deviation over samples
43Spring 2011 – Beyond Programmable Shading
Per-Sample Shading Visualization
44Spring 2011 – Beyond Programmable Shading
Deferred Rendering Conclusions
• Deferred shading is a useful rendering tool
– Decouples shading from visibility
– Allows efficient user-space scheduling and culling
• Tile-based methods win going forward
– ComputeShader/OpenCL/CUDA implementations save a lot of bandwidth
– Fastest and most flexible
– Enable efficient MSAA
45Spring 2011 – Beyond Programmable Shading
Summary for GPU Compute Languages
• GPU compute languages
– “Easy” way to exploit compute capability of GPUs (easier than 3D APIs)
– The performance benefit over pixel shaders comes when using on-core
R/W memory to save off-chip bandwidth
– Increasingly used as “just another tool in the real-time graphics
programmer’s toolkit”
– Deferred rendering
– Shadows
– Post-processing
– …
– The current languages have a lot of rough edges and limitations.
46Spring 2011 – Beyond Programmable Shading
Backup
47Spring 2011 – Beyond Programmable Shading
Future Work
• Hierarchical light culling
– Straightforward but would need lots of small lights
•Improve MSAA memory usage
–Irregular/compressed sample storage?
–Revisit binning pipelines?
–Sacrifice higher resolutions for better AA?
48Spring 2011 – Beyond Programmable Shading
Acknowledgements
• Microsoft and Crytek for the scene assets
• Johan Andersson from DICE
• Craig Kolb, Matt Pharr, and others in the Advanced Rendering
Technology team at Intel
• Nico Galoppo, Anupreet Kalra and Mike Burrows from Intel
49Spring 2011 – Beyond Programmable Shading
References
• [Andersson 2009] Johan Andersson, “Parallel Graphics in Frostbite - Current & Future”,
http://s09.idav.ucdavis.edu/
• [Fatahalian 2010] Kayvon Fatahalian, “Evolving the Direct3D Pipeline for Real-
Time Micropolygon Rendering”, http://bps10.idav.ucdavis.edu/
• [Hoffman 2009] Naty Hoffman, “Deferred Lighting Approaches”,
http://www.realtimerendering.com/blog/deferred-lighting-approaches/
• [Stone 2009] Adrian Stone, “Deferred Shading Shines. Deferred Lighting? Not So Much.”,
http://gameangst.com/?p=141
50Spring 2011 – Beyond Programmable Shading
Questions?
• Full source and demo available at:
– http://visual-computing.intel-
research.net/art/publications/deferred_rendering/
51Spring 2011 – Beyond Programmable Shading
Quad-Based Light Culling
52Spring 2011 – Beyond Programmable Shading
Deferred Lighting / Light Pre-Pass
• Goal: reduce G-buffer overhead
• Split diffuse and specular terms
– Common concession is monochromatic specular
• Factor out constant terms from summation
– Albedo, specular amount, etc.
• Sum inner terms over all lights
53Spring 2011 – Beyond Programmable Shading
Deferred Lighting / Light Pre-Pass
• Resolve pass combines factored components
– Still best to store all terms in G-buffer up front
– Better SIMD efficiency
• Incremental improvement for some hardware
– Relies on pre-factoring lighting functions
– Ability to vary resolve pass is not particularly useful
• See [Hoffman 2009] and [Stone 2009]
54Spring 2011 – Beyond Programmable Shading
MSAA with Quad-Based Methods
• Mark pixels for per-sample shading
– Stencil still faster than branching on most hardware
– Probably gets scheduled better
• Shade in two passes: per-pixel and per-sample
– Unfortunately, duplicates culling work
– Scheduling is still a problem
55Spring 2011 – Beyond Programmable Shading
Per-Sample Scheduling
• Lack of spatial locality causes hardware scheduling inefficiency
56Spring 2011 – Beyond Programmable Shading
MSAA with Tile-Based Methods
• Handle per-pixel and per-sample in one pass
– Avoids duplicate culling work
– Can use branching, but incurs scheduling problems
– Instead, reschedule per-sample pixels
– Shade sample 0 for the whole tile
– Pack a list of pixels that require per-sample shading
– Redistribute threads to process additional samples
– Scatter per-sample shaded results
57Spring 2011 – Beyond Programmable Shading
Tile-Based MSAA at 1080p, 1024
Lights
0
5
10
15
20
25
30
35
Crytek Sponza
(ATI 5870)
2009 Game
(ATI 5870)
Crytek Sponza
(NVIDIA 480)
2009 Game
(NVIDIA 480)
FrameTime(ms)
No MSAA
4x MSAA (Branching)
4x MSAA (Packed)
58Spring 2011 – Beyond Programmable Shading
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
FrameTime(ms)
Number of Point Lights
Deferred Shading (ATI 5870)
Deferred Lighting (ATI 5870)
Deferred Shading (NVIDIA 480)
Deferred Lighting (NVIDIA 480)
Tiled (ATI 5870)
Tiled (NVIDIA 480)
4x MSAA Performance at 1080p
Slope ~ 5 µs / light
Slope ~ 35 µs / light
Tiled takes less of a hit from MSAA
Deferred lighting even less compelling

Mais conteúdo relacionado

Mais procurados

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
How data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesHow data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesElectronic Arts / DICE
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandAMD Developer Central
 
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Unity Technologies
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion CullingIntel® Software
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for MobilesSt1X
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesNarann29
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesBryan Duggan
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
 

Mais procurados (20)

Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
How data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield HeroesHow data rules the world: Telemetry in Battlefield Heroes
How data rules the world: Telemetry in Battlefield Heroes
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
 
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Low-level Graphics APIs
Low-level Graphics APIsLow-level Graphics APIs
Low-level Graphics APIs
 
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game Engines
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 

Semelhante a Example uses of gpu compute models

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Alexander Dolbilov
 
DIANNE - A distributed deep learning framework on OSGi - Tim Verbelen
DIANNE - A distributed deep learning framework on OSGi - Tim VerbelenDIANNE - A distributed deep learning framework on OSGi - Tim Verbelen
DIANNE - A distributed deep learning framework on OSGi - Tim Verbelenmfrancis
 
Minko - Targeting Flash/Stage3D with C++ and GLSL
Minko - Targeting Flash/Stage3D with C++ and GLSLMinko - Targeting Flash/Stage3D with C++ and GLSL
Minko - Targeting Flash/Stage3D with C++ and GLSLMinko3D
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderertobias_persson
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glchangehee lee
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsナム-Nam Nguyễn
 
Using the BizTalk Mapper
Using the BizTalk MapperUsing the BizTalk Mapper
Using the BizTalk MapperDaniel Toomey
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 

Semelhante a Example uses of gpu compute models (20)

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
DIANNE - A distributed deep learning framework on OSGi - Tim Verbelen
DIANNE - A distributed deep learning framework on OSGi - Tim VerbelenDIANNE - A distributed deep learning framework on OSGi - Tim Verbelen
DIANNE - A distributed deep learning framework on OSGi - Tim Verbelen
 
Minko - Targeting Flash/Stage3D with C++ and GLSL
Minko - Targeting Flash/Stage3D with C++ and GLSLMinko - Targeting Flash/Stage3D with C++ and GLSL
Minko - Targeting Flash/Stage3D with C++ and GLSL
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_gl
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
 
Using the BizTalk Mapper
Using the BizTalk MapperUsing the BizTalk Mapper
Using the BizTalk Mapper
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 

Mais de Pedram Mazloom

PowerVR performance recommendations
PowerVR performance recommendationsPowerVR performance recommendations
PowerVR performance recommendationsPedram Mazloom
 
Nvidia cuda programming_guide_1.0
Nvidia cuda programming_guide_1.0Nvidia cuda programming_guide_1.0
Nvidia cuda programming_guide_1.0Pedram Mazloom
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5dPedram Mazloom
 
Gdc11 lighting used in BF3
Gdc11 lighting used in BF3Gdc11 lighting used in BF3
Gdc11 lighting used in BF3Pedram Mazloom
 
Deferred rendering using compute shader
Deferred rendering using compute shaderDeferred rendering using compute shader
Deferred rendering using compute shaderPedram Mazloom
 
GPU performance analysis
GPU performance analysisGPU performance analysis
GPU performance analysisPedram Mazloom
 

Mais de Pedram Mazloom (9)

Swarm ai
Swarm aiSwarm ai
Swarm ai
 
PowerVR performance recommendations
PowerVR performance recommendationsPowerVR performance recommendations
PowerVR performance recommendations
 
Physics artifacts
Physics artifactsPhysics artifacts
Physics artifacts
 
Open gl tips
Open gl tipsOpen gl tips
Open gl tips
 
Nvidia cuda programming_guide_1.0
Nvidia cuda programming_guide_1.0Nvidia cuda programming_guide_1.0
Nvidia cuda programming_guide_1.0
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5d
 
Gdc11 lighting used in BF3
Gdc11 lighting used in BF3Gdc11 lighting used in BF3
Gdc11 lighting used in BF3
 
Deferred rendering using compute shader
Deferred rendering using compute shaderDeferred rendering using compute shader
Deferred rendering using compute shader
 
GPU performance analysis
GPU performance analysisGPU performance analysis
GPU performance analysis
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Example uses of gpu compute models

  • 1. 1Spring 2011 – Beyond Programmable Shading Graphics with GPU Compute APIs Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington
  • 2. 2Spring 2011 – Beyond Programmable Shading What’s In This Talk? • Brief review of last lecture • Advanced usage patterns of GPU compute languages • Rendering uses cases for GPU Computing Languages – Histograms (for shadows, tone mapping, etc) – Deferred rendering – Writing new graphics pipelines (sort of )
  • 3. 3Spring 2011 – Beyond Programmable Shading Remember: Figure by Kayvon Fatahalian
  • 4. 4Spring 2011 – Beyond Programmable Shading Definitions: Execution • Task – A logically related set of instructions executed in a single execution context (aka shader, instance of a kernel, task) • Concurrent execution – Multiple tasks that may execute simultaneously (because they are logically independent) • Parallel execution – Multiple tasks whose execution contexts are guaranteed to be live simultaneously (because you want them to be for locality, synchronization, etc)
  • 5. 5Spring 2011 – Beyond Programmable Shading Synchronization • Synchronization – Restricting when tasks are permitted to execute • Granularity of permitted synchronization determines at which granularity system allows user to control scheduling
  • 6. 6Spring 2011 – Beyond Programmable Shading GPU Compute Languages Review • “Write code from within two nested concurrent/parallel loops” • Abstracts – Cores, execution contexts, and SIMD ALUs • Exposes – Parallel execution contexts on same core – Fast R/W on-core memory shared by the execution contexts on same core • Synchronization – Fine grain: between execution contexts on same core – Very coarse: between large sets of concurrent work – No medium-grain synchronization “between function calls” like task systems provide
  • 7. 7Spring 2011 – Beyond Programmable Shading GPU Compute Pseudocode void myWorkGroup() { parallel_for(i = 0 to NumWorkItems - 1) { … GPU Kernel Code … (This is where you write GPU compute code) } } void main() { concurrent_for( i = 0 to NumWorkGroups - 1) { myWorkGroup(); } sync; }
  • 8. 8Spring 2011 – Beyond Programmable Shading DX CS/OCL/CUDA Execution Model • Fundamental unit is work-item – Single instance of “kernel” program (i.e., “task” using the definitions in this talk) – Each work-item executes in single SIMD lane • Work items collected in work-groups – Work-group scheduled on single core – Work-items in a work-group – Execute in parallel – Can share R/W on-chip scratchpad memory – Can wait for other work-items in work-group • Users launch a grid of work-groups – Spawn many concurrent work-groups void f(...) { int x = ...; ...; ...; if(...) { ... } } Figure by Tim Foley
  • 9. 9Spring 2011 – Beyond Programmable Shading When Use GPU Compute vs Pixel Shader? • Use GPU compute language if your algorithm needs on-chip memory – Reduce bandwidth by building local data structures • Otherwise, use pixel shader – All mapping, decomposition, and scheduling decisions automatic – (Easier to reach peak performance)
  • 10. 10Spring 2011 – Beyond Programmable Shading Conventional Thread Parallelism on GPUs • Also called “persistent threads” • “Expert” usage model for GPU compute – Defeat abstractions over cores, execution contexts, and SIMD functional units – Defeat system scheduler, load balancing, etc. – Code not portable between architectures
  • 11. 11Spring 2011 – Beyond Programmable Shading • Execution – Two-level parallel execution model – Lower level: parallel execution of M identical tasks on M-wide SIMD functional unit – Higher level: parallel execution of N different tasks on N execution contexts • What is abstracted? – Nothing (other than automatic mapping to SIMD lanes) • Where is synchronization allowed? – Lower-level: between any task running on same SIMD functional unit – Higher-level: between any execution context Conventional Thread Parallelism on GPUs
  • 12. 12Spring 2011 – Beyond Programmable Shading Why Persistent Threads? • Enable alternate programming models that require different scheduling and synchronization rules than the default model provides • Example alternate programming models – Task systems (esp. nested task parallelism) – Producer-consumer rendering pipelines – (See references at end of this slide deck for more details)
  • 13. 13Spring 2011 – Beyond Programmable Shading “Sample Distribution Shadow Maps,” Lauritzen et al., SIGGRAPH 2010 Without ComputeShader With ComputeShader Sample Distribution Shadow Maps
  • 14. 14Spring 2011 – Beyond Programmable Shading Sample Distribution Shadow Maps • Algorithm Overview – Render depth buffer from eye – Analyze depth buffer with Compute Shader: – Build histogram of pixel positions – Analyze histogram to optimize shadow rendering – Renders shadow maps • Compute Shader to build histogram (from pixel data) – Sequential and parallel code in single kernel (many fork-joins) – Liberal use of atomic to shared (on-chip) and global (off-chip) memory – Gather and scatter to arbitrary locations in off-chip and on-chip memory • Compute Shader to analyze histogram – Sequential and parallel code in single kernel (many fork-joins)
  • 15. 15Spring 2011 – Beyond Programmable Shading Building Histogram in DX11 CS [numthreads(BLOCK_DIM, BLOCK_DIM, 1)] void ScatterHistogram(uint3 groupId : SV_GroupID, uint3 groupThreadId : SV_GroupThreadID, uint groupIndex : SV_GroupIndex) { // Initialize local histogram in parallel // Parallelism: // - Within threadgroup: SIMD lanes map to histogram bins // - Between threadgroups: Each threadgroup has own histogram localHistogram[groupIndex] = emptyBin(); GroupMemoryBarrierWithGroupSync(); . . .
  • 16. 16Spring 2011 – Beyond Programmable Shading Building Histogram in DX11 CS // Build histogram in parallel // Parallelism: // - Within threadgroup: SIMD lanes map to pixels in image tile // - Between threadgroups: Each threadgroup maps to image tile // Read and compute surface data uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy; SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords); // Bin based on view space Z // Scatter data to the right bin in our local (on-chip) histogram int bin = int(ZToBin(data.positionView.z)); InterlockedAdd(localHistogram[bin].count, 1U); InterlockedMin(localHistogram[bin].bounds.minTexCoordX, data.texCoordX); InterlockedMax(localHistogram[bin].bounds.maxTexCoordX, data.texCoordX); //… (more atomic min/max operations for other values in histogram bin) … GroupMemoryBarrierWithGroupSync(); . . .
  • 17. 17Spring 2011 – Beyond Programmable Shading Building Histogram in DX11 CS // Use per-threadgroup scalar code to atomically merge all on-chip histograms into // single histogram in global memory. // Parallelism // - Within threadgroup: SIMD lanes map to histogram elements // - Between threadgroups: Each threadgroup writing to single global histogram uint i = groupIndex; if (localHistogram[i].count > 0) { InterlockedAdd(gHistogram[i].count, histogram[i].count); InterlockedMin(gHistogram[i].bounds.minTexCoordX, histogram[i].bounds.minTexCoordX ); InterlockedMin(gHistogram[i].bounds.minTexCoordY, histogram[i].bounds.minTexCoordY ); InterlockedMin(gHistogram[i].bounds.minLightSpaceZ, histogram[i].bounds.minLightSpaceZ); InterlockedMax(gHistogram[i].bounds.maxTexCoordX, histogram[i].bounds.maxTexCoordX ); InterlockedMax(gHistogram[i].bounds.maxTexCoordY, histogram[i].bounds.maxTexCoordY ); InterlockedMax(gHistogram[i].bounds.maxLightSpaceZ, histogram[i].bounds.maxLightSpaceZ); } }
  • 18. 18Spring 2011 – Beyond Programmable Shading Optimization: Moving farther away from basic data-parallelism • Problem---1:1 mapping between workgroups and image tiles – Flushes local memory to global memory more times than necessary – Would like larger workgroups but limited to 1024 workitems per group • Solution – Use the largest workgroups possible (1024 workitems) – Launch fewer workgroups. Find sweet spot that fills all threads on all cores to maximize latency hiding but minimizes the writes to global memory – Loop over multiple image tiles within a single compute shader • Take-away – “Invoke just enough parallel work to fill the SIMD lanes, threads, and cores of the machine to achieve sufficient latency hiding” – The abstraction is broken because this optimization exposes the number of hardware resources 
  • 19. 19Spring 2011 – Beyond Programmable Shading Building Histogram in DX11 CS // Build histogram in parallel // Parallelism: // - Within threadgroup: SIMD lanes map to pixels in image tile // - Between threadgroups: Each threadgroup maps to image tile uint2 tileStart = groupId.xy * TILE_DIM + groupThreadId.xy; for (uint tileY = 0; tileY < TILE_DIM; tileY += BLOCK_DIM) { for (uint tileX = 0; tileX < TILE_DIM; tileX += BLOCK_DIM) { // Read and compute surface data uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy; SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords); // Bin based on view space Z // Scatter data to the right bin in our local (on-chip) histogram int bin = int(ZToBin(data.positionView.z)); InterlockedAdd(localHistogram[bin].count, 1U); InterlockedMin(localHistogram[bin].bounds.minTexCoordX, data.texCoordX); … (more atomic min/max ops for other values in histogram bin) … }} GroupMemoryBarrierWithGroupSync(); . . .
  • 20. 20Spring 2011 – Beyond Programmable Shading SW Pipeline 1: Particle Rasterizer • Mock-up particle rendering pipeline with render-target-read – Written by 2 people over the course of 1 week – Runs ~2x slower than D3D rendering pipeline (but has glass jaws) Without Volumetric Shadow With Volumetric Shadow
  • 21. 21Spring 2011 – Beyond Programmable Shading Tiled Particle Rasterizer in DX11 CS [numthreads(RAST_THREADS_X, RAST_THREADS_Y, 1)] void RasterizeParticleCS(uint3 groupId : SV_GroupID, uint3 groupThreadId : SV_GroupThreadID, uint groupIndex : SV_GroupIndex) { uint i = 0; // For all particles.. while (i < mParticleCount) { GroupMemoryBarrierWithGroupSync(); const uint particlePerIter = min(mParticleCount - i, NT_X * NT_Y); // Vertex shader and primitive assembly // Parallelism: SIMD lanes map over particles. if (groupIndex < particlePerIter) { const uint particleIndex = i + groupIndex; // … read vertex data for this particle from memory, // construct screen-facing quad, test if particle intersects tile, // use atomics to on-chip memory to append to list of particles } GroupMemoryBarrierWithGroupSync(); . . .
  • 22. 22Spring 2011 – Beyond Programmable Shading Tiled Particle Rasterizer in DX11 CS // Find all particles that intersect this pixel // Parallelism: SIMD lanes map over pixels in image tile for (n = 0; n < gVisibileParticlePerIter; n++) { if (ParticleIntersectsPixel(gParticles[n], fragmentPos)) { float dx, dy; ComputeInterpolants(gParticles[n], fragmentPos, dx, dy); float3 viewPos = BilinearInterp3(gParticles[n].viewPos, dx, dy); float3 entry, exit, t; if (IntersectParticle(viewPos, gParticles[n], entry, exit, t)) { // Run pixel shader on this particle // Read-modify-write framebuffer held in global off-chip memory } } } i += particlePerIter; }
  • 23. 23Spring 2011 – Beyond Programmable Shading SW Pipeline 1: Particle Rasterizer • Usage – Atomics to on-chip memory – Gather/scatter to on-chip and off-chip memory – Latency hiding of off-chip memory accesses • Lesson learned – The programmer productivity of these programming models is impressive – This pipeline is statically scheduled (from a SW perspective) but underlying hardware scheduler is dynamically scheduling threadgroups – Needs to be doing dynamic SW scheduling to achieve more stable / higher performance
  • 24. 24Spring 2011 – Beyond Programmable Shading Software Pipeline 2: OptiX • NVIDIA interactive ray-tracing library – Started as research project, product announced 2009 • Custom rendering pipeline – Implemented entirely with CUDA using “persistent threads” usage pattern – Users define geometry, materials, lights in C-for-CUDA – PTX intermediate layer used to synthesize optimized code • Other custom pipelines in the research world – Stochastic rasterization, particles, conservative rasterization, …
  • 25. 25Spring 2011 – Beyond Programmable Shading Deferred Rendering (Slides by Andrew Lauritzen) (Possibly the most important use of ComputeShader)
  • 26. 26Spring 2011 – Beyond Programmable Shading Overview • Forward shading • Deferred shading and lighting • Tile-based deferred shading
  • 27. 27Spring 2011 – Beyond Programmable Shading Forward Shading • Do everything we need to shade a pixel – for each light – Shadow attenuation (sampling shadow maps) – Distance attenuation – Evaluate lighting and accumulate • Multi-pass requires resubmitting scene geometry – Not a scalable solution
  • 28. 28Spring 2011 – Beyond Programmable Shading Forward Shading Problems • Ineffective light culling – Object space at best – Trade-off with shader permutations/batching • Memory footprint of all inputs – Everything must be resident at the same time (!) • Shading small triangles is inefficient – Covered earlier in this course: [Fatahalian 2010]
  • 29. 29Spring 2011 – Beyond Programmable Shading Conventional Deferred Shading • Store lighting inputs in memory (G-buffer) – for each light – Use rasterizer to scatter light volume and cull – Read lighting inputs from G-buffer – Compute lighting – Accumulate lighting with additive blending • Reorders computation to extract coherence
  • 30. 30Spring 2011 – Beyond Programmable Shading Modern Implementation • Cull with screen-aligned quads – Cover light extents with axis-aligned bounding box – Full light meshes (spheres, cones) are generally overkill – Can use oriented bounding box for narrow spot lights – Use conservative single-direction depth test – Two-pass stencil is more expensive than it is worth – Depth bounds test on some hardware, but not batch-friendly
  • 31. 31Spring 2011 – Beyond Programmable Shading Lit Scene (256 Point Lights)
  • 32. 32Spring 2011 – Beyond Programmable Shading Deferred Shading Problems • Bandwidth overhead when lights overlap – for each light – Use rasterizer to scatter light volume and cull – Read lighting inputs from G-buffer  overhead – Compute lighting – Accumulate lighting with additive blending  overhead • Not doing enough work to amortize overhead
  • 33. 33Spring 2011 – Beyond Programmable Shading Improving Deferred Shading • Reduce G-buffer overhead – Access fewer things inside the light loop – Deferred lighting / light pre-pass • Amortize overhead – Group overlapping lights and process them together – Tile-based deferred shading
  • 34. 34Spring 2011 – Beyond Programmable Shading Tile-Based Deferred Rendering Parallel_for over lights Atomically append lights that affect tile to shared list Barrier Parallel_for over pixels in tile Evaluate all selected lights at each pixel
  • 35. 35Spring 2011 – Beyond Programmable Shading Tile-Based Deferred Shading • Goal: amortize overhead – Large reduction in bandwidth requirements • Use screen tiles to group lights – Use tight tile frusta to cull non-intersecting lights – Reduces number of lights to consider – Read G-buffer once and evaluate all relevant lights – Reduces bandwidth of overlapping lights • See [Andersson 2009] for more details
  • 36. 36Spring 2011 – Beyond Programmable Shading Lit Scene (1024 Point Lights)
  • 37. 37Spring 2011 – Beyond Programmable Shading Tile-Based Light Culling
  • 38. 38Spring 2011 – Beyond Programmable Shading Quad-Based Lighting Culling
  • 39. 39Spring 2011 – Beyond Programmable Shading 1 2 4 8 16 16 32 64 128 256 512 1024 FrameTime(ms) Number of Point Lights Quad (ATI 5870) Quad (NVIDIA 480) Tiled (NVIDIA 480) Tiled (ATI 5870) Light Culling Only at 1080p Slope ~ 0.5 µs / light Slope ~ 7 µs / light Tile setup dominates
  • 40. 40Spring 2011 – Beyond Programmable Shading 1 2 4 8 16 32 16 32 64 128 256 512 1024 FrameTime(ms) Number of Point Lights Deferred Shading (NVIDIA 480) Deferred Shading (ATI 5870) Deferred Lighting (ATI 5870) Deferred Lighting (NVIDIA 480) Tiled (NVIDIA 480) Tiled (ATI 5870) Total Performance at 1080p Deferred lighting slightly faster, but trends similarly Slope ~ 4 µs / light Slope ~ 20 µs / light Few lights overlap
  • 41. 41Spring 2011 – Beyond Programmable Shading Anti-aliasing • Multi-sampling with deferred rendering requires some work – Regular G-buffer couples visibility and shading • Handle multi-frequency shading in user space – Store G-buffer at sample frequency – Only apply per-sample shading where necessary – Offers additional flexibility over forward rendering
  • 42. 42Spring 2011 – Beyond Programmable Shading Identifying Edges • Forward MSAA causes redundant work – It applies to all triangle edges, even for continuous, tessellated surfaces • Want to find surface discontinuities – Compare sample depths to depth derivatives – Compare (shading) normal deviation over samples
  • 43. 43Spring 2011 – Beyond Programmable Shading Per-Sample Shading Visualization
  • 44. 44Spring 2011 – Beyond Programmable Shading Deferred Rendering Conclusions • Deferred shading is a useful rendering tool – Decouples shading from visibility – Allows efficient user-space scheduling and culling • Tile-based methods win going forward – ComputeShader/OpenCL/CUDA implementations save a lot of bandwidth – Fastest and most flexible – Enable efficient MSAA
  • 45. 45Spring 2011 – Beyond Programmable Shading Summary for GPU Compute Languages • GPU compute languages – “Easy” way to exploit compute capability of GPUs (easier than 3D APIs) – The performance benefit over pixel shaders comes when using on-core R/W memory to save off-chip bandwidth – Increasingly used as “just another tool in the real-time graphics programmer’s toolkit” – Deferred rendering – Shadows – Post-processing – … – The current languages have a lot of rough edges and limitations.
  • 46. 46Spring 2011 – Beyond Programmable Shading Backup
  • 47. 47Spring 2011 – Beyond Programmable Shading Future Work • Hierarchical light culling – Straightforward but would need lots of small lights •Improve MSAA memory usage –Irregular/compressed sample storage? –Revisit binning pipelines? –Sacrifice higher resolutions for better AA?
  • 48. 48Spring 2011 – Beyond Programmable Shading Acknowledgements • Microsoft and Crytek for the scene assets • Johan Andersson from DICE • Craig Kolb, Matt Pharr, and others in the Advanced Rendering Technology team at Intel • Nico Galoppo, Anupreet Kalra and Mike Burrows from Intel
  • 49. 49Spring 2011 – Beyond Programmable Shading References • [Andersson 2009] Johan Andersson, “Parallel Graphics in Frostbite - Current & Future”, http://s09.idav.ucdavis.edu/ • [Fatahalian 2010] Kayvon Fatahalian, “Evolving the Direct3D Pipeline for Real- Time Micropolygon Rendering”, http://bps10.idav.ucdavis.edu/ • [Hoffman 2009] Naty Hoffman, “Deferred Lighting Approaches”, http://www.realtimerendering.com/blog/deferred-lighting-approaches/ • [Stone 2009] Adrian Stone, “Deferred Shading Shines. Deferred Lighting? Not So Much.”, http://gameangst.com/?p=141
  • 50. 50Spring 2011 – Beyond Programmable Shading Questions? • Full source and demo available at: – http://visual-computing.intel- research.net/art/publications/deferred_rendering/
  • 51. 51Spring 2011 – Beyond Programmable Shading Quad-Based Light Culling
  • 52. 52Spring 2011 – Beyond Programmable Shading Deferred Lighting / Light Pre-Pass • Goal: reduce G-buffer overhead • Split diffuse and specular terms – Common concession is monochromatic specular • Factor out constant terms from summation – Albedo, specular amount, etc. • Sum inner terms over all lights
  • 53. 53Spring 2011 – Beyond Programmable Shading Deferred Lighting / Light Pre-Pass • Resolve pass combines factored components – Still best to store all terms in G-buffer up front – Better SIMD efficiency • Incremental improvement for some hardware – Relies on pre-factoring lighting functions – Ability to vary resolve pass is not particularly useful • See [Hoffman 2009] and [Stone 2009]
  • 54. 54Spring 2011 – Beyond Programmable Shading MSAA with Quad-Based Methods • Mark pixels for per-sample shading – Stencil still faster than branching on most hardware – Probably gets scheduled better • Shade in two passes: per-pixel and per-sample – Unfortunately, duplicates culling work – Scheduling is still a problem
  • 55. 55Spring 2011 – Beyond Programmable Shading Per-Sample Scheduling • Lack of spatial locality causes hardware scheduling inefficiency
  • 56. 56Spring 2011 – Beyond Programmable Shading MSAA with Tile-Based Methods • Handle per-pixel and per-sample in one pass – Avoids duplicate culling work – Can use branching, but incurs scheduling problems – Instead, reschedule per-sample pixels – Shade sample 0 for the whole tile – Pack a list of pixels that require per-sample shading – Redistribute threads to process additional samples – Scatter per-sample shaded results
  • 57. 57Spring 2011 – Beyond Programmable Shading Tile-Based MSAA at 1080p, 1024 Lights 0 5 10 15 20 25 30 35 Crytek Sponza (ATI 5870) 2009 Game (ATI 5870) Crytek Sponza (NVIDIA 480) 2009 Game (NVIDIA 480) FrameTime(ms) No MSAA 4x MSAA (Branching) 4x MSAA (Packed)
  • 58. 58Spring 2011 – Beyond Programmable Shading 1 2 4 8 16 32 64 16 32 64 128 256 512 1024 FrameTime(ms) Number of Point Lights Deferred Shading (ATI 5870) Deferred Lighting (ATI 5870) Deferred Shading (NVIDIA 480) Deferred Lighting (NVIDIA 480) Tiled (ATI 5870) Tiled (NVIDIA 480) 4x MSAA Performance at 1080p Slope ~ 5 µs / light Slope ~ 35 µs / light Tiled takes less of a hit from MSAA Deferred lighting even less compelling