3. CS 354 3
My Office Hours
Tuesday, before class
Painter (PAI) 5.35
8:45 a.m. to 9:15
Thursday, after class
ACE 6.302
11:00 a.m. to 12
Randy’s office hours
Monday & Wednesday
11 a.m. to 12:00
Painter (PAI) 5.33
4. CS 354 4
Last time, this time
Last lecture, we discussed
Acceleration structures
This lecture
Graphics Performance Analysis
Projects
Project 4 on ray tracing on Piazza
Due May 2, 2012
Get started!
5. CS 354 5
On a sheet of paper
Daily Quiz • Write your EID, name, and date
• Write #1, #2, #3 followed by its answer
Multiple choice: Which is
NOT a bounding volume True of False: Volume
representation rendering can be accelerated
by the GPU by drawing
a) sphere
blended slices of the volume.
b) axis-aligned bounding box
c) object aligned bounding box
d) bounding graph point
e) convex polyhedron
True or False: Place objects
within a uniform grid is easier
than placing objects within a
KD tree.
6. CS 354 6
Graphics Performance Analysis
Generating synthetic images by computer
is computationally—and bandwidth—
intensive
Achieving interactive rates is key
60 frames/second ≈ real-time interactivity
Worth optimizing
Entertainment and intuition tied to interactivity
How do we think about graphics
performance analysis?
7. CS 354 7
Framing Amdahl’s Law
Assume a workload with two parts
First part in A%
Second part is B%
Such that A% + B% = 100%
If we have a technique to speedup the
second part by N times
But have no speedup for the first part
What overall speed up can we expect?
8. CS 354 8
Amdahl’s Equation
Assume A% + B% = 100%
If the un-optimized effort is 100%, the optimized
effort should be smaller
B%
OptimizedEffort = A% +
N
Speedup is ratio of UnoptimizedEffort to
OptimizedEffort
100% 1
Speedup = =
B% B
A% + ( B − 1) +
N N
9. CS 354 9
Who was Amdahl?
Gene Amdahl
CPU architect for IBM in 1960s
Helped design IBM’s System/360 mainframe
architecture
Left IBM to found Amdahl computer
Building IBM compatible mainframes
Why?
Evaluating whether to invest in parallel
processing or not
10. CS 354 10
Parallelization
Broadly speaking, computer tasks can be broken
into two portions
Sequential sub-tasks
Naturally requires steps to be done in a particular order
Examples: text layout, entropy decoding
Parallel sub-tasks
Problem splits into lots of independent chunks of work
Chunks of work can be done by separate processing units
simultaneously: parallelization
Examples: tracing rays, shading pixels, transforming
vertices
11. CS 354 11
Serial Work Sandwiching
Parallel Work
12. CS 354 12
Example of Amdahl’s Law
Say a task is 50% serial and 50% parallel
Consider using 4 parallel processors on the
parallel portion
Speedup: 1.6x
Consider using 40 parallel processor on parallel
portion
Speedup: 1.951x
Consider limit: 1
lim =2
n →∞ .5
.5 +
n
14. CS 354 14
Pessimism about Parallelism?
Amdahl’s Law can instill pessimism about
parallel processing
If the serial work percent is high, adding
parallel units has low benefit
Assumes fixed “problem” size
So workload stays same size even as parallel
execution resources are added
So why do GPUs offer 100’s of cores
then?
15. CS 354 15
Gustafson's Law
Observation
by John Gustafson
With N parallel unit, bigger problems can be attacked
Great example
Increasing GPU resolution
Was 640x480 pixels, now 1920x1200
More parallel units means more pixels can be
processed simultaneously
Supporting rendering resolutions previously unattainable
Problem size improvement
problemScale = N − A( N − 1)
16. CS 354 16
Example
Say a task is 50% serial and 50% parallel
Consider using 4 parallel processors on the
parallel portion
Problem scales up: 2.5x
Consider 100 parallel processors
Problem scales up: 50.5x
Also consider heterogeneous nature of graphics
processing units
17. CS 354 17
Coherent Work vs.
Incoherent Work
Not all parallel work is created equal
Coherent work = “adjacent” chunks of work
performing similar operations and memory
accesses
Example: camera rays, pixel shading
Allows sharing control of instruction execution
Good for caches
Incoherent work = “adjacent” chunks of work
performing dissimilar operations and memory
accesses
Examples: reflection, shadow, and refraction rays
Bad for caches
18. CS 354 18
Coherent vs. Incoherent Rays
coherent = camera rays coherent = light rays
incoherent = reflected rays
19. CS 354 19
Keeping Work Coherent?
How do we keep work concurrent?
Pipelines
Careful because they can introduce latency
Data structures
SPMD (or SIMD) execution
Single Program, Multiple Data
To exploit Single Instruction, Multiple Data (SIMD)
units
Bundling “adjacent” work elements helps cache and
memory access efficiency
21. A Simplified Graphics Pipeline
CS 354 21
Application
Application-
OpenGL API boundary
Vertex batching & assembly
Triangle assembly
Triangle clipping
NDC to window space
Triangle rasterization
Fragment shading
Depth testing Depth buffer
Color update Framebuffer
22. CS 354 22
Another View of the Graphics Pipeline
3D Application
or Game
OpenGL API
CPU – GPU
Boundary
GPU Vertex Primitive Clipping, Setup, Raster
Front End Assembly Assembly and Rasterization Operations
Vertex Geometry Fragment
Shader Program Shader
Attribute Fetch
Legend
Parameter Buffer Read Texture Fetch Framebuffer Access
programmable
fixed-function
Memory Interface
OpenGL 3.3
23. CS 354 23
Modeling Pipeline Efficiency
Rate of processing for sequential tasks
Assume three tasks
Run time is sum of each operation’s time
A+B+C
Rate of processing in a pipeline
Assume three tasks, treated as stages
Performance gated by slowest operation
Three operations in pipeline: A, B, C
Run time = max(A,B,C)
24. CS 354 24
Hardware Clocks
Heart beat of hardware
Measured in frequency
Hertz (Hz) = cycles per second
Megahertz, gigahertz = million, billion Hz
Faster clocks = faster computation and
data transfer
So why not simply raise clocks?
High clocks consume more power
Circuits are only rated to a maximum clock
speed before becoming unreliable
25. CS 354 25
Clock Domains
Given chip may have multiple clocks running
Three key domains (GPU-centric)
Graphics clock—for fixed-function units
Example uses: rasterization, texture filtering, blending
Optimize for throughput, not latency
Can often instance more units instead of raising clocks
Processor clock—for programmable shader units
Example: shader instruction execution
Generally higher than graphics clock
Because optimized for latency rather than throughput
Memory clock—for talking to external memory
Depends on speed rating of external memory
Other domains too
Display clock, PCI-Express bus clock
Generally not crucial to rendering performance
26. CS 354 26
3D Pipeline Programmable
Domains run on Unified Hardware
Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
Plus tessellation + compute (not shown below)
,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations
Can be Vertex Primitive Fragment
unified Program Program Program
hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access
Memory Interface
27. CS 354 27
Memory Bandwidth
Raw memory bandwidth
Physical clock rate
Examples: 3 Ghz
Memory bus width
64-bit, 128-bit, 192-bit, 256-bit, 384-bit
Wider buses are faster but more expensive to route all those wires
Signaling rate
Double data rate (DDR) means signals are sent on the rising and
falling clock edges
Often logical memory clock rate includes signaling rate
Computing raw memory bandwidth
bandwidth = physicalClock × signalPerClock × busWidth
28. CS 354 28
Latency vs. Throughput
Raw bandwidth is reduced by memory
utilization bandwidth
Unrealistic to expect 100% utilization
GPUs are much better than CPUs generally
Trade-off
Maximizing throughput (utilization) increases
latency
Minimizing latency reduces utilization
30. CS 354 30
GeForce Peak
Memory Bandwidth Trends
200
128-bit interface 256-bit interface
180
Raw
160 bandwidth
Gigabytes per second
140
Effective raw
bandwidth
120
with
compression
100
Expon.
(Effective raw
bandwidth
80
with
compression)
60
Expon. (Raw
bandwidth)
40
20
0
GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce
GT S 4600 6800 Ultra 7800 GT X
31. CS 354 31
Effective GPU
Memory Bandwidth
Compression schemes
Lossless depth and color (when multisampling)
compression
Lossy texture compression (S3TC / DXTC)
Typically assumes 4:1 compression
Avoidance useless work
Early killing of fragments (Z cull)
Avoiding useless blending and texture fetches
Very clever memory controller designs
Combining memory accesses for improved coherency
Caches for texture fetches
32. CS 354 32
Other Metrics
Host bandwidth
Vertex pulling
Vertex transformation
Triangle rasterization and setup
Fragment shading rate
Shader instruction rate
Raster (blending) operation rate
Early Z reject rate
33. CS 354 33
Kepler GeForce GTX 680
High-level Block Diagram
8 Streaming
Multiprocessors
(SMX)
1536 CUDA Cores
8 Geometry Units
4 Raster Units
128 Texture units
32 Raster operations
256-bit GDDR5
memory
34. CS 354 34
Kepler Streaming Multiprocessor
8 more copies of this
35. CS 354 35
Prior Generation Streaming
Multiprocessor (SM)
Multi-processor
execution unit (Fermi)
32 scalar processor
cores
Warp is a unit of
thread execution of up
to 32 threads
Two workloads
Graphics
Vertex shader
Tessellation
Geometry shader
Fragment shader
Compute
36. CS 354 36
Power Gating
Computer architecture has hit the “power wall”
Low-power operation is at a premium
Battery-powered devices
Thermal constraints
Economic constraints
Power Management (PM) works to reduce
power by
Lower clocks when performance isn’t required
Disabling hardware units
Avoids leakage
37. CS 354 37
Scene Graph Labor
High-level division of scene graph labor
Four pipeline stages
App (application)
Code that manipulates/modifies the scene graph in response to
user input or other events
Isect (intersection)
Geometric queries such as collision detection or picking
Cull
Traverse the scene graph to find the nodes to be rendered
Best example: eliminate objects out of view
Optimize the ordering of nodes
Sort objects to minimize graphics hardware state changes
Draw
Communicating drawing commands to the hardware
Generally through graphics API (OpenGL or Direct3D)
Can map well to multi-processor CPU systems
38. CS 354 38
App-cull-draw Threading
App-cull-draw processing on one CPU core
App-cull-draw processing on multiple CPUs
39. CS 354 39
Scene Graph Profiling
Scene graph should help provide insight
into performance
Process statistics
What’s going on?
Time stamps
Database statistics
How complex is the scene in any frame?
40. CS 354 40
Example:
Depth Complexity Visualization
How many pixels are being rendered?
Pixels can be rasterized by multiple objects
Depth complexity is the average number of times a
pixel or color sample is updated per frame
yellow and black indicate higher depth complexity
41. CS 354 41
Example:
Heads-up Display of Statistics
Process statistics
How long is
everything taking?
Database statistic
What is being
rendered?
Overlaying on
active scene often
value
Dynamic update
42. CS 354 42
Benchmarking
Synthetic benchmarks focus on rendering
particular operations in isolation
What is the blended pixel performance
Application benchmarks
Try to reflect what a real application would do
43. CS 354 43
Tips for Interactive
Performance Analysis
Vary things you can control
Change window resolution
Making it smaller and seeing better performance
Null driver analysis
Skip the actual rendering calls
What if the driver was *infinitely” fast
Use occlusion queries to monitor how many
samples (pixels) are actually got to need
Keep data on the GPU
Let GPU do Direct Memory Access (DMA)
Keep from swapping textures and buffers
Easy when multi-gigabyte graphics cards available
44. CS 354 44
Next Class
Next lecture
Surfaces
Programmable tessellation
Reading
None
Project 4
Project 4 is a simple ray tracer
Due Wednesday, May 2, 2012