1. AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM
STEPHAN HODES
DEVELOPER TECHNOLOGY ENGINEER, AMD
GCN PERFORMANCE „FTW“
2. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM2
AGENDA
GCN architecture explained
Top 10: GCN Performance Advice
Questions
3. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM3
AMD GRAPHICS CORE NEXT
What is GCN?
‒Non VLIW architecture
‒ Less dependent on manual vectorization of shaders
‒ Susceptible to register pressure
‒Architecture used in:
‒ AMD discrete GPUs since 2012 (HD7700 and better)
‒ Kabini and Kaveri APUs
‒ Future AMD hardware
‒ New consoles
GCN Hardware is required for Mantle
‒ DirectX 12 API support
4. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM44
PRODUCT SPECIFICATIONS
AMD RADEON™ R9 290 SERIES
R9 290X R9 290
Compute Units 44 40
Engine Clock Up to 1 GHz Up to 950 MHz
Compute Performance 5.6 TFLOPS 4.9 TFLOPS
Memory Configuration 4GB GDDR5 / 512-bit 4GB GDDR5 / 512-bit
Memory Speed 5.0 Gbps 5.0 Gbps
AMD TrueAudio Technology Yes Yes
API Support
DirectX®
11.2
OpenGL 4.3
Mantle
DirectX®
11.2
OpenGL 4.3
Mantle
5. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM5
GCN COMPUTE UNIT – SPECIFICS
Non VLIW instruction set architecture
4 [16-lane] Vector ALU (SIMD)
‒ One wavefront is 64 threads
‒ 1 SP (Single-Precision) op: 4 clocks
‒ 1 DP (Double-Precision) ADD: 8 clocks
‒ 1 DP MUL/FMA & Transcendental:16 clocks
‒ 64KB Vector GPRs
1 fully programmable scalar ALU
‒ Shared by all threads of a wavefront
‒ Used for flow control, pointer arithmetic, etc.
‒ 8KB Scalar GPRs, scalar data cache, etc.
Branch &
Message Unit
Scalar Unit
Vector Units
(4x SIMD-16)
Vector Registers
(VGPRs, 4x 64KB)
Texture Filter
Units (4)
Local Data Share
(LDS, 64KB)
L1 Cache
(16KB)
Scheduler
Texture Fetch
Load / Store Units
(16)
Scalar Registers
(SGPRs, 8KB)
6. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM6
GCN COMPUTE UNIT – SPECIFICS
Distributed programmable scheduler(up to 2560 threads)
‒ Each compute unit can execute
instructions from multiple kernels
‒ Separate decode/issue for:
‒ 1 Vector Arithmetic Logic Unit (ALU)
‒ 1 Scalar ALU or Scalar Memory Read
or 1 Branch/Message
‒ 1 Vector memory access
(Read/Write/Atomic)
‒ 1 Local Data Share operation
(LDS)
‒ 1 Export or Global Data Share operation
(GDS)
Plus 1 Special/Internal – [no functional unit]
(s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)
Branch &
Message Unit
Scalar Unit
Vector Units
(4x SIMD-16)
Vector Registers
(VGPRs, 4x 64KB)
Texture Filter
Units (4)
Local Data Share
(LDS, 64KB)
L1 Cache
(16KB)
Scheduler
Texture Fetch
Load / Store Units
(16)
Scalar Registers
(SGPRs, 8KB)
7. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM7
GCN COMPUTE UNIT – SPECIFICS
64KB Local Data Share(LDS)
‒ 32 banks, with conflict resolution
‒ Bandwidth amplification
16KB read/write L1 vector data cache
Texture Units (utilize L1)
‒ 16 Load/Store units
‒ 4 Filter units
1 Branch & Message Unit
‒ Executes branch instructions
(as dispatched by Scalar Unit)
Branch &
Message Unit
Scalar Unit
Vector Units
(4x SIMD-16)
Vector Registers
(VGPRs, 4x 64KB)
Texture Filter
Units (4)
Local Data Share
(LDS, 64KB)
L1 Cache
(16KB)
Scheduler
Texture Fetch
Load / Store Units
(16)
Scalar Registers
(SGPRs, 8KB)
8. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM8
GCN COMPUTE UNIT – LATENCY HIDING
Up to 10 Wavefronts/SIMD
‒ Used to hide latency
‒ Round Robin scheduling
‒ Independent kernels
‒ Often limited by GPR or LDS usage
Time (clocks) Batch 2 Batch 3 Batch 4Batch 1
Stall
Runnable
Stall
Runnable
Stall
Runnable
Stall
Runnable
Done!
Done!
Done!
Done!
9. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM9
GDC COMPUTE UNIT – REGISTER PRESSURE
Vector GPRs
‒ 64KB / 64 threads / 4 Byte / 10 wavefronts = 25.6 VGPR/thread => Max 24 VGPR per thread
Scalar GPRs
‒ 8KB / 4 SIMD / 4 Byte / 10 wavefronts = 51.2 SGPR/wavefronts => Max 48 SGPR per wavefront
LDS
‒ 32KB/threadgroup and threadgroup size 64 => 2 wavefronts/CU max.
‒ 32KB/threadgroup and threadgroup size 256 => 8 wavefronts/CU max.
‒ 16KB/threadgroup and threadgroup size 256 => 16 wavefronts/CU max.
10. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM10
GCN SHADER OPTIMIZATION STRATEGIES
Try reducing GPR count if you are
slightly over a waves-per-SIMD
threshold
‒ Deep nesting
‒ Local array declarations
‒ Long-lived temporary variables
Reducing GPRs not always optimal
‒ Shadercompiler might use GPRs
to reduce latency
‒ High number of threads/CU
can thrash your caches
image_load v6, v[35:38], s[4:11]
v_mov_b32 v3, v35
image_load v7, v[3:6], s[4:11]
v_mov_b32 v38, v36
image_load v8, v[37:40], s[4:11]
v_mov_b32 v3, v37
image_load v9, v[3:6], s[4:11]
s_waitcnt vmcnt(2)
v_min_f32 v6, v6, v7
s_waitcnt vmcnt(1)
v_min_f32 v6, v6, v8
s_waitcnt vmcnt(0)
v_min_f32 v40, v6, v9
image_load v6, v[35:38], s[4:11]
v_mov_b32 v3, v35
image_load v7, v[3:6], s[4:11]
v_mov_b32 v38, v36
v_mov_b32 v3, v37
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
image_load v7, v[37:40], s[4:11]
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
image_load v7, v[3:6], s[4:11]
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
Always profile your changes!
http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/codexl/
http://developer.amd.com/community/blog/2014/05/16/codexl-game-developers-analyze-hlsl-gcn
11. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM11
Top 10 Performance Advice
12. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM12
TOP 10 PERFORMANCE ADVICE
1. Use the power of DirectCompute
‒ Thread group size should be multiple of 64
‒ 256 is often a good choice.
‒ Don‘t underestimate the benefits of LDS
‒ Use asynchronous compute
‒ Don‘t switch between Compute/Rasterization
too frequently
13. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM13
TOP 10 PERFORMANCE ADVICE
2. Don‘t over-tessellate
‒ Small triangles result in poor quad occupancy
‒ Use [maxtessfactor(X)] in Hull Shader declaration
‒ Recommended value is 15 or less
‒ Implement culling in Hull Shader
‒ Use Adaptive Tessellation
‒ Distance Adaptive
‒ Screen Space Adaptive
‒ Orientation Adaptive
!
Especially when rendering Shadowmaps!!!
14. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM14
TOP 10 PERFORMANCE ADVICE
3. Keep your pipeline short
‒ Avoid large expansion in the Geometry Shader
‒ Often a Vertex Shader-only solution can
replace Geometry Shader usage
‒ Bokeh expansion
‒ Pointsprites
‒ Disable tessellation pipeline if unused
4. Pack shaderstage output
‒ Limit Vertex and Domain Shader output size to
4 float4/int4 attributes for best performance.
struct PS_INPUT
{
float3 vPosition;
float3 vNormal;
float2 vTexcoord1;
float2 vTexcoord2;
float2 vTexcoord3;
}; // Unoptimal
struct PS_INPUT
{
float4 vPositionTexcoord1U;
float4 vNormalTexcoord1V;
float4 vTexcoords23;
}; // Good
15. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM15
TOP 10 PERFORMANCE ADVICE
5. Update your Data using map/unmap
‒ Avoid MAP_WRITE_DISCARD
‒ Prefer MAP_WRITE_NO_OVERWRITE
‒ Avoid UpdateSubresource
‒ Prefer Map and/or CopyResource instead
‒ UpdateSubresource is ok for small (<=4KB) updates
‒ CopyResource introduces GPU stalls
‒ Don‘t use the updated resource immediately
‒ Using data without copying it to local first
sometimes can improve performance
16. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM16
TOP 10 PERFORMANCE ADVICE
6. Use flow control with care
‒ Flow control has little overhead
‒ Skipping data fetches usually is good
‒ Avoid non-coherent codepaths
within a wavefront
‒ Watch out for GPR pressure
caused by loops and deep nested branches
v_cmp_gt_f32 r0,r1 //a > b, establish VCC
s_mov_b64 s0,exec //Save current exec mask
s_and_b64 exec,vcc,exec //Do “if”
s_cbranch_vccz label0 //Branch if all lanes fail
v_sub_f32 r2,r0,r1 //result = a – b
v_mul_f32 r2,r2,r0 //result=result * a
label0:
s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)
s_cbranch_execz label1 //Branch if all lanes fail
v_sub_f32 r2,r1,r0 //result = b – a
v_mul_f32 r2,r2,r1 //result = result * b
label1:
s_mov_b64 exec,s0 //Restore exec mask
// Branching code example
float fn0(float a,float b)
{
if(a>b)
return((a-b)*a);
else
return((b-a)*b);
}
17. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM17
TOP 10 PERFORMANCE ADVICE
7. Pack your G-Buffer using RGBA16_UINT
‒ Fetches from RGBA16 are full rate (without filtering)
‒ Bilinear fetches to RGBA16 are half rate
‒ Exports to RGBA16_INT are full rate (without blending)
Caution: Blended exports to RGBA16_INT are ¼ speed
8. Depth buffer: don’t render after read
‒ Binding a depth buffer as texture will decompress it,
this will make subsequent Z ops more expensive.
‒ Critical for shadow map atlas rendering!
‒ Consider exporting depth to G-Buffer
18. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM18
TOP 10 PERFORMANCE ADVICE
9. Batch, Batch, Batch!
‒ Add support for geometry instancing
‒ Pool & batch your updates
‒ Less important with Mantle/DirectX12
‒ Reduces Drawcall overhead
‒ Allows better scheduling
10. (DX11) Prefer engine threading
over Deferred Contexts
‒ Deferred contexts are a software feature
‒ … or move to Mantle/DirectX12
19. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM19
TOP 10 PERFORMANCE ADVICE
Avoid LDS bank conflicts
‒ Accessing LDS with addresses that are
32 DWORD apart from different threads
will cause bank conflicts
‒ Unless if it‘s the same address
Don't use gather with offsets
‒ This will result in 4 image_gather4 instructions
image_gather4_c_lz v4, v[12:15], s[4:11], s[12:15]
v_mov_b32 v11, 1
image_gather4_c_lz_o v5, v[11:14], s[4:11], s[12:15]
v_mov_b32 v11, 0x00000100
image_gather4_c_lz_o v7, v[11:14], s[4:11], s[12:15]
v_mov_b32 v11, 0x00000101
image_gather4_c_lz_o v0, v[11:14], s[4:11], s[12:15]
s_waitcnt vmcnt(0)
Bonus Advice
image_gather4_c_lz v0, v[2:5], s[4:11], s[12:15]
s_waitcnt vmcnt(0)
float4 PsExample( PsInput Input ) : SV_Target
{
return tex.GatherCmpRed(
g_SamplePointCmp,
Input.vTex,
Input.depth );
}
float4 PsExample( PsInput Input ) : SV_Target
{
return tex.GatherCmpRed(
g_SamplePointCmp,
Input.vTex,
Input.depth,
int2(0,0),
int2(1,0),
int2(0,1),
int2(1,1) );
}
20. | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM20
Questions?
Stephan.Hodes@amd.com