PPU Optimisation Lesson

Engineer Learning Session #1
Optimisation Tips for PowerPC

Ben Hanke
April 16th 2009

Established Common Sense
 “90% of execution time is spent in 10% of the code”
 “Programmers are really bad at knowing what really needs
optimizing so should be guided by profiling”
 “Why should I bother optimizing X? It only runs N times per
frame”
 “Processors are really fast these days so it doesn’t matter”
 “The compiler can optimize that better than I can”

Alternative View
 The compiler isn’t always as smart as you think it is.
 Really bad things can happen in the most innocent looking code
because of huge penalties inherent in the architecture.
 A generally sub-optimal code base will nickle and dime you for a
big chunk of your frame rate.
 It’s easier to write more efficient code up front than to go and find
it all later with a profiler.

PPU Hardware Threads
 We have two of them running on alternate cycles
 If one stalls, other thread runs
 Not multi-core:
Shared exection units
Shared access to memory and cache
Most registers duplicated
 Ideal usage:
Threads filling in stalls for each other without thrashing cache

PS3 Cache
 Level 1 Instruction Cache
32 kb 2-way set associative, 128 byte cache line
 Level 1 Data Cache
32 kb 4-way set associative, 128 byte cache line
Write-through to L2
 Level 2 Data and Instruction Cache
512 kb 8-way set associative
Write-back
128 byte cache line

Cache Miss Penalties
 L1 cache miss = 40 cycles
 L2 cache miss = ~1000 cycles!
 In other words, random reads from memory are excruciatingly
expensive!
 Reading data with large strides – very bad
 Consider smaller data structures, or group data that will be read
together

Virtual Functions
 What happens when you call a virtual function?
 What does this code do?
virtual void Update() {}
 May touch cache line at vtable address unnecessarily
 Consider batching by type for better iCache pattern
 If you know the type, maybe you don’t need the virtual – save
touching memory to read the function address
 Even better – maybe the data you actually want to manipulate
can be kept close together in memory?

Spot the Stall

int SlowFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}

Method 1

Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}

Method 2

int FastFunction(int * __restrict a,
int * __restrict b)
{
*a = 1;
*b = 2;
return *a + *b; // we promise that a != b
}

__restrict Keyword

 __restrict only works with pointers, not references (which
sucks).
 Aliasing only applies to identical types.

 Can be applied to implicit this pointer in member functions.

 Put it after the closing brace.

 Stops compiler worrying that you passed a class data member

to the function.

Load-Hit-Store

 What is it?

 Write followed by read
 PPU store queue

 Average latency 40 cycles

 Snoop bits 52 through 61 (implications?)

True LHS
 Type casting between register files:
float floatValue = (float)intValue;
float posY = vPosition.X();

 Data member as loop counter
while( m_Counter-- ) {}

 Aliasing:
void swap( int & a, int & b ) { int t = a; a = b; b = t; }

Workaround: Data member as loop counter

int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
doSomething();
}
m_Counter = 0; // Store to memory just once

Workarounds
 Keep data in the same domain as much as possible
 Reorder your writes and reads to allow space for latency
 Consider using word flags instead of many packed bools
Load flags word into register
Perform logical bitwise operations on flags in register
Store new flags value

e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
newcanSeeFlag | kAngry;

False LHS Case
 Store queue snooping only compares bits 52 through 61
 So false LHS occurs if you read from address while different item
in queue matches addr & 0xFFC.
 Writing to and reading from memory 4KB apart
 Writing to and reading from memory on sub-word boundary, e.g.
packed short, byte, bool
 Write a bool then read a nearby one -> ~40 cycle stall

Example
struct BaddyState
{
bool m_bAlive;
bool m_bCanSeePlayer;
bool m_bAngry;
bool m_bWearingHat;
}; // 4 bytes

Where might we stall?
if ( m_bAlive )
{
m_bCanSeePlayer = LineOfSightCheck( this, player );
if ( m_bCanSeePlayer && !m_bAngry )
{
m_bAngry = true;
StartShootingPlayer();
}
}

Workaround
if ( m_bAlive ) // load, compare
{
const bool bAngry = m_bAngry; // load
const bool bCanSeePlayer = LineOfSightCheck( this, player );
m_bCanSeePlayer = bCanSeePlayer; // store
if (bCanSeePlayer && !bAngry ) // compare registers
{
m_bAngry = true; // store
StartShootingPlayer();
}
}

Loop + Singleton Gotcha
What happens here?

for( int i = 0; i < enemyCount; ++i )
{
EnemyManager::Get().DispatchEnemy();
}

Workaround

EnemyManager & enemyManager = EnemyManager::Get();
for( int i = 0; i < enemyCount; ++i )
{
enemyManager.DispatchEnemy();
}

Branch Hints
 Branch mis-prediction can hurt your performance
 24 cycles penalty to flush the instruction queue
 If you know a certain branch is rarely taken, you can use a static
branch hint, e.g.
if ( __builtin_expect( bResult, 0 ) )

 Far better to eliminate the branch!
Use logical bitwise operations to mask results
This is far easier and more applicable in SIMD code

Floating Point Branch Elimination
 __fsel, __fsels – Floating point select

float min( float a, float b )
{
return ( a < b ) ? a : b;
}
float min( float a, float b )
{
return __fsels( a – b, b, a );
}

Microcoded Instructions
 Single instruction -> several, fetched from ROM, pipeline bubble
 Common example of one to avoid: shift immediate.
int a = b << c;
 Minimum 11 cycle latency.
 If you know range of values, can be better to switch to a fixed shift!
switch( c )
{
case 1: a = b << 1; break;
default: break;
}

Loop Unrolling
 Why unroll loops?
Less branches
Better concurrency, instruction pipelining, hide latency
More opportunities for compiler to optimise
Only works if code can actually be interleaved, e.g. inline functions, no
inter-iteration dependencies
 How many times is enough?
On average about 4 – 6 times works well
Best for loops where num iterations is known up front
 Need to think about spare - iterationCount % unrollCount

Picking up the Spare
 If you can, artificially pad your data with safe values to keep as multiple of
unroll count. In this example, you might process up to 3 dummy items in
the worst case.

for ( int i = 0; i < numElements; i += 4 )
{
InlineTransformation( pElements[ i+0 ] );
}

 If you can’t pad, run for numElements & ~3 instead and run to
completion in a second loop.
for ( ; i < numElements; ++i )
{
InlineTransformation( pElements[ i ] );
}

Alternative method (pros and cons - one branch but longer code generated).

switch( numElements & 3 )
{
case 3: InlineTransformation( pElements[ i+2 ] );
case 0: break;
}

Loop unrolled… now what?
 If you unrolled your loop 4 times, you might be able to use SIMD
 Use AltiVec intrinsics – align your data to 16 bytes
 128 bit registers - operate on 4 32-bit values in parallel
 Most SIMD instructions have 1 cycle throughput
 Consider using SOA data instead of AOS
AOS: Arrays of interleaved posX, posY, posZ structures
SOA: A structure of arrays for each field dimensioned for all elements

PPU Optimisation Lesson

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a PPU Optimisation Lesson

Semelhante a PPU Optimisation Lesson (20)

Mais de slantsixgames

Mais de slantsixgames (12)

Último

Último (20)

PPU Optimisation Lesson