SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Engineer Learning Session #1
Optimisation Tips for PowerPC

Ben Hanke
April 16th 2009
Playstation 3 PPU
Optimized Playstation 3 PPU
Consoles with PowerPC
Established Common Sense
   “90% of execution time is spent in 10% of the code”
   “Programmers are really bad at knowing what really needs
    optimizing so should be guided by profiling”
   “Why should I bother optimizing X? It only runs N times per
    frame”
   “Processors are really fast these days so it doesn’t matter”
   “The compiler can optimize that better than I can”
Alternative View
   The compiler isn’t always as smart as you think it is.
   Really bad things can happen in the most innocent looking code
    because of huge penalties inherent in the architecture.
   A generally sub-optimal code base will nickle and dime you for a
    big chunk of your frame rate.
   It’s easier to write more efficient code up front than to go and find
    it all later with a profiler.
PPU Hardware Threads
   We have two of them running on alternate cycles
   If one stalls, other thread runs
   Not multi-core:
      Shared exection units
      Shared access to memory and cache
      Most registers duplicated
   Ideal usage:
      Threads filling in stalls for each other without thrashing cache
PS3 Cache
   Level 1 Instruction Cache
      32 kb 2-way set associative, 128 byte cache line
   Level 1 Data Cache
      32 kb 4-way set associative, 128 byte cache line
      Write-through to L2
   Level 2 Data and Instruction Cache
      512 kb 8-way set associative
      Write-back
      128 byte cache line
Cache Miss Penalties
   L1 cache miss = 40 cycles
   L2 cache miss = ~1000 cycles!
   In other words, random reads from memory are excruciatingly
    expensive!
   Reading data with large strides – very bad
   Consider smaller data structures, or group data that will be read
    together
Virtual Functions
   What happens when you call a virtual function?
   What does this code do?
         virtual void Update() {}
   May touch cache line at vtable address unnecessarily
   Consider batching by type for better iCache pattern
   If you know the type, maybe you don’t need the virtual – save
    touching memory to read the function address
   Even better – maybe the data you actually want to manipulate
    can be kept close together in memory?
Data Hazards Ahead
Spot the Stall


int SlowFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 1


Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 2


int FastFunction(int * __restrict a,
                 int * __restrict b)
{
   *a = 1;
   *b = 2;
   return *a + *b; // we promise that a != b
}
__restrict Keyword


 __restrict only works with pointers, not references (which
sucks).
 Aliasing only applies to identical types.

 Can be applied to implicit this pointer in member functions.

    Put it after the closing brace.

    Stops compiler worrying that you passed a class data member

   to the function.
Load-Hit-Store

   What is it?

   Write followed by read
   PPU store queue

   Average latency 40 cycles

   Snoop bits 52 through 61 (implications?)
True LHS
   Type casting between register files:
      float floatValue = (float)intValue;
       float posY = vPosition.X();


   Data member as loop counter
      while( m_Counter-- ) {}


   Aliasing:
      void swap( int & a, int & b ) { int t = a; a = b; b = t; }
Workaround: Data member as loop counter

int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
    doSomething();
}
m_Counter = 0; // Store to memory just once
Workarounds
   Keep data in the same domain as much as possible
   Reorder your writes and reads to allow space for latency
   Consider using word flags instead of many packed bools
      Load flags word into register
      Perform logical bitwise operations on flags in register
      Store new flags value

    e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
      newcanSeeFlag | kAngry;
False LHS Case
   Store queue snooping only compares bits 52 through 61
   So false LHS occurs if you read from address while different item
    in queue matches addr & 0xFFC.
   Writing to and reading from memory 4KB apart
   Writing to and reading from memory on sub-word boundary, e.g.
    packed short, byte, bool
   Write a bool then read a nearby one -> ~40 cycle stall
Example
struct BaddyState
{
    bool m_bAlive;
    bool m_bCanSeePlayer;
    bool m_bAngry;
    bool m_bWearingHat;
}; // 4 bytes
Where might we stall?
if ( m_bAlive )
{
    m_bCanSeePlayer = LineOfSightCheck( this, player );
    if ( m_bCanSeePlayer && !m_bAngry )
    {
        m_bAngry = true;
        StartShootingPlayer();
    }
}
Workaround
if ( m_bAlive ) // load, compare
{
    const bool bAngry = m_bAngry; // load
    const bool bCanSeePlayer = LineOfSightCheck( this, player );
    m_bCanSeePlayer = bCanSeePlayer; // store
    if (bCanSeePlayer && !bAngry ) // compare registers
    {
        m_bAngry = true; // store
        StartShootingPlayer();
    }
}
Loop + Singleton Gotcha
What happens here?

 for( int i = 0; i < enemyCount; ++i )
 {
     EnemyManager::Get().DispatchEnemy();
 }
Workaround

EnemyManager & enemyManager = EnemyManager::Get();
for( int i = 0; i < enemyCount; ++i )
{
    enemyManager.DispatchEnemy();
}
Branch Hints
   Branch mis-prediction can hurt your performance
   24 cycles penalty to flush the instruction queue
   If you know a certain branch is rarely taken, you can use a static
    branch hint, e.g.
      if ( __builtin_expect( bResult, 0 ) )

   Far better to eliminate the branch!
      Use logical bitwise operations to mask results
      This is far easier and more applicable in SIMD code
Floating Point Branch Elimination
   __fsel, __fsels – Floating point select

    float min( float a, float b )
    {
        return ( a < b ) ? a : b;
    }
    float min( float a, float b )
    {
        return __fsels( a – b, b, a );
    }
Microcoded Instructions
   Single instruction -> several, fetched from ROM, pipeline bubble
   Common example of one to avoid: shift immediate.
        int a = b << c;
   Minimum 11 cycle latency.
   If you know range of values, can be better to switch to a fixed shift!
       switch( c )
       {
            case 1: a = b << 1; break;
            case 2: a = b << 2; break;
            case 3: a = b << 3; break;
            default: break;
       }
Loop Unrolling
   Why unroll loops?
      Less branches
      Better concurrency, instruction pipelining, hide latency
      More opportunities for compiler to optimise
      Only works if code can actually be interleaved, e.g. inline functions, no
      inter-iteration dependencies
   How many times is enough?
      On average about 4 – 6 times works well
      Best for loops where num iterations is known up front
   Need to think about spare - iterationCount % unrollCount
Picking up the Spare
   If you can, artificially pad your data with safe values to keep as multiple of
    unroll count. In this example, you might process up to 3 dummy items in
    the worst case.

     for ( int i = 0; i < numElements; i += 4 )
     {
          InlineTransformation( pElements[ i+0 ] );
          InlineTransformation( pElements[ i+1 ] );
          InlineTransformation( pElements[ i+2 ] );
          InlineTransformation( pElements[ i+3 ] );
     }
Picking up the Spare
   If you can’t pad, run for numElements & ~3 instead and run to
    completion in a second loop.
    for ( ; i < numElements; ++i )
    {
          InlineTransformation( pElements[ i ] );
    }
Picking up the Spare
Alternative method (pros and cons - one branch but longer code generated).


  switch( numElements & 3 )
 {
 case 3: InlineTransformation( pElements[ i+2 ] );
 case 2: InlineTransformation( pElements[ i+1 ] );
 case 1: InlineTransformation( pElements[ i+0 ] );
 case 0: break;
 }
Loop unrolled… now what?
   If you unrolled your loop 4 times, you might be able to use SIMD
   Use AltiVec intrinsics – align your data to 16 bytes
   128 bit registers - operate on 4 32-bit values in parallel
   Most SIMD instructions have 1 cycle throughput
   Consider using SOA data instead of AOS
      AOS: Arrays of interleaved posX, posY, posZ structures
      SOA: A structure of arrays for each field dimensioned for all elements
Example: FloatToHalf
Example: FloatToHalf4
Questions?

Mais conteúdo relacionado

Mais procurados

100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects Andrey Karpov
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gcexsuns
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precisionDaniel_Rhodes
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherNiloy Biswas
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operationNam Yong Kim
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timingPVS-Studio
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Piotr Przymus
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Piotr Przymus
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operationMazin Alwaaly
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation harshit chavda
 

Mais procurados (20)

opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precision
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipher
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operation
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timing
 
Modes of Operation
Modes of Operation Modes of Operation
Modes of Operation
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operation
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
lec9_ref.pdf
lec9_ref.pdflec9_ref.pdf
lec9_ref.pdf
 
Caching in
Caching inCaching in
Caching in
 

Semelhante a PPU Optimisation Lesson

Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfRomanKhavronenko
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projectsPVS-Studio
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0PVS-Studio
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio
 
C optimization notes
C optimization notesC optimization notes
C optimization notesFyaz Ghaffar
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 

Semelhante a PPU Optimisation Lesson (20)

Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Matopt
MatoptMatopt
Matopt
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernel
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 

Mais de slantsixgames

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hdslantsixgames
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8thslantsixgames
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)slantsixgames
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)slantsixgames
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introductionslantsixgames
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SConsslantsixgames
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009slantsixgames
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentationslantsixgames
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overviewslantsixgames
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overviewslantsixgames
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentationslantsixgames
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipeslantsixgames
 

Mais de slantsixgames (12)

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hd
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8th
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introduction
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SCons
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentation
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overview
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentation
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipe
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

PPU Optimisation Lesson

  • 1. Engineer Learning Session #1 Optimisation Tips for PowerPC Ben Hanke April 16th 2009
  • 5. Established Common Sense  “90% of execution time is spent in 10% of the code”  “Programmers are really bad at knowing what really needs optimizing so should be guided by profiling”  “Why should I bother optimizing X? It only runs N times per frame”  “Processors are really fast these days so it doesn’t matter”  “The compiler can optimize that better than I can”
  • 6. Alternative View  The compiler isn’t always as smart as you think it is.  Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture.  A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate.  It’s easier to write more efficient code up front than to go and find it all later with a profiler.
  • 7. PPU Hardware Threads  We have two of them running on alternate cycles  If one stalls, other thread runs  Not multi-core: Shared exection units Shared access to memory and cache Most registers duplicated  Ideal usage: Threads filling in stalls for each other without thrashing cache
  • 8. PS3 Cache  Level 1 Instruction Cache 32 kb 2-way set associative, 128 byte cache line  Level 1 Data Cache 32 kb 4-way set associative, 128 byte cache line Write-through to L2  Level 2 Data and Instruction Cache 512 kb 8-way set associative Write-back 128 byte cache line
  • 9. Cache Miss Penalties  L1 cache miss = 40 cycles  L2 cache miss = ~1000 cycles!  In other words, random reads from memory are excruciatingly expensive!  Reading data with large strides – very bad  Consider smaller data structures, or group data that will be read together
  • 10. Virtual Functions  What happens when you call a virtual function?  What does this code do? virtual void Update() {}  May touch cache line at vtable address unnecessarily  Consider batching by type for better iCache pattern  If you know the type, maybe you don’t need the virtual – save touching memory to read the function address  Even better – maybe the data you actually want to manipulate can be kept close together in memory?
  • 12. Spot the Stall int SlowFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 13. Method 1 Q: When will this work and when won’t it? inline int FastFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 14. Method 2 int FastFunction(int * __restrict a, int * __restrict b) { *a = 1; *b = 2; return *a + *b; // we promise that a != b }
  • 15. __restrict Keyword  __restrict only works with pointers, not references (which sucks).  Aliasing only applies to identical types.  Can be applied to implicit this pointer in member functions.  Put it after the closing brace.  Stops compiler worrying that you passed a class data member to the function.
  • 16. Load-Hit-Store  What is it?  Write followed by read  PPU store queue  Average latency 40 cycles  Snoop bits 52 through 61 (implications?)
  • 17. True LHS  Type casting between register files: float floatValue = (float)intValue; float posY = vPosition.X();  Data member as loop counter while( m_Counter-- ) {}  Aliasing: void swap( int & a, int & b ) { int t = a; a = b; b = t; }
  • 18. Workaround: Data member as loop counter int counter = m_Counter; // load into register while( counter-- ) // register will decrement { doSomething(); } m_Counter = 0; // Store to memory just once
  • 19. Workarounds  Keep data in the same domain as much as possible  Reorder your writes and reads to allow space for latency  Consider using word flags instead of many packed bools Load flags word into register Perform logical bitwise operations on flags in register Store new flags value e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) | newcanSeeFlag | kAngry;
  • 20. False LHS Case  Store queue snooping only compares bits 52 through 61  So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC.  Writing to and reading from memory 4KB apart  Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool  Write a bool then read a nearby one -> ~40 cycle stall
  • 21. Example struct BaddyState { bool m_bAlive; bool m_bCanSeePlayer; bool m_bAngry; bool m_bWearingHat; }; // 4 bytes
  • 22. Where might we stall? if ( m_bAlive ) { m_bCanSeePlayer = LineOfSightCheck( this, player ); if ( m_bCanSeePlayer && !m_bAngry ) { m_bAngry = true; StartShootingPlayer(); } }
  • 23. Workaround if ( m_bAlive ) // load, compare { const bool bAngry = m_bAngry; // load const bool bCanSeePlayer = LineOfSightCheck( this, player ); m_bCanSeePlayer = bCanSeePlayer; // store if (bCanSeePlayer && !bAngry ) // compare registers { m_bAngry = true; // store StartShootingPlayer(); } }
  • 24. Loop + Singleton Gotcha What happens here? for( int i = 0; i < enemyCount; ++i ) { EnemyManager::Get().DispatchEnemy(); }
  • 25. Workaround EnemyManager & enemyManager = EnemyManager::Get(); for( int i = 0; i < enemyCount; ++i ) { enemyManager.DispatchEnemy(); }
  • 26. Branch Hints  Branch mis-prediction can hurt your performance  24 cycles penalty to flush the instruction queue  If you know a certain branch is rarely taken, you can use a static branch hint, e.g. if ( __builtin_expect( bResult, 0 ) )  Far better to eliminate the branch! Use logical bitwise operations to mask results This is far easier and more applicable in SIMD code
  • 27. Floating Point Branch Elimination  __fsel, __fsels – Floating point select float min( float a, float b ) { return ( a < b ) ? a : b; } float min( float a, float b ) { return __fsels( a – b, b, a ); }
  • 28. Microcoded Instructions  Single instruction -> several, fetched from ROM, pipeline bubble  Common example of one to avoid: shift immediate. int a = b << c;  Minimum 11 cycle latency.  If you know range of values, can be better to switch to a fixed shift! switch( c ) { case 1: a = b << 1; break; case 2: a = b << 2; break; case 3: a = b << 3; break; default: break; }
  • 29. Loop Unrolling  Why unroll loops? Less branches Better concurrency, instruction pipelining, hide latency More opportunities for compiler to optimise Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies  How many times is enough? On average about 4 – 6 times works well Best for loops where num iterations is known up front  Need to think about spare - iterationCount % unrollCount
  • 30. Picking up the Spare  If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case. for ( int i = 0; i < numElements; i += 4 ) { InlineTransformation( pElements[ i+0 ] ); InlineTransformation( pElements[ i+1 ] ); InlineTransformation( pElements[ i+2 ] ); InlineTransformation( pElements[ i+3 ] ); }
  • 31. Picking up the Spare  If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop. for ( ; i < numElements; ++i ) { InlineTransformation( pElements[ i ] ); }
  • 32. Picking up the Spare Alternative method (pros and cons - one branch but longer code generated). switch( numElements & 3 ) { case 3: InlineTransformation( pElements[ i+2 ] ); case 2: InlineTransformation( pElements[ i+1 ] ); case 1: InlineTransformation( pElements[ i+0 ] ); case 0: break; }
  • 33. Loop unrolled… now what?  If you unrolled your loop 4 times, you might be able to use SIMD  Use AltiVec intrinsics – align your data to 16 bytes  128 bit registers - operate on 4 32-bit values in parallel  Most SIMD instructions have 1 cycle throughput  Consider using SOA data instead of AOS AOS: Arrays of interleaved posX, posY, posZ structures SOA: A structure of arrays for each field dimensioned for all elements