SlideShare uma empresa Scribd logo
1 de 32
Optimizing Lua for Consoles Allan J. Murphy Senior Software Design Engineer Advanced Technology Group Microsoft
Introduction What do I know about Lua? Part of Microsoft’s ATG group Performance reviews, developer visits Working with actual title performance Including Lua loading from light to very heavy
About Console Cores
Lua Usage Lua is commonly used in console games Low memory footprint Lightweight processing Many sets of bindings to C++ 360 and PS3 have 3.2Ghz CPUs Lua should run just fine, right? Sadly, like most other converted code, not so
Lua Performance Is performance a problem? Level of Lua usage in console games varies Depends on genre of game, in part Sparse use – e.g. complex AI behaviors only Could be a couple of milliseconds Highly integrated – all the way down into the engine and renderer Could be the major bound on frame rate on CPU Lua is not always easily parallelizable Or at least, parallel implementations are uncommon So yes, Lua performance really is important
Performance of Ported Code Code ported to PS3 or 360 CPU may surprise And not in a good way 360 naïve port can be 10x slower than Windows Lua tasks can be in that range But the processor is 3.2Ghz, why the slowdown? CPU cores are cut down to reduce cost Memory system lower spec Cheap slow memory, smaller caches, no L3
Console Performance Penalties
In-Order Penalties Where is code penalized? Memory access L2 cache miss CPU core missing out-of-order execution hardware Load Hit Store Branch mispredict Expensive instructions
L2 Cache Miss Memory is slow An L2 miss is 610 cycles An L2 hit is 40 cycles An L1 hit is 5 cycles Factor of 15 difference between L2 hit and miss Cache line is 128 bytes Typically loading double the line size of x86 Easy to waste memory throughput Poor memory use heavily penalized
Load-Hit-Store (LHS) LHS occurs when the CPU stores to a memory address…	… then loads from it very shortly after In-order hardware unable to alter instruction flow to avoid No store-forwarding hardware in CPU No instructions for moving data between register sets LHS most often caused in code by: Changing register set, eg casts, combining math types Parameters passed by reference Pointer aliasing
Branch Mispredict Branches prevent compiler scheduling around penalties Given other penalties, this can be very important Mispredicting a branch on console is costly Mispredict causes CPU to: Discard instructions it has fetched, thinking it needed them 23-24 cycle penalty as correct instructions fetched Branch prediction normally does a good job But in some cases this penalty can be high
How Does This Affect the Lua VM?
How Does This Affect the Lua VM?	 Console CPU cores penalize Lua in several ways: LHS on data handling L2 miss on table access L2 miss on garbage collection and free list maintenance Branch mispredict on VM main loop Interesting aside Work to avoid in-order core issues and L2 miss… 	… improves performance on out of order cores anyway
Data Handling, LHS & Memory Access
Data Handling, LHS & Memory Access Lua keeps all basic types internally as a union 4 byte value represents bool, pointer, numeric data… Type field Results in 64 bit structure Issues Enum has only 9 values, but is stored in 32 bits No way to pass this structure in registers Pass value as int, LHS when you need float, and vice versa Storing on stack incurs extra instructions and memory access
Data Handling, LHS & Memory Access Not a very easy problem to solve elegantly Poor solution: …Just bear the cost Doesn’t seem good enough on performance starved CPU Unpalatable solution: …Don’t use union Pass int and float parts through registers at all times Solves memory and LHS issues Not very pretty though
getTable() & L2 Miss
getTable() & L2 Miss Much of Lua’s data stored in tables Even simple field access goes through table system For some sequentially indexed data… 	… goes through separate small array storage Commonly… 	…value lookup done via hash table
getTable() & L2 Miss L2 Miss L2 Miss Lua Table struct Key & TValue nextPtr TValue TValue TValue L2 Miss Key & TValue Tvalue nextPtr Branch Array Part TValue Key & TValue TValue Hash Table nextPtr TValue TValue L2 Miss Key & TValue nextPtr
getTable() & L2 Miss Likely several L2 misses just to get to value Several possible improvements Abandon small sequential array Save space, which improves caching We don’t have the large caches and fast memory of a desktop Drop branching and logic for handling small array Main hash table works for sequential case anyway Focus effort on optimizing one mechanism, not two
getTable() & L2 Miss Compact hash table to improve L2 performance Store table of 2 entries since typical list depth is 1.2 Make hash table contiguous Drop next pointers Store types as 4 bits packed separate to values Bulk together in groups of 28, ie one cache line in size Drops data size by 62.5%, L2 miss should drop similarly Make hash collision mechanism just advance in array Collision should be much less expensive Means hash function can be simpler, ie faster
Garbage Collection & L2 Miss
Garbage Collection & L2 Miss Default garbage collector Works via mark and sweep system On console, this is very expensive Each free block record examined incurs L2 miss ie 610 cycles Typically only a flag per block record examined But L2 miss loads 128 byte cache line Throughput is wasted, loaded data is unused L2 miss massively dominates total time
Garbage Collection & L2 Miss Consider supporting with custom block allocator Histogram allocation requests Tune block allocator sizes to spikes in histogram Block allocator… Keeps a bitmask of allocated chunks Chunks are fixed size Good allocator size is multiple of 1024 records – L2 cache line size Reduces memory fragmentation When full, falls back to normal allocator
Branch Mispredict & Lua VM
Branch Mispredict & Lua VM Lua is typically interpreted on consoles No JITting since security model forbids executing on data Precompiled code possible, but some disadvantages VM main loop typically does: Pick up opcode Jump through huge switch to code to execute opcode Pick up data required by opcode Execute Back to top
Branch Mispredict & Lua VM Problem… The VM loop is mispredict-mungous Switch statement is implemented using bctr instruction Loads unknown & unpredictable value from memory (opcode) Then branch on it Simple branch prediction hardware on core: Has 6 bit global history and 2 bit prediction scheme Doesn’t have much of a chance in this case Mispredict penalty grows linearly with opcode count
Branch Mispredict & Lua VM There are many code perturbations that seem hopeful Tree of ifs derived from popularity of opcodes ‘direct threading’ Preloading ctr register Sadly, the best route is to branch less Statistical analysis of opcode sequences For example, 35% of opcode pairs are getTable-getTable Idea: build super-opcode processing which drops branches Remove other branches on opcode
Summary
Summary Console cores and memory punish Lua performance Four areas mentioned above But other smaller areas too LHS, branch mispredict and L2 miss are your enemy In particular, L2 miss is never to be underestimated Improving performance requires care and thought But there are gains to be found
Optimizing Lua For Consoles - Allen Murphy (Microsoft)

Mais conteúdo relacionado

Mais procurados

Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationFarwa Ansari
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimizationK Gowsic Gowsic
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cacheVISHAL DONGA
 
Csc1401 lecture05 - cache memory
Csc1401   lecture05 - cache memoryCsc1401   lecture05 - cache memory
Csc1401 lecture05 - cache memoryIIUM
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUBNusrat Mary
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Ismail Mukiibi
 
Cache optimization
Cache optimizationCache optimization
Cache optimizationKavi Kathir
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureHaris456
 

Mais procurados (20)

cache
cachecache
cache
 
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 
Csc1401 lecture05 - cache memory
Csc1401   lecture05 - cache memoryCsc1401   lecture05 - cache memory
Csc1401 lecture05 - cache memory
 
Cache memory
Cache memory Cache memory
Cache memory
 
Lecture2
Lecture2Lecture2
Lecture2
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
 
Unit 5-lecture-2
Unit 5-lecture-2Unit 5-lecture-2
Unit 5-lecture-2
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache optimization
Cache optimizationCache optimization
Cache optimization
 
Cache memory
Cache memoryCache memory
Cache memory
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
Dual port ram
Dual port ramDual port ram
Dual port ram
 
Chapter 5 c
Chapter 5 cChapter 5 c
Chapter 5 c
 
Buffer Overflow
Buffer OverflowBuffer Overflow
Buffer Overflow
 

Destaque

Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)Kore VM
 
Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)Kore VM
 
Lua patient zero bret mogilefsky (scea)
Lua patient zero   bret mogilefsky (scea)Lua patient zero   bret mogilefsky (scea)
Lua patient zero bret mogilefsky (scea)Kore VM
 
Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)Kore VM
 
Lua by Ong Hean Kuan
Lua by Ong Hean KuanLua by Ong Hean Kuan
Lua by Ong Hean Kuanfossmy
 

Destaque (6)

Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
 
Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)
 
Lua patient zero bret mogilefsky (scea)
Lua patient zero   bret mogilefsky (scea)Lua patient zero   bret mogilefsky (scea)
Lua patient zero bret mogilefsky (scea)
 
Media Kit May 2010
Media Kit May 2010Media Kit May 2010
Media Kit May 2010
 
Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)
 
Lua by Ong Hean Kuan
Lua by Ong Hean KuanLua by Ong Hean Kuan
Lua by Ong Hean Kuan
 

Semelhante a Optimizing Lua For Consoles - Allen Murphy (Microsoft)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMakerKris Buytaert
 
Low level java programming
Low level java programmingLow level java programming
Low level java programmingPeter Lawrey
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
 
Chapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsiChapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsirisal07
 
Sql server troubleshooting
Sql server troubleshootingSql server troubleshooting
Sql server troubleshootingNathan Winters
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple coresLee Hanxue
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimizationManish Rawat
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!corehard_by
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architectureAHM Pervej Kabir
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software PerformanceGibraltar Software
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards Bharti Khemani
 

Semelhante a Optimizing Lua For Consoles - Allen Murphy (Microsoft) (20)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Chapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsiChapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsi
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Sql server troubleshooting
Sql server troubleshootingSql server troubleshooting
Sql server troubleshooting
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
cs-procstruc.ppt
cs-procstruc.pptcs-procstruc.ppt
cs-procstruc.ppt
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards
 

Último

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Último (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Optimizing Lua For Consoles - Allen Murphy (Microsoft)

  • 1.
  • 2. Optimizing Lua for Consoles Allan J. Murphy Senior Software Design Engineer Advanced Technology Group Microsoft
  • 3. Introduction What do I know about Lua? Part of Microsoft’s ATG group Performance reviews, developer visits Working with actual title performance Including Lua loading from light to very heavy
  • 5. Lua Usage Lua is commonly used in console games Low memory footprint Lightweight processing Many sets of bindings to C++ 360 and PS3 have 3.2Ghz CPUs Lua should run just fine, right? Sadly, like most other converted code, not so
  • 6. Lua Performance Is performance a problem? Level of Lua usage in console games varies Depends on genre of game, in part Sparse use – e.g. complex AI behaviors only Could be a couple of milliseconds Highly integrated – all the way down into the engine and renderer Could be the major bound on frame rate on CPU Lua is not always easily parallelizable Or at least, parallel implementations are uncommon So yes, Lua performance really is important
  • 7. Performance of Ported Code Code ported to PS3 or 360 CPU may surprise And not in a good way 360 naïve port can be 10x slower than Windows Lua tasks can be in that range But the processor is 3.2Ghz, why the slowdown? CPU cores are cut down to reduce cost Memory system lower spec Cheap slow memory, smaller caches, no L3
  • 9. In-Order Penalties Where is code penalized? Memory access L2 cache miss CPU core missing out-of-order execution hardware Load Hit Store Branch mispredict Expensive instructions
  • 10. L2 Cache Miss Memory is slow An L2 miss is 610 cycles An L2 hit is 40 cycles An L1 hit is 5 cycles Factor of 15 difference between L2 hit and miss Cache line is 128 bytes Typically loading double the line size of x86 Easy to waste memory throughput Poor memory use heavily penalized
  • 11. Load-Hit-Store (LHS) LHS occurs when the CPU stores to a memory address… … then loads from it very shortly after In-order hardware unable to alter instruction flow to avoid No store-forwarding hardware in CPU No instructions for moving data between register sets LHS most often caused in code by: Changing register set, eg casts, combining math types Parameters passed by reference Pointer aliasing
  • 12. Branch Mispredict Branches prevent compiler scheduling around penalties Given other penalties, this can be very important Mispredicting a branch on console is costly Mispredict causes CPU to: Discard instructions it has fetched, thinking it needed them 23-24 cycle penalty as correct instructions fetched Branch prediction normally does a good job But in some cases this penalty can be high
  • 13. How Does This Affect the Lua VM?
  • 14. How Does This Affect the Lua VM? Console CPU cores penalize Lua in several ways: LHS on data handling L2 miss on table access L2 miss on garbage collection and free list maintenance Branch mispredict on VM main loop Interesting aside Work to avoid in-order core issues and L2 miss… … improves performance on out of order cores anyway
  • 15. Data Handling, LHS & Memory Access
  • 16. Data Handling, LHS & Memory Access Lua keeps all basic types internally as a union 4 byte value represents bool, pointer, numeric data… Type field Results in 64 bit structure Issues Enum has only 9 values, but is stored in 32 bits No way to pass this structure in registers Pass value as int, LHS when you need float, and vice versa Storing on stack incurs extra instructions and memory access
  • 17. Data Handling, LHS & Memory Access Not a very easy problem to solve elegantly Poor solution: …Just bear the cost Doesn’t seem good enough on performance starved CPU Unpalatable solution: …Don’t use union Pass int and float parts through registers at all times Solves memory and LHS issues Not very pretty though
  • 19. getTable() & L2 Miss Much of Lua’s data stored in tables Even simple field access goes through table system For some sequentially indexed data… … goes through separate small array storage Commonly… …value lookup done via hash table
  • 20. getTable() & L2 Miss L2 Miss L2 Miss Lua Table struct Key & TValue nextPtr TValue TValue TValue L2 Miss Key & TValue Tvalue nextPtr Branch Array Part TValue Key & TValue TValue Hash Table nextPtr TValue TValue L2 Miss Key & TValue nextPtr
  • 21. getTable() & L2 Miss Likely several L2 misses just to get to value Several possible improvements Abandon small sequential array Save space, which improves caching We don’t have the large caches and fast memory of a desktop Drop branching and logic for handling small array Main hash table works for sequential case anyway Focus effort on optimizing one mechanism, not two
  • 22. getTable() & L2 Miss Compact hash table to improve L2 performance Store table of 2 entries since typical list depth is 1.2 Make hash table contiguous Drop next pointers Store types as 4 bits packed separate to values Bulk together in groups of 28, ie one cache line in size Drops data size by 62.5%, L2 miss should drop similarly Make hash collision mechanism just advance in array Collision should be much less expensive Means hash function can be simpler, ie faster
  • 24. Garbage Collection & L2 Miss Default garbage collector Works via mark and sweep system On console, this is very expensive Each free block record examined incurs L2 miss ie 610 cycles Typically only a flag per block record examined But L2 miss loads 128 byte cache line Throughput is wasted, loaded data is unused L2 miss massively dominates total time
  • 25. Garbage Collection & L2 Miss Consider supporting with custom block allocator Histogram allocation requests Tune block allocator sizes to spikes in histogram Block allocator… Keeps a bitmask of allocated chunks Chunks are fixed size Good allocator size is multiple of 1024 records – L2 cache line size Reduces memory fragmentation When full, falls back to normal allocator
  • 27. Branch Mispredict & Lua VM Lua is typically interpreted on consoles No JITting since security model forbids executing on data Precompiled code possible, but some disadvantages VM main loop typically does: Pick up opcode Jump through huge switch to code to execute opcode Pick up data required by opcode Execute Back to top
  • 28. Branch Mispredict & Lua VM Problem… The VM loop is mispredict-mungous Switch statement is implemented using bctr instruction Loads unknown & unpredictable value from memory (opcode) Then branch on it Simple branch prediction hardware on core: Has 6 bit global history and 2 bit prediction scheme Doesn’t have much of a chance in this case Mispredict penalty grows linearly with opcode count
  • 29. Branch Mispredict & Lua VM There are many code perturbations that seem hopeful Tree of ifs derived from popularity of opcodes ‘direct threading’ Preloading ctr register Sadly, the best route is to branch less Statistical analysis of opcode sequences For example, 35% of opcode pairs are getTable-getTable Idea: build super-opcode processing which drops branches Remove other branches on opcode
  • 31. Summary Console cores and memory punish Lua performance Four areas mentioned above But other smaller areas too LHS, branch mispredict and L2 miss are your enemy In particular, L2 miss is never to be underestimated Improving performance requires care and thought But there are gains to be found