Optimizing Lua For Consoles - Allen Murphy (Microsoft)
1.
2. Optimizing Lua for Consoles Allan J. Murphy Senior Software Design Engineer Advanced Technology Group Microsoft
3. Introduction What do I know about Lua? Part of Microsoft’s ATG group Performance reviews, developer visits Working with actual title performance Including Lua loading from light to very heavy
5. Lua Usage Lua is commonly used in console games Low memory footprint Lightweight processing Many sets of bindings to C++ 360 and PS3 have 3.2Ghz CPUs Lua should run just fine, right? Sadly, like most other converted code, not so
6. Lua Performance Is performance a problem? Level of Lua usage in console games varies Depends on genre of game, in part Sparse use – e.g. complex AI behaviors only Could be a couple of milliseconds Highly integrated – all the way down into the engine and renderer Could be the major bound on frame rate on CPU Lua is not always easily parallelizable Or at least, parallel implementations are uncommon So yes, Lua performance really is important
7. Performance of Ported Code Code ported to PS3 or 360 CPU may surprise And not in a good way 360 naïve port can be 10x slower than Windows Lua tasks can be in that range But the processor is 3.2Ghz, why the slowdown? CPU cores are cut down to reduce cost Memory system lower spec Cheap slow memory, smaller caches, no L3
9. In-Order Penalties Where is code penalized? Memory access L2 cache miss CPU core missing out-of-order execution hardware Load Hit Store Branch mispredict Expensive instructions
10. L2 Cache Miss Memory is slow An L2 miss is 610 cycles An L2 hit is 40 cycles An L1 hit is 5 cycles Factor of 15 difference between L2 hit and miss Cache line is 128 bytes Typically loading double the line size of x86 Easy to waste memory throughput Poor memory use heavily penalized
11. Load-Hit-Store (LHS) LHS occurs when the CPU stores to a memory address… … then loads from it very shortly after In-order hardware unable to alter instruction flow to avoid No store-forwarding hardware in CPU No instructions for moving data between register sets LHS most often caused in code by: Changing register set, eg casts, combining math types Parameters passed by reference Pointer aliasing
12. Branch Mispredict Branches prevent compiler scheduling around penalties Given other penalties, this can be very important Mispredicting a branch on console is costly Mispredict causes CPU to: Discard instructions it has fetched, thinking it needed them 23-24 cycle penalty as correct instructions fetched Branch prediction normally does a good job But in some cases this penalty can be high
14. How Does This Affect the Lua VM? Console CPU cores penalize Lua in several ways: LHS on data handling L2 miss on table access L2 miss on garbage collection and free list maintenance Branch mispredict on VM main loop Interesting aside Work to avoid in-order core issues and L2 miss… … improves performance on out of order cores anyway
16. Data Handling, LHS & Memory Access Lua keeps all basic types internally as a union 4 byte value represents bool, pointer, numeric data… Type field Results in 64 bit structure Issues Enum has only 9 values, but is stored in 32 bits No way to pass this structure in registers Pass value as int, LHS when you need float, and vice versa Storing on stack incurs extra instructions and memory access
17. Data Handling, LHS & Memory Access Not a very easy problem to solve elegantly Poor solution: …Just bear the cost Doesn’t seem good enough on performance starved CPU Unpalatable solution: …Don’t use union Pass int and float parts through registers at all times Solves memory and LHS issues Not very pretty though
19. getTable() & L2 Miss Much of Lua’s data stored in tables Even simple field access goes through table system For some sequentially indexed data… … goes through separate small array storage Commonly… …value lookup done via hash table
20. getTable() & L2 Miss L2 Miss L2 Miss Lua Table struct Key & TValue nextPtr TValue TValue TValue L2 Miss Key & TValue Tvalue nextPtr Branch Array Part TValue Key & TValue TValue Hash Table nextPtr TValue TValue L2 Miss Key & TValue nextPtr
21. getTable() & L2 Miss Likely several L2 misses just to get to value Several possible improvements Abandon small sequential array Save space, which improves caching We don’t have the large caches and fast memory of a desktop Drop branching and logic for handling small array Main hash table works for sequential case anyway Focus effort on optimizing one mechanism, not two
22. getTable() & L2 Miss Compact hash table to improve L2 performance Store table of 2 entries since typical list depth is 1.2 Make hash table contiguous Drop next pointers Store types as 4 bits packed separate to values Bulk together in groups of 28, ie one cache line in size Drops data size by 62.5%, L2 miss should drop similarly Make hash collision mechanism just advance in array Collision should be much less expensive Means hash function can be simpler, ie faster
24. Garbage Collection & L2 Miss Default garbage collector Works via mark and sweep system On console, this is very expensive Each free block record examined incurs L2 miss ie 610 cycles Typically only a flag per block record examined But L2 miss loads 128 byte cache line Throughput is wasted, loaded data is unused L2 miss massively dominates total time
25. Garbage Collection & L2 Miss Consider supporting with custom block allocator Histogram allocation requests Tune block allocator sizes to spikes in histogram Block allocator… Keeps a bitmask of allocated chunks Chunks are fixed size Good allocator size is multiple of 1024 records – L2 cache line size Reduces memory fragmentation When full, falls back to normal allocator
27. Branch Mispredict & Lua VM Lua is typically interpreted on consoles No JITting since security model forbids executing on data Precompiled code possible, but some disadvantages VM main loop typically does: Pick up opcode Jump through huge switch to code to execute opcode Pick up data required by opcode Execute Back to top
28. Branch Mispredict & Lua VM Problem… The VM loop is mispredict-mungous Switch statement is implemented using bctr instruction Loads unknown & unpredictable value from memory (opcode) Then branch on it Simple branch prediction hardware on core: Has 6 bit global history and 2 bit prediction scheme Doesn’t have much of a chance in this case Mispredict penalty grows linearly with opcode count
29. Branch Mispredict & Lua VM There are many code perturbations that seem hopeful Tree of ifs derived from popularity of opcodes ‘direct threading’ Preloading ctr register Sadly, the best route is to branch less Statistical analysis of opcode sequences For example, 35% of opcode pairs are getTable-getTable Idea: build super-opcode processing which drops branches Remove other branches on opcode
31. Summary Console cores and memory punish Lua performance Four areas mentioned above But other smaller areas too LHS, branch mispredict and L2 miss are your enemy In particular, L2 miss is never to be underestimated Improving performance requires care and thought But there are gains to be found