This talk covers the work Intel and Epic Games have done together to enable improved performance of UE4 on Intel platforms, including DirectX 12 and Android. Many techniques presented are general and apply to all games and engines.
1. Jeff Rous – Graphics Software Engineer, Intel
Twitter: @jeff_rous
Optimization Deep Dive:
Unreal Engine 4 on Intel
2. Intel Software – Developer Relations Division Intel Confidential
Overview
Rationale
Intel Graphics Roadmap/Details
How We Measured
Common Pain Points
Shader Optimizations
Optimizing for DX12
VR Tips and Tricks
Android x86/x64 and ASTC Support
3. Intel Software – Developer Relations Division Intel Confidential 3
Why Work Together?
Benefits all games that use the engine
UE4 runs on more hardware
Intel is 18% GPU share. 4 of top 10 most popular GPUs are Intel. (Steam)
Optimizations help everyone – high end to phone
Common goals
APIs like DX12 and Vulkan are going to power tomorrow’s games
Virtual reality an important new segment
Android is a large market and key for Epic and Intel
4. Intel Software – Developer Relations Division Intel Confidential 4
Intel® HD Graphics: Roadmap
Sandy Bridge
Intel® 2nd Gen Core™
Processor
• 32nm
• Feature Level 10.1
• Up to 12 EUs
2011
Ivy Bridge
Intel® 3nd Gen Core™
Processor
• 22nm
• Feature Level 11.0
• Up to 16EUs
2012
Haswell
Intel® 4nd Gen Core™
Processor
• Feature Level 11.1
• DX Extensions
• GT3 (40 EUs)
• EDRAM
• Iris Pro™, Iris™
brands
2013
Broadwell
Intel® 5nd Gen
Core™ Processor
• 14nm
• Feature Level
11.2
• Up to 48 EUs
2014
Skylake
Intel® 6th Gen
Core™ Processor
• Feature Level
12.0
• GT4 (72 EUs)
• GT3e 15/28W
• DX12 HW
2015-16
Up to 30X faster graphics over last 5 years
5. Intel Software – Developer Relations Division Intel Confidential 5
Intel® HD Graphics: EDRAM
Basic facts
Located on the same package with CPU
64-128MB
Bandwidth – 50 GB/Sec each way
(100BGB/sec total BW)
Acts as 4th level $
Just works: no API required to use and take
advantage
Bandwidth Saving
Increasing compute requires more bandwidth
EDRAM helps to reduce BW consumption
and improve EU efficiency
Just works, but efficiency can be improved by re-using frame data
CPU Package
Intel 6rd Gen Core™ chip
CPU
Core
CPU
Core
CPU
Core
Ring-bus
CPU
CoreLL$
System
Memory
Gfx Core
EDRAM
6. Intel Software – Developer Relations Division Intel Confidential 6
How We Measured – Intel GPA
Use ToggleDrawEvents command
Frame debugging and live mode
Experiment!
7. Intel Software – Developer Relations Division Intel Confidential 7
How We Measured
ProfileGPU command
Stat commands
Windows Performance Analyzer
Intel Extreme Tuning Utility
8. Intel Software – Developer Relations Division Intel Confidential 8
Intel Pain Points – Memory Bandwidth
Memory bandwidth at a premium with integrated graphics
Gbuffers are memory hungry. UE4 is configurable where you can change the
format, eliminate or even combine channels. Scaling resolution of gbuffers good
to a point.
9. Intel Software – Developer Relations Division Intel Confidential 9
Intel Pain Points – Dense Geometry
Sub pixel or very dense mesh vertex shader execution can’t be covered by pixel
shader execution leading to hardware starving. Use LOD where possible.
Clipper can get bottlenecked in the worst cases. Use frustum culling on bounding
boxes at the very least. Occlusion culling for hidden objects.
10. Intel Software – Developer Relations Division Intel Confidential 10
A Word About Power
Intel graphics typically in low power systems.
Less CPU usage means more graphics.
11. Intel Software – Developer Relations Division Intel Confidential 11
Shaders – Local Memory
64 byte cache lines benefit from loop unrolling a great deal.
Avoid small loads in tight loops
12. Intel Software – Developer Relations Division Intel Confidential 12
Shaders – Unused Attributes
Often shaders are bound with large structures full of constants that go unused.
This is not cache friendly.
Depth passes are especially bad, outputting values not used by a null pixel
shader.
In UE4, make use of r.ShaderPipelines for depth passes.
In DX12, make liberal use of DENY_*_ACCESS to limit resource-shader visibility.
13. Intel Software – Developer Relations Division Intel Confidential 13
Shaders – Branching and Sampling
Using lots of temporaries can starve the
hardware.
Branching is expensive if loads are inside
the conditional blocks.
Group loads as early in the shader as
possible to help cover latency.
15. Intel Software – Developer Relations Division Intel Confidential 15
DX12 Performance – Fast Clear
Specify optional D3D12_CLEAR_COLOR when calling
CreateCommittedResource
Intel hardware has fast clear path for 1 bit per pixel clear values eg. (1,0,1,0)
When clearing, use the up front specified color for maximum performance.
~9% performance gain on Elemental Demo on DX12!
In the engine today
16. Intel Software – Developer Relations Division Intel Confidential 16
DX12 Performance – Root Signature
Blueprint of resources available
Root constants
Root descriptors
Descriptor tables
Constants that sit directly in root are copied to each
invocation of the shader (pushed) rather than read
from memory when used (pulled)
Can significantly speed up shader execution
Automatically handled by driver in DX11
17. Intel Software – Developer Relations Division Intel Confidential 17
VR Tips and Tricks
Simple techniques to take advantage of an under-utilized resource, the CPU!
Easily adds realism to your VR scenes without much incremental GPU work.
Min spec defined for high end VR.
Effects can be scaled up easily through BluePrints.
18. Intel Software – Developer Relations Division Intel Confidential 18
VR Tips and Tricks - Destruction
Simulates dynamic fracturing of meshes into smaller pieces.
Typical destruction workloads consist of a few seconds of a lot of simulation time
followed by a return to the baseline.
Better CPUs can keep pieces around longer and fracture more for more realism.
19. Intel Software – Developer Relations Division Intel Confidential 19
VR Tips and Tricks - Cloth
Dynamic mesh simulation that responds to the player, wind or other
environmental factors.
Typical cloth workloads include player capes or flags. Simulated every frame.
Easy to scale - More cloth systems means more CPU usage
20. Intel Software – Developer Relations Division Intel Confidential 20
Android x86/x64 Support
Native apps reduce CPU load, startup times and power consumption
Supported in UE4 today through editor menu
Requires source build
Package as fat or separated APKs
OpenGL ES 3.1 + AEP for best quality
ASTC textures
Deferred renderer
Supported on latest Intel tablets
21. Intel Software – Developer Relations Division Intel Confidential 21
Fast ASTC compression
Next gen format (OpenGL ES, Vulkan)
Very good compression on RGB/RGBA for variety of block sizes
UE4 now has support for Intel’s fast texture compressor for ASTC
44x speed improvement
Quality comparable to ARM compressor
UE4 uses Intel’s BC6H/BC7 compressors already
Released with 4.13
22. Intel Software – Developer Relations Division Intel Confidential 22
ASTC Quality Comparison
Zoomed in portion of a 2048x2048 normal map
Original: 12 MB ETC1: 2 MB ASTC 6x6: 1.8 MB
23. Intel Software – Developer Relations Division Intel Confidential 23
What’s Next?
Intel Compiler Support - 4.14
Vtune Amplifier Support – Event based
CPU sampling using itt_notify
framework. Gives deep insight into
what the engine is doing at all times.
Future release.
VR Sample showing off techniques to
take advantage of extra CPU cycles.
24. Intel Software – Developer Relations Division Intel Confidential 24
Wrap up
Intel and Epic have worked together to enable key technologies to enable
developers to make their best games.
Take advantage of scaling features in UE4 – Epic has done a lot of work to
support lower end hardware.
Test on Intel hardware early. UE4 is powerful but it can easily bring down a high
end system. With proper optimization, UE4 games run really well on Intel
hardware.
We’re going to give specific advice and pointers on how to apply these learnings both to UE4 but to other engines as well. A lot of what we’re talking about is common between engines and hardware.
Vulkan – Beta driver released for Windows. Open source driver available for Linux.
Kaby Lake launched for mobile. Desktop coming soon.
Intel GPA is a tool that helps developers identify where their apps are slow on Intel graphics. Contains both a live mode and a frame debugger. These help narrow down whether you’re bottlenecked in shadows, geometry, post processing etc. ToggleDrawEvents is a console command in UE4 that turns on annotations to help identify where in the scene you are.
Close ups of portion of GPA window.
ProfileGPU is a UE4 command that gives all kinds of good info about what the engine is doing graphics wise.
Stat commands give useful information about what each component system of UE4 is doing. For example, you can see how many draw calls are done with stat d3d11rhi.
Windows Performance Analyzer (WPA) is a Microsoft tool that creates graphs and data tables generated from event trace log (ETL) files for analysis. It can be used to measure the whole system including CPU, GPU, memory and IO while UE4 is running. Useful for noticing trends over time and finding system bottlenecks.
The Intel Extreme Tuning Utility gives you real time data about what your system is doing with regards to frequency, temperature and power. With it you can determine how the different parts of the CPU and GPU interact with each other and if in which situations you become power limited. Especially useful for games on laptops. We’ll talk about it a bit later.
Memory bandwidth – Using RAM as graphics memory rather than dedicated GDDR. Shared with CPU.
Gbuffers – Gbuffers are memory hungry. For example the actual pixel formats can be anything from 32 to 128 bit. We’ve seen cases where the game was 64-bit by default and no one bothered to change it. This led to a huge gain in perf. Half float is good for mobile, most cases won’t need full 32-bit precision. UE4 is configurable where you can change the format, eliminate or even combine channels. Scaling resolution of gbuffers good to a point ScreenPercentage console variable.
The pictures show what a few of the gbuffers look like for a particular frame. Each frame generates 5 full resolution gbuffers and 1 depth buffer. That’s a lot of bandwidth.
Meshes – Dense and even sub pixel meshes are something we see all the time early in development. Artists are let loose without a graphics engineer to reign them in. Tens of millions of triangles per frame with lots of sub pixel meshes look super pretty but will run like a slide show in reality. What happens at the hardware level is that with so many vertices, you can’t get the pixel shaders running in parallel, causing a bubble in the pipeline. Using level of detail on your meshes is a good way to solve this problem.
Culling – Don’t assume sending the whole scene to the rasterizer is efficient. You burn bandwidth and power sending everything down that gets trivially rejected. Worst case are games like GTA5 where everything for the whole city is sent down letting the GPU figure it out. It will do the job, but you won’t like how it performs. In these cases, the hardware clipper can become the bottleneck. Using frustum culling on bounding boxes and occlusion culling for hidden objects helps a lot. Intel also has a software occlusion culling sample available on the Intel Developer Zone that uses the CPU. Check it out. There’s a link at the end of this presentation for it.
Power – In PC games this is often an overlooked aspect because you’re running full out all of the time. When you’ve got the CPU and GPU in one package like Intel does, you get into a situation where the more CPU you use the less available power the GPU has to work with. This can have a big effect on performance. If you’re running beyond 60fps, you’re burning power churning out frames that aren’t used. Cap frame rate at vsync to save battery.
This is a picture of the Intel Extreme Tuning Utility. With it you can measure the effects of power on the system, which is especially useful for laptops and ultrabooks. For example we can see here that using a lot of CPU will cause the processor graphics frequency to not be smooth over time. This causes frame rate to drop because the CPU is taking some power away for its work, leaving less for graphics. Temperature is also an important metric to pay attention to, maybe even moreso than power. If the CPU gets into a thermal throttling situation, performance will be drastically reduced. Intel has a power SDK available on the Intel Developer Zone for developers wanting to understand how power is affecting their game. For example it will tell you when you’re on AC or battery so you can adjust accordingly.
Demo to show this a bit later.
Shader local memory – Intel CPUs have 64-byte cache lines for both the CPU and GPU. Taking this into consideration will improve your hit rate a lot and improve performance on graphics workloads. For example, you could unroll a tight inner loop that loads 16 bytes per iteration 4 times and do the load once, as shown in the picture. In general you want to avoid small loads in tight loops because they aren’t cache friendly.
r.ShaderPipelines creates shader permutations that remove attributes. In depth passes this is okay and doesn’t cause a permutation explosion because there’s not many depth only shaders. It’s also where most of the attribute usage problems come from. This is new in 4.13.
Will talk a bit more about the DX12 specifics later.
Temporary usage – Using lots of temporaries can starve the hardware because the banks are shared between a number of hardware threads. Not common but has been seen on Infiltrator workload. Number of temporaries can be inferred from the shader dump in Intel GPA.
Branching - Consider compiling different versions of your shaders to reduce branches where it makes sense. Use constant buffers as the source for conditionals because loads from the L3 cache are faster than from a texture lookup.
Sampling – In general, grouping your loads early in the shader is the best policy. The compiler does give you some help but in cases like conditionals, it can’t always do it. Pulling loads out of branches if your else conditionals are rare sometimes makes sense. Be sure to test!
The picture depicts what an internal execution unit looks like on Intel hardware. Each EU has 7 hardware threads. Each thread has 128 SIMD8 32-bit general purpose registers plus some architecture specific registers (the ARF in the picture). The hardware threads share the two FPUs, the branch unit and the send unit. Typical EU counts vary per processor SKU with the highest being 72.
Root signatures are a new concept to DX12 and are a big area of performance regression between DX11 and DX12. Previously, the driver knew all of the state before a draw happened and could optimize based on it. A root signature is a blueprint of resources available and include Root constants, Root descriptors, and Descriptor tables. They can be thought of as similar to a C call with function arguments. The values passed in are in registers rather than read from memory.
Constants that sit directly in root are copied to each invocation of the shader (pushed) rather than read from memory when used (pulled). This is where the problem begins. DX11 could determine which values needed to be pushed vs pulled. Now in DX12, this is up to the application to determine.
One of the first things a typical game does when being ported from DX11 to DX12 is to create one large root signature for all of the frame’s data. This is actually bad on Intel. Using a larger root signature wastes memory bandwidth because it’s copied to each invocation of the shader, whether or not it is used. This can be helped by using the root signature visibility flags keeping in mind that data that isn’t needed is wasted. In general, we recommend using a different root signature as your scene changes, for example one for the base pass and one for post processing. So more than one but less than one per draw. Using these suggestions can significantly speed up shader execution and is an active area of investigation in UE4.
Min spec i5-4590, GTX 970 for Oculus and Vive.
This feature allows for dynamic fracturing of meshes into smaller pieces. It is part of the physics family and as such simulation is run on a separate thread. Typical destruction workloads consist of a few seconds of a lot of simulation time followed by a return to the baseline.
Good target for improved realism with proper content. It’s also easy to scale up by fracturing meshes more and removing fractured chunks after a longer length of time on a more powerful CPU. Since destruction is done through the physics engine on worker threads, the CPU won’t become the rendering bottleneck until quite a few systems are going at once.
Be sure to test this. Players can get into a situation where you have a Matrix hallway like scene with pieces flying off walls all over the place. This can cause a spike in CPU usage and end up causing dropped frames, which is bad.
This feature allows for dynamic simulation of meshes that respond to the player, wind or other environmental factors. It is another member of the physics family where simulation is run on a separate thread. Typical cloth workloads include player capes or flags. Cloth is simulated every frame, even if the player is not looking at it because the simulation results determine if it shows up in the player's view.
Good target for improved realism with proper content. For example, if your levels have flags, or your characters wear loose clothes, these would be good usages. Cloth simulation uses the CPU about the same amount from frame to frame assuming more systems aren’t added. It’s easily predictable and you can tune the amount you’re using to fit the available headroom.
In this way, more cloth means higher simulation times and greater CPU usage. Since physics is done on worker threads, unless you have so many systems that the CPU becomes the bottleneck, the number of systems and complexity can be scaled up as needed.
OGLES 3.1 runs with deferred renderer, not the mobile forward path
2048x2048 RGB Normal Map, with mips – 17 MB uncompressedTop mip: Original 12 MB, ETC 2 MB, ASTC 1.8 MBMip-maps add ~43% sizeThe ASTC texture is RG channels only, the Z-component is derived in the shader
Lots of scaling effort made by Epic for reduced quality effects. Take advantage of them. Check out the scalability settings in the engine.