LCU14 307- Advanced Toolchain Usage (parts 1&2)
---------------------------------------------------
Speaker: M.Collison, M. Kuvyrkov & W. Newton
Date: September 17, 2014
---------------------------------------------------
★ Session Summary ★
This set of sessions will go into detail on many toolchain topics and help the attendee get the most out of their toolchain usage. Topics covered will include: inline assembly Link Time Optimizations (LTO) Feedback Directed Optimizations (FDO) Proper code annotation for: promoting vectorization avoiding false sharing memory aliasing restrict keyword usage Optimization levels and what they mean Demystifying -march, -mfpu, -mcpu, -mtune, -with-mode Linking options Libatomic usage Debugging binaries compiled with optimizations.
---------------------------------------------------
★ Resources ★
Zerista: http://lcu14.zerista.com/event/member/137754
Google Event: https://plus.google.com/u/0/events/csb19bbpqh43888ghud0si92p40
Video: https://www.youtube.com/watch?v=E0troMIh1Go&list=UUIVqQKxCyQLJS6xvSmfndLA
Etherpad: http://pad.linaro.org/p/lcu14-307
---------------------------------------------------
★ Event Details ★
Linaro Connect USA - #LCU14
September 15-19th, 2014
Hyatt Regency San Francisco Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
2. Part 1 Part 2
● GCC optimization levels
● Using random compiler options
● Toolchain defaults by vendor
● How to select target flags
● Feedback directed optimization
● Link-time optimization
● Inline assembly
● Auto-vectorization
● Minimizing global symbols
● Section garbage collection
● GNU symbol hash
4. Optimization Level 0
● -O0 is equivalent to no optimization.
● -O0 is equivalent to providing no optimization option
● THIS IS THE DEFAULT
5. Optimization Level 1 (-O or -O1)
● Enables basic optimizations that attempt to reduce code size and
execution time
● Debugging of generated code is minimally affected
● Important optimizations enabled:
● Dead code and store elimination on trees and RTL
● Basic loop optimizations
● Register allocation
● If conversion
● Convert conditional jumps into “branch-less equivalents”
● Constant propagation
● Eliminate redundant jumps to jumps
6. Optimization Level 2 (-O2)
● Enables all optimizations from –O1 level
● Adds more aggressive optimizations at expense of debuggability
● Important optimizations enabled:
● Global CSE, constant and copy propagation
● Global implies within an entire function not across function boundaries
● Instruction scheduling to take advantage of processor pipeline
● Inlining of small functions
● Interprocedural constant propagation
● Reorder basic blocks to improve cache locality
● Partial redundancy elimination
7. Optimization Level 3 (-O3)
● All optimizations enabled by -O2
● Optimizes more aggressively to reduce execution time at the
expense of code size
● (Potentially) Inline any function
● Loop vectorization to utilize SIMD instructions
● Function cloning to make interprocedural constant propagation more powerful
● Loop unswitching
8. Optimize for Code Size (-Os)
● Enables all optimizations as –O2 that do not increase code size
● Disables the following –O2 optimizations:
● Optimizations that align the start of functions, loops, branch targets and labels
● Reordering of basic blocks
9. Optimize for Debugging (-Og)
● Enables optimizations that do not interfere with debugging
● Debugging (“-g”) must still be enabled
● I use “-Og” and “-g” for edit-compile development cycle
10. Recommendation for Optimization Options
● Use -Og and -g for edit-compile-debug cycle
● Use -O2 for where code size and execution are important
● Use -O3 when execution speed is the primary requirement
● Use -Os when code size is the primary requirement
11. But I’m experienced, I /know/ the good flags!
● 3 years ago I spent 3 days finding the best combination of GCC
flags for my project / board / benchmark
● -O2 -funroll-loops -fno-schedule-insns --param <some>=<thing>
● … 3 major version of compiler later …
● Why simple -Os outperforms my custom-tuned options?
● I thought loop unrolling makes loops go faster.
● I saw “-fno-schedule-insns” on the internet.
● I hand-tuned --param <some>=<thing>
12. But I’m experienced, I /know/ the good flags!
● Feature flags (these are OK)
● -std=c++11 -- language standard
● -fno-common -- language feature
● -mcpu=cortex-a15 -mfpu=neon-vfpv4 -- target feature
● Compatibility flags (not OK, please fix your code)
● -fno-strict-aliasing
● -fsigned-char
● Optimization flags (not OK, please use -Og/-Os/-O2/-O3/-Ofast)
● -f<optimization>
● -fno-<optimization>
● --param <some>=<thing>
14. How to select target flags
● -mcpu=CPU -mfpu=FPU
● -mcpu=cortex-a15 -mfpu=neon-vfpv4
● -mcpu=FOO is the same as -march=<FOO’s arch> -mtune=FOO
● [-mcpu is preferred option]
● Using ABI options require a matching set of libraries (multilib)
● There always is a default multilib for default ABI options
● Linaro toolchains have a single -- default -- multilib per toolchain
● MEANING OF MULTILIB: set of libraries, not libraries for multiple ABIs.
● For different ABI configurations use different Linaro toolchain
packages (or build your own with cbuild2!)
15. Feedback Directed Optimization
Feedback directed optimization provides information to the compiler
which is then used for making optimization decisions.
● Branch probabilities
● Inlining
● Hot/cold code reordering and partitioning (not on ARM)
The information used is generated by profiling, which can be done by
one of two methods.
● gprof style code instrumentation
● Statistical profiling with hardware counters
16. Using code instrumentation
1. Build the code with appropriate options to add profiling
instrumentation
-fprofile-generate=dir, where dir is the output directory
2. Run the application with a representative workload.
3. Rebuild the code with profile generated by the run.
-fprofile-use=dir, where dir is the same directory as before
This results in two build types, the slower instrumented build and the
final optimized build.
17. Performance
In this example I used the Opus 1.1 codec encoder test and gcc
4.8.3 on x86_64.
Build Type Run Time Relative Run Time
Default 27.727s 100%
Instrumented 34.008s 123%
Optimized 24.301s 88%
18. Build Time
Instrumenting and optimizing based on profiles also adds some
overhead to compile times.
Build Type Build Time Relative Build Time
Default 42.410s 100%
Instrumented 55.508s 131%
Optimized 70.544s 166%
19. AutoFDO
A new method of feedback directed optimization developed by
Google. Uses perf to generate profiles using optimized binaries
with debug information.
https://github.com/google/autofdo
1. Build a standard optimized build (with debug info).
2. Run the application with perf record branch profiling.
3. Convert profile with autofdo tool.
4. Build with -fauto-profile.
Only supported in Google’s gcc branch not on master. Provides
around 70-80% of the performance benefits of the instrumentation
method but profiling overhead is only around 2%.
20. Link Time Optimization (LTO)
● Allows optimizations that work on a entire file to work across the
entire application
● Works by saving the compiler IL in object files and using the IL to
optimize at “link-time”
● Enabled with “–flto”
● -fuse-linker-plugin allows LTO to be applied to object files in libraries (assuming proper
linker support)
● Limitation: Use same command line options when compiling
source files
● gcc –O2 –flto –c a.c
● gcc –O2 –flto –c b.c
● gcc –o a.out a.o b.o -flto
● LTO is production ready in gcc 4.9
21. Part 1 Part 2
● GCC optimization levels
● Using random compiler options
● Toolchain defaults by vendor
● How to select target flags
● Feedback directed optimization
● Link-time optimization
● Inline assembly
● Auto-vectorization
● Minimizing global symbols
● Section garbage collection
● GNU symbol hash
22. Inline Assembly
● Using instructions compiler does not know about
● Are you sure -- check latest built-ins / intrinsics!
● Privileged instructions
● Syscall / interrupt instructions
● Basic Asm
● asm (“INSN1”);
● Limited use; all operands must already be in specific registers
● See docs: https://gcc.gnu.org/onlinedocs/gcc/Basic-Asm.html
● Extended Asm
● asm (“TEMPLATE” : “OUTPUTS” : “INPUTS” : “CLOBBERS”);
● See glibc or linux kernel for inspiration
● See docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
23. Inline Assembly -- statements
● “asm (:::);” is just another normal statement
● GCC optimizes asm statements just like any other statements
● Programmer is responsible for specifying ALL effects of asm
● “asm volatile (:::);”
● Number of executions, not presence in code, is guaranteed.
24. Inline Assembly -- variables
Wrong
int fund (int arg) {
asm (“insn r0”); // I know ABI
return arg;
}
Correct
int func (int _arg) {
int arg asm(“r0”) = _arg;
asm (“insn %0” : “+r” (arg));
return arg;
}
25. Auto-vectorization
Vectorization performs multiple iterations of a loop (or repeated
operation) using vector instructions that operate on multiple data
items simultaneously. gcc is capable of identifying code that can
be vectorized and applying this transformation.
Compiler flags to enable this optimization:
● -O3
● -ftree-vectorize
26. Auto-vectorization Example
A simple loop to vectorize:
#define SIZE (1UL << 16)
void test1(double *a, double *b)
{
for (size_t i = 0; i < SIZE; i++)
a[i] += b[i];
}
27. Auto-vectorization Example
What code is generated by gcc -std=c99 -O2 -mfpu=neon?
test1:
movs r3, #0
.L3:
fldd d16, [r0]
fldmiad r1!, {d17}
faddd d16, d16, d17
adds r3, r3, #1
cmp r3, #65536
fstmiad r0!, {d16}
bne .L3
bx lr
28. Auto-vectorization Example
What code is generated by gcc -std=c99 -O3 -mfpu=neon?
The code is unchanged. Why did we not see any vectorization? gcc
provides -ftree-vectorizer-verbose to help.
test.c:9: note: not vectorized: no vectype for stmt: _7 = *_6;
scalar_type: double
ARMv7 NEON does not support vectorizing double precision
operations so gcc cannot vectorize the loop.
29. Auto-vectorization Example
So how about we switch to float. Does it vectorize?
No. What do we get from -ftree-vectorizer-verbose?
test.c:8: note: not vectorized: relevant stmt not supported: _11 = _7 +
_10;
test.c:8: note: bad operation or unsupported loop bound.
NEON does not support full IEEE 754, so gcc won’t use it.
30. Auto-vectorization Example
If we know that our data does not contain any problematic values
(denormals or non-default NaNs) and we can deal with the other
restrictions (round to nearest, no traps) we can tell gcc NEON is
OK with -funsafe-math-optimizations.
Finally, we see vector instructions!
32. Auto-vectorization Example
That’s still quite a lot of code, how can we improve it? Use the
restrict keyword to annotate that the two arrays do not alias
(overlap).
#define SIZE (1UL << 16)
void test1(float * restrict a, float * restrict b)
{
for (size_t i = 0; i < SIZE; i++)
a[i] += b[i];
}
34. Auto-vectorization Example
gcc is expending a lot of instructions making sure the pointers are
aligned to an 8 byte boundary. Often this can be guaranteed by the
allocator or data structure layout.
void test1(float * restrict a_, float * restrict b_)
{
float *a = __builtin_assume_aligned(a_, 8);
float *b = __builtin_assume_aligned(b_, 8);
for (size_t i = 0; i < SIZE; i++)
a[i] += b[i];
}
35. Auto-vectorization Example
Now we have something that looks fairly optimal.
test1:
add r3, r0, #262144
.L3:
vld1.64 {d16-d17}, [r0:64]
vld1.64 {d18-d19}, [r1:64]!
vadd.f32 q8, q8, q9
vst1.64 {d16-d17}, [r0:64]!
cmp r0, r3
bne .L3
bx lr
36. Auto-vectorization Tips
● Use the right types
● Understand the implications for mathematical operations
● Use restrict annotations where possible
● Use vector aligned pointers where possible and annotate them
● Use countable loop conditions e.g. i < n
● Don’t do control flow in the loop e.g. break, function calls
● Experiment with -ftree-vectorizer-verbose
37. Minimizing Global Symbols
Reducing the number of global symbols in shared objects is
beneficial for a number of reasons.
● Reduced startup time
● Faster function calls
● Smaller disk and memory footprint
There a number of ways to achieve this goal:
● Make as many functions as possible static
● Use a version script to force symbols local
● Use -fvisibility=hidden and symbol attributes
● Use ld -Bsymbolic
38. -Bsymbolic
-Bsymbolic binds global references within a shared library to
definitions within the shared library where possible, bypassing the
PLT for functions. -Bsymbolic-functions behaves similarly but
applies only to functions.
This breaks symbol preemption and pointer comparison so cannot
be applied without a certain amount of care. -Bsymbolic-functions
is safer as comparison of function pointers is rarer than
comparison of data pointers.
39. -Bsymbolic Example
lib1.c:
int func1(int a)
{
return 1 + func2(a);
}
lib2.c:
int func2(int a)
{
return a*2;
}
43. -Bsymbolic Example
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
00008f14 R_ARM_RELATIVE *ABS*
00008f18 R_ARM_RELATIVE *ABS*
00009028 R_ARM_RELATIVE *ABS*
00009014 R_ARM_GLOB_DAT __cxa_finalize
00009018 R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable
0000901c R_ARM_GLOB_DAT __gmon_start__
00009020 R_ARM_GLOB_DAT _Jv_RegisterClasses
00009024 R_ARM_GLOB_DAT _ITM_registerTMCloneTable
0000900c R_ARM_JUMP_SLOT __cxa_finalize
00009010 R_ARM_JUMP_SLOT __gmon_start__
44. Section Garbage Collection
ld is capable of dropping any unused input sections from the final
link. It does this by following references between sections from an
entry point, and un-referenced sections are removed (or garbage
collected).
● Compile with -ffunction-sections and -fdata-sections
● Link with --gc-sections
● Only helps on projects that contain some redundancy
45. GNU Symbol Hash
Dynamic objects contain a hash to map symbol names to
addresses. The GNU hash feature implemented in ld and glibc
performs considerably better than the standard ELF hash.
● Fast hash function with good collision avoidance
● Bloom filters to quickly check for symbol in a hash
● Symbols sorted for cache locality
Creation of a GNU hash section can be enabled by passing --hash-style=
gnu or --hash-style=both to ld. The Android dynamic linker
does not currently support GNU hash sections!
46. More about Linaro Connect: connect.linaro.org
Linaro members: www.linaro.org/members
More about Linaro: www.linaro.org/about/