SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
LCU14-307: Advanced Toolchain Usage 
M.Collison, M. Kuvyrkov & W. Newton, LCU14 
LCU14 BURLINGAME
Part 1 Part 2 
● GCC optimization levels 
● Using random compiler options 
● Toolchain defaults by vendor 
● How to select target flags 
● Feedback directed optimization 
● Link-time optimization 
● Inline assembly 
● Auto-vectorization 
● Minimizing global symbols 
● Section garbage collection 
● GNU symbol hash
GCC Optimization Levels 
● Optimization Level 0 
● Optimization Level 1 (-O1) 
● Optimization Level 2 (-O2) 
● Optimization Level 3 (-O3) 
● Code Size Optimization (-Os) 
● Optimize for debugging (-Og)
Optimization Level 0 
● -O0 is equivalent to no optimization. 
● -O0 is equivalent to providing no optimization option 
● THIS IS THE DEFAULT
Optimization Level 1 (-O or -O1) 
● Enables basic optimizations that attempt to reduce code size and 
execution time 
● Debugging of generated code is minimally affected 
● Important optimizations enabled: 
● Dead code and store elimination on trees and RTL 
● Basic loop optimizations 
● Register allocation 
● If conversion 
● Convert conditional jumps into “branch-less equivalents” 
● Constant propagation 
● Eliminate redundant jumps to jumps
Optimization Level 2 (-O2) 
● Enables all optimizations from –O1 level 
● Adds more aggressive optimizations at expense of debuggability 
● Important optimizations enabled: 
● Global CSE, constant and copy propagation 
● Global implies within an entire function not across function boundaries 
● Instruction scheduling to take advantage of processor pipeline 
● Inlining of small functions 
● Interprocedural constant propagation 
● Reorder basic blocks to improve cache locality 
● Partial redundancy elimination
Optimization Level 3 (-O3) 
● All optimizations enabled by -O2 
● Optimizes more aggressively to reduce execution time at the 
expense of code size 
● (Potentially) Inline any function 
● Loop vectorization to utilize SIMD instructions 
● Function cloning to make interprocedural constant propagation more powerful 
● Loop unswitching
Optimize for Code Size (-Os) 
● Enables all optimizations as –O2 that do not increase code size 
● Disables the following –O2 optimizations: 
● Optimizations that align the start of functions, loops, branch targets and labels 
● Reordering of basic blocks
Optimize for Debugging (-Og) 
● Enables optimizations that do not interfere with debugging 
● Debugging (“-g”) must still be enabled 
● I use “-Og” and “-g” for edit-compile development cycle
Recommendation for Optimization Options 
● Use -Og and -g for edit-compile-debug cycle 
● Use -O2 for where code size and execution are important 
● Use -O3 when execution speed is the primary requirement 
● Use -Os when code size is the primary requirement
But I’m experienced, I /know/ the good flags! 
● 3 years ago I spent 3 days finding the best combination of GCC 
flags for my project / board / benchmark 
● -O2 -funroll-loops -fno-schedule-insns --param <some>=<thing> 
● … 3 major version of compiler later … 
● Why simple -Os outperforms my custom-tuned options? 
● I thought loop unrolling makes loops go faster. 
● I saw “-fno-schedule-insns” on the internet. 
● I hand-tuned --param <some>=<thing>
But I’m experienced, I /know/ the good flags! 
● Feature flags (these are OK) 
● -std=c++11 -- language standard 
● -fno-common -- language feature 
● -mcpu=cortex-a15 -mfpu=neon-vfpv4 -- target feature 
● Compatibility flags (not OK, please fix your code) 
● -fno-strict-aliasing 
● -fsigned-char 
● Optimization flags (not OK, please use -Og/-Os/-O2/-O3/-Ofast) 
● -f<optimization> 
● -fno-<optimization> 
● --param <some>=<thing>
So many defaults (AArch32) 
● Linaro (cross) 
● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a9 
● Ubuntu (native) 
● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 
● Debian armhf (native) 
● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 
● Debian armel (native) 
● -marm -march=armv4t -mfloat-abi=soft -mtune=arm7tdmi 
● Fedora (native) 
● -marm -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 
● CodeSourcery (cross) 
● -marm -march=armv5te -mfloat-abi=soft -mtune=arm1026ejs 
● other multilibs available
How to select target flags 
● -mcpu=CPU -mfpu=FPU 
● -mcpu=cortex-a15 -mfpu=neon-vfpv4 
● -mcpu=FOO is the same as -march=<FOO’s arch> -mtune=FOO 
● [-mcpu is preferred option] 
● Using ABI options require a matching set of libraries (multilib) 
● There always is a default multilib for default ABI options 
● Linaro toolchains have a single -- default -- multilib per toolchain 
● MEANING OF MULTILIB: set of libraries, not libraries for multiple ABIs. 
● For different ABI configurations use different Linaro toolchain 
packages (or build your own with cbuild2!)
Feedback Directed Optimization 
Feedback directed optimization provides information to the compiler 
which is then used for making optimization decisions. 
● Branch probabilities 
● Inlining 
● Hot/cold code reordering and partitioning (not on ARM) 
The information used is generated by profiling, which can be done by 
one of two methods. 
● gprof style code instrumentation 
● Statistical profiling with hardware counters
Using code instrumentation 
1. Build the code with appropriate options to add profiling 
instrumentation 
-fprofile-generate=dir, where dir is the output directory 
2. Run the application with a representative workload. 
3. Rebuild the code with profile generated by the run. 
-fprofile-use=dir, where dir is the same directory as before 
This results in two build types, the slower instrumented build and the 
final optimized build.
Performance 
In this example I used the Opus 1.1 codec encoder test and gcc 
4.8.3 on x86_64. 
Build Type Run Time Relative Run Time 
Default 27.727s 100% 
Instrumented 34.008s 123% 
Optimized 24.301s 88%
Build Time 
Instrumenting and optimizing based on profiles also adds some 
overhead to compile times. 
Build Type Build Time Relative Build Time 
Default 42.410s 100% 
Instrumented 55.508s 131% 
Optimized 70.544s 166%
AutoFDO 
A new method of feedback directed optimization developed by 
Google. Uses perf to generate profiles using optimized binaries 
with debug information. 
https://github.com/google/autofdo 
1. Build a standard optimized build (with debug info). 
2. Run the application with perf record branch profiling. 
3. Convert profile with autofdo tool. 
4. Build with -fauto-profile. 
Only supported in Google’s gcc branch not on master. Provides 
around 70-80% of the performance benefits of the instrumentation 
method but profiling overhead is only around 2%.
Link Time Optimization (LTO) 
● Allows optimizations that work on a entire file to work across the 
entire application 
● Works by saving the compiler IL in object files and using the IL to 
optimize at “link-time” 
● Enabled with “–flto” 
● -fuse-linker-plugin allows LTO to be applied to object files in libraries (assuming proper 
linker support) 
● Limitation: Use same command line options when compiling 
source files 
● gcc –O2 –flto –c a.c 
● gcc –O2 –flto –c b.c 
● gcc –o a.out a.o b.o -flto 
● LTO is production ready in gcc 4.9
Part 1 Part 2 
● GCC optimization levels 
● Using random compiler options 
● Toolchain defaults by vendor 
● How to select target flags 
● Feedback directed optimization 
● Link-time optimization 
● Inline assembly 
● Auto-vectorization 
● Minimizing global symbols 
● Section garbage collection 
● GNU symbol hash
Inline Assembly 
● Using instructions compiler does not know about 
● Are you sure -- check latest built-ins / intrinsics! 
● Privileged instructions 
● Syscall / interrupt instructions 
● Basic Asm 
● asm (“INSN1”); 
● Limited use; all operands must already be in specific registers 
● See docs: https://gcc.gnu.org/onlinedocs/gcc/Basic-Asm.html 
● Extended Asm 
● asm (“TEMPLATE” : “OUTPUTS” : “INPUTS” : “CLOBBERS”); 
● See glibc or linux kernel for inspiration 
● See docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
Inline Assembly -- statements 
● “asm (:::);” is just another normal statement 
● GCC optimizes asm statements just like any other statements 
● Programmer is responsible for specifying ALL effects of asm 
● “asm volatile (:::);” 
● Number of executions, not presence in code, is guaranteed.
Inline Assembly -- variables 
Wrong 
int fund (int arg) { 
asm (“insn r0”); // I know ABI 
return arg; 
} 
Correct 
int func (int _arg) { 
int arg asm(“r0”) = _arg; 
asm (“insn %0” : “+r” (arg)); 
return arg; 
}
Auto-vectorization 
Vectorization performs multiple iterations of a loop (or repeated 
operation) using vector instructions that operate on multiple data 
items simultaneously. gcc is capable of identifying code that can 
be vectorized and applying this transformation. 
Compiler flags to enable this optimization: 
● -O3 
● -ftree-vectorize
Auto-vectorization Example 
A simple loop to vectorize: 
#define SIZE (1UL << 16) 
void test1(double *a, double *b) 
{ 
for (size_t i = 0; i < SIZE; i++) 
a[i] += b[i]; 
}
Auto-vectorization Example 
What code is generated by gcc -std=c99 -O2 -mfpu=neon? 
test1: 
movs r3, #0 
.L3: 
fldd d16, [r0] 
fldmiad r1!, {d17} 
faddd d16, d16, d17 
adds r3, r3, #1 
cmp r3, #65536 
fstmiad r0!, {d16} 
bne .L3 
bx lr
Auto-vectorization Example 
What code is generated by gcc -std=c99 -O3 -mfpu=neon? 
The code is unchanged. Why did we not see any vectorization? gcc 
provides -ftree-vectorizer-verbose to help. 
test.c:9: note: not vectorized: no vectype for stmt: _7 = *_6; 
scalar_type: double 
ARMv7 NEON does not support vectorizing double precision 
operations so gcc cannot vectorize the loop.
Auto-vectorization Example 
So how about we switch to float. Does it vectorize? 
No. What do we get from -ftree-vectorizer-verbose? 
test.c:8: note: not vectorized: relevant stmt not supported: _11 = _7 + 
_10; 
test.c:8: note: bad operation or unsupported loop bound. 
NEON does not support full IEEE 754, so gcc won’t use it.
Auto-vectorization Example 
If we know that our data does not contain any problematic values 
(denormals or non-default NaNs) and we can deal with the other 
restrictions (round to nearest, no traps) we can tell gcc NEON is 
OK with -funsafe-math-optimizations. 
Finally, we see vector instructions!
Auto-vectorization Example 
test1: .L4: .L5: 
add r3, r1, #16 vld1.32 {q9}, [r1]! flds s15, [r0] 
add r2, r0, #16 vld1.32 {q8}, [r0] fldmias r1!, {s14} 
cmp r0, r3 vadd.f32 q8, q9, q8 fadds s15, s14, s15 
it cc vst1.32 {q8}, [r0]! adds r3, r3, #1 
cmpcc r1, r2 cmp r0, r3 cmp r3, #65536 
ite cs bne .L4 fstmias r0!, {s15} 
movcs r3, #1 bx lr bne .L5 
movcc r3, #0 bx lr 
bcc .L5 
add r3, r0, #262144
Auto-vectorization Example 
That’s still quite a lot of code, how can we improve it? Use the 
restrict keyword to annotate that the two arrays do not alias 
(overlap). 
#define SIZE (1UL << 16) 
void test1(float * restrict a, float * restrict b) 
{ 
for (size_t i = 0; i < SIZE; i++) 
a[i] += b[i]; 
}
Auto-vectorization Example 
Well, that was unexpected! 
test1: flds s15, [r1, #8] vld1.64 {d16-d17},[r2:64]! lsls r4, r4, #2 .L1: 
sbfx r3, r0, #2, #1 fadds s15, s14, s15 vadd.f32 q8, q9, q8 cmp r6, #2 pop {r4,r5,r6,r7,r8,pc} 
ands r3, r3, #3 movw ip, #65533 vst1.64 {d16-d17},[r4:64]! add r2, r0, r4 .L9: 
push {r4,r5,r6,r7,r8,lr} mov r8, #3 bhi .L8 add r4, r4, r1 mov ip, #65536 
beq .L9 fsts s15, [r0, #8] cmp lr, r7 flds s14, [r2] mov r8, r3 
flds s14, [r0] .L2: add r3, r8, r7 flds s15, [r4] b .L2 
flds s15, [r1] rsb lr, r3, #65536 rsb r6, r7, ip fadds s15, s14, s15 .L11: 
fadds s15, s14, s15 lsls r5, r3, #2 beq .L1 add r3, r3, #2 movw ip, #65534 
cmp r3, #1 adds r2, r0, r5 lsls r5, r3, #2 fsts s15, [r2] mov r8, #2 
fsts s15, [r0] add r5, r5, r1 cmp r6, #1 beq .L1 b .L2 
bls .L10 lsr r6, lr, #2 add r2, r0, r5 lsls r3, r3, #2 .L10: 
flds s14, [r0, #4] movs r3, #0 add r5, r5, r1 add r0, r0, r3 movw ip, #65535 
flds s15, [r1, #4] mov r4, r2 flds s14, [r2] add r3, r3, r1 mov r8, #1 
fadds s15, s14, s15 lsls r7, r6, #2 flds s15, [r5] flds s14, [r0] b .L2 
cmp r3, #2 .L8: fadds s15, s14, s15 flds s15, [r3] 
fsts s15, [r0, #4] adds r3, r3, #1 add r4, r3, #1 fadds s15, s14, s15 
bls .L11 vld1.32 {q9}, [r5]! fsts s15, [r2] fsts s15, [r0] 
flds s14, [r0, #8] cmp r6, r3 beq .L1 pop {r4,r5,r6,r7,r8,pc}
Auto-vectorization Example 
gcc is expending a lot of instructions making sure the pointers are 
aligned to an 8 byte boundary. Often this can be guaranteed by the 
allocator or data structure layout. 
void test1(float * restrict a_, float * restrict b_) 
{ 
float *a = __builtin_assume_aligned(a_, 8); 
float *b = __builtin_assume_aligned(b_, 8); 
for (size_t i = 0; i < SIZE; i++) 
a[i] += b[i]; 
}
Auto-vectorization Example 
Now we have something that looks fairly optimal. 
test1: 
add r3, r0, #262144 
.L3: 
vld1.64 {d16-d17}, [r0:64] 
vld1.64 {d18-d19}, [r1:64]! 
vadd.f32 q8, q8, q9 
vst1.64 {d16-d17}, [r0:64]! 
cmp r0, r3 
bne .L3 
bx lr
Auto-vectorization Tips 
● Use the right types 
● Understand the implications for mathematical operations 
● Use restrict annotations where possible 
● Use vector aligned pointers where possible and annotate them 
● Use countable loop conditions e.g. i < n 
● Don’t do control flow in the loop e.g. break, function calls 
● Experiment with -ftree-vectorizer-verbose
Minimizing Global Symbols 
Reducing the number of global symbols in shared objects is 
beneficial for a number of reasons. 
● Reduced startup time 
● Faster function calls 
● Smaller disk and memory footprint 
There a number of ways to achieve this goal: 
● Make as many functions as possible static 
● Use a version script to force symbols local 
● Use -fvisibility=hidden and symbol attributes 
● Use ld -Bsymbolic
-Bsymbolic 
-Bsymbolic binds global references within a shared library to 
definitions within the shared library where possible, bypassing the 
PLT for functions. -Bsymbolic-functions behaves similarly but 
applies only to functions. 
This breaks symbol preemption and pointer comparison so cannot 
be applied without a certain amount of care. -Bsymbolic-functions 
is safer as comparison of function pointers is rarer than 
comparison of data pointers.
-Bsymbolic Example 
lib1.c: 
int func1(int a) 
{ 
return 1 + func2(a); 
} 
lib2.c: 
int func2(int a) 
{ 
return a*2; 
}
-Bsymbolic Example 
gcc -O2 -shared -o lib.so lib1.o lib2.o 
00000540 <func1>: 
540: b508 push {r3, lr} 
542: f7ff ef7e blx 440 <_init+0x38> 
546: 3001 adds r0, #1 
548: bd08 pop {r3, pc} 
54a: bf00 nop 
0000054c <func2>: 
54c: 0040 lsls r0, r0, #1 
54e: 4770 bx lr
-Bsymbolic Example 
DYNAMIC RELOCATION RECORDS 
OFFSET TYPE VALUE 
00008f14 R_ARM_RELATIVE *ABS* 
00008f18 R_ARM_RELATIVE *ABS* 
0000902c R_ARM_RELATIVE *ABS* 
00009018 R_ARM_GLOB_DAT __cxa_finalize 
0000901c R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable 
00009020 R_ARM_GLOB_DAT __gmon_start__ 
00009024 R_ARM_GLOB_DAT _Jv_RegisterClasses 
00009028 R_ARM_GLOB_DAT _ITM_registerTMCloneTable 
0000900c R_ARM_JUMP_SLOT __cxa_finalize 
00009010 R_ARM_JUMP_SLOT __gmon_start__ 
00009014 R_ARM_JUMP_SLOT func2
-Bsymbolic Example 
gcc -O2 -shared -Wl,-Bsymbolic-functions -o liblib.so lib1.o lib2.o 
0000052c <func1>: 
52c: b508 push {r3, lr} 
52e: f000 f803 bl 538 <func2> 
532: 3001 adds r0, #1 
534: bd08 pop {r3, pc} 
536: bf00 nop 
00000538 <func2>: 
538: 0040 lsls r0, r0, #1 
53a: 4770 bx lr
-Bsymbolic Example 
DYNAMIC RELOCATION RECORDS 
OFFSET TYPE VALUE 
00008f14 R_ARM_RELATIVE *ABS* 
00008f18 R_ARM_RELATIVE *ABS* 
00009028 R_ARM_RELATIVE *ABS* 
00009014 R_ARM_GLOB_DAT __cxa_finalize 
00009018 R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable 
0000901c R_ARM_GLOB_DAT __gmon_start__ 
00009020 R_ARM_GLOB_DAT _Jv_RegisterClasses 
00009024 R_ARM_GLOB_DAT _ITM_registerTMCloneTable 
0000900c R_ARM_JUMP_SLOT __cxa_finalize 
00009010 R_ARM_JUMP_SLOT __gmon_start__
Section Garbage Collection 
ld is capable of dropping any unused input sections from the final 
link. It does this by following references between sections from an 
entry point, and un-referenced sections are removed (or garbage 
collected). 
● Compile with -ffunction-sections and -fdata-sections 
● Link with --gc-sections 
● Only helps on projects that contain some redundancy
GNU Symbol Hash 
Dynamic objects contain a hash to map symbol names to 
addresses. The GNU hash feature implemented in ld and glibc 
performs considerably better than the standard ELF hash. 
● Fast hash function with good collision avoidance 
● Bloom filters to quickly check for symbol in a hash 
● Symbols sorted for cache locality 
Creation of a GNU hash section can be enabled by passing --hash-style= 
gnu or --hash-style=both to ld. The Android dynamic linker 
does not currently support GNU hash sections!
More about Linaro Connect: connect.linaro.org 
Linaro members: www.linaro.org/members 
More about Linaro: www.linaro.org/about/

Mais conteúdo relacionado

Mais de Linaro

It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 

Mais de Linaro (20)

It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 

Último (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

LCU14 307- Advanced Toolchain Usage (parts 1&2)

  • 1. LCU14-307: Advanced Toolchain Usage M.Collison, M. Kuvyrkov & W. Newton, LCU14 LCU14 BURLINGAME
  • 2. Part 1 Part 2 ● GCC optimization levels ● Using random compiler options ● Toolchain defaults by vendor ● How to select target flags ● Feedback directed optimization ● Link-time optimization ● Inline assembly ● Auto-vectorization ● Minimizing global symbols ● Section garbage collection ● GNU symbol hash
  • 3. GCC Optimization Levels ● Optimization Level 0 ● Optimization Level 1 (-O1) ● Optimization Level 2 (-O2) ● Optimization Level 3 (-O3) ● Code Size Optimization (-Os) ● Optimize for debugging (-Og)
  • 4. Optimization Level 0 ● -O0 is equivalent to no optimization. ● -O0 is equivalent to providing no optimization option ● THIS IS THE DEFAULT
  • 5. Optimization Level 1 (-O or -O1) ● Enables basic optimizations that attempt to reduce code size and execution time ● Debugging of generated code is minimally affected ● Important optimizations enabled: ● Dead code and store elimination on trees and RTL ● Basic loop optimizations ● Register allocation ● If conversion ● Convert conditional jumps into “branch-less equivalents” ● Constant propagation ● Eliminate redundant jumps to jumps
  • 6. Optimization Level 2 (-O2) ● Enables all optimizations from –O1 level ● Adds more aggressive optimizations at expense of debuggability ● Important optimizations enabled: ● Global CSE, constant and copy propagation ● Global implies within an entire function not across function boundaries ● Instruction scheduling to take advantage of processor pipeline ● Inlining of small functions ● Interprocedural constant propagation ● Reorder basic blocks to improve cache locality ● Partial redundancy elimination
  • 7. Optimization Level 3 (-O3) ● All optimizations enabled by -O2 ● Optimizes more aggressively to reduce execution time at the expense of code size ● (Potentially) Inline any function ● Loop vectorization to utilize SIMD instructions ● Function cloning to make interprocedural constant propagation more powerful ● Loop unswitching
  • 8. Optimize for Code Size (-Os) ● Enables all optimizations as –O2 that do not increase code size ● Disables the following –O2 optimizations: ● Optimizations that align the start of functions, loops, branch targets and labels ● Reordering of basic blocks
  • 9. Optimize for Debugging (-Og) ● Enables optimizations that do not interfere with debugging ● Debugging (“-g”) must still be enabled ● I use “-Og” and “-g” for edit-compile development cycle
  • 10. Recommendation for Optimization Options ● Use -Og and -g for edit-compile-debug cycle ● Use -O2 for where code size and execution are important ● Use -O3 when execution speed is the primary requirement ● Use -Os when code size is the primary requirement
  • 11. But I’m experienced, I /know/ the good flags! ● 3 years ago I spent 3 days finding the best combination of GCC flags for my project / board / benchmark ● -O2 -funroll-loops -fno-schedule-insns --param <some>=<thing> ● … 3 major version of compiler later … ● Why simple -Os outperforms my custom-tuned options? ● I thought loop unrolling makes loops go faster. ● I saw “-fno-schedule-insns” on the internet. ● I hand-tuned --param <some>=<thing>
  • 12. But I’m experienced, I /know/ the good flags! ● Feature flags (these are OK) ● -std=c++11 -- language standard ● -fno-common -- language feature ● -mcpu=cortex-a15 -mfpu=neon-vfpv4 -- target feature ● Compatibility flags (not OK, please fix your code) ● -fno-strict-aliasing ● -fsigned-char ● Optimization flags (not OK, please use -Og/-Os/-O2/-O3/-Ofast) ● -f<optimization> ● -fno-<optimization> ● --param <some>=<thing>
  • 13. So many defaults (AArch32) ● Linaro (cross) ● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a9 ● Ubuntu (native) ● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 ● Debian armhf (native) ● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 ● Debian armel (native) ● -marm -march=armv4t -mfloat-abi=soft -mtune=arm7tdmi ● Fedora (native) ● -marm -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8 ● CodeSourcery (cross) ● -marm -march=armv5te -mfloat-abi=soft -mtune=arm1026ejs ● other multilibs available
  • 14. How to select target flags ● -mcpu=CPU -mfpu=FPU ● -mcpu=cortex-a15 -mfpu=neon-vfpv4 ● -mcpu=FOO is the same as -march=<FOO’s arch> -mtune=FOO ● [-mcpu is preferred option] ● Using ABI options require a matching set of libraries (multilib) ● There always is a default multilib for default ABI options ● Linaro toolchains have a single -- default -- multilib per toolchain ● MEANING OF MULTILIB: set of libraries, not libraries for multiple ABIs. ● For different ABI configurations use different Linaro toolchain packages (or build your own with cbuild2!)
  • 15. Feedback Directed Optimization Feedback directed optimization provides information to the compiler which is then used for making optimization decisions. ● Branch probabilities ● Inlining ● Hot/cold code reordering and partitioning (not on ARM) The information used is generated by profiling, which can be done by one of two methods. ● gprof style code instrumentation ● Statistical profiling with hardware counters
  • 16. Using code instrumentation 1. Build the code with appropriate options to add profiling instrumentation -fprofile-generate=dir, where dir is the output directory 2. Run the application with a representative workload. 3. Rebuild the code with profile generated by the run. -fprofile-use=dir, where dir is the same directory as before This results in two build types, the slower instrumented build and the final optimized build.
  • 17. Performance In this example I used the Opus 1.1 codec encoder test and gcc 4.8.3 on x86_64. Build Type Run Time Relative Run Time Default 27.727s 100% Instrumented 34.008s 123% Optimized 24.301s 88%
  • 18. Build Time Instrumenting and optimizing based on profiles also adds some overhead to compile times. Build Type Build Time Relative Build Time Default 42.410s 100% Instrumented 55.508s 131% Optimized 70.544s 166%
  • 19. AutoFDO A new method of feedback directed optimization developed by Google. Uses perf to generate profiles using optimized binaries with debug information. https://github.com/google/autofdo 1. Build a standard optimized build (with debug info). 2. Run the application with perf record branch profiling. 3. Convert profile with autofdo tool. 4. Build with -fauto-profile. Only supported in Google’s gcc branch not on master. Provides around 70-80% of the performance benefits of the instrumentation method but profiling overhead is only around 2%.
  • 20. Link Time Optimization (LTO) ● Allows optimizations that work on a entire file to work across the entire application ● Works by saving the compiler IL in object files and using the IL to optimize at “link-time” ● Enabled with “–flto” ● -fuse-linker-plugin allows LTO to be applied to object files in libraries (assuming proper linker support) ● Limitation: Use same command line options when compiling source files ● gcc –O2 –flto –c a.c ● gcc –O2 –flto –c b.c ● gcc –o a.out a.o b.o -flto ● LTO is production ready in gcc 4.9
  • 21. Part 1 Part 2 ● GCC optimization levels ● Using random compiler options ● Toolchain defaults by vendor ● How to select target flags ● Feedback directed optimization ● Link-time optimization ● Inline assembly ● Auto-vectorization ● Minimizing global symbols ● Section garbage collection ● GNU symbol hash
  • 22. Inline Assembly ● Using instructions compiler does not know about ● Are you sure -- check latest built-ins / intrinsics! ● Privileged instructions ● Syscall / interrupt instructions ● Basic Asm ● asm (“INSN1”); ● Limited use; all operands must already be in specific registers ● See docs: https://gcc.gnu.org/onlinedocs/gcc/Basic-Asm.html ● Extended Asm ● asm (“TEMPLATE” : “OUTPUTS” : “INPUTS” : “CLOBBERS”); ● See glibc or linux kernel for inspiration ● See docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
  • 23. Inline Assembly -- statements ● “asm (:::);” is just another normal statement ● GCC optimizes asm statements just like any other statements ● Programmer is responsible for specifying ALL effects of asm ● “asm volatile (:::);” ● Number of executions, not presence in code, is guaranteed.
  • 24. Inline Assembly -- variables Wrong int fund (int arg) { asm (“insn r0”); // I know ABI return arg; } Correct int func (int _arg) { int arg asm(“r0”) = _arg; asm (“insn %0” : “+r” (arg)); return arg; }
  • 25. Auto-vectorization Vectorization performs multiple iterations of a loop (or repeated operation) using vector instructions that operate on multiple data items simultaneously. gcc is capable of identifying code that can be vectorized and applying this transformation. Compiler flags to enable this optimization: ● -O3 ● -ftree-vectorize
  • 26. Auto-vectorization Example A simple loop to vectorize: #define SIZE (1UL << 16) void test1(double *a, double *b) { for (size_t i = 0; i < SIZE; i++) a[i] += b[i]; }
  • 27. Auto-vectorization Example What code is generated by gcc -std=c99 -O2 -mfpu=neon? test1: movs r3, #0 .L3: fldd d16, [r0] fldmiad r1!, {d17} faddd d16, d16, d17 adds r3, r3, #1 cmp r3, #65536 fstmiad r0!, {d16} bne .L3 bx lr
  • 28. Auto-vectorization Example What code is generated by gcc -std=c99 -O3 -mfpu=neon? The code is unchanged. Why did we not see any vectorization? gcc provides -ftree-vectorizer-verbose to help. test.c:9: note: not vectorized: no vectype for stmt: _7 = *_6; scalar_type: double ARMv7 NEON does not support vectorizing double precision operations so gcc cannot vectorize the loop.
  • 29. Auto-vectorization Example So how about we switch to float. Does it vectorize? No. What do we get from -ftree-vectorizer-verbose? test.c:8: note: not vectorized: relevant stmt not supported: _11 = _7 + _10; test.c:8: note: bad operation or unsupported loop bound. NEON does not support full IEEE 754, so gcc won’t use it.
  • 30. Auto-vectorization Example If we know that our data does not contain any problematic values (denormals or non-default NaNs) and we can deal with the other restrictions (round to nearest, no traps) we can tell gcc NEON is OK with -funsafe-math-optimizations. Finally, we see vector instructions!
  • 31. Auto-vectorization Example test1: .L4: .L5: add r3, r1, #16 vld1.32 {q9}, [r1]! flds s15, [r0] add r2, r0, #16 vld1.32 {q8}, [r0] fldmias r1!, {s14} cmp r0, r3 vadd.f32 q8, q9, q8 fadds s15, s14, s15 it cc vst1.32 {q8}, [r0]! adds r3, r3, #1 cmpcc r1, r2 cmp r0, r3 cmp r3, #65536 ite cs bne .L4 fstmias r0!, {s15} movcs r3, #1 bx lr bne .L5 movcc r3, #0 bx lr bcc .L5 add r3, r0, #262144
  • 32. Auto-vectorization Example That’s still quite a lot of code, how can we improve it? Use the restrict keyword to annotate that the two arrays do not alias (overlap). #define SIZE (1UL << 16) void test1(float * restrict a, float * restrict b) { for (size_t i = 0; i < SIZE; i++) a[i] += b[i]; }
  • 33. Auto-vectorization Example Well, that was unexpected! test1: flds s15, [r1, #8] vld1.64 {d16-d17},[r2:64]! lsls r4, r4, #2 .L1: sbfx r3, r0, #2, #1 fadds s15, s14, s15 vadd.f32 q8, q9, q8 cmp r6, #2 pop {r4,r5,r6,r7,r8,pc} ands r3, r3, #3 movw ip, #65533 vst1.64 {d16-d17},[r4:64]! add r2, r0, r4 .L9: push {r4,r5,r6,r7,r8,lr} mov r8, #3 bhi .L8 add r4, r4, r1 mov ip, #65536 beq .L9 fsts s15, [r0, #8] cmp lr, r7 flds s14, [r2] mov r8, r3 flds s14, [r0] .L2: add r3, r8, r7 flds s15, [r4] b .L2 flds s15, [r1] rsb lr, r3, #65536 rsb r6, r7, ip fadds s15, s14, s15 .L11: fadds s15, s14, s15 lsls r5, r3, #2 beq .L1 add r3, r3, #2 movw ip, #65534 cmp r3, #1 adds r2, r0, r5 lsls r5, r3, #2 fsts s15, [r2] mov r8, #2 fsts s15, [r0] add r5, r5, r1 cmp r6, #1 beq .L1 b .L2 bls .L10 lsr r6, lr, #2 add r2, r0, r5 lsls r3, r3, #2 .L10: flds s14, [r0, #4] movs r3, #0 add r5, r5, r1 add r0, r0, r3 movw ip, #65535 flds s15, [r1, #4] mov r4, r2 flds s14, [r2] add r3, r3, r1 mov r8, #1 fadds s15, s14, s15 lsls r7, r6, #2 flds s15, [r5] flds s14, [r0] b .L2 cmp r3, #2 .L8: fadds s15, s14, s15 flds s15, [r3] fsts s15, [r0, #4] adds r3, r3, #1 add r4, r3, #1 fadds s15, s14, s15 bls .L11 vld1.32 {q9}, [r5]! fsts s15, [r2] fsts s15, [r0] flds s14, [r0, #8] cmp r6, r3 beq .L1 pop {r4,r5,r6,r7,r8,pc}
  • 34. Auto-vectorization Example gcc is expending a lot of instructions making sure the pointers are aligned to an 8 byte boundary. Often this can be guaranteed by the allocator or data structure layout. void test1(float * restrict a_, float * restrict b_) { float *a = __builtin_assume_aligned(a_, 8); float *b = __builtin_assume_aligned(b_, 8); for (size_t i = 0; i < SIZE; i++) a[i] += b[i]; }
  • 35. Auto-vectorization Example Now we have something that looks fairly optimal. test1: add r3, r0, #262144 .L3: vld1.64 {d16-d17}, [r0:64] vld1.64 {d18-d19}, [r1:64]! vadd.f32 q8, q8, q9 vst1.64 {d16-d17}, [r0:64]! cmp r0, r3 bne .L3 bx lr
  • 36. Auto-vectorization Tips ● Use the right types ● Understand the implications for mathematical operations ● Use restrict annotations where possible ● Use vector aligned pointers where possible and annotate them ● Use countable loop conditions e.g. i < n ● Don’t do control flow in the loop e.g. break, function calls ● Experiment with -ftree-vectorizer-verbose
  • 37. Minimizing Global Symbols Reducing the number of global symbols in shared objects is beneficial for a number of reasons. ● Reduced startup time ● Faster function calls ● Smaller disk and memory footprint There a number of ways to achieve this goal: ● Make as many functions as possible static ● Use a version script to force symbols local ● Use -fvisibility=hidden and symbol attributes ● Use ld -Bsymbolic
  • 38. -Bsymbolic -Bsymbolic binds global references within a shared library to definitions within the shared library where possible, bypassing the PLT for functions. -Bsymbolic-functions behaves similarly but applies only to functions. This breaks symbol preemption and pointer comparison so cannot be applied without a certain amount of care. -Bsymbolic-functions is safer as comparison of function pointers is rarer than comparison of data pointers.
  • 39. -Bsymbolic Example lib1.c: int func1(int a) { return 1 + func2(a); } lib2.c: int func2(int a) { return a*2; }
  • 40. -Bsymbolic Example gcc -O2 -shared -o lib.so lib1.o lib2.o 00000540 <func1>: 540: b508 push {r3, lr} 542: f7ff ef7e blx 440 <_init+0x38> 546: 3001 adds r0, #1 548: bd08 pop {r3, pc} 54a: bf00 nop 0000054c <func2>: 54c: 0040 lsls r0, r0, #1 54e: 4770 bx lr
  • 41. -Bsymbolic Example DYNAMIC RELOCATION RECORDS OFFSET TYPE VALUE 00008f14 R_ARM_RELATIVE *ABS* 00008f18 R_ARM_RELATIVE *ABS* 0000902c R_ARM_RELATIVE *ABS* 00009018 R_ARM_GLOB_DAT __cxa_finalize 0000901c R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable 00009020 R_ARM_GLOB_DAT __gmon_start__ 00009024 R_ARM_GLOB_DAT _Jv_RegisterClasses 00009028 R_ARM_GLOB_DAT _ITM_registerTMCloneTable 0000900c R_ARM_JUMP_SLOT __cxa_finalize 00009010 R_ARM_JUMP_SLOT __gmon_start__ 00009014 R_ARM_JUMP_SLOT func2
  • 42. -Bsymbolic Example gcc -O2 -shared -Wl,-Bsymbolic-functions -o liblib.so lib1.o lib2.o 0000052c <func1>: 52c: b508 push {r3, lr} 52e: f000 f803 bl 538 <func2> 532: 3001 adds r0, #1 534: bd08 pop {r3, pc} 536: bf00 nop 00000538 <func2>: 538: 0040 lsls r0, r0, #1 53a: 4770 bx lr
  • 43. -Bsymbolic Example DYNAMIC RELOCATION RECORDS OFFSET TYPE VALUE 00008f14 R_ARM_RELATIVE *ABS* 00008f18 R_ARM_RELATIVE *ABS* 00009028 R_ARM_RELATIVE *ABS* 00009014 R_ARM_GLOB_DAT __cxa_finalize 00009018 R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable 0000901c R_ARM_GLOB_DAT __gmon_start__ 00009020 R_ARM_GLOB_DAT _Jv_RegisterClasses 00009024 R_ARM_GLOB_DAT _ITM_registerTMCloneTable 0000900c R_ARM_JUMP_SLOT __cxa_finalize 00009010 R_ARM_JUMP_SLOT __gmon_start__
  • 44. Section Garbage Collection ld is capable of dropping any unused input sections from the final link. It does this by following references between sections from an entry point, and un-referenced sections are removed (or garbage collected). ● Compile with -ffunction-sections and -fdata-sections ● Link with --gc-sections ● Only helps on projects that contain some redundancy
  • 45. GNU Symbol Hash Dynamic objects contain a hash to map symbol names to addresses. The GNU hash feature implemented in ld and glibc performs considerably better than the standard ELF hash. ● Fast hash function with good collision avoidance ● Bloom filters to quickly check for symbol in a hash ● Symbols sorted for cache locality Creation of a GNU hash section can be enabled by passing --hash-style= gnu or --hash-style=both to ld. The Android dynamic linker does not currently support GNU hash sections!
  • 46. More about Linaro Connect: connect.linaro.org Linaro members: www.linaro.org/members More about Linaro: www.linaro.org/about/