SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
DEMYSTIFYING A COMPILER
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
Compilation trajectory
Intermediate language
Dealing with local variables
link-time and whole program optimization
Optimization levels
Compiler optimization taxonomies
Classic
Scope
Code pattern
How to get the feedback from optimization?
Compiler optimization challenges
Summary
4
EXECUTABLE GENERATION PHASES
Pre-processing. Pre-process, but don't compile.
gcc -E test.cc
cl /E test.cc
Compilation. Compile, but don't assemble.
gcc -S test.cc
cl /FA test.cc
Assembling. Assemble, but don't link.
gcc -c test.cc
cl /c test.cc
Linking. Link object les to generate the executable.
gcc test.cc
cl test.cc
5 . 1
COMPILATION TRAJECTORY
Lexical Analysis
scans the source code as a stream of characters
converting it into lexemes (tokens).
Syntax Analysis
takes the tokens, produced by lexical analysis, as input
and generates a syntax tree. Source code grammar
(syntactical correctness) is checked here.
5 . 2
COMPILATION TRAJECTORY
Semantic Analysis
checks whether the constructed syntax tree follows the
language rules (including the type checking).
Intermediate Code Generation
builds a program representation for some abstract
machine. It is in between the high-level language and the
target machine language.
5 . 3
COMPILATION TRAJECTORY
Code Optimization
does optimization of the intermediate code (eg,
redundancy elimination).
Code Generation
takes an optimized representation of the intermediate
code and maps it to the target machine language.
6
FRONTEND AND BACKEND
Only a backend is required for new machine support
Only a frontend is required for new language support
Most of optimizations resemble each other for all targets
and could be applied in between frontend and backend
7 . 1
INTERMEDIATE LANGUAGE
Optimization techniques become much easier to conduct on
the level of intermediate code. Modern compilers usually use
2 levels of intermediate representation (IR).
7 . 2
INTERMEDIATE LANGUAGE
High Level IR
is close to the source and can be easily generated from
the source code. Some code optimizations are possible.
It is not very suitable for target machine optimization.
Low Level IR
is close to the target machine and used for machine-
dependent optimizations: register allocation, instruction
selection, peephole optimization.
7 . 3
INTERMEDIATE LANGUAGE
Language-speci c to be used for JIT compilation later:
Java byte code; .NET CLI, NVIDIA PTX.
Language independent, like three-(four-)address code
(similar to a classic RISC ISA).
a = b + c * d + c * d;
Three-Address Code (TAC)
r1 = c * d;
r2 = b + r1;
r3 = r2 + r1;
a = r3
Here rth
is an abstract register.
7 . 4
THREE-ADDRESS CODE
Quadruples has four elds
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
Triples or Indirect triples have three elds
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
7 . 5
INTERMEDIATE LANGUAGE
Provides frontend independent code representation.
and
GNU Compiler Collection
-fdump-tree-all -fdump-tree-optimized
-fdump-tree-ssa -fdump-rtl-all
clang and other LLWM based compilers
-emit-llvm
CIL (C Intermediate Language)
Visual Studio cl.exe
GENERIC GIMPLE
LLVM IL
7 . 6
INTERMEDIATE LANGUAGE
uint32_t gray2rgba_v1(uint8_t c)
{
return c + (c<<8) + (c<<16) + (c<<24);
}
$ clang -Os -S -emit-llvm test.c -o test.ll
$ cat test.ll
define i32 @gray2rgba_v1(i8 zeroext %c) # 0 {
%1 = zext i8 %c to i32
%2 = mul i32 %1, 16843009
ret i32 %2
}
gray2rgba_v1:
movzbl %dil, %eax
imull $16843009, %eax, %eax
ret
8 . 1
DEALING LOCAL VARIABLES
Compiler don't care how many variables are used in code,
register allocation is done after IR rotations.
for( ; j <= roi.width - 4; j += 4 )
{
uchar t0 = tab[src[j]];
uchar t1 = tab[src[j+ 1]];
dst[j] = t0;
dst[j+1] = t1;
t0 = tab[src[j+2]];
t1 = tab[src[j+3]];
dst[j+2] = t0;
dst[j+3] = t1;
}
8 . 2
DEALING LOCAL VARIABLES
.lr.ph4: ; preds = %0, %.lr.ph4
%indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [0, %0 ]
%6 = getelementptr inbounds i8* %src, i64 %indvars.iv5
%7 = load i8* %6, align 1, !tbaa !1
%8 = zext i8 %7 to i64
%9 = getelementptr inbounds i8* %tab, i64 %8
%10 = load i8* %9, align 1, !tbaa !1
%11 = or i64 %indvars.iv5, 1
%12 = getelementptr inbounds i8* %src, i64 %11
%13 = load i8* %12, align 1, !tbaa !1
%14 = zext i8 %13 to i64
%15 = getelementptr inbounds i8* %tab, i64 %14
%16 = load i8* %15, align 1, !tbaa !1
%17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5
store i8 %10, i8* %17, align 1, !tbaa !1
%18 = getelementptr inbounds i8* %dst, i64 %11
store i8 %16, i8* %18, align 1, !tbaa !1
%19 = or i64 %indvars.iv5, 2
// ...
%28 = zext i8 %27 to i64
%29 = getelementptr inbounds i8* %tab, i64 %28
%30 = load i8* %29, align 1, !tbaa !1
%31 = getelementptr inbounds i8* %dst, i64 %19
store i8 %24, i8* %31, align 1, !tbaa !1
%32 = getelementptr inbounds i8* %dst, i64 %25
store i8 %30, i8* %32, align 1, !tbaa !1
%indvars.iv.next6 = add nuw nsw i64 %indvars.iv5,4
%33 = trunc i64 %indvars.iv.next6 to i32
%34 = icmp sgt i32 %33, %1
br i1 %34, label %..preheader_crit_edge, label %.lr.ph4
9 . 1
Perform inter-procedural optimizations during linking.
Most compilers support this feature:
LINK-TIME OPTIMIZATION (LTO)
(- to)
(- to) stating with 4.9
(/GL, /LTCG)
...
clang
gcc
cl.exe
9 . 2
WHOPR: WHOLE PROGRAM OPTIMIZATION
1. Compile each source le separately, add extra information
to the object le
2. Analyze information collected from all object les
3. Perform second optimization phase to generate object le
4. Link the nal binary
Eliminate even more redundant code
Compilations is better optimized for multi-core systems
10 . 1
OPTIMIZATION LEVELS
-O0 (the default) No optimization
generates unoptimized code but has the fastest
compilation time.
-O1 Moderate optimization
optimizes reasonably well but does not degrade
compilation time signi cantly.
-O2 Full optimization
generates highly optimized code and has the slowest
compilation time.
10 . 2
OPTIMIZATION LEVELS
-O3 Aggressive optimization
employees more aggressive automatic inlining of
subprograms within a unit and attempts to vectorize.
-Os Optimizes with focus on program size
enables all -O2 optimizations that do not typically
increase code size. It also performs further
optimizations designed to reduce code size.
10 . 3
ENABLED OPTIMIZATIONS: GCC -O0
GNU C version 4.9.2 (x86_64-linux-gnu)
$ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s
options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks -
fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique - dent - nline-atomics -
ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-debug-
strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-
heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-
dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftrapping-math -ftree-
coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-
loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-
double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-
387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
10 . 4
ENABLED OPTIMIZATIONS: GCC -O1
GNU C version 4.9.2 (x86_64-linux-gnu)
$ touch 1.c; gcc -O1 -S -fverbose-asm 1.c -o 1.s
options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack-
adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c -asm -fearly-inlining -
feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability - dent -
f-conversion - f-conversion2 - nline - nline-atomics - nline-functions-called-once - pa-pro le - pa-pure-const - pa-reference - ra-hoist-
pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants -
fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-
heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-
spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-
types -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-
coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-
loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree-
scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-
double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-
387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
10 . 5
ENABLED OPTIMIZATIONS: GCC -O2
GNU C version 4.9.2 (x86_64-linux-gnu)
$ touch 1.c; gcc -O2 -S -fverbose-asm 1.c -o 1.s
options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-
stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize -
fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate -
ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads - dent - f-conversion - f-
conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions-called-once - nline-small-functions - pa-cp - pa-pro le - pa-pure-
const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts -
fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
-foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder-
blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -
fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-
stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -
fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-
builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -
ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -
ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -
ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-
stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -
mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper
10 . 6
ENABLED OPTIMIZATIONS: GCC -O3
GNU C version 4.9.2 (x86_64-linux-gnu)
$ touch 1.c; gcc -O3 -S -fverbose-asm 1.c -o 1.s
options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine-
stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -
fdevirtualize -fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -
fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-
adjacent-loads - dent - f-conversion - f-conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions - nline-functions-called-
once - nline-small-functions - pa-cp - pa-cp-clone - pa-pro le - pa-pure-const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save-
slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-
constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -
fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition
-freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-
interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule-
insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over ow
-fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree-
ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
-ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree-
parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-
slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm -
fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-
store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -
msse2 -mtls-direct-seg-refs -mvzeroupper
11
CLASSIC COMPILER OPTIMIZATION TAXONOMY
Machine independent
Applicable across a broad
range of machines
1. Eliminate redundant
computations, dead code
2. Reduce running time and
space
3. Decrease ratio of
overhead to real work
4. Specialize code on a
context
5. Enable other optimizations
Machine dependent
Capitalize on speci c
machine properties
1. Manage or hide latency
2. Manage resources
(registers, stack)
3. Improve mapping from IR
to concrete machine
4. Use some exotic
instructions (eg VLDM )
12 . 1
SCOPE COMPILER OPTIMIZATION TAXONOMY
Interprocedural optimizations
consider the whole translation unit, involve analysis of
data ow and dependency graphs.
Intraprocedural optimizations
consider the whole procedure, involve analysis of
data ow and dependency graphs.
12 . 2
SCOPE COMPILER OPTIMIZATION TAXONOMY
Global optimizations
consider the inter-most code block with the context.
Loop optimizations belong to this.
Local optimizations
consider a single block, the analysis is limited to it.
Peephole optimizations
map one or more consecutive operators from the IR to a
machine code.
12 . 3
INTERPROCEDURAL OPTIMIZATIONS (IPO)
Look at all routines in a translation unit in order to make
optimizations across routine boundaries, including but not
limited to inlining and cloning.
Also called as Interprocedural Analysis (IPA).
Compiler can move, optimize, restructure and delete code
between procedures
and even different source les, if LTO is enabled
Inlining — replacing a subroutine call with the replicated
code of it
Cloning — optimizing logic in the copied subroutine for a
particular call
13
PATTERN COMPILER OPTIMIZATION TAXONOMY
Dependency chains (linear code)
Branches
Loop bodies
Single loop
Loop and branch
Multi-loop
Functional calls to subroutines
14 . 1
HOW TO GET OPTIMIZATION FEEDBACK?
Check wall-time of you application
If a compiler has done its job well, you'll see performance
improvements
Dump an assembly of your code (or/and IL)
Ensure instruction and register scheduling
Check for extra operations and register spills
See compiler optimization report
All the compilers have some support for it
Some of them are able to generate very detailed reports
about loop unrolling, auto-vectorization, VLIW slots
scheduling, etc
14 . 2
COMMONLY CONSIDERED METRICS
Wall(-clock) time
is a human perception of the span of time from the
start to the completion of a task.
Power consumption
is the electrical energy which is consumed to complete a
task.
Processor time (or runtime)
is the total execution time during which a processor was
dedicated to a task (i.e. executes instructions of that
task).
14 . 3
DUMPING ASSEMBLY
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code.
$ gcc code.c -S -o asm.s
Assembly writing is the least portable optimization
Inline assembly limits compiler optimizations
Assembly does not give overwhelming speedup nowadays
Sometimes it is needed to overcome compiler bugs and
optimization limitations
14 . 4
EXAMPLE: GCC FEEDBACK OPTIONS
Enables optimization information printing
-fopt-info
-fopt-info-<optimized/missed/note/all>
-fopt-info-all-<ipa/loop/inline/vec/optall>
-fopt-info=filename
Controls the amount of debugging output the scheduler
prints on targets that use instruction scheduling
-fopt-info -fsched-verbose=n
Controls the amount of output from auto-vectorizer
-ftree-vectorizer-verbose=n
14 . 5
EXAMPLES: GCC FEEDBACK OPTIONS
Outputs all optimization info to stderr.
gcc -O3 -fopt-info
Outputs missed optimization report from all the passes
to missed.txt
gcc -O3 -fopt-info-missed=missed.txt
Outputs information about missed optimizations as
well as optimized locations from all the inlining passes
to inline.txt.
gcc -O3 -fopt-info-inline-optimized-missed=inline.txt
14 . 6
GCC FEEDBACK EXAMPLE
./src/box.cc:193:9: note: loop vectorized
./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:96:9: note: loop vectorized
./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:51:9: note: loop vectorized
./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:193:9: note: loop with 7 iterationscompletely unrolled
./src/box.cc:32:13: note: loop with 7 iterations completely unrolled
./src/box.cc:96:9: note: loop with 15 iterations completely unrolled
./src/box.cc:51:9: note: loop with 15 iterations completely unrolled
./src/box.cc:584:9: note: loop vectorized
./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:482:9: note: loop vectorized
./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:463:5: note: loop vectorized
./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment
15 . 1
POINTER ALIASING
void twiddle1(int *xp, int *yp)
{
*xp += *yp;
*xp += *yp;
}
void twiddle2(int *xp, int *yp)
{
*xp += 2* *yp;
}
ARE THEY ALWAYS EQUAL?
15 . 2
POINTER ALIASING
What if..
int main(int argc, char** argv)
{
int i = 5, j = 5;
twiddle1(&i, &i);
twiddle2(&j, &j);
printf("twiddle1 result is %dn" , i);
printf("twiddle2 result is %dn" , j);
}
twiddle1 result is 20 while twiddle2 result is 15
15 . 3
POINTER ALIASING
Aliasing refers to the situation where the same memory
location can be accessed by using different names.
void twiddle1(int *xp, int *yp)
{
*xp += *yp;
*xp += *yp;
}
void twiddle2(int *xp, int *yp)
{
*xp += 2* *yp;
}
15 . 4
STRICT ALIASING ASSUMPTION
Strict aliasing is an assumption, made by a C (or C++)
compiler, that dereferencing pointers to objects of different
types will never refer to the same memory location.
This assumption enables more aggressive optimization (gcc
assumes it up from -02), but a programmer should have to
follow strict aliasing rules to get code working correctly.
15 . 5
results in
results in
STRICT ALIASING ASSUMPTION
void check(int32_t *h, int64_t *k)
{
*h = 5; *k = 6;
printf("%dn", *h);
}
void main()
{
int64_t k;
check((int32_t *)&k, &k);
}
gcc -O1 test.c -o test ; ./test 6
gcc -O2 test.c -o test ; ./test 5
15 . 6
POINTER ALIASING: MISSED OPPORTUNITIES
Compiler freely schedules arithmetic,
but often preserves the order of memory dereferencing
Compiler is limited in redundancy elimination
Compiler is limited in loop unrolling
Compiler is limited in auto-vectorization
16 . 1
FUNCTION CALLS
int callee();
int caller()
{
return callee() + callee();
}
int callee();
int caller()
{
return 2*callee();
}
ARE THEY EQUAL?
16 . 2
FUNCTION CALLS
int callee(int i);
int caller()
{
int s=0, i=0;
for ( ; i < N ; i++)
s += callee(i);
return s;
}
int callee(int i);
int caller()
{
int s0=0, s1=0, i=0;
for ( ; i < N/2 ; i+=2)
{
s0+=callee(i);
s1+=callee(i+1);
}
return s0 + s1;
}
ARE THEY EQUAL?
16 . 3
PURE FUNCTIONS
Pure function is a function for which both of the following
statements are true:
1. The function always evaluates the same result having been
given the same argument value(s). The function result must
not depend on any hidden information or state that may
change while program execution proceeds or between
different executions of the program, as well as on any
external input from I/O devices.
2. Evaluation of the result does not cause any semantically
observable side effect or output, such as mutation of
mutable objects or output to I/O devices.
16 . 4
PURE FUNCTIONS
Pure functions are much easier to optimize. Expressing
ideas in code as pure functions simpli es compiler's life.
Most functions from math.h are not pure (sets/cleans
oating point ags and conditions, throws oating point
exceptions)
Use constexpr keyword for c++11 to hint a compiler that
function could be evaluated in compile time
Use static keyword to help the compiler to see all the
usages of the function (and perform aggressive inlining, or
even deduce whether the function is pure or not)
Neither constexpr nor static doesn't guarantee that
function is pure but they give compiler some hints.
16 . 5
FUNCTIONAL CALLS: MISSED OPPORTUNITIES
If the compiler fails to inline a function body:
it is limited in redundancy elimination
there are some overhead on function calls
inlining is crucial for functional calls from loops
many other optimizations aren't performed for this
fragment of code
loop unrolling
auto-vectorization
etc
potential bloating of code and stack
17
FLOATING POINT
Floating point arithmetics is not associative,
so A+(B+C) != (A+B)+C
A compiler is very conservative about oating point!
void associativityCheck (void)
{
double x = 3.1415926535897931;
double a = 1.0e15;
double b = -(1.0e15 - 1.0);
printf("%f %fn", x*(a + b), x*a + x*b );
}
$ gcc check.c -o check ; ./check
3.141593 3.000000
Such situation is known as catastrophic cancellation
18
MORE OPTIMIZATION CHALLENGES
Branches inside a loop
Exceptions
Accesses to storage type global variables
Inline assembly
volatile keyword
19 . 1
SUMMARY
Source code does through lexical, syntax, semantic
analysis, as well as IR generation, optimization before
producing target machine code
Backend and frontend simplify compiler development
Intermediate language makes compiler optimizations
reusable across broad range of languages and targets
IL can be Language-speci c or Language independent
Triples and Quadruples are widely used as
language-independent IR
All the compiler optimizations are done on IR
Register allocation goes after IR optimization,
local-variable reuse is pointless nowadays!
19 . 2
SUMMARY
LTO allows do optimizations during linking
WHOPR allows globally optimize whole binary
Compiler usually supports multiple optimization levels
Compiler optimizations are split into machine-dependent
and machine-independent groups
By scope compiler optimizations are split into
interprocedural, intraprocedural, global, local and
peephole
The most common targets are dependency chains,
branches, loops
Compiler optimization is a multi-phase iterative process
Performing one optimization allows many other
Most optimizations need certain order of application
19 . 3
SUMMARY
Checking wall-time, assembly or optimizer's report are the
most common ways to get optimization feedback
Wall time is the most important metric to optimize
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code
Inspect optimizer's report to demystify its "habits"
Stick to the strict aliasing rule
Clean code is not enough.. Write pure code!
Compilers are usually very conservative
optimizing oating point math
20
THE END
/ 2015-2016MARINA KOLPAKOVA

Mais conteúdo relacionado

Mais procurados

Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition 艾鍗科技
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWHsien-Hsin Sean Lee, Ph.D.
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Hsien-Hsin Sean Lee, Ph.D.
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
 
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Hsien-Hsin Sean Lee, Ph.D.
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkAlexey Smirnov
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreHsien-Hsin Sean Lee, Ph.D.
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 

Mais procurados (20)

Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
 
ocelot
ocelotocelot
ocelot
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 

Semelhante a Pragmatic Optimization in Modern Programming - Demystifying the Compiler

07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilationAdam Husár
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT CompilerNetronome
 
Specialized Compiler for Hash Cracking
Specialized Compiler for Hash CrackingSpecialized Compiler for Hash Cracking
Specialized Compiler for Hash CrackingPositive Hack Days
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CSusumu Tokumoto
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet FiltersKernel TLV
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial javaTpoint s
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesNetronome
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_worldfantasy zheng
 

Semelhante a Pragmatic Optimization in Modern Programming - Demystifying the Compiler (20)

07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Specialized Compiler for Hash Cracking
Specialized Compiler for Hash CrackingSpecialized Compiler for Hash Cracking
Specialized Compiler for Hash Cracking
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Handout#10
Handout#10Handout#10
Handout#10
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_world
 

Último

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 

Último (20)

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 

Pragmatic Optimization in Modern Programming - Demystifying the Compiler

  • 1. 1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING DEMYSTIFYING A COMPILER Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3. 3 OUTLINE Compilation trajectory Intermediate language Dealing with local variables link-time and whole program optimization Optimization levels Compiler optimization taxonomies Classic Scope Code pattern How to get the feedback from optimization? Compiler optimization challenges Summary
  • 4. 4 EXECUTABLE GENERATION PHASES Pre-processing. Pre-process, but don't compile. gcc -E test.cc cl /E test.cc Compilation. Compile, but don't assemble. gcc -S test.cc cl /FA test.cc Assembling. Assemble, but don't link. gcc -c test.cc cl /c test.cc Linking. Link object les to generate the executable. gcc test.cc cl test.cc
  • 5. 5 . 1 COMPILATION TRAJECTORY Lexical Analysis scans the source code as a stream of characters converting it into lexemes (tokens). Syntax Analysis takes the tokens, produced by lexical analysis, as input and generates a syntax tree. Source code grammar (syntactical correctness) is checked here.
  • 6. 5 . 2 COMPILATION TRAJECTORY Semantic Analysis checks whether the constructed syntax tree follows the language rules (including the type checking). Intermediate Code Generation builds a program representation for some abstract machine. It is in between the high-level language and the target machine language.
  • 7. 5 . 3 COMPILATION TRAJECTORY Code Optimization does optimization of the intermediate code (eg, redundancy elimination). Code Generation takes an optimized representation of the intermediate code and maps it to the target machine language.
  • 8. 6 FRONTEND AND BACKEND Only a backend is required for new machine support Only a frontend is required for new language support Most of optimizations resemble each other for all targets and could be applied in between frontend and backend
  • 9. 7 . 1 INTERMEDIATE LANGUAGE Optimization techniques become much easier to conduct on the level of intermediate code. Modern compilers usually use 2 levels of intermediate representation (IR).
  • 10. 7 . 2 INTERMEDIATE LANGUAGE High Level IR is close to the source and can be easily generated from the source code. Some code optimizations are possible. It is not very suitable for target machine optimization. Low Level IR is close to the target machine and used for machine- dependent optimizations: register allocation, instruction selection, peephole optimization.
  • 11. 7 . 3 INTERMEDIATE LANGUAGE Language-speci c to be used for JIT compilation later: Java byte code; .NET CLI, NVIDIA PTX. Language independent, like three-(four-)address code (similar to a classic RISC ISA). a = b + c * d + c * d; Three-Address Code (TAC) r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3 Here rth is an abstract register.
  • 12. 7 . 4 THREE-ADDRESS CODE Quadruples has four elds Op arg1 arg2 result * c d r1 + b r1 r2 + r2 r1 r3 = r3 a Triples or Indirect triples have three elds Op arg1 arg2 * c d + b (0) + (1) (0) = (2)
  • 13. 7 . 5 INTERMEDIATE LANGUAGE Provides frontend independent code representation. and GNU Compiler Collection -fdump-tree-all -fdump-tree-optimized -fdump-tree-ssa -fdump-rtl-all clang and other LLWM based compilers -emit-llvm CIL (C Intermediate Language) Visual Studio cl.exe GENERIC GIMPLE LLVM IL
  • 14. 7 . 6 INTERMEDIATE LANGUAGE uint32_t gray2rgba_v1(uint8_t c) { return c + (c<<8) + (c<<16) + (c<<24); } $ clang -Os -S -emit-llvm test.c -o test.ll $ cat test.ll define i32 @gray2rgba_v1(i8 zeroext %c) # 0 { %1 = zext i8 %c to i32 %2 = mul i32 %1, 16843009 ret i32 %2 } gray2rgba_v1: movzbl %dil, %eax imull $16843009, %eax, %eax ret
  • 15. 8 . 1 DEALING LOCAL VARIABLES Compiler don't care how many variables are used in code, register allocation is done after IR rotations. for( ; j <= roi.width - 4; j += 4 ) { uchar t0 = tab[src[j]]; uchar t1 = tab[src[j+ 1]]; dst[j] = t0; dst[j+1] = t1; t0 = tab[src[j+2]]; t1 = tab[src[j+3]]; dst[j+2] = t0; dst[j+3] = t1; }
  • 16. 8 . 2 DEALING LOCAL VARIABLES .lr.ph4: ; preds = %0, %.lr.ph4 %indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [0, %0 ] %6 = getelementptr inbounds i8* %src, i64 %indvars.iv5 %7 = load i8* %6, align 1, !tbaa !1 %8 = zext i8 %7 to i64 %9 = getelementptr inbounds i8* %tab, i64 %8 %10 = load i8* %9, align 1, !tbaa !1 %11 = or i64 %indvars.iv5, 1 %12 = getelementptr inbounds i8* %src, i64 %11 %13 = load i8* %12, align 1, !tbaa !1 %14 = zext i8 %13 to i64 %15 = getelementptr inbounds i8* %tab, i64 %14 %16 = load i8* %15, align 1, !tbaa !1 %17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5 store i8 %10, i8* %17, align 1, !tbaa !1 %18 = getelementptr inbounds i8* %dst, i64 %11 store i8 %16, i8* %18, align 1, !tbaa !1 %19 = or i64 %indvars.iv5, 2 // ... %28 = zext i8 %27 to i64 %29 = getelementptr inbounds i8* %tab, i64 %28 %30 = load i8* %29, align 1, !tbaa !1 %31 = getelementptr inbounds i8* %dst, i64 %19 store i8 %24, i8* %31, align 1, !tbaa !1 %32 = getelementptr inbounds i8* %dst, i64 %25 store i8 %30, i8* %32, align 1, !tbaa !1 %indvars.iv.next6 = add nuw nsw i64 %indvars.iv5,4 %33 = trunc i64 %indvars.iv.next6 to i32 %34 = icmp sgt i32 %33, %1 br i1 %34, label %..preheader_crit_edge, label %.lr.ph4
  • 17. 9 . 1 Perform inter-procedural optimizations during linking. Most compilers support this feature: LINK-TIME OPTIMIZATION (LTO) (- to) (- to) stating with 4.9 (/GL, /LTCG) ... clang gcc cl.exe
  • 18. 9 . 2 WHOPR: WHOLE PROGRAM OPTIMIZATION 1. Compile each source le separately, add extra information to the object le 2. Analyze information collected from all object les 3. Perform second optimization phase to generate object le 4. Link the nal binary Eliminate even more redundant code Compilations is better optimized for multi-core systems
  • 19. 10 . 1 OPTIMIZATION LEVELS -O0 (the default) No optimization generates unoptimized code but has the fastest compilation time. -O1 Moderate optimization optimizes reasonably well but does not degrade compilation time signi cantly. -O2 Full optimization generates highly optimized code and has the slowest compilation time.
  • 20. 10 . 2 OPTIMIZATION LEVELS -O3 Aggressive optimization employees more aggressive automatic inlining of subprograms within a unit and attempts to vectorize. -Os Optimizes with focus on program size enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
  • 21. 10 . 3 ENABLED OPTIMIZATIONS: GCC -O0 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks - fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique - dent - nline-atomics - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-debug- strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group- heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns- dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftrapping-math -ftree- coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize- loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
  • 22. 10 . 4 ENABLED OPTIMIZATIONS: GCC -O1 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O1 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack- adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c -asm -fearly-inlining - feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability - dent - f-conversion - f-conversion2 - nline - nline-atomics - nline-functions-called-once - pa-pro le - pa-pure-const - pa-reference - ra-hoist- pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants - fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path- heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched- spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide- types -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree- coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree- loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree- scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
  • 23. 10 . 5 ENABLED OPTIMIZATIONS: GCC -O2 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O2 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize - fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate - ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads - dent - f-conversion - f- conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions-called-once - nline-small-functions - pa-cp - pa-pro le - pa-pure- const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts - fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder- blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic - fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched- stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector - fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree- builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts - ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= - ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge - ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign- stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp - mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper
  • 24. 10 . 6 ENABLED OPTIMIZATIONS: GCC -O3 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O3 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks - fdevirtualize -fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations - fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist- adjacent-loads - dent - f-conversion - f-conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions - nline-functions-called- once - nline-small-functions - pa-cp - pa-cp-clone - pa-pro le - pa-pure-const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save- slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge- constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining - fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched- interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule- insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree- ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree- parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree- slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm - fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned- store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse - msse2 -mtls-direct-seg-refs -mvzeroupper
  • 25. 11 CLASSIC COMPILER OPTIMIZATION TAXONOMY Machine independent Applicable across a broad range of machines 1. Eliminate redundant computations, dead code 2. Reduce running time and space 3. Decrease ratio of overhead to real work 4. Specialize code on a context 5. Enable other optimizations Machine dependent Capitalize on speci c machine properties 1. Manage or hide latency 2. Manage resources (registers, stack) 3. Improve mapping from IR to concrete machine 4. Use some exotic instructions (eg VLDM )
  • 26. 12 . 1 SCOPE COMPILER OPTIMIZATION TAXONOMY Interprocedural optimizations consider the whole translation unit, involve analysis of data ow and dependency graphs. Intraprocedural optimizations consider the whole procedure, involve analysis of data ow and dependency graphs.
  • 27. 12 . 2 SCOPE COMPILER OPTIMIZATION TAXONOMY Global optimizations consider the inter-most code block with the context. Loop optimizations belong to this. Local optimizations consider a single block, the analysis is limited to it. Peephole optimizations map one or more consecutive operators from the IR to a machine code.
  • 28. 12 . 3 INTERPROCEDURAL OPTIMIZATIONS (IPO) Look at all routines in a translation unit in order to make optimizations across routine boundaries, including but not limited to inlining and cloning. Also called as Interprocedural Analysis (IPA). Compiler can move, optimize, restructure and delete code between procedures and even different source les, if LTO is enabled Inlining — replacing a subroutine call with the replicated code of it Cloning — optimizing logic in the copied subroutine for a particular call
  • 29. 13 PATTERN COMPILER OPTIMIZATION TAXONOMY Dependency chains (linear code) Branches Loop bodies Single loop Loop and branch Multi-loop Functional calls to subroutines
  • 30. 14 . 1 HOW TO GET OPTIMIZATION FEEDBACK? Check wall-time of you application If a compiler has done its job well, you'll see performance improvements Dump an assembly of your code (or/and IL) Ensure instruction and register scheduling Check for extra operations and register spills See compiler optimization report All the compilers have some support for it Some of them are able to generate very detailed reports about loop unrolling, auto-vectorization, VLIW slots scheduling, etc
  • 31. 14 . 2 COMMONLY CONSIDERED METRICS Wall(-clock) time is a human perception of the span of time from the start to the completion of a task. Power consumption is the electrical energy which is consumed to complete a task. Processor time (or runtime) is the total execution time during which a processor was dedicated to a task (i.e. executes instructions of that task).
  • 32. 14 . 3 DUMPING ASSEMBLY Assembler is a must-have to check the compiler but it is rarely used to write low-level code. $ gcc code.c -S -o asm.s Assembly writing is the least portable optimization Inline assembly limits compiler optimizations Assembly does not give overwhelming speedup nowadays Sometimes it is needed to overcome compiler bugs and optimization limitations
  • 33. 14 . 4 EXAMPLE: GCC FEEDBACK OPTIONS Enables optimization information printing -fopt-info -fopt-info-<optimized/missed/note/all> -fopt-info-all-<ipa/loop/inline/vec/optall> -fopt-info=filename Controls the amount of debugging output the scheduler prints on targets that use instruction scheduling -fopt-info -fsched-verbose=n Controls the amount of output from auto-vectorizer -ftree-vectorizer-verbose=n
  • 34. 14 . 5 EXAMPLES: GCC FEEDBACK OPTIONS Outputs all optimization info to stderr. gcc -O3 -fopt-info Outputs missed optimization report from all the passes to missed.txt gcc -O3 -fopt-info-missed=missed.txt Outputs information about missed optimizations as well as optimized locations from all the inlining passes to inline.txt. gcc -O3 -fopt-info-inline-optimized-missed=inline.txt
  • 35. 14 . 6 GCC FEEDBACK EXAMPLE ./src/box.cc:193:9: note: loop vectorized ./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:96:9: note: loop vectorized ./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:51:9: note: loop vectorized ./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:193:9: note: loop with 7 iterationscompletely unrolled ./src/box.cc:32:13: note: loop with 7 iterations completely unrolled ./src/box.cc:96:9: note: loop with 15 iterations completely unrolled ./src/box.cc:51:9: note: loop with 15 iterations completely unrolled ./src/box.cc:584:9: note: loop vectorized ./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:482:9: note: loop vectorized ./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:463:5: note: loop vectorized ./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment
  • 36. 15 . 1 POINTER ALIASING void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; } ARE THEY ALWAYS EQUAL?
  • 37. 15 . 2 POINTER ALIASING What if.. int main(int argc, char** argv) { int i = 5, j = 5; twiddle1(&i, &i); twiddle2(&j, &j); printf("twiddle1 result is %dn" , i); printf("twiddle2 result is %dn" , j); } twiddle1 result is 20 while twiddle2 result is 15
  • 38. 15 . 3 POINTER ALIASING Aliasing refers to the situation where the same memory location can be accessed by using different names. void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }
  • 39. 15 . 4 STRICT ALIASING ASSUMPTION Strict aliasing is an assumption, made by a C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location. This assumption enables more aggressive optimization (gcc assumes it up from -02), but a programmer should have to follow strict aliasing rules to get code working correctly.
  • 40. 15 . 5 results in results in STRICT ALIASING ASSUMPTION void check(int32_t *h, int64_t *k) { *h = 5; *k = 6; printf("%dn", *h); } void main() { int64_t k; check((int32_t *)&k, &k); } gcc -O1 test.c -o test ; ./test 6 gcc -O2 test.c -o test ; ./test 5
  • 41. 15 . 6 POINTER ALIASING: MISSED OPPORTUNITIES Compiler freely schedules arithmetic, but often preserves the order of memory dereferencing Compiler is limited in redundancy elimination Compiler is limited in loop unrolling Compiler is limited in auto-vectorization
  • 42. 16 . 1 FUNCTION CALLS int callee(); int caller() { return callee() + callee(); } int callee(); int caller() { return 2*callee(); } ARE THEY EQUAL?
  • 43. 16 . 2 FUNCTION CALLS int callee(int i); int caller() { int s=0, i=0; for ( ; i < N ; i++) s += callee(i); return s; } int callee(int i); int caller() { int s0=0, s1=0, i=0; for ( ; i < N/2 ; i+=2) { s0+=callee(i); s1+=callee(i+1); } return s0 + s1; } ARE THEY EQUAL?
  • 44. 16 . 3 PURE FUNCTIONS Pure function is a function for which both of the following statements are true: 1. The function always evaluates the same result having been given the same argument value(s). The function result must not depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, as well as on any external input from I/O devices. 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.
  • 45. 16 . 4 PURE FUNCTIONS Pure functions are much easier to optimize. Expressing ideas in code as pure functions simpli es compiler's life. Most functions from math.h are not pure (sets/cleans oating point ags and conditions, throws oating point exceptions) Use constexpr keyword for c++11 to hint a compiler that function could be evaluated in compile time Use static keyword to help the compiler to see all the usages of the function (and perform aggressive inlining, or even deduce whether the function is pure or not) Neither constexpr nor static doesn't guarantee that function is pure but they give compiler some hints.
  • 46. 16 . 5 FUNCTIONAL CALLS: MISSED OPPORTUNITIES If the compiler fails to inline a function body: it is limited in redundancy elimination there are some overhead on function calls inlining is crucial for functional calls from loops many other optimizations aren't performed for this fragment of code loop unrolling auto-vectorization etc potential bloating of code and stack
  • 47. 17 FLOATING POINT Floating point arithmetics is not associative, so A+(B+C) != (A+B)+C A compiler is very conservative about oating point! void associativityCheck (void) { double x = 3.1415926535897931; double a = 1.0e15; double b = -(1.0e15 - 1.0); printf("%f %fn", x*(a + b), x*a + x*b ); } $ gcc check.c -o check ; ./check 3.141593 3.000000 Such situation is known as catastrophic cancellation
  • 48. 18 MORE OPTIMIZATION CHALLENGES Branches inside a loop Exceptions Accesses to storage type global variables Inline assembly volatile keyword
  • 49. 19 . 1 SUMMARY Source code does through lexical, syntax, semantic analysis, as well as IR generation, optimization before producing target machine code Backend and frontend simplify compiler development Intermediate language makes compiler optimizations reusable across broad range of languages and targets IL can be Language-speci c or Language independent Triples and Quadruples are widely used as language-independent IR All the compiler optimizations are done on IR Register allocation goes after IR optimization, local-variable reuse is pointless nowadays!
  • 50. 19 . 2 SUMMARY LTO allows do optimizations during linking WHOPR allows globally optimize whole binary Compiler usually supports multiple optimization levels Compiler optimizations are split into machine-dependent and machine-independent groups By scope compiler optimizations are split into interprocedural, intraprocedural, global, local and peephole The most common targets are dependency chains, branches, loops Compiler optimization is a multi-phase iterative process Performing one optimization allows many other Most optimizations need certain order of application
  • 51. 19 . 3 SUMMARY Checking wall-time, assembly or optimizer's report are the most common ways to get optimization feedback Wall time is the most important metric to optimize Assembler is a must-have to check the compiler but it is rarely used to write low-level code Inspect optimizer's report to demystify its "habits" Stick to the strict aliasing rule Clean code is not enough.. Write pure code! Compilers are usually very conservative optimizing oating point math