This document discusses compiler optimizations. It begins with an outline of topics including compilation trajectory, intermediate languages, optimization levels, and optimization techniques. It then provides more details on each phase of compilation, how compilers use intermediate representations to perform optimizations, and specific optimizations like common subexpression elimination, constant propagation, and instruction scheduling.
3. 3
OUTLINE
Compilation trajectory
Intermediate language
Dealing with local variables
link-time and whole program optimization
Optimization levels
Compiler optimization taxonomies
Classic
Scope
Code pattern
How to get the feedback from optimization?
Compiler optimization challenges
Summary
4. 4
EXECUTABLE GENERATION PHASES
Pre-processing. Pre-process, but don't compile.
gcc -E test.cc
cl /E test.cc
Compilation. Compile, but don't assemble.
gcc -S test.cc
cl /FA test.cc
Assembling. Assemble, but don't link.
gcc -c test.cc
cl /c test.cc
Linking. Link object les to generate the executable.
gcc test.cc
cl test.cc
5. 5 . 1
COMPILATION TRAJECTORY
Lexical Analysis
scans the source code as a stream of characters
converting it into lexemes (tokens).
Syntax Analysis
takes the tokens, produced by lexical analysis, as input
and generates a syntax tree. Source code grammar
(syntactical correctness) is checked here.
6. 5 . 2
COMPILATION TRAJECTORY
Semantic Analysis
checks whether the constructed syntax tree follows the
language rules (including the type checking).
Intermediate Code Generation
builds a program representation for some abstract
machine. It is in between the high-level language and the
target machine language.
7. 5 . 3
COMPILATION TRAJECTORY
Code Optimization
does optimization of the intermediate code (eg,
redundancy elimination).
Code Generation
takes an optimized representation of the intermediate
code and maps it to the target machine language.
8. 6
FRONTEND AND BACKEND
Only a backend is required for new machine support
Only a frontend is required for new language support
Most of optimizations resemble each other for all targets
and could be applied in between frontend and backend
9. 7 . 1
INTERMEDIATE LANGUAGE
Optimization techniques become much easier to conduct on
the level of intermediate code. Modern compilers usually use
2 levels of intermediate representation (IR).
10. 7 . 2
INTERMEDIATE LANGUAGE
High Level IR
is close to the source and can be easily generated from
the source code. Some code optimizations are possible.
It is not very suitable for target machine optimization.
Low Level IR
is close to the target machine and used for machine-
dependent optimizations: register allocation, instruction
selection, peephole optimization.
11. 7 . 3
INTERMEDIATE LANGUAGE
Language-speci c to be used for JIT compilation later:
Java byte code; .NET CLI, NVIDIA PTX.
Language independent, like three-(four-)address code
(similar to a classic RISC ISA).
a = b + c * d + c * d;
Three-Address Code (TAC)
r1 = c * d;
r2 = b + r1;
r3 = r2 + r1;
a = r3
Here rth
is an abstract register.
12. 7 . 4
THREE-ADDRESS CODE
Quadruples has four elds
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
Triples or Indirect triples have three elds
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
13. 7 . 5
INTERMEDIATE LANGUAGE
Provides frontend independent code representation.
and
GNU Compiler Collection
-fdump-tree-all -fdump-tree-optimized
-fdump-tree-ssa -fdump-rtl-all
clang and other LLWM based compilers
-emit-llvm
CIL (C Intermediate Language)
Visual Studio cl.exe
GENERIC GIMPLE
LLVM IL
14. 7 . 6
INTERMEDIATE LANGUAGE
uint32_t gray2rgba_v1(uint8_t c)
{
return c + (c<<8) + (c<<16) + (c<<24);
}
$ clang -Os -S -emit-llvm test.c -o test.ll
$ cat test.ll
define i32 @gray2rgba_v1(i8 zeroext %c) # 0 {
%1 = zext i8 %c to i32
%2 = mul i32 %1, 16843009
ret i32 %2
}
gray2rgba_v1:
movzbl %dil, %eax
imull $16843009, %eax, %eax
ret
15. 8 . 1
DEALING LOCAL VARIABLES
Compiler don't care how many variables are used in code,
register allocation is done after IR rotations.
for( ; j <= roi.width - 4; j += 4 )
{
uchar t0 = tab[src[j]];
uchar t1 = tab[src[j+ 1]];
dst[j] = t0;
dst[j+1] = t1;
t0 = tab[src[j+2]];
t1 = tab[src[j+3]];
dst[j+2] = t0;
dst[j+3] = t1;
}
17. 9 . 1
Perform inter-procedural optimizations during linking.
Most compilers support this feature:
LINK-TIME OPTIMIZATION (LTO)
(- to)
(- to) stating with 4.9
(/GL, /LTCG)
...
clang
gcc
cl.exe
18. 9 . 2
WHOPR: WHOLE PROGRAM OPTIMIZATION
1. Compile each source le separately, add extra information
to the object le
2. Analyze information collected from all object les
3. Perform second optimization phase to generate object le
4. Link the nal binary
Eliminate even more redundant code
Compilations is better optimized for multi-core systems
19. 10 . 1
OPTIMIZATION LEVELS
-O0 (the default) No optimization
generates unoptimized code but has the fastest
compilation time.
-O1 Moderate optimization
optimizes reasonably well but does not degrade
compilation time signi cantly.
-O2 Full optimization
generates highly optimized code and has the slowest
compilation time.
20. 10 . 2
OPTIMIZATION LEVELS
-O3 Aggressive optimization
employees more aggressive automatic inlining of
subprograms within a unit and attempts to vectorize.
-Os Optimizes with focus on program size
enables all -O2 optimizations that do not typically
increase code size. It also performs further
optimizations designed to reduce code size.
25. 11
CLASSIC COMPILER OPTIMIZATION TAXONOMY
Machine independent
Applicable across a broad
range of machines
1. Eliminate redundant
computations, dead code
2. Reduce running time and
space
3. Decrease ratio of
overhead to real work
4. Specialize code on a
context
5. Enable other optimizations
Machine dependent
Capitalize on speci c
machine properties
1. Manage or hide latency
2. Manage resources
(registers, stack)
3. Improve mapping from IR
to concrete machine
4. Use some exotic
instructions (eg VLDM )
26. 12 . 1
SCOPE COMPILER OPTIMIZATION TAXONOMY
Interprocedural optimizations
consider the whole translation unit, involve analysis of
data ow and dependency graphs.
Intraprocedural optimizations
consider the whole procedure, involve analysis of
data ow and dependency graphs.
27. 12 . 2
SCOPE COMPILER OPTIMIZATION TAXONOMY
Global optimizations
consider the inter-most code block with the context.
Loop optimizations belong to this.
Local optimizations
consider a single block, the analysis is limited to it.
Peephole optimizations
map one or more consecutive operators from the IR to a
machine code.
28. 12 . 3
INTERPROCEDURAL OPTIMIZATIONS (IPO)
Look at all routines in a translation unit in order to make
optimizations across routine boundaries, including but not
limited to inlining and cloning.
Also called as Interprocedural Analysis (IPA).
Compiler can move, optimize, restructure and delete code
between procedures
and even different source les, if LTO is enabled
Inlining — replacing a subroutine call with the replicated
code of it
Cloning — optimizing logic in the copied subroutine for a
particular call
29. 13
PATTERN COMPILER OPTIMIZATION TAXONOMY
Dependency chains (linear code)
Branches
Loop bodies
Single loop
Loop and branch
Multi-loop
Functional calls to subroutines
30. 14 . 1
HOW TO GET OPTIMIZATION FEEDBACK?
Check wall-time of you application
If a compiler has done its job well, you'll see performance
improvements
Dump an assembly of your code (or/and IL)
Ensure instruction and register scheduling
Check for extra operations and register spills
See compiler optimization report
All the compilers have some support for it
Some of them are able to generate very detailed reports
about loop unrolling, auto-vectorization, VLIW slots
scheduling, etc
31. 14 . 2
COMMONLY CONSIDERED METRICS
Wall(-clock) time
is a human perception of the span of time from the
start to the completion of a task.
Power consumption
is the electrical energy which is consumed to complete a
task.
Processor time (or runtime)
is the total execution time during which a processor was
dedicated to a task (i.e. executes instructions of that
task).
32. 14 . 3
DUMPING ASSEMBLY
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code.
$ gcc code.c -S -o asm.s
Assembly writing is the least portable optimization
Inline assembly limits compiler optimizations
Assembly does not give overwhelming speedup nowadays
Sometimes it is needed to overcome compiler bugs and
optimization limitations
33. 14 . 4
EXAMPLE: GCC FEEDBACK OPTIONS
Enables optimization information printing
-fopt-info
-fopt-info-<optimized/missed/note/all>
-fopt-info-all-<ipa/loop/inline/vec/optall>
-fopt-info=filename
Controls the amount of debugging output the scheduler
prints on targets that use instruction scheduling
-fopt-info -fsched-verbose=n
Controls the amount of output from auto-vectorizer
-ftree-vectorizer-verbose=n
34. 14 . 5
EXAMPLES: GCC FEEDBACK OPTIONS
Outputs all optimization info to stderr.
gcc -O3 -fopt-info
Outputs missed optimization report from all the passes
to missed.txt
gcc -O3 -fopt-info-missed=missed.txt
Outputs information about missed optimizations as
well as optimized locations from all the inlining passes
to inline.txt.
gcc -O3 -fopt-info-inline-optimized-missed=inline.txt
35. 14 . 6
GCC FEEDBACK EXAMPLE
./src/box.cc:193:9: note: loop vectorized
./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:96:9: note: loop vectorized
./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:51:9: note: loop vectorized
./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:193:9: note: loop with 7 iterationscompletely unrolled
./src/box.cc:32:13: note: loop with 7 iterations completely unrolled
./src/box.cc:96:9: note: loop with 15 iterations completely unrolled
./src/box.cc:51:9: note: loop with 15 iterations completely unrolled
./src/box.cc:584:9: note: loop vectorized
./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:482:9: note: loop vectorized
./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment
./src/box.cc:463:5: note: loop vectorized
./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing
./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment
36. 15 . 1
POINTER ALIASING
void twiddle1(int *xp, int *yp)
{
*xp += *yp;
*xp += *yp;
}
void twiddle2(int *xp, int *yp)
{
*xp += 2* *yp;
}
ARE THEY ALWAYS EQUAL?
37. 15 . 2
POINTER ALIASING
What if..
int main(int argc, char** argv)
{
int i = 5, j = 5;
twiddle1(&i, &i);
twiddle2(&j, &j);
printf("twiddle1 result is %dn" , i);
printf("twiddle2 result is %dn" , j);
}
twiddle1 result is 20 while twiddle2 result is 15
38. 15 . 3
POINTER ALIASING
Aliasing refers to the situation where the same memory
location can be accessed by using different names.
void twiddle1(int *xp, int *yp)
{
*xp += *yp;
*xp += *yp;
}
void twiddle2(int *xp, int *yp)
{
*xp += 2* *yp;
}
39. 15 . 4
STRICT ALIASING ASSUMPTION
Strict aliasing is an assumption, made by a C (or C++)
compiler, that dereferencing pointers to objects of different
types will never refer to the same memory location.
This assumption enables more aggressive optimization (gcc
assumes it up from -02), but a programmer should have to
follow strict aliasing rules to get code working correctly.
41. 15 . 6
POINTER ALIASING: MISSED OPPORTUNITIES
Compiler freely schedules arithmetic,
but often preserves the order of memory dereferencing
Compiler is limited in redundancy elimination
Compiler is limited in loop unrolling
Compiler is limited in auto-vectorization
42. 16 . 1
FUNCTION CALLS
int callee();
int caller()
{
return callee() + callee();
}
int callee();
int caller()
{
return 2*callee();
}
ARE THEY EQUAL?
43. 16 . 2
FUNCTION CALLS
int callee(int i);
int caller()
{
int s=0, i=0;
for ( ; i < N ; i++)
s += callee(i);
return s;
}
int callee(int i);
int caller()
{
int s0=0, s1=0, i=0;
for ( ; i < N/2 ; i+=2)
{
s0+=callee(i);
s1+=callee(i+1);
}
return s0 + s1;
}
ARE THEY EQUAL?
44. 16 . 3
PURE FUNCTIONS
Pure function is a function for which both of the following
statements are true:
1. The function always evaluates the same result having been
given the same argument value(s). The function result must
not depend on any hidden information or state that may
change while program execution proceeds or between
different executions of the program, as well as on any
external input from I/O devices.
2. Evaluation of the result does not cause any semantically
observable side effect or output, such as mutation of
mutable objects or output to I/O devices.
45. 16 . 4
PURE FUNCTIONS
Pure functions are much easier to optimize. Expressing
ideas in code as pure functions simpli es compiler's life.
Most functions from math.h are not pure (sets/cleans
oating point ags and conditions, throws oating point
exceptions)
Use constexpr keyword for c++11 to hint a compiler that
function could be evaluated in compile time
Use static keyword to help the compiler to see all the
usages of the function (and perform aggressive inlining, or
even deduce whether the function is pure or not)
Neither constexpr nor static doesn't guarantee that
function is pure but they give compiler some hints.
46. 16 . 5
FUNCTIONAL CALLS: MISSED OPPORTUNITIES
If the compiler fails to inline a function body:
it is limited in redundancy elimination
there are some overhead on function calls
inlining is crucial for functional calls from loops
many other optimizations aren't performed for this
fragment of code
loop unrolling
auto-vectorization
etc
potential bloating of code and stack
47. 17
FLOATING POINT
Floating point arithmetics is not associative,
so A+(B+C) != (A+B)+C
A compiler is very conservative about oating point!
void associativityCheck (void)
{
double x = 3.1415926535897931;
double a = 1.0e15;
double b = -(1.0e15 - 1.0);
printf("%f %fn", x*(a + b), x*a + x*b );
}
$ gcc check.c -o check ; ./check
3.141593 3.000000
Such situation is known as catastrophic cancellation
49. 19 . 1
SUMMARY
Source code does through lexical, syntax, semantic
analysis, as well as IR generation, optimization before
producing target machine code
Backend and frontend simplify compiler development
Intermediate language makes compiler optimizations
reusable across broad range of languages and targets
IL can be Language-speci c or Language independent
Triples and Quadruples are widely used as
language-independent IR
All the compiler optimizations are done on IR
Register allocation goes after IR optimization,
local-variable reuse is pointless nowadays!
50. 19 . 2
SUMMARY
LTO allows do optimizations during linking
WHOPR allows globally optimize whole binary
Compiler usually supports multiple optimization levels
Compiler optimizations are split into machine-dependent
and machine-independent groups
By scope compiler optimizations are split into
interprocedural, intraprocedural, global, local and
peephole
The most common targets are dependency chains,
branches, loops
Compiler optimization is a multi-phase iterative process
Performing one optimization allows many other
Most optimizations need certain order of application
51. 19 . 3
SUMMARY
Checking wall-time, assembly or optimizer's report are the
most common ways to get optimization feedback
Wall time is the most important metric to optimize
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code
Inspect optimizer's report to demystify its "habits"
Stick to the strict aliasing rule
Clean code is not enough.. Write pure code!
Compilers are usually very conservative
optimizing oating point math