BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
Rechkov. Lomonosov Report
1. Introduction Assembler as a native language Anomalies detection
Detecting abnormal executable files using
binary code mining
Rechkov Anton
TU Berlin Germany & TTI SFU Russia
21th March 2012
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 1 / 31
2. Introduction Assembler as a native language Anomalies detection
Malware evolution
Ciphered
Encrypted malware code of viruses
Oligomorphic
Generation of a decryptor by randomly selecting each piece of the decryptor
from several predefined alternatives.
Polymorphic
Generation of a sample by encypting malware body and modifying decryptor
each replication
Metamorphic
Reprograming all virus body by some obfuscation engine.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 2 / 31
3. Introduction Assembler as a native language Anomalies detection
Modern detection technique
Signature analysis
Searching a determine pattern in code.
Emulation
Unpacking and analysis through the emulation of malware code and continue
signature analysis.
Behavioral analysis
Analysis of functions graph flow.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 3 / 31
4. Introduction Assembler as a native language Anomalies detection
Code modification
Obfuscation
Transformation of executable program code which preserves functionality, but
complicates the analysis and understanding algorithms.
Deobfuscation
Resolving irrelevant code by
Algebraic models
Formal grammars
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31
5. Introduction Assembler as a native language Anomalies detection
Code modification
Obfuscation
Transformation of executable program code which preserves functionality, but
complicates the analysis and understanding algorithms.
Deobfuscation
Resolving irrelevant code by
Algebraic models
Formal grammars
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31
6. Introduction Assembler as a native language Anomalies detection
Outline
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 5 / 31
7. Introduction Assembler as a native language Anomalies detection
Binary code mining
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 6 / 31
8. Introduction Assembler as a native language Anomalies detection
Binary code mining
Structure of compiler
Common compiler scheme
Code generator engine:
Machine code generator,
Optimizers:
interprocedural
optimization (IPO),
profile-guided
optimization (PGO),
high-level optimizations
Mutation code generator /
obfuscator.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 7 / 31
9. Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
10. Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
11. Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
12. Introduction Assembler as a native language Anomalies detection
Binary code mining
Approving theory
Experiment
Determine instruction sequences
Compile source code with compilers
Compare distributions
Compilers
⇒ MSVC
⇒ LLVM
⇒ GCC
⇒ Intel C++ Compiler
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31
13. Introduction Assembler as a native language Anomalies detection
Binary code mining
Approving theory
Experiment
Determine instruction sequences
Compile source code with compilers
Compare distributions
Compilers
⇒ MSVC
⇒ LLVM
⇒ GCC
⇒ Intel C++ Compiler
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31
14. Introduction Assembler as a native language Anomalies detection
Binary code mining
XTEA distribution test
Frequency of words in binary.
(a) LLVM (b) MSVC
(c) Intel C++ (d) GCC
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 10 / 31
15. Introduction Assembler as a native language Anomalies detection
Binary code mining
Optimize binary’s mean distribution
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 11 / 31
16. Introduction Assembler as a native language Anomalies detection
Native language processing
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 12 / 31
17. Introduction Assembler as a native language Anomalies detection
Native language processing
Text Mining
Language detection
Author detection
Text Classification
Document clustering
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 13 / 31
18. Introduction Assembler as a native language Anomalies detection
Stochastic models
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 14 / 31
19. Introduction Assembler as a native language Anomalies detection
Stochastic models
Neural networks
Advantages
+ effectively with small number of training vectors
+ assessment of all samples proximity
Disadvantages
- predetermining model
manual words definition
manual excessive elements analysis
reeducation limitations
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 15 / 31
20. Introduction Assembler as a native language Anomalies detection
Stochastic models
Probability model
Advantages
+ self-sufficient word definition
+ education only by positive vectors
+ education unification(flexible reeducation)
Disadvantages
- big sample set for education
- errors while distribution determination
- computational complexity
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 16 / 31
21. Introduction Assembler as a native language Anomalies detection
Outline
1 Assembler as a native language
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 17 / 31
22. Introduction Assembler as a native language Anomalies detection
Preparation
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 18 / 31
23. Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
Matlab
Stochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
24. Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
Matlab
Stochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
25. Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
Matlab
Stochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
26. Introduction Assembler as a native language Anomalies detection
Code generator lexemes
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 20 / 31
27. Introduction Assembler as a native language Anomalies detection
Code generator lexemes
From disassembling to lexemes
Lexem
3 to 6 instruction length sequences
ignore unknown bytes
maximum repeated sequences
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 21 / 31
28. Introduction Assembler as a native language Anomalies detection
Code generator lexemes
Lexemes analysis
Suffix Tree example
Suffix tree:
Economy memory,
String searching faster then O(N 2 ),
Fast assessment of maximum
repeats in strings
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 22 / 31
29. Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 23 / 31
30. Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Radial basis networks
Neural net architecture
no need to choose the number of
hidden layers
lack of the pathology convergence
fast convergence through a
combination of learning algorithms.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 24 / 31
31. Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Detection compilers
Compiler detection testing
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 25 / 31
32. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Table of Contents
1 Assembler as a native language
Binary code mining
Native language processing
Stochastic models
2 Anomalies detection
Preparation
Code generator lexemes
Anomalies detection by neural networks
Anomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 26 / 31
33. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Multivariate Gamma
Empirical and theoretical PDF
of element
Using a set of bi- and 3-variate 40
Gamma: 35
Gamma PDF
Empirical PDF
Suggest Gamma 30
distribution 25
Sample proximity
PDF
20
Fast education 15
10
5
0
−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12
X
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 27 / 31
34. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Probability model testing
Error graphs of compiler probabilities based on coefficient of
minimal value Pp = Pmin ∗ 10coef
i i
1 1
false positive GCC O0 false positive MS
false negative Clang 0.9 false negative LLVM
0.9
false negative Intel
false negative GCC O2 0.8
0.8 false negative MS
0.7 0.7
0.6 0.6
error
error
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
coeff for min value coeff for min value
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 28 / 31
35. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Probability model testing
Problem of existing zero elements
1 1
false positive GCC O2 false positive GCC O2
false negative Clang 0.9 false negative Clang
0.9
false negative Intel false negative Intel
false negative GCC O0 false negative GCC O0
0.8 0.8
false negative MS false negative MS
0.7 0.7
0.6 0.6
error
error
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
coeff for min value coeff for min value
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 29 / 31
36. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Conclusion
Proposed connection between native language and
assembler
Developed algorithms of lexical assembler language
analyzes
Developed experimental stochastic models:
Based on neural networks
Based on probability model
Realized lexical assembler language analysis.
Approximate false positive errors of compiler detection:
27%
10-15%
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 30 / 31
37. Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Questions?
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 31 / 31