13. Standard way
•
•
•
•
Hex editor
Researcher
Brain (equally important as the Hex editor)
Basic knowledge how data can be organized
(in brain)
• Analysis of the executable file that
manipulates with data format
14. 3 ways of researching
• Dynamic analysis – use reverse engineering of
application that manipulate with data format
• Static analysis – try to extract header,
structures, fields and try to find relationship
between data
• Statistic analysis – when have some amount
of samples and use statistics of changes and
ranges of values
15. Breaking the rules
• There is the rule RTFM (Read The F**king
Manual)
• Nobody likes it
• I’m not exception
• First of all I start my research, and second – try
to find related works and generalize ideas
from them
16. Definitions
• Field – some value used to describe Data
Format
• Structure – way for organizing various fields
• Data – information for representing which
Data Format is developed
• Header – common structure before data, may
contain substructures
18. Main Idea
• Data format is developed by human (not pets or
aliens)
• Sometimes looks like that no human works on it…
• Format developers use common data
organization concepts and similar thoughts when
creating new data formats
• If we find regularities in data format organization
rules we can automate searching of them
19. Three types of data format
• File formats – multimedia formats, database
formats, internal formats for exchanging
between program components and etc.
• Protocols – network protocols, hardware
device interaction protocols, protocols of
interaction between driver and user space
application and etc.
• Structures in memory – OS structures,
application structures and etc.
21. Tasks to solve
•
•
•
•
•
•
•
Extract header, separate it from data
Find field boundaries
Find structures and substructures
Find types of fields
Detect bit and byte ordering
Determine semantics of fields
Interpret goal of field – it’s semantics
22. Bit order
• There are most significant bit – MSB and least
significant bit - LSB
• MSB is a left-most bit – commonly used – value is
10010101b – 95h - 149
• LSB is a left-most bit - value is 10101001b – A9h 169
23. Byte order
• Little-endian – Intel byte order
• Big-endian – network byte order
• Middle-endian or mixed-endian – mixed byte order –
when length of value more then default processor
machine word
25. Types of fields
• Service fields – for describing Data Format
(size of structure and etc.)
• Common fields – “fields from life” (time, date
and etc.)
• Specific fields – we can find range for that
type (bit flags and etc.)
• Application specific – can be interpret only by
application, we should use dynamic analysis
26. Field Size and Levels of field
interpretation
• Commonly field is a byte sequence
• Sometimes field is a bit sequence
•
•
•
•
Fixed size in bytes (1, 2, 3, 4, …)
Fixed size in bits (1, 2, 3, 4, …)
Variable size in bytes
Variable size in bits
27. Basic field types
•
•
•
•
•
•
Hex value – by default
Decimal value
Character value (up to 4 symbols)
String (ASCII or Unicode)
Float value
Bits value
29. Text
• Usually fixed size, remaining space after text
filled with 0
• Variable sized text ended with 0, or there are
size of this text field
• Usually text string contain their meanings
30. Offset
• Offset to another field
• Offset to data
• Pointer to another structure (may be child or
parent)
• Pointer to another instance of such structure
(next or previous in linked list)
• Can be relative and absolute
31. Export in PE-EXE
Module ImageBase
IMAGE_NT_HEADERS
+ImageBase
Signature DWORD
IMAGE_DOS_HEADER
IMAGE_EXPORT_DIRECTORY
+ImageBase
...
...
FileHeader
IMAGE_FILE_HEADER
+01Ch
AddressOfFunctions
e_lfanew DWORD
+03Ch
OptionalHeader
IMAGE_OPTIONAL_HEADER
+020h
AddressOfNames
IMAGE_DIRECTORY_ENTRY
_EXPORT
+024h
AddressOfNameOrdinals
...
+078h
...
Names (Offsets)
...
(Elem size - dword)
Offset FuncName X
Offset FuncName X+1
Offset FuncName X+2
...
+ImageBase
+ImageBase
Name Ordinals
Index
=
Index
Function Names
+ImageBase
...
‘LoadLibraryA’,0
‘LoadLibraryExA’,0
‘LoadLibraryExW’,0
...
...
...
(Elem size - word)
Index FuncAddr X
Index FuncAddr X+1
Index FuncAddr X+2
...
Find
LoadLibraryExA
+ImageBase
Functions Addresses
...
(Elem size - dword)
FuncAddr X
FuncAddr X+1
FuncAddr X+2
...
LoadLibraryExA EP:
mov edi,edi
push ebp
...
+ImageBase
32. Size
•
•
•
•
Size of data
Size of structure
Size of substructure
Size of field
• Can be not only in bytes, but also in bits,
words (2 byte), double words, paragraphs (16
byte) and etc.
33. Quantity
• Quantity of substructures
• Quantity of elements
• IMAGE_NT_HEADERS.IMAGE_FILE_HEADER.
NumberOfSections
• IMAGE_EXPORT_DIRECTORY.
NumberOfFunctions
34. ID
• Identifier of data format, often called magic
number
• ID of structure
• ID of substructure
• Often ID is value consist of char symbols
• Often ID looks like “magic” - BE BA FE CA
40. Projects, articles and authors
• Protocol Informatics Project – based on “Network Protocol Analysis
using Bioinformatics Algorithms” by Marshall A. Beddoe
• Discoverer - “Discoverer: Automatic Protocol Reverse Engineering
from Network Traces” by Weidong Cui, Jayanthkumar Kannan,
Helen J. Wang
• Laika – “Digging For Data Structures” by Anthony Cozzie, Frank
Stratton, Hui Xue, and Samuel T. King
• RolePlayer – “Protocol-Independent Adaptive Replay of Application
Dialog” by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy
H. Katzy
• Tupni – “Tupni: Automatic Reverse Engineering of Input Formats”
by Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz
Irun-Briz
45. Applications
•
•
•
•
•
•
•
•
RE any program
RE undocumented/proprietary file formats
RE undocumented/proprietary network protocols
RE undocumented structures in memory
Fuzzing
Examination of protocol implementation
Replay network interaction
Zero-day vulnerability signature generation