Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
2. Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Byte-wise Approximate Matching : 6 Cases
• A Framework : Theory, Algorithms, Technologies
• A Few Algorithms and Analysis
• Practical Applications
• Q & A
Classification 9/10/2015 2
3. Copyright 2011 Trend Micro Inc.
Background
• A Problem in DLP (Data Loss Prevention):
– In 2005, when designing the DLP system for my startup, I had to
solve this problem:
– S = {d1, d2,…, dn} is a bag of sensitive documents. Given any
document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥
δ.
• where RLV(s,t) is a function to measure the relevance of two documents.
• Two challenges: how to construct RLV ? How to make search scalable?
Classification 9/10/2015 3
4. Copyright 2011 Trend Micro Inc.
Background
• In early 2014, I studied fuzzy hashing :
– a family of similarity preserving hashing techniques & tools
– Example: TLSH, ssdeep, sdhash
– Problem: Given two binary strings s1 and s2, measure the similarity
by SIM(H(s1), H(s2)) = s.
• H is a hash function that preserves string similarity.
• SIM is a function to measure similarity of two hash values
• A challenge: how to evaluate pros & cons between them?
Classification 9/10/2015 4
5. Copyright 2011 Trend Micro Inc.
Background
• In early 2014, the NIST specification document NIST.SP.800-168
introduces a novel concept of approximate matching :
– To replace the concept of binary similarity matching.
– Four use cases are used to describe this concept:
• Similarity Detection: to identify different versions of a document.
• Cross Correlation: to identify a common object between two documents.
• Embedded Object Detection: to identify a given object inside a document.
• Fragment Detection: to identify the presence of traces/fragments of a
known document in network stream.
• In 2013, I noticed that eDiscovery has a near deduplication
problem that needs to group similar documents together.
5
7. Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• We extend these NIST cases to 6 cases due to our practice in
DLP & malware analysis.
– that can be described rigorously.
• Conceptual description :
7
10. Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• Intuitive description with binary strings:
10
11. Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Let us start with a few concepts:
– A string s is β-nontrivial if Len(s) ≥β.
• In practice, set β=64.
• This is excluding the triviality of substrings such as substrings of a few bytes.
– Let SS(β)={ s | string s is β-nontrivial } !!!
– SSIM(s1, s2) measures similarity between two nontrivial strings.
• Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe
byte-wise approximate matching:
1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of
identicalness. We denote it as EM1.
2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1.
3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the
problem of containment. We denote it as EM2.
4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate
containment. We denote it as AM2.
5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …,
q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it
as EM3.
6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such
that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing.
We denote it as AM3.
11
12. Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case
in definition 1 is true, we say R and T are byte-wise relevant.
This is a novel relationship.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
• Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
12
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪
13. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• S = An object space S.
•R = A relationship for objects in S.
• Three problems of interest:
1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1.
2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that R (o, b )=1.
3. Clustering: Given a bag B of objects, partition B into a set of
groups { G1, G2,…,Gm} based on R.
• Given the byte-wise relevance BR , we need solutions for
– Matching
– Searching
– Clustering
13
14. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Matching problems & the solutions:
Classification 9/10/2015 14
15. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Searching problem for the relationship BR :
– B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such
that BR(T, s)=1.
Classification 9/10/2015 15
16. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• How to solve the searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about
when B has 1 million strings .
• It is a lazy idea!
– Candidate selection approach:
• STEP 1: select a few candidates {s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select “good” candidates?
• String tokenizer: extract tokens from each string from B.
• Indexer : index the tokens along with the string ID to create a index DB as FP-DB.
• Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search
possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s.
– NOTE:
• This is similar to a keyword based search engine where the keywords are the
tokens.
• A special token is string fingerprint.
– Other tokens include k-grams, k-subsequences, blocks and chunks .
» Fingerprints are generated from special blocks.
16
17. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Architecture of candidate selection based approach:
17
18. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A clustering problem based on the relationship BR :
• Given a bag B of β-nontrivial strings, partition B into a set of groups { G1,
G2,…,Gm} based on BR.
Classification 9/10/2015 18
19. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A solution to clustering problem:
• A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected.
• A group is a connected sub-graph of the G(V,E) where V=Node(B).
Classification 9/10/2015 19
20. Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• The framework can be summarized as follows:
Classification 9/10/2015 20
21. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us go quickly over the theory and algorithm for two matching
problems :
– Similarity AM1
– Cross-sharing EM3
• What is similarity? How to measure it?
– A traditional approach is to compare two strings directly such as the
LCS method ( largest common subsequence).
– The popular fuzzy hash {ssdeep, sdhash and TLSH} use different
algorithms for measuring the similarity.
Classification 9/10/2015 21
22. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Fuzzy hash can be summarized in the following 3 steps:
Classification 9/10/2015 22
23. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• ssdeep:
– STEP 1: split the string into a sequence of consecutive chunks.
– STEP 2: hash each chunk into 6 bits and place them into a 80-byte
container sequentially.
– STEP 3: Use Levenshtein distance to measure the similarity.
Classification 9/10/2015 23
24. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• sdhash:
– STEP 1: Select a few 64-grams of higher entropy values.
– STEP 2: Generate a hash for each and them all hashes into one or
more 256-byte bloom filters.
– STEP 3: Use Hamming distance to measure the similarity.
Classification 9/10/2015 24
25. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• TLSH:
– STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5
3).
– STEP 2: Generate a hash for each triplet and map them into a 32-byte
container.
– STEP 3: Use a heuristic diff algorithm to measure the similarity.
Classification 9/10/2015 25
26. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Summary of Three Fuzzy Hashing Algorithms:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 9/10/2015 26
27. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Three approaches for measuring similarity … {ssdeep, sdhash
& TLSH use digest comparison.
Classification 9/10/2015 27
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
28. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 9/10/2015 28
29. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• 4 categories of tokens :
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• 8 ways to describe a string for similarity:
Classification 9/10/2015 29
30. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Evaluate a fuzzy hash based on follows:
– Data Structure:
• Bag: a bag ignores the order of tokens. It is good at handling content swapping.
• Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
– Tokens:
• k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation.
• k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation .
• Chunks: This approach takes account of every byte in raw granularity. It should be
OK at handling containment and cross sharing
• Blocks: Depending on different selection functions, even though it does not take
account of every byte, but it may present a string more efficiently and that is good
for generating similarity digests. Due to the nature of fixed length blocks, it is good
at handling containment and cross sharing.
• M2.4 leads to a novel fuzzy hashing scheme : TSFP
– It has some capabilities beyond existing schemes.
– I am not introduce it today due to limited time.
30
31. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us investigate EM3 :
– The cross-sharing problem.
• What is cross-sharing ? And how to measure it?
• Given a string, any two substrings follow one out of three cases:
31
32. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Cross-sharing … some examples :
32
33. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring
of T}.
– Ω(T) is the set of all β-nontrivial substrings of T.
• Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists
a bijective mapping F: SR ST such that F(r)=t and r=t, we say that SR
and ST are canonical with F.
33
34. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are
canonical with F: SR ST. ∀ r1 , r2 ∊ SR, one of following cases holds:
34
NOTE: we are only interested in case 1-3.
35. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are
canonical with F: SR ST.
– ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR
and ST are translative.
– ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR
and ST are weakly translative.
– Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST .
35
36. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 4: Given R, T∊ SS(β), we define two measurements for cross-
sharing between R and T at two levels:
– L1(R, T) = Max { L(SR, ST) | SR and ST are translative }
– L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative }
• Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings)
are different to each other, T is β-nonrepetitive.
– This is to measure how random T is.
• Theorem 2 : Given R,T ∊ SS(β), we have:
– L1(R, T) ≤ L2(R, T)
– If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T)
36
37. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Algorithm ( identify cross-shared substrings)
– INPUT: INPUT: R[1, …, m] and T[1, …, n].
– SUMMARY:
1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating
m+1- β hash values. We store them into a hash table HT with separate chaining using
linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the
offsets where hash values are created.
2. From the first offset of T with a the same rolling hash to slide the window of β-width along T,
do match with hash table H. If not matched, continue the next offset, otherwise try to
match the maximum, then continue from an offset around the end of the matched substring
of T.
– OUTPUT: SR ST .
37
38. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0,
otherwise, we have
– SR and ST are weakly translative ;
– L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T).
– If R is β–nonrepetitive, L(SR, ST) = L2(R, T).
– If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; .
38
39. Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let me summarize what have been done:
Classification 9/10/2015 39
40. Copyright 2011 Trend Micro Inc.
Practical Applications
• This framework can be applied to the following areas.
– E-Discovery
• Grouping near duplicate documents
• Comparing of near duplicate documents
– Digital forensic analysis
• Identifying similar objects or files
– Anti-plagiarism
• Copy detection
– Source code governance
– Malware analysis
– Spam filtering
– DLP
– etc
40
41. Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your attention.
• Do you have any questions?
• Email: liwei_ren@trendmicro.com
• Home page: http://pitt.academia.edu/LiweiRen for external
talks.
41