SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Copyright 2011 Trend Micro Inc. 1
Bytewise Approximate Match: Theory,
Algorithms and Applications
Liwei Ren, Ph.D, Trend Micro
EISIC 2015, Manchester, UK, Sept, 2015
Trend Micro
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Byte-wise Approximate Matching : 6 Cases
• A Framework : Theory, Algorithms, Technologies
• A Few Algorithms and Analysis
• Practical Applications
• Q & A
Classification 9/10/2015 2
Copyright 2011 Trend Micro Inc.
Background
• A Problem in DLP (Data Loss Prevention):
– In 2005, when designing the DLP system for my startup, I had to
solve this problem:
– S = {d1, d2,…, dn} is a bag of sensitive documents. Given any
document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥
δ.
• where RLV(s,t) is a function to measure the relevance of two documents.
• Two challenges: how to construct RLV ? How to make search scalable?
Classification 9/10/2015 3
Copyright 2011 Trend Micro Inc.
Background
• In early 2014, I studied fuzzy hashing :
– a family of similarity preserving hashing techniques & tools
– Example: TLSH, ssdeep, sdhash
– Problem: Given two binary strings s1 and s2, measure the similarity
by SIM(H(s1), H(s2)) = s.
• H is a hash function that preserves string similarity.
• SIM is a function to measure similarity of two hash values
• A challenge: how to evaluate pros & cons between them?
Classification 9/10/2015 4
Copyright 2011 Trend Micro Inc.
Background
• In early 2014, the NIST specification document NIST.SP.800-168
introduces a novel concept of approximate matching :
– To replace the concept of binary similarity matching.
– Four use cases are used to describe this concept:
• Similarity Detection: to identify different versions of a document.
• Cross Correlation: to identify a common object between two documents.
• Embedded Object Detection: to identify a given object inside a document.
• Fragment Detection: to identify the presence of traces/fragments of a
known document in network stream.
• In 2013, I noticed that eDiscovery has a near deduplication
problem that needs to group similar documents together.
5
Copyright 2011 Trend Micro Inc.
Background
• I started this effort in 2014:
6
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• We extend these NIST cases to 6 cases due to our practice in
DLP & malware analysis.
– that can be described rigorously.
• Conceptual description :
7
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
8
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
9
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• Intuitive description with binary strings:
10
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Let us start with a few concepts:
– A string s is β-nontrivial if Len(s) ≥β.
• In practice, set β=64.
• This is excluding the triviality of substrings such as substrings of a few bytes.
– Let SS(β)={ s | string s is β-nontrivial } !!!
– SSIM(s1, s2) measures similarity between two nontrivial strings.
• Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe
byte-wise approximate matching:
1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of
identicalness. We denote it as EM1.
2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1.
3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the
problem of containment. We denote it as EM2.
4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate
containment. We denote it as AM2.
5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …,
q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it
as EM3.
6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such
that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing.
We denote it as AM3.
11
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case
in definition 1 is true, we say R and T are byte-wise relevant.
This is a novel relationship.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
• Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
12
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• S = An object space S.
•R = A relationship for objects in S.
• Three problems of interest:
1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1.
2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that R (o, b )=1.
3. Clustering: Given a bag B of objects, partition B into a set of
groups { G1, G2,…,Gm} based on R.
• Given the byte-wise relevance BR , we need solutions for
– Matching
– Searching
– Clustering
13
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Matching problems & the solutions:
Classification 9/10/2015 14
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Searching problem for the relationship BR :
– B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such
that BR(T, s)=1.
Classification 9/10/2015 15
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• How to solve the searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about
when B has 1 million strings .
• It is a lazy idea!
– Candidate selection approach:
• STEP 1: select a few candidates {s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select “good” candidates?
• String tokenizer: extract tokens from each string from B.
• Indexer : index the tokens along with the string ID to create a index DB as FP-DB.
• Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search
possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s.
– NOTE:
• This is similar to a keyword based search engine where the keywords are the
tokens.
• A special token is string fingerprint.
– Other tokens include k-grams, k-subsequences, blocks and chunks .
» Fingerprints are generated from special blocks.
16
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Architecture of candidate selection based approach:
17
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A clustering problem based on the relationship BR :
• Given a bag B of β-nontrivial strings, partition B into a set of groups { G1,
G2,…,Gm} based on BR.
Classification 9/10/2015 18
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A solution to clustering problem:
• A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected.
• A group is a connected sub-graph of the G(V,E) where V=Node(B).
Classification 9/10/2015 19
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• The framework can be summarized as follows:
Classification 9/10/2015 20
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us go quickly over the theory and algorithm for two matching
problems :
– Similarity AM1
– Cross-sharing EM3
• What is similarity? How to measure it?
– A traditional approach is to compare two strings directly such as the
LCS method ( largest common subsequence).
– The popular fuzzy hash {ssdeep, sdhash and TLSH} use different
algorithms for measuring the similarity.
Classification 9/10/2015 21
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Fuzzy hash can be summarized in the following 3 steps:
Classification 9/10/2015 22
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• ssdeep:
– STEP 1: split the string into a sequence of consecutive chunks.
– STEP 2: hash each chunk into 6 bits and place them into a 80-byte
container sequentially.
– STEP 3: Use Levenshtein distance to measure the similarity.
Classification 9/10/2015 23
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• sdhash:
– STEP 1: Select a few 64-grams of higher entropy values.
– STEP 2: Generate a hash for each and them all hashes into one or
more 256-byte bloom filters.
– STEP 3: Use Hamming distance to measure the similarity.
Classification 9/10/2015 24
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• TLSH:
– STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5
3).
– STEP 2: Generate a hash for each triplet and map them into a 32-byte
container.
– STEP 3: Use a heuristic diff algorithm to measure the similarity.
Classification 9/10/2015 25
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Summary of Three Fuzzy Hashing Algorithms:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 9/10/2015 26
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Three approaches for measuring similarity … {ssdeep, sdhash
& TLSH use digest comparison.
Classification 9/10/2015 27
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 9/10/2015 28
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• 4 categories of tokens :
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• 8 ways to describe a string for similarity:
Classification 9/10/2015 29
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Evaluate a fuzzy hash based on follows:
– Data Structure:
• Bag: a bag ignores the order of tokens. It is good at handling content swapping.
• Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
– Tokens:
• k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation.
• k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation .
• Chunks: This approach takes account of every byte in raw granularity. It should be
OK at handling containment and cross sharing
• Blocks: Depending on different selection functions, even though it does not take
account of every byte, but it may present a string more efficiently and that is good
for generating similarity digests. Due to the nature of fixed length blocks, it is good
at handling containment and cross sharing.
• M2.4 leads to a novel fuzzy hashing scheme : TSFP
– It has some capabilities beyond existing schemes.
– I am not introduce it today due to limited time.
30
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us investigate EM3 :
– The cross-sharing problem.
• What is cross-sharing ? And how to measure it?
• Given a string, any two substrings follow one out of three cases:
31
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Cross-sharing … some examples :
32
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring
of T}.
– Ω(T) is the set of all β-nontrivial substrings of T.
• Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists
a bijective mapping F: SR  ST such that F(r)=t and r=t, we say that SR
and ST are canonical with F.
33
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are
canonical with F: SR  ST. ∀ r1 , r2 ∊ SR, one of following cases holds:
34
NOTE: we are only interested in case 1-3.
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are
canonical with F: SR  ST.
– ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR
and ST are translative.
– ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR
and ST are weakly translative.
– Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST .
35
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 4: Given R, T∊ SS(β), we define two measurements for cross-
sharing between R and T at two levels:
– L1(R, T) = Max { L(SR, ST) | SR and ST are translative }
– L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative }
• Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings)
are different to each other, T is β-nonrepetitive.
– This is to measure how random T is.
• Theorem 2 : Given R,T ∊ SS(β), we have:
– L1(R, T) ≤ L2(R, T)
– If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T)
36
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Algorithm ( identify cross-shared substrings)
– INPUT: INPUT: R[1, …, m] and T[1, …, n].
– SUMMARY:
1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating
m+1- β hash values. We store them into a hash table HT with separate chaining using
linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the
offsets where hash values are created.
2. From the first offset of T with a the same rolling hash to slide the window of β-width along T,
do match with hash table H. If not matched, continue the next offset, otherwise try to
match the maximum, then continue from an offset around the end of the matched substring
of T.
– OUTPUT: SR ST .
37
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0,
otherwise, we have
– SR and ST are weakly translative ;
– L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T).
– If R is β–nonrepetitive, L(SR, ST) = L2(R, T).
– If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; .
38
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let me summarize what have been done:
Classification 9/10/2015 39
Copyright 2011 Trend Micro Inc.
Practical Applications
• This framework can be applied to the following areas.
– E-Discovery
• Grouping near duplicate documents
• Comparing of near duplicate documents
– Digital forensic analysis
• Identifying similar objects or files
– Anti-plagiarism
• Copy detection
– Source code governance
– Malware analysis
– Spam filtering
– DLP
– etc
40
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your attention.
• Do you have any questions?
• Email: liwei_ren@trendmicro.com
• Home page: http://pitt.academia.edu/LiweiRen for external
talks.
41

Mais conteúdo relacionado

Mais procurados

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Innovation Quotient Pvt Ltd
 
A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)Alexander Decker
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithmssmumbahelp
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphFedorNikolaev
 

Mais procurados (20)

Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithms
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 

Semelhante a Bytewise Approximate Match: Theory, Algorithms and Applications

Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringLiwei Ren任力偉
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsLiwei Ren任力偉
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential CompressionLiwei Ren任力偉
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficSandipan Biswas
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learningBhushan Kotnis
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGIJNSA Journal
 

Semelhante a Bytewise Approximate Match: Theory, Algorithms and Applications (20)

Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
 

Mais de Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based SecurityLiwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsLiwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemLiwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudLiwei Ren任力偉
 

Mais de Liwei Ren任力偉 (19)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the Cloud
 

Último

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 

Último (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 

Bytewise Approximate Match: Theory, Algorithms and Applications

  • 1. Copyright 2011 Trend Micro Inc. 1 Bytewise Approximate Match: Theory, Algorithms and Applications Liwei Ren, Ph.D, Trend Micro EISIC 2015, Manchester, UK, Sept, 2015 Trend Micro
  • 2. Copyright 2011 Trend Micro Inc. Agenda • Background • Byte-wise Approximate Matching : 6 Cases • A Framework : Theory, Algorithms, Technologies • A Few Algorithms and Analysis • Practical Applications • Q & A Classification 9/10/2015 2
  • 3. Copyright 2011 Trend Micro Inc. Background • A Problem in DLP (Data Loss Prevention): – In 2005, when designing the DLP system for my startup, I had to solve this problem: – S = {d1, d2,…, dn} is a bag of sensitive documents. Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ. • where RLV(s,t) is a function to measure the relevance of two documents. • Two challenges: how to construct RLV ? How to make search scalable? Classification 9/10/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Background • In early 2014, I studied fuzzy hashing : – a family of similarity preserving hashing techniques & tools – Example: TLSH, ssdeep, sdhash – Problem: Given two binary strings s1 and s2, measure the similarity by SIM(H(s1), H(s2)) = s. • H is a hash function that preserves string similarity. • SIM is a function to measure similarity of two hash values • A challenge: how to evaluate pros & cons between them? Classification 9/10/2015 4
  • 5. Copyright 2011 Trend Micro Inc. Background • In early 2014, the NIST specification document NIST.SP.800-168 introduces a novel concept of approximate matching : – To replace the concept of binary similarity matching. – Four use cases are used to describe this concept: • Similarity Detection: to identify different versions of a document. • Cross Correlation: to identify a common object between two documents. • Embedded Object Detection: to identify a given object inside a document. • Fragment Detection: to identify the presence of traces/fragments of a known document in network stream. • In 2013, I noticed that eDiscovery has a near deduplication problem that needs to group similar documents together. 5
  • 6. Copyright 2011 Trend Micro Inc. Background • I started this effort in 2014: 6
  • 7. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases • We extend these NIST cases to 6 cases due to our practice in DLP & malware analysis. – that can be described rigorously. • Conceptual description : 7
  • 8. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases 8
  • 9. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases 9
  • 10. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases • Intuitive description with binary strings: 10
  • 11. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Match: A Rigorous Definition • Let us start with a few concepts: – A string s is β-nontrivial if Len(s) ≥β. • In practice, set β=64. • This is excluding the triviality of substrings such as substrings of a few bytes. – Let SS(β)={ s | string s is β-nontrivial } !!! – SSIM(s1, s2) measures similarity between two nontrivial strings. • Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe byte-wise approximate matching: 1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of identicalness. We denote it as EM1. 2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1. 3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the problem of containment. We denote it as EM2. 4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate containment. We denote it as AM2. 5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …, q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it as EM3. 6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing. We denote it as AM3. 11
  • 12. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Match: A Rigorous Definition • Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case in definition 1 is true, we say R and T are byte-wise relevant. This is a novel relationship. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0. • Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y. 12 EM1 EM2 EM3 AM1 AM2 AM3 ↪ ↪ ↪ ↪ ↪ ↪ ↪
  • 13. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • S = An object space S. •R = A relationship for objects in S. • Three problems of interest: 1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1. 2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that R (o, b )=1. 3. Clustering: Given a bag B of objects, partition B into a set of groups { G1, G2,…,Gm} based on R. • Given the byte-wise relevance BR , we need solutions for – Matching – Searching – Clustering 13
  • 14. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Matching problems & the solutions: Classification 9/10/2015 14
  • 15. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Searching problem for the relationship BR : – B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such that BR(T, s)=1. Classification 9/10/2015 15
  • 16. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • How to solve the searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about when B has 1 million strings . • It is a lazy idea! – Candidate selection approach: • STEP 1: select a few candidates {s1, s2,…,sm} quickly • STEP 2: evaluate each BR(T, sk). – How to select “good” candidates? • String tokenizer: extract tokens from each string from B. • Indexer : index the tokens along with the string ID to create a index DB as FP-DB. • Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s. – NOTE: • This is similar to a keyword based search engine where the keywords are the tokens. • A special token is string fingerprint. – Other tokens include k-grams, k-subsequences, blocks and chunks . » Fingerprints are generated from special blocks. 16
  • 17. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Architecture of candidate selection based approach: 17
  • 18. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • A clustering problem based on the relationship BR : • Given a bag B of β-nontrivial strings, partition B into a set of groups { G1, G2,…,Gm} based on BR. Classification 9/10/2015 18
  • 19. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • A solution to clustering problem: • A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected. • A group is a connected sub-graph of the G(V,E) where V=Node(B). Classification 9/10/2015 19
  • 20. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • The framework can be summarized as follows: Classification 9/10/2015 20
  • 21. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let us go quickly over the theory and algorithm for two matching problems : – Similarity AM1 – Cross-sharing EM3 • What is similarity? How to measure it? – A traditional approach is to compare two strings directly such as the LCS method ( largest common subsequence). – The popular fuzzy hash {ssdeep, sdhash and TLSH} use different algorithms for measuring the similarity. Classification 9/10/2015 21
  • 22. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Fuzzy hash can be summarized in the following 3 steps: Classification 9/10/2015 22
  • 23. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • ssdeep: – STEP 1: split the string into a sequence of consecutive chunks. – STEP 2: hash each chunk into 6 bits and place them into a 80-byte container sequentially. – STEP 3: Use Levenshtein distance to measure the similarity. Classification 9/10/2015 23
  • 24. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • sdhash: – STEP 1: Select a few 64-grams of higher entropy values. – STEP 2: Generate a hash for each and them all hashes into one or more 256-byte bloom filters. – STEP 3: Use Hamming distance to measure the similarity. Classification 9/10/2015 24
  • 25. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • TLSH: – STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5 3). – STEP 2: Generate a hash for each triplet and map them into a 32-byte container. – STEP 3: Use a heuristic diff algorithm to measure the similarity. Classification 9/10/2015 25
  • 26. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Summary of Three Fuzzy Hashing Algorithms: – Using a first model to describe a binary string with selected features: • ssdeep model: a string is a sequence of chunks (split from the string). • sdhash model: a string is a bag of 64-byte blocks (selected with entropy values). • TLSH model: a string is a bag of triplets (selected from all 5-grams). – Using a second model to map the selected features into a digest which is able to preserve similarity to certain degree. • ssdeep model: a sequence of chunks is mapped into a 80-byte digest. • sdhash model: a bag of blocks is mapped into one or multiple 256-byte bloom filter bitmaps. • TLSH model: a bag of triplets is mapped into a 32-byte container. Classification 9/10/2015 26
  • 27. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Three approaches for measuring similarity … {ssdeep, sdhash & TLSH use digest comparison. Classification 9/10/2015 27 • 1st model plays critical role for similarity comparison. • Let focus on discussing various 1st models today. • Based on a unified format. • 2nd model saves space but further reduces accuracy.
  • 28. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Unified format for 1st model: – A string is described as a collection of tokens (aka, features) organized by a data structure: • ssdeep: a sequence of chunks. • sdhash: a bag of 64-byte blocks with high entropy values. • TLSH: a bag of selected triplets. – Two types of data structures: sequence, bag. – Three types of tokens: chunks, blocks, triplets. • Analogical comparison: Classification 9/10/2015 28
  • 29. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • 4 categories of tokens : – k-grams where k is as small as 3,4,… – k-subsequences: any subsequence with length k. The triplet in TLSH is an example. – Chunks: whole string is split into non-overlapping chunks. – Blocks: selected substrings of fixed length. • 8 ways to describe a string for similarity: Classification 9/10/2015 29
  • 30. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Evaluate a fuzzy hash based on follows: – Data Structure: • Bag: a bag ignores the order of tokens. It is good at handling content swapping. • Sequence: a sequence organizes tokens in an order. This is weak for handling content swapping. – Tokens: • k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation. • k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation . • Chunks: This approach takes account of every byte in raw granularity. It should be OK at handling containment and cross sharing • Blocks: Depending on different selection functions, even though it does not take account of every byte, but it may present a string more efficiently and that is good for generating similarity digests. Due to the nature of fixed length blocks, it is good at handling containment and cross sharing. • M2.4 leads to a novel fuzzy hashing scheme : TSFP – It has some capabilities beyond existing schemes. – I am not introduce it today due to limited time. 30
  • 31. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let us investigate EM3 : – The cross-sharing problem. • What is cross-sharing ? And how to measure it? • Given a string, any two substrings follow one out of three cases: 31
  • 32. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Cross-sharing … some examples : 32
  • 33. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring of T}. – Ω(T) is the set of all β-nontrivial substrings of T. • Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists a bijective mapping F: SR  ST such that F(r)=t and r=t, we say that SR and ST are canonical with F. 33
  • 34. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are canonical with F: SR  ST. ∀ r1 , r2 ∊ SR, one of following cases holds: 34 NOTE: we are only interested in case 1-3.
  • 35. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are canonical with F: SR  ST. – ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR and ST are translative. – ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR and ST are weakly translative. – Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST . 35
  • 36. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Definition 4: Given R, T∊ SS(β), we define two measurements for cross- sharing between R and T at two levels: – L1(R, T) = Max { L(SR, ST) | SR and ST are translative } – L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative } • Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings) are different to each other, T is β-nonrepetitive. – This is to measure how random T is. • Theorem 2 : Given R,T ∊ SS(β), we have: – L1(R, T) ≤ L2(R, T) – If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T) 36
  • 37. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Algorithm ( identify cross-shared substrings) – INPUT: INPUT: R[1, …, m] and T[1, …, n]. – SUMMARY: 1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating m+1- β hash values. We store them into a hash table HT with separate chaining using linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the offsets where hash values are created. 2. From the first offset of T with a the same rolling hash to slide the window of β-width along T, do match with hash table H. If not matched, continue the next offset, otherwise try to match the maximum, then continue from an offset around the end of the matched substring of T. – OUTPUT: SR ST . 37
  • 38. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0, otherwise, we have – SR and ST are weakly translative ; – L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T). – If R is β–nonrepetitive, L(SR, ST) = L2(R, T). – If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; . 38
  • 39. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let me summarize what have been done: Classification 9/10/2015 39
  • 40. Copyright 2011 Trend Micro Inc. Practical Applications • This framework can be applied to the following areas. – E-Discovery • Grouping near duplicate documents • Comparing of near duplicate documents – Digital forensic analysis • Identifying similar objects or files – Anti-plagiarism • Copy detection – Source code governance – Malware analysis – Spam filtering – DLP – etc 40
  • 41. Copyright 2011 Trend Micro Inc. Q&A • Thank you for your attention. • Do you have any questions? • Email: liwei_ren@trendmicro.com • Home page: http://pitt.academia.edu/LiweiRen for external talks. 41