2. About Me
⢠Data Scientist at TrendMicro
⢠PhD at Monash University
⢠Data Mining consultant for NASA and FAA
⢠Data Scientist at Mailfrontier
⢠Inventor TLSH
⢠Adjunct Professor at University of Queensland
3. This Talk
What?
⢠TLSH Tools for processing malware
⢠Data derived from Malware Bazaar
Why?
⢠Label new / unknown samples
How?
⢠Clustering Malware Bazaar using standard ML tools
⢠(HAC-T / DBSCAN)
⢠Visualization of clusters (from Malware Bazaar)
4. Quick Intro to TLSH
⢠Trendmicro Locality Sensitive Hash
⢠pip install py-tlsh
⢠Open source code at https://github.com/trendmicro/tlsh
⢠Fuzzy Hash
⢠With advantages from Machine Learning
⢠Works with Sklearn, Jupyter Notebooks and DBSCAN
⢠Adopted by VirusTotal
⢠Adopted by Malware Bazaar
⢠A part of the STIX standard
5. What do TLSH look like?
chrome.exe
SHA256:c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403
T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
chrome.exe
SHA256:723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
8. Malware Bazaar
As of 17 Sept 2021, Malware Bazaar https://bazaar.abuse.ch/ has a
dataset with
⢠389300 samples
⢠323709 samples have a label
We have clustered this dataset and found 16452 clusters
https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
11. Demo (1)
⢠Clustered Malware Bazaar
⢠Cluster output and pattern file from 2021-09-17 provided at
⢠https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
⢠Use this to predict the malware family of Malware Bazaar 2021-09-18
14. Demo (1): Predicting Signature
⢠Difficult task as there are 592 distinct signatures in Malware Bazaar
⢠Associated 164 / 246 samples to clusters.
⢠We split the predictions into 3 categories
⢠Correct Signature 132/164
⢠Incorrect 13/164
⢠Inconclusive 19/164
15. Demo (1): Uses in the SOC
⢠Automatic labelling of unknown samples
⢠Scalable
⢠Suitable for Automation
⢠Associates unknown samples with similar historical samples
⢠Understand scope of the threat
⢠YARA rules
⢠âŚ
ĂTake suitable action
16. Demo (2)
⢠Understanding Clustering
⢠Dendrograms for malware
⢠See
https://github.com/trendmicro/tlsh/blob/master/tlshCluster/malbaz.ipynb
17. Digging Deeper
⢠Why TLSH is the way that it is.
⢠Why it uses kskip-grams
⢠Comparison of TLSH with other Similarity Digests
⢠Comparison of Clustering Methods
19. Kskip Ngrams
Data:
Ngram Features (N=4)
ABCD BCDE CDEF DEFG EFGH FGHI GHIJ
Kskip-Ngram N=4 K=2
AB AC AD BC BD BE CD CE CF DE DF DG EF EG EH FG
FH FI GH GI GJ HI HJ IJ
A B C D E F G H I J
20. Selecting K and N for Kskip-Ngrams
Computational Complexity(low score is good)
K=5 21
K=4 15 35
K=3 10 20 35
K=2 6 10 15 21
K=1 3 4 5 6 7
K=0
(Ngram)
1 1 1 1 1 1
N=3 N=4 N=5 N=6 N=7 N=8 âŚ
29. Types of Clustering
⢠Similarity of the files
⢠Fuzzy Hashes
⢠Feature based
⢠Deep Learning
⢠YARA Rules
⢠Apply a pattern (Smart pattern)
⢠Sandbox / behavioural analysis
⢠âŚ
30. Fuzzy Hashes
⢠Cryptographic Hashes:
⢠Any change completely changes the hash
⢠Useful for collecting evidence
⢠Fuzzy Hashes:
⢠Have the convenience of cryptographic hashes
⢠Can measure the Similarity between files
⢠Speed and Scale
31. Potential Issues with Clustering
⢠Scale
⢠Does the method scale up to 10 million / 100 million files?
⢠Access to the file
⢠Does the method need to process the file?
⢠Manual effort
⢠Packers
⢠Multiple malware families may use the same packer
⢠Some methods will distinguish; other methods will not
32. Category Technique Speed /
Scale
Access to file Manual
effort
Can separate
families that
share a packer
Similarity Fuzzy Hash Fast No No No
Feature based
ML
Slow Yes Features No
Deep Learning Slow Yes Network ?
YARA rules Medium Yes Yes Yes
Smart Pattern Fast Yes Yes Yes
Sandbox /
Behavioral
Slow Yes No Yes
33. Clustering Solutions
⢠Use multiple methods of clustering
⢠Split clustering / categorization into phases
1. Large scale / quick / cheap
⢠Fuzzy hashes (TLSH) are ideal
2. When needed, use more expensive methods
⢠Extensive security knowledge required
⢠Sandboxes
⢠Smart Patterns
⢠YARA rules
⢠Deep Learning
⢠etc
34. Conclusion
⢠Get the tools.
⢠pip install py-tlsh
⢠Open Source (Apache license)
⢠https://github.com/trendmicro/tlsh
⢠Fuzzy Hashes / TLSH / Telfhash are really useful tools
⢠Working with huge databases
⢠Use standard dev-ops / ML tools for malware
⢠Jupyter notebooks
⢠Sklearn
⢠DBSCAN
⢠Dendrograms for visualizing clustering