SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
TLSH for the SOC
Jonathan Oliver
About Me
• Data Scientist at TrendMicro
• PhD at Monash University
• Data Mining consultant for NASA and FAA
• Data Scientist at Mailfrontier
• Inventor TLSH
• Adjunct Professor at University of Queensland
This Talk
What?
• TLSH Tools for processing malware
• Data derived from Malware Bazaar
Why?
• Label new / unknown samples
How?
• Clustering Malware Bazaar using standard ML tools
• (HAC-T / DBSCAN)
• Visualization of clusters (from Malware Bazaar)
Quick Intro to TLSH
• Trendmicro Locality Sensitive Hash
• pip install py-tlsh
• Open source code at https://github.com/trendmicro/tlsh
• Fuzzy Hash
• With advantages from Machine Learning
• Works with Sklearn, Jupyter Notebooks and DBSCAN
• Adopted by VirusTotal
• Adopted by Malware Bazaar
• A part of the STIX standard
What do TLSH look like?
chrome.exe
SHA256:c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403
T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
chrome.exe
SHA256:723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
Distance Calculation
T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
1 1 3 3 3
Total Distance = 11
0-30 Very Close Match
31-60 Close Match
61-100 Possible Match
Malware Bazaar
Malware Bazaar
As of 17 Sept 2021, Malware Bazaar https://bazaar.abuse.ch/ has a
dataset with
• 389300 samples
• 323709 samples have a label
We have clustered this dataset and found 16452 clusters
https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
Use Cases / Motivation
Typical Use Case
Demo (1)
• Clustered Malware Bazaar
• Cluster output and pattern file from 2021-09-17 provided at
• https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
• Use this to predict the malware family of Malware Bazaar 2021-09-18
Demo (1)
Demo (1)
Demo (1): Predicting Signature
• Difficult task as there are 592 distinct signatures in Malware Bazaar
• Associated 164 / 246 samples to clusters.
• We split the predictions into 3 categories
• Correct Signature 132/164
• Incorrect 13/164
• Inconclusive 19/164
Demo (1): Uses in the SOC
• Automatic labelling of unknown samples
• Scalable
• Suitable for Automation
• Associates unknown samples with similar historical samples
• Understand scope of the threat
• YARA rules
• …
ÞTake suitable action
Demo (2)
• Understanding Clustering
• Dendrograms for malware
• See
https://github.com/trendmicro/tlsh/blob/master/tlshCluster/malbaz.ipynb
Digging Deeper
• Why TLSH is the way that it is.
• Why it uses kskip-grams
• Comparison of TLSH with other Similarity Digests
• Comparison of Clustering Methods
Why K-skip-grams?
• Work on short strings / files
• Hard to attack
Kskip Ngrams
Data:
Ngram Features (N=4)
ABCD BCDE CDEF DEFG EFGH FGHI GHIJ
Kskip-Ngram N=4 K=2
AB AC AD BC BD BE CD CE CF DE DF DG EF EG EH FG
FH FI GH GI GJ HI HJ IJ
A B C D E F G H I J
Selecting K and N for Kskip-Ngrams
Computational Complexity(low score is good)
K=5 21
K=4 15 35
K=3 10 20 35
K=2 6 10 15 21
K=1 3 4 5 6 7
K=0
(Ngram)
1 1 1 1 1 1
N=3 N=4 N=5 N=6 N=7 N=8 …
Kskip-Ngram versus Ngrams
GAN-like experiment
Real World Data
Adversarial Agent
Discriminator
Match
No Match
Selecting K and N for Kskip-Ngrams
Adversarial Agent (Search Width = 15)
(low score is good)
K=5 7.5
K=4 11.3
K=3 13.7
K=2 16.1
K=1 16.0
K=0
(Ngram)
25.4 31.2 32 43.4 57.4
N=3 N=4 N=5 N=6 N=7 N=8 …
Selecting K and N for Kskip-Ngrams
Accuracy
Comparing LSH /
Similarity Digests
Ref: Mar)n-Perez et al. “Bringing order to approximate matching:
Classifica?on and a@acks on similarity digest algorithms”
Metric Trees for Nearest Neighbor Search
Nodes contain
(item, distance)
Metric Trees:
Do not work for
(bounded) Similarity
Measures
Comparing Clustering
Approaches
Types of Clustering
• Similarity of the files
• Fuzzy Hashes
• Feature based
• Deep Learning
• YARA Rules
• Apply a pattern (Smart pattern)
• Sandbox / behavioural analysis
• …
Fuzzy Hashes
• Cryptographic Hashes:
• Any change completely changes the hash
• Useful for collecting evidence
• Fuzzy Hashes:
• Have the convenience of cryptographic hashes
• Can measure the Similarity between files
• Speed and Scale
Potential Issues with Clustering
• Scale
• Does the method scale up to 10 million / 100 million files?
• Access to the file
• Does the method need to process the file?
• Manual effort
• Packers
• Multiple malware families may use the same packer
• Some methods will distinguish; other methods will not
Category Technique Speed /
Scale
Access to file Manual
effort
Can separate
families that
share a packer
Similarity Fuzzy Hash Fast No No No
Feature based
ML
Slow Yes Features No
Deep Learning Slow Yes Network ?
YARA rules Medium Yes Yes Yes
Smart Pattern Fast Yes Yes Yes
Sandbox /
Behavioral
Slow Yes No Yes
Clustering Solutions
• Use multiple methods of clustering
• Split clustering / categorization into phases
1. Large scale / quick / cheap
• Fuzzy hashes (TLSH) are ideal
2. When needed, use more expensive methods
• Extensive security knowledge required
• Sandboxes
• Smart Patterns
• YARA rules
• Deep Learning
• etc
Conclusion
• Get the tools.
• pip install py-tlsh
• Open Source (Apache license)
• https://github.com/trendmicro/tlsh
• Fuzzy Hashes / TLSH / Telfhash are really useful tools
• Working with huge databases
• Use standard dev-ops / ML tools for malware
• Jupyter notebooks
• Sklearn
• DBSCAN
• Dendrograms for visualizing clustering
Resources
• TLSH
• https://github.com/trendmicro/tlsh
• Papers on TLSH
• http://tlsh.org/papers.html
• Malware Bazaar
• https://bazaar.abuse.ch/
Thanks to University of Queensland

Mais conteĂşdo relacionado

Semelhante a 2021_TLSH_SOC_pub.pdf

Introduction to cryptography part1-final
Introduction to cryptography  part1-finalIntroduction to cryptography  part1-final
Introduction to cryptography part1-final
Taymoor Nazmy
 

Semelhante a 2021_TLSH_SOC_pub.pdf (20)

Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Cryptography
CryptographyCryptography
Cryptography
 
Basic cryptography
Basic cryptographyBasic cryptography
Basic cryptography
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
 
CISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - CryptographyCISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - Cryptography
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Threat hunting and achieving security maturity
Threat hunting and achieving security maturityThreat hunting and achieving security maturity
Threat hunting and achieving security maturity
 
Introduction to cryptography part1-final
Introduction to cryptography  part1-finalIntroduction to cryptography  part1-final
Introduction to cryptography part1-final
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
CRYPTOGRAPHY
CRYPTOGRAPHYCRYPTOGRAPHY
CRYPTOGRAPHY
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4  CRYPTOGRAPHIC SYSTEMS.pptxUNIT 4  CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
 
Malicious Domain Profiling
Malicious Domain Profiling Malicious Domain Profiling
Malicious Domain Profiling
 
ANALYZE'15 - Bulk Malware Analysis at Scale
ANALYZE'15 - Bulk Malware Analysis at ScaleANALYZE'15 - Bulk Malware Analysis at Scale
ANALYZE'15 - Bulk Malware Analysis at Scale
 

Mais de JonathanOliver26

HACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdfHACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdf
JonathanOliver26
 

Mais de JonathanOliver26 (7)

blackhole.pdf
blackhole.pdfblackhole.pdf
blackhole.pdf
 
HACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdfHACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdf
 
2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security
 
Using lexigraphical distancing to block spam
Using lexigraphical distancing to block spamUsing lexigraphical distancing to block spam
Using lexigraphical distancing to block spam
 
Introduction to MML and Supervised Learning
Introduction to MML and Supervised LearningIntroduction to MML and Supervised Learning
Introduction to MML and Supervised Learning
 
Privacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliver
 
Privacy log files
Privacy log filesPrivacy log files
Privacy log files
 

Último

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Último (20)

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 

2021_TLSH_SOC_pub.pdf

  • 1. TLSH for the SOC Jonathan Oliver
  • 2. About Me • Data Scientist at TrendMicro • PhD at Monash University • Data Mining consultant for NASA and FAA • Data Scientist at Mailfrontier • Inventor TLSH • Adjunct Professor at University of Queensland
  • 3. This Talk What? • TLSH Tools for processing malware • Data derived from Malware Bazaar Why? • Label new / unknown samples How? • Clustering Malware Bazaar using standard ML tools • (HAC-T / DBSCAN) • Visualization of clusters (from Malware Bazaar)
  • 4. Quick Intro to TLSH • Trendmicro Locality Sensitive Hash • pip install py-tlsh • Open source code at https://github.com/trendmicro/tlsh • Fuzzy Hash • With advantages from Machine Learning • Works with Sklearn, Jupyter Notebooks and DBSCAN • Adopted by VirusTotal • Adopted by Malware Bazaar • A part of the STIX standard
  • 5. What do TLSH look like? chrome.exe SHA256:c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403 T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db chrome.exe SHA256:723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
  • 8. Malware Bazaar As of 17 Sept 2021, Malware Bazaar https://bazaar.abuse.ch/ has a dataset with • 389300 samples • 323709 samples have a label We have clustered this dataset and found 16452 clusters https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
  • 9. Use Cases / Motivation
  • 11. Demo (1) • Clustered Malware Bazaar • Cluster output and pattern file from 2021-09-17 provided at • https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz • Use this to predict the malware family of Malware Bazaar 2021-09-18
  • 14. Demo (1): Predicting Signature • Difficult task as there are 592 distinct signatures in Malware Bazaar • Associated 164 / 246 samples to clusters. • We split the predictions into 3 categories • Correct Signature 132/164 • Incorrect 13/164 • Inconclusive 19/164
  • 15. Demo (1): Uses in the SOC • Automatic labelling of unknown samples • Scalable • Suitable for Automation • Associates unknown samples with similar historical samples • Understand scope of the threat • YARA rules • … ÞTake suitable action
  • 16. Demo (2) • Understanding Clustering • Dendrograms for malware • See https://github.com/trendmicro/tlsh/blob/master/tlshCluster/malbaz.ipynb
  • 17. Digging Deeper • Why TLSH is the way that it is. • Why it uses kskip-grams • Comparison of TLSH with other Similarity Digests • Comparison of Clustering Methods
  • 18. Why K-skip-grams? • Work on short strings / files • Hard to attack
  • 19. Kskip Ngrams Data: Ngram Features (N=4) ABCD BCDE CDEF DEFG EFGH FGHI GHIJ Kskip-Ngram N=4 K=2 AB AC AD BC BD BE CD CE CF DE DF DG EF EG EH FG FH FI GH GI GJ HI HJ IJ A B C D E F G H I J
  • 20. Selecting K and N for Kskip-Ngrams Computational Complexity(low score is good) K=5 21 K=4 15 35 K=3 10 20 35 K=2 6 10 15 21 K=1 3 4 5 6 7 K=0 (Ngram) 1 1 1 1 1 1 N=3 N=4 N=5 N=6 N=7 N=8 …
  • 21. Kskip-Ngram versus Ngrams GAN-like experiment Real World Data Adversarial Agent Discriminator Match No Match
  • 22. Selecting K and N for Kskip-Ngrams Adversarial Agent (Search Width = 15) (low score is good) K=5 7.5 K=4 11.3 K=3 13.7 K=2 16.1 K=1 16.0 K=0 (Ngram) 25.4 31.2 32 43.4 57.4 N=3 N=4 N=5 N=6 N=7 N=8 …
  • 23. Selecting K and N for Kskip-Ngrams Accuracy
  • 25. Ref: Mar)n-Perez et al. “Bringing order to approximate matching: Classica?on and a@acks on similarity digest algorithms”
  • 26. Metric Trees for Nearest Neighbor Search Nodes contain (item, distance)
  • 27. Metric Trees: Do not work for (bounded) Similarity Measures
  • 29. Types of Clustering • Similarity of the files • Fuzzy Hashes • Feature based • Deep Learning • YARA Rules • Apply a pattern (Smart pattern) • Sandbox / behavioural analysis • …
  • 30. Fuzzy Hashes • Cryptographic Hashes: • Any change completely changes the hash • Useful for collecting evidence • Fuzzy Hashes: • Have the convenience of cryptographic hashes • Can measure the Similarity between files • Speed and Scale
  • 31. Potential Issues with Clustering • Scale • Does the method scale up to 10 million / 100 million files? • Access to the file • Does the method need to process the file? • Manual effort • Packers • Multiple malware families may use the same packer • Some methods will distinguish; other methods will not
  • 32. Category Technique Speed / Scale Access to file Manual effort Can separate families that share a packer Similarity Fuzzy Hash Fast No No No Feature based ML Slow Yes Features No Deep Learning Slow Yes Network ? YARA rules Medium Yes Yes Yes Smart Pattern Fast Yes Yes Yes Sandbox / Behavioral Slow Yes No Yes
  • 33. Clustering Solutions • Use multiple methods of clustering • Split clustering / categorization into phases 1. Large scale / quick / cheap • Fuzzy hashes (TLSH) are ideal 2. When needed, use more expensive methods • Extensive security knowledge required • Sandboxes • Smart Patterns • YARA rules • Deep Learning • etc
  • 34. Conclusion • Get the tools. • pip install py-tlsh • Open Source (Apache license) • https://github.com/trendmicro/tlsh • Fuzzy Hashes / TLSH / Telfhash are really useful tools • Working with huge databases • Use standard dev-ops / ML tools for malware • Jupyter notebooks • Sklearn • DBSCAN • Dendrograms for visualizing clustering
  • 35. Resources • TLSH • https://github.com/trendmicro/tlsh • Papers on TLSH • http://tlsh.org/papers.html • Malware Bazaar • https://bazaar.abuse.ch/ Thanks to University of Queensland