SlideShare uma empresa Scribd logo
1 de 20
Machine Learning for Malware
Classification and Clustering
Phil Roth, Data Scientist
1
• PhD in particle astrophysics
• Switched to making images from radar data
• Switched to solving security problems with data
Phil Roth
Data Scientist
2
Outline
• Malware Detection
• Boosted Decision Trees
• Malware Features
• Evaluating Performance
• Bringing a Human into the Loop
3
The Problem: Antivirus
The security industry has declared antivirus as dead, but
there is no widely accepted replacement.
Machine Learning can be that replacement.
4
The Problem: Antivirus
• Antivirus uses signatures, heuristics, and hand crafted rules
that do not scale well
• Using polymorphism and obfuscation, malware authors can
circumvent rules based detection techniques
5
The Solution: Machine Learning
Machine Learning uses statistical techniques to learn
patterns from large datasets
6
Two Steps:
• Feature Extraction
• Boundary Learning
Machine Learning Advantages
• Automation
• Deep Insights
• Scalability
• Generalization
7
Machine Learning Challenges
• Requires labels
• Requires large data sets
• Security field requires very low tolerance for errors
8
Boosted Decision Trees
Basically, it’s a game of 20 questions
Source: https://en.wikipedia.org/wiki/Decision_tree_learning
A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard). The
figures under the leaves show the
probability of survival and the
percentage of observations in the
leaf.
9
Boosted Decision Trees
• The trees are built by choosing “questions” that
maximize the discrimination between two classes
• The model is called “boosted” because misclassified
samples are given higher weight in future tree building
10
Why Boosted Decision Trees?
Proven results in security and physics
References:
https://www.kaggle.com/c/malware-classification/
http://arxiv.org/pdf/1511.04317.pdf
http://jmlr.org/proceedings/papers/v42/chen14.pdf
11
Malware Features
The extracted features determine your
model’s performance, but there is a tradeoff
Complicated Explainable
12
Complicated Features
Byte frequency and byte
entropy features form a
binary fingerprint that inform
the model
13
Explainable Features
Lists of capabilities don’t greatly help the model classify a
sample, but they can provide more insight to an analyst.
This sample can:
• Record keystrokes
• Send/receive network traffic
• Modify registry
14
Evaluating Performance
We must be careful not to learn from “future” information:
time
time
Train Data
Test Data
Model Train Times
Patterns learned here….
... should not inform classifications here
15
Bringing Humans in the Loop
Amazon built an entire tool (Mechanical Turk) to cheaply
generate labels from human intuition:
Are these products related?
16
Bringing Humans in the Loop
Our labels are more expensive to obtain, and so choosing
what samples to label is even more important.
Is this binary malicious?
Active Learning can help!
17
Bringing Humans in the Loop
When new data arrives, Active Learning tells analysts
which labels would be most helpful.
18
Integration
• Our malware classifier model has been integrated into
our stealthy sensor and Hunt Platform
• Ask the other friendly Endgamers here for a demo!
19
Thanks!
proth@endgame.com
@mrphilroth
20

Mais conteúdo relacionado

Mais procurados

Semantics aware malware detection ppt
Semantics aware malware detection pptSemantics aware malware detection ppt
Semantics aware malware detection ppt
Manish Yadav
 

Mais procurados (20)

Cognitive Computing in Security with AI
Cognitive Computing in Security with AI Cognitive Computing in Security with AI
Cognitive Computing in Security with AI
 
Machine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringMachine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and Clustering
 
Machine Learning in Malware Detection
Machine Learning in Malware DetectionMachine Learning in Malware Detection
Machine Learning in Malware Detection
 
Semantics aware malware detection ppt
Semantics aware malware detection pptSemantics aware malware detection ppt
Semantics aware malware detection ppt
 
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения..."Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...
 
Malware classification using Machine Learning
Malware classification using Machine LearningMalware classification using Machine Learning
Malware classification using Machine Learning
 
Malware Detection using Machine Learning
Malware Detection using Machine Learning	Malware Detection using Machine Learning
Malware Detection using Machine Learning
 
Labeling the virus share malware dataset lessons learned
Labeling the virus share malware dataset  lessons learnedLabeling the virus share malware dataset  lessons learned
Labeling the virus share malware dataset lessons learned
 
Malware Classification and Analysis
Malware Classification and AnalysisMalware Classification and Analysis
Malware Classification and Analysis
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Detecting Evasive Malware in Sandbox
Detecting Evasive Malware in SandboxDetecting Evasive Malware in Sandbox
Detecting Evasive Malware in Sandbox
 
Nguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s view
Nguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s viewNguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s view
Nguyen Huu Trung - Building a web vulnerability scanner - From a hacker’s view
 
Worst-Case Scenario: Being Detected without Knowing You are Detected
Worst-Case Scenario: Being Detected without Knowing You are DetectedWorst-Case Scenario: Being Detected without Knowing You are Detected
Worst-Case Scenario: Being Detected without Knowing You are Detected
 
Malware Dectection Using Machine learning
Malware Dectection Using Machine learningMalware Dectection Using Machine learning
Malware Dectection Using Machine learning
 
An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)
An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)
An Example of use the Threat Modeling Tool (FFRI Monthly Research Nov 2016)
 
STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...
STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...
STRIDE Variants and Security Requirements-based Threat Analysis (FFRI Monthly...
 
The VTC experience
The VTC experienceThe VTC experience
The VTC experience
 
Introduction to penetration testing
Introduction to penetration testingIntroduction to penetration testing
Introduction to penetration testing
 
Whittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest IrelandWhittaker How To Break Software Security - SoftTest Ireland
Whittaker How To Break Software Security - SoftTest Ireland
 
Test Strategies & Common Mistakes
Test Strategies & Common MistakesTest Strategies & Common Mistakes
Test Strategies & Common Mistakes
 

Semelhante a Machine Learning for Malware Classification and Clustering

Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System Analysis
Deepak Shankar
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
Rod Soto
 
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Security Conference
 
First Principles Vulnerability Assessment
First Principles Vulnerability AssessmentFirst Principles Vulnerability Assessment
First Principles Vulnerability Assessment
Manuel Brugnoli
 

Semelhante a Machine Learning for Malware Classification and Clustering (20)

Web applications security conference slides
Web applications security  conference slidesWeb applications security  conference slides
Web applications security conference slides
 
Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2
 
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
Battista Biggio @ ICML 2015 - "Is Feature Selection Secure against Training D...
 
Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System Analysis
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
 
Malware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware VirtualizationMalware Collection and Analysis via Hardware Virtualization
Malware Collection and Analysis via Hardware Virtualization
 
First Principles Vulnerability Assessment
First Principles Vulnerability AssessmentFirst Principles Vulnerability Assessment
First Principles Vulnerability Assessment
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
AI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are DangerousAI & ML in Cyber Security - Why Algorithms Are Dangerous
AI & ML in Cyber Security - Why Algorithms Are Dangerous
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
Controlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and DataControlling Access to IBM i Systems and Data
Controlling Access to IBM i Systems and Data
 
Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...Deep learning in manufacturing predicting and preventing manufacturing defect...
Deep learning in manufacturing predicting and preventing manufacturing defect...
 
New Horizons SCYBER Presentation
New Horizons SCYBER PresentationNew Horizons SCYBER Presentation
New Horizons SCYBER Presentation
 
Expand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and DataExpand Your Control of Access to IBM i Systems and Data
Expand Your Control of Access to IBM i Systems and Data
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Foutse_Khomh.pptx
Foutse_Khomh.pptxFoutse_Khomh.pptx
Foutse_Khomh.pptx
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
 

Mais de EndgameInc

Mais de EndgameInc (7)

Filar seymour oreilly_bot_story_
Filar seymour oreilly_bot_story_Filar seymour oreilly_bot_story_
Filar seymour oreilly_bot_story_
 
Hunting before a Known Incident
Hunting before a Known IncidentHunting before a Known Incident
Hunting before a Known Incident
 
Hardware-Assisted Rootkits & Instrumentation
Hardware-Assisted Rootkits & InstrumentationHardware-Assisted Rootkits & Instrumentation
Hardware-Assisted Rootkits & Instrumentation
 
Hunting on the Cheap
Hunting on the CheapHunting on the Cheap
Hunting on the Cheap
 
​Dynamic Detection of Malicious Behavior
​Dynamic Detection of Malicious Behavior​Dynamic Detection of Malicious Behavior
​Dynamic Detection of Malicious Behavior
 
Extracting the Malware Signal from Internet Noise
Extracting the Malware Signal from Internet NoiseExtracting the Malware Signal from Internet Noise
Extracting the Malware Signal from Internet Noise
 
Worst-Case Scenario: Being Detected without Knowing You are Detected
Worst-Case Scenario: Being Detected without Knowing You are DetectedWorst-Case Scenario: Being Detected without Knowing You are Detected
Worst-Case Scenario: Being Detected without Knowing You are Detected
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Machine Learning for Malware Classification and Clustering

  • 1. Machine Learning for Malware Classification and Clustering Phil Roth, Data Scientist 1
  • 2. • PhD in particle astrophysics • Switched to making images from radar data • Switched to solving security problems with data Phil Roth Data Scientist 2
  • 3. Outline • Malware Detection • Boosted Decision Trees • Malware Features • Evaluating Performance • Bringing a Human into the Loop 3
  • 4. The Problem: Antivirus The security industry has declared antivirus as dead, but there is no widely accepted replacement. Machine Learning can be that replacement. 4
  • 5. The Problem: Antivirus • Antivirus uses signatures, heuristics, and hand crafted rules that do not scale well • Using polymorphism and obfuscation, malware authors can circumvent rules based detection techniques 5
  • 6. The Solution: Machine Learning Machine Learning uses statistical techniques to learn patterns from large datasets 6 Two Steps: • Feature Extraction • Boundary Learning
  • 7. Machine Learning Advantages • Automation • Deep Insights • Scalability • Generalization 7
  • 8. Machine Learning Challenges • Requires labels • Requires large data sets • Security field requires very low tolerance for errors 8
  • 9. Boosted Decision Trees Basically, it’s a game of 20 questions Source: https://en.wikipedia.org/wiki/Decision_tree_learning A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. 9
  • 10. Boosted Decision Trees • The trees are built by choosing “questions” that maximize the discrimination between two classes • The model is called “boosted” because misclassified samples are given higher weight in future tree building 10
  • 11. Why Boosted Decision Trees? Proven results in security and physics References: https://www.kaggle.com/c/malware-classification/ http://arxiv.org/pdf/1511.04317.pdf http://jmlr.org/proceedings/papers/v42/chen14.pdf 11
  • 12. Malware Features The extracted features determine your model’s performance, but there is a tradeoff Complicated Explainable 12
  • 13. Complicated Features Byte frequency and byte entropy features form a binary fingerprint that inform the model 13
  • 14. Explainable Features Lists of capabilities don’t greatly help the model classify a sample, but they can provide more insight to an analyst. This sample can: • Record keystrokes • Send/receive network traffic • Modify registry 14
  • 15. Evaluating Performance We must be careful not to learn from “future” information: time time Train Data Test Data Model Train Times Patterns learned here…. ... should not inform classifications here 15
  • 16. Bringing Humans in the Loop Amazon built an entire tool (Mechanical Turk) to cheaply generate labels from human intuition: Are these products related? 16
  • 17. Bringing Humans in the Loop Our labels are more expensive to obtain, and so choosing what samples to label is even more important. Is this binary malicious? Active Learning can help! 17
  • 18. Bringing Humans in the Loop When new data arrives, Active Learning tells analysts which labels would be most helpful. 18
  • 19. Integration • Our malware classifier model has been integrated into our stealthy sensor and Hunt Platform • Ask the other friendly Endgamers here for a demo! 19

Notas do Editor

  1. Dive right into train versus test data.