SlideShare uma empresa Scribd logo
1 de 43
ember
an open source
malware classifier and
dataset
whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame
whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum
Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time
One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.
Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk
Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intellectual property
Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s
DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!
Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification
Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet
Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?
Endgame Malware BEnchmark for Research
“MNIST for malware”
ember
“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember
The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data
Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data
7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2
[proth@proth-mbp data]$ cd ember
[proth@proth-mbp ember]$ ls -lh
total 9.2G
-rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl
-rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,
The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]
features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF
features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features
features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each
features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names
features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys
features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature
features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…
vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%
disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)
code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make predictions on
new PE files
notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset
suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training
suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification
demo time!
ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637
ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/endgameinc/ember
THANKYOU!
Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

Mais conteúdo relacionado

Mais procurados

Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with kerasMOHITKUMAR1379
 
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...Adrià Romero López
 
Malware Dectection Using Machine learning
Malware Dectection Using Machine learningMalware Dectection Using Machine learning
Malware Dectection Using Machine learningShubham Dubey
 
Machine Learning SPPU Unit 1
Machine Learning SPPU Unit 1Machine Learning SPPU Unit 1
Machine Learning SPPU Unit 1Amruta Aphale
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature surveyAkshay Hegde
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamWithTheBest
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning BasicsSuresh Arora
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInDaniel Tunkelang
 
Melanoma Skin Cancer Detection using Image Processing and Machine Learning
Melanoma Skin Cancer Detection using Image Processing and Machine LearningMelanoma Skin Cancer Detection using Image Processing and Machine Learning
Melanoma Skin Cancer Detection using Image Processing and Machine Learningijtsrd
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksHannes Hapke
 

Mais procurados (20)

Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with keras
 
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
 
Malware Dectection Using Machine learning
Malware Dectection Using Machine learningMalware Dectection Using Machine learning
Malware Dectection Using Machine learning
 
Machine Learning SPPU Unit 1
Machine Learning SPPU Unit 1Machine Learning SPPU Unit 1
Machine Learning SPPU Unit 1
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature survey
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedIn
 
Abusive Language Detection.pptx
Abusive Language Detection.pptxAbusive Language Detection.pptx
Abusive Language Detection.pptx
 
Sms spam-detection
Sms spam-detectionSms spam-detection
Sms spam-detection
 
Melanoma Skin Cancer Detection using Image Processing and Machine Learning
Melanoma Skin Cancer Detection using Image Processing and Machine LearningMelanoma Skin Cancer Detection using Image Processing and Machine Learning
Melanoma Skin Cancer Detection using Image Processing and Machine Learning
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 

Semelhante a Ember

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...Amazon Web Services Korea
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAxel de Romblay
 

Semelhante a Ember (20)

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
MLBox
MLBoxMLBox
MLBox
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Ember

  • 1. ember an open source malware classifier and dataset
  • 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  • 4. Open datasets push ML research forward source: https://twitter.com/benhamner/status/938123380074610688 Datasets cited in NIPS papers over time
  • 5. One example: MNIST MNIST: http://yann.lecun.com/exdb/mnist/ Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  • 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  • 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  • 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/ DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
  • 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: https://www.ll.mit.edu//ideval/data/1998data.html https://www.ll.mit.edu//ideval/data/1999data.html KDD Cup 1999: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html OLD!!!!
  • 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/ VirusShare [Malicious Only]: https://virusshare.com/ Microsoft Malware Challenge [Malicious Only. Headers Stripped]: https://www.kaggle.com/c/malware-classification
  • 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  • 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  • 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  • 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  • 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  • 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  • 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  • 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  • 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  • 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  • 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  • 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format https://lief.quarkslab.com/ https://lief.quarkslab.com/doc/Intro.html https://github.com/lief-project/LIEF
  • 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  • 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
  • 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  • 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  • 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  • 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  • 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  • 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  • 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  • 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  • 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  • 36. code https://github.com/endgameinc/ember The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  • 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  • 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  • 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  • 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper: https://arxiv.org/abs/1804.04637
  • 42.
  • 43. ember Download the data: https://pubdata.endgame.com/ember/ember_dataset.tar.bz2 Download the code: https://github.com/endgameinc/ember THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum