SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Anomaly/Novelty Detection
with scikit-learn
Alexandre Gramfort
Telecom ParisTech - CNRS LTCI
alexandre.gramfort@telecom-paristech.fr
GitHub : @agramfort Twitter : @agramfort
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
2
Objective: Spot the red apple
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
3
“An outlier is an observation in a data set which appears to
be inconsistent with the remainder of that set of data.”
Johnson 1992
“An outlier is an observation which deviates so much from
the other observations as to arouse suspicions that it was
generated by a different mechanism.”
Hawkins 1980
Outlier/Anomaly
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
4
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
5
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
ML Taxonomy
6
Machine Learning
SupervisedUnsupervised
Regression Classif.
“Prediction”
Clustering Dim. Red. Anomaly
Novelty
Detection
Alexandre Gramfort Anomaly detection with scikit-learn
Applications
7
• Fraud detection
• Network intrusion
• Finance
• Insurance
• Maintenance
• Medicine (unusual symptoms)
• Measurement errors (from sensors)
Any application where
looking at unusual
observations is
relevant
Alexandre Gramfort Anomaly detection with scikit-learn
Big picture
9
Look for samples that are in
low density regions, isolated
Look for a region of the space
that is small in volume but
contains most of the samples
Density based approach (KDE,
Gaussian Ellipse, GMM)
Kernel methods
Nearest neighbors
Trees / Partitioning
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=1).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Low density regions
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=2).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Novelty detection via density estimates
>>> from sklearn.neighbors import KernelDensity
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)
>>> log_dens = kde.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Kernel methods
Nearest neighbors
Trees / Partitioning
Our toy dataset
Kernel Approach
http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html
>>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Nearest neighbors
Trees / Partitioning
Nearest Neighbors (NN) Approach
https://github.com/scikit-learn/scikit-learn/pull/5279
>>> est = LocalOutlierFactor(n_neighbors=5)
Local Outlier Factor (LOF)
https://en.wikipedia.org/wiki/Local_outlier_factor
Trees / Partitioning
Partitioning / Tree Approach
http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html
>>> est = IsolationForest(n_estimators=100)
Isolation Forest
An anomaly can be isolated with a
very shallow random tree
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Network intrusions
https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/
bench_isolation_forest.py
Alexandre Gramfort Anomaly detection with scikit-learn
Some caveats
25
• How to set model hyperparameters?
• How to evaluate performance in the unsupervised setup?
• In any AD method there is a notion of metric/similarity
between samples, e.g. Euclidian distance. Unclear how to define
it (think continuous, categorical features etc.)
http://scikit-learn.org/stable/modules/outlier_detection.html
Alexandre Gramfort
alexandre.gramfort@telecom-paristech.frContact:
GitHub : @agramfort Twitter : @agramfort
Questions?
1 position to work on Scikit-Learn and Scipy stack available !
Thanks @ngoix & @albertthomas88 for the work

Mais conteúdo relacionado

Mais procurados

Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesHumberto Marchezi
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsTuri, Inc.
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAndrea Dal Pozzolo
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksZak Jost
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detectionguest0edcaf
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 

Mais procurados (20)

Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection: A Survey
Anomaly Detection: A SurveyAnomaly Detection: A Survey
Anomaly Detection: A Survey
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Xgboost
XgboostXgboost
Xgboost
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 

Mais de agramfort

MNE sapien labs 2019
MNE sapien labs 2019MNE sapien labs 2019
MNE sapien labs 2019agramfort
 
MAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsMAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsagramfort
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsagramfort
 
ICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. GramfortICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. Gramfortagramfort
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.agramfort
 
Teaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechTeaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechagramfort
 
Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013agramfort
 

Mais de agramfort (7)

MNE sapien labs 2019
MNE sapien labs 2019MNE sapien labs 2019
MNE sapien labs 2019
 
MAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsMAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signals
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillations
 
ICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. GramfortICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. Gramfort
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
 
Teaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechTeaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTech
 
Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013
 

Último

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Anomaly/Novelty detection with scikit-learn

  • 1. Anomaly/Novelty Detection with scikit-learn Alexandre Gramfort Telecom ParisTech - CNRS LTCI alexandre.gramfort@telecom-paristech.fr GitHub : @agramfort Twitter : @agramfort
  • 2. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 2 Objective: Spot the red apple
  • 3. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 3 “An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992 “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Hawkins 1980 Outlier/Anomaly
  • 4. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 4 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 5. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 5 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 6. Alexandre Gramfort Anomaly detection with scikit-learn ML Taxonomy 6 Machine Learning SupervisedUnsupervised Regression Classif. “Prediction” Clustering Dim. Red. Anomaly Novelty Detection
  • 7. Alexandre Gramfort Anomaly detection with scikit-learn Applications 7 • Fraud detection • Network intrusion • Finance • Insurance • Maintenance • Medicine (unusual symptoms) • Measurement errors (from sensors) Any application where looking at unusual observations is relevant
  • 8.
  • 9. Alexandre Gramfort Anomaly detection with scikit-learn Big picture 9 Look for samples that are in low density regions, isolated Look for a region of the space that is small in volume but contains most of the samples
  • 10. Density based approach (KDE, Gaussian Ellipse, GMM) Kernel methods Nearest neighbors Trees / Partitioning
  • 11. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=1).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7) Low density regions
  • 12. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=2).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 13. Novelty detection via density estimates >>> from sklearn.neighbors import KernelDensity >>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) >>> log_dens = kde.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 18. Nearest Neighbors (NN) Approach https://github.com/scikit-learn/scikit-learn/pull/5279 >>> est = LocalOutlierFactor(n_neighbors=5)
  • 19. Local Outlier Factor (LOF) https://en.wikipedia.org/wiki/Local_outlier_factor
  • 21. Partitioning / Tree Approach http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html >>> est = IsolationForest(n_estimators=100)
  • 22. Isolation Forest An anomaly can be isolated with a very shallow random tree
  • 23.
  • 25. Alexandre Gramfort Anomaly detection with scikit-learn Some caveats 25 • How to set model hyperparameters? • How to evaluate performance in the unsupervised setup? • In any AD method there is a notion of metric/similarity between samples, e.g. Euclidian distance. Unclear how to define it (think continuous, categorical features etc.)
  • 27.
  • 28. Alexandre Gramfort alexandre.gramfort@telecom-paristech.frContact: GitHub : @agramfort Twitter : @agramfort Questions? 1 position to work on Scikit-Learn and Scipy stack available ! Thanks @ngoix & @albertthomas88 for the work