SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Barry Becker, ESI-Group
VISUALIZATION OF
ENHANCED SPARK
INDUCED NAIVE BAYES
CLASSIFIER
Overview
• What the Naïve Bayes Classifier is
• How it can be visualized, and why you would want to
• Limitations of the version in spark, how to get around them
• Demo and use cases
What is a Naïve Bayes Classifier?
• A probabilistic model that can be used to predict a
categorical value (the target)
• Induced using training data and evaluated with test data
Inducer
Classifier
Training Data
New Data Predictions
What is a Naïve Bayes Classifier?
• Advantages
– Fast and simple
– Easy to visualize
• Disadvantage
– Assumes features are independent
Use Case: Detecting Poisonous Mushrooms
Expert determines edibility based on
cap shape, odor, gill spacing, stalk root,
veil type, veil color, ring type, spore print color,
habitat, gill size, stalk shape, … many more….
for 8,124 cases
https://www.pinterest.com/pin/239253798926772284
Bayes’ Rule
P(Edible | X) = P(X | Edible) P(Edible) / P(X)
Where X is
Odor = none and
Spore print color = chocolate and
Ring type = evanescent
Bayes Rule Demo
Understand
model with
Visualization
Everything you see is
based on counts
How the
Model
Makes
Predictions
Multiplies conditional
probabilities to come
up with a posterior
probability
How the
Model
Makes
Predictions
Multiplies conditional
probabilities to come
up with a posterior
probability
Limitations of Spark Naïve Bayes
Out-of-the-box version is only for document classification!
Document Classification Approach
• Each feature represents a word
• For each feature,
determine frequency for each class
More General Approach
• Each feature is a column
• For each value of each feature,
determine class distribution
=* …**
Prediction:
Not Spam
loan viagra zebra
=**
Prediction:
Poisonousevanescent none chocolate
Ring Type: Odor: SP Color:
spam
not spam
edible
poisonous
Limitations of Spark Naïve Bayes
All columns must be non-negative integers
– But what if there are nulls?
– But what if there are strings?
– But what if there are floating point values?
– But what if there are dates?
– What if target is continuous?
How to Handle String Columns?
• First replace Nulls with a special Null value
• Use StringIndexers to map string values to
integer indices
• Lastly, use IndexToString transformer to restore
the predicted value back to a string.
How to Handle Continuous Columns?
• Use MDLP Discretization to intelligently bin
continuous (number or date) columns with
respect to the target
– Entropy based binning makes distributions of
adjacent bins as different as possible
• Replace Nulls with NaN so they can be in a
separate bin.
Example:
Job Type
from
Application
MDLP applied to a continuous column
MDLP Binning
30 Uniform Bins
HistogramView
MDLP applied to continuous column
MDLP Binning
30 Uniform Bins
SpinogramView
MDLP Binning of Continuous Features
Splits seek to maximize the information, as measured by
entropy
Recursively split bins until
– Information gain is too small
– Bins are getting too small (avoids overfitting)
Open Source
https://github.com/sramirez/spark-MDLP-discretization
What if continuous Target?
Automatically bin
a continuous target
into 3 equal weight bins
Mineset Demo
Pixel-Perfect Rendering
Anti-aliased rendering
Pixel-perfect rendering
Pixel-Perfect Rendering
Anti-aliased rendering
Pixel-perfect rendering
Conclusion
• Spark Naïve Bayes classifier can be applied to
any data as long as some modifications made
• MDLP is useful for binning continuous features
• Visualization of model provides insight, trust,
and ability to answer what If questions
https://cloud.esi-group.com/analytics
Barry Becker - bbe@esi-group.com

Mais conteúdo relacionado

Semelhante a Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
Kimin Lee
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
Ernesto Mislej
 
Revisiting evolutionary information filtering
Revisiting evolutionary information filteringRevisiting evolutionary information filtering
Revisiting evolutionary information filtering
Manolis Vavalis
 
Introduction to sampling
Introduction to samplingIntroduction to sampling
Introduction to sampling
Situo Liu
 

Semelhante a Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker (20)

Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
classfeb24.ppt
classfeb24.pptclassfeb24.ppt
classfeb24.ppt
 
classfeb24.ppt
classfeb24.pptclassfeb24.ppt
classfeb24.ppt
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
 
Revisiting evolutionary information filtering
Revisiting evolutionary information filteringRevisiting evolutionary information filtering
Revisiting evolutionary information filtering
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Cluster_saumitra.ppt
Cluster_saumitra.pptCluster_saumitra.ppt
Cluster_saumitra.ppt
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled Ensembles
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
 
Introduction to sampling
Introduction to samplingIntroduction to sampling
Introduction to sampling
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Clustering
ClusteringClustering
Clustering
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Último (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Visualization of Enhanced Spark Induced Naive Bayes Classifier with Barry Becker

  • 1. Barry Becker, ESI-Group VISUALIZATION OF ENHANCED SPARK INDUCED NAIVE BAYES CLASSIFIER
  • 2. Overview • What the Naïve Bayes Classifier is • How it can be visualized, and why you would want to • Limitations of the version in spark, how to get around them • Demo and use cases
  • 3. What is a Naïve Bayes Classifier? • A probabilistic model that can be used to predict a categorical value (the target) • Induced using training data and evaluated with test data Inducer Classifier Training Data New Data Predictions
  • 4. What is a Naïve Bayes Classifier? • Advantages – Fast and simple – Easy to visualize • Disadvantage – Assumes features are independent
  • 5. Use Case: Detecting Poisonous Mushrooms Expert determines edibility based on cap shape, odor, gill spacing, stalk root, veil type, veil color, ring type, spore print color, habitat, gill size, stalk shape, … many more…. for 8,124 cases https://www.pinterest.com/pin/239253798926772284
  • 6. Bayes’ Rule P(Edible | X) = P(X | Edible) P(Edible) / P(X) Where X is Odor = none and Spore print color = chocolate and Ring type = evanescent Bayes Rule Demo
  • 10. Limitations of Spark Naïve Bayes Out-of-the-box version is only for document classification! Document Classification Approach • Each feature represents a word • For each feature, determine frequency for each class More General Approach • Each feature is a column • For each value of each feature, determine class distribution =* …** Prediction: Not Spam loan viagra zebra =** Prediction: Poisonousevanescent none chocolate Ring Type: Odor: SP Color: spam not spam edible poisonous
  • 11. Limitations of Spark Naïve Bayes All columns must be non-negative integers – But what if there are nulls? – But what if there are strings? – But what if there are floating point values? – But what if there are dates? – What if target is continuous?
  • 12. How to Handle String Columns? • First replace Nulls with a special Null value • Use StringIndexers to map string values to integer indices • Lastly, use IndexToString transformer to restore the predicted value back to a string.
  • 13. How to Handle Continuous Columns? • Use MDLP Discretization to intelligently bin continuous (number or date) columns with respect to the target – Entropy based binning makes distributions of adjacent bins as different as possible • Replace Nulls with NaN so they can be in a separate bin.
  • 15. MDLP applied to a continuous column MDLP Binning 30 Uniform Bins HistogramView
  • 16. MDLP applied to continuous column MDLP Binning 30 Uniform Bins SpinogramView
  • 17. MDLP Binning of Continuous Features Splits seek to maximize the information, as measured by entropy Recursively split bins until – Information gain is too small – Bins are getting too small (avoids overfitting) Open Source https://github.com/sramirez/spark-MDLP-discretization
  • 18. What if continuous Target? Automatically bin a continuous target into 3 equal weight bins
  • 22. Conclusion • Spark Naïve Bayes classifier can be applied to any data as long as some modifications made • MDLP is useful for binning continuous features • Visualization of model provides insight, trust, and ability to answer what If questions