SlideShare a Scribd company logo
1 of 45
1
About Presenter
Alok Singh
Alok Singh is a Principal Engineer at the IBM CODAIT
(Center for Open-Source Data and AI Technologies). He
has built and architected multiple analytical
frameworks and implemented machine learning
algorithms along with various data science use cases.
His interest is in creating Big Data and scalable
machine learning software and algorithms. He has also
created many Data Science based applications.
– Interests: Big Data, Scalable Machine Learning Software, and Algorithms.
– Contact:
url- http://codait.org
email- singh_alok@hotmail.com
github -https://github.com/aloknsingh
2
3
Agenda
• IBM and Data Science
• Why this talk?
• Tools and Technology
• Overview and flow of workshop
• Scikit-learn based ML pipeline
• Various Model evaluation techniques
• Iterative Model training
• Hands on tutorial for building predictive models for imbalance datasets
• Setup
• Tutorial
• Q/A
4
IBM Commitments to Open
Source AI and Data Science
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
5
Center for Open Source
Data and AI Technologies
Code - Build and improve practical frameworks to
enable more developers to realize immediate
value
Content – Showcase solutions to complex and
real world AI problems
Community – Bring developers and data
scientists to engage with IBM
Improving Enterprise AI lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
CODAIT
codait.org
6
A L L Y O U R T O O L S I N O N E P L A C E
IBM Watson Studio Data Science Experience is an environment
that brings together everything that a Data Scientist needs. It
includes the most popular Open Source tools and IBM unique
value-add functionalities with community and social features,
integrated as a first class citizen to make Data Scientists more
successful. datascience.ibm.com
IBM Watson Studio for Data Science Experience
IBM Watson Studio
Click here to watch video
9
Why This Talk for building
Predictive Models for
Imbalance Datasets?
Challenges in building ML model
10
• Non Representative data
• Insufficient data
• Poor quality data
• Imbalance data set
• Irrelevant features
• Over fitting the model on data
• Under fitting of the model on data
• Whether model will generalize or not
Imbalance Dataset: A Story …
11
We get real life labeled data sets from our clients and we are asked to build a classifier to
predict if someone will buy our products.
We clean our data and build latest state of art ML model for it.
We run cross validation and feature engineering to select best model.
We evaluate model and it doesn’t perform good on test dataset but in theory it should perform
good.
We explore more and finds that the number of labeled samples where someone buys the
product is very less say 1 % and our classifier didn’t learn since it thought that those product
that was bought was noise.
What we will do in this workshop …
12
• Data Set Description.
• Exploratory Analysis to understand the data.
• Use various preprocessing to clean and prepare the data.
• Use naive XGBoost to run the classification.
• Use cross validation to get the model.
• Plot, precision recall curve and ROC curve.
• We will then tune it and use weighted positive samples to improve classification
performance.
• We will also talk about the following advanced techniques.
• Oversampling of minority class and Undersampling of majority class.
• SMOTE algorithms.
13
Tools and Technologies
Scikit-Learn
14
• Library of ML models and utilities
• Has many out of box ML models and utilities that works out of box
• Python is now very popular among data-scientists.
• XGBoost allows one to tune various parameters.
• Has pipeline, estimators and transformers concept that allows one to build big ML apps
• Well Documented and a lot of examples and support group.
Why and What of XGBoost.
15
• XGBoost is extreme gradient boosting algorithm based on trees and tends to perform very
good out of the box compare to other ML algorithms.
• XGBoost is popular amongst data-scientist and one of the most common ML algorithms
used in Kaggle Competitions.
• XGBoost allows one to tune various parameters.
• XGBoost allows parallel processing.
• Works very well with Python and Scikit Learn.
16
XGBoost
Source: Official Tutorial => https://xgboost.readthedocs.io/en/latest/tutorials/model.html
17
XGBoost (continue …)
Source: Official Tutorial => https://xgboost.readthedocs.io/en/latest/tutorials/model.html
18
XGBoost (continue …)
Source: From the XGBoost Author=> https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
19
Overview of the this tutorial
Scikit-based ML flow
Read
input
Data Exploration Preprocessing
Feature
Engineering
XGBoost
based
iterative
model
training
Scoring/Genera
lization
Metrics for Model Performance.
21
To come up with the best model, we should evaluate model performance for comparison among
various models. There are many ways to evaluate the model performance for classification.
Before we go into model training we will understand various model evaluation criteria
1) Accuracy Score
2) Confusion Matrix
3) ROC curve
4) Precision Recall Curve
22
Confusion Matrix
23
Receiver Operating Characteristics (ROC)
By FeanDoe - Modified version from Walber's Precision and Recall https://commons.wikimedia.org/wiki/File:Precisionrecall.svg, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=65826093
24
ROC (continue…)
Source:Wikipedia By Sharpr - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=44059691
25
Precision-Recall Curve(PR Curve)
By Walber [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons
26
PR Curve (Continue…)
F1-Score
27
• Since there is the inverse relationship between precision and
recall .
• You increase one , other will go down and vice versa.
• And hence if we want to do justice to both the metrics, we
usually use the combination of both .
• We use harmonic mean to combine them.
Which Metrics to use for Imbalance Datasets?
28
• We will use precision recall with varying threshold to find out the best threshold
• But we should note that Precision and recall have inverse relationship and hence if we want high
precision then we should be fine with low recall and vice versa.
• There are many applications where high precision might be desired. For example if we train a model
to detect safe website for kids. It's ok to reject many safe website (low recall) as long as we correctly
reject bad website (high precision)
• However for our applications, i.e bank client's CD subscription prediction, higher recall is desirable.
Since it is perfectly fine to have false client i.e model predict client will accept the subscription but
will actually decline it, as long as we predict almost all the client who is likely to accept the
subscription and thus improving the balance sheet.
High Precision Example (ML Model for Net-Nanny)
29
High Recall Example (ML Model for Potential Customer)
30
31
Model Training
Model Training Iteration
1.Candidate
Model
2.Train
3.Evaluate
4. Adjust
First Attempt to XGBoost Model Training
33
• We will split dataset into training and test datasets
• We will first create a simple XGBoost model with using traditional ML techniques.
• We will evaluate models using various metrics we discussed.
• Analyze the reports and concludes the model is not good.
Strategies for better Classifier for the Imbalance Dataset.
34
1. Use the weighted class i.e give higher weights to minority positive class i.e 'yes’ label.
2. Over-Sampling of the minority class and under-sampling of the majority class.
3. Use the SMOTE (Synthetic Minority Oversampling Techniques).
Weighted Class classifier
35
Since at training time, classifier doesn’t learn from minority samples
The natural approach would be to some how increase the number of minority samples
One way to do is to give more weight to minority class at the time of evaluation of the node
split.
Weighted Class classifier (continue…)
36
Minority Over-Sampling and Majority Under-sampling
37
• Idea is similar to Weighted class
• Instead of giving higher weights, we just replicate a subset of random minority samples
• Also we can under samples the majority class i.e don’t take all the samples from majority
class
SMOTE algorithm (Synthetic Minority Oversampling
Technique)
38
• Intuitively, just replicating the minority sample sounds cheating.
• So better approach is to have new synthetic samples created by interpolating and
extrapolating the two or more samples.
• The candidate samples for interpolation can come from majority or minority or the
combination of them.
• The suggestion by SMOTE is to use the samples on the decision boundary .
• SMOTE Paper by V Chawla and etal: https://arxiv.org/pdf/1106.1813.pdf
SMOTE (continue…)
SMOTE(continue…)
40
• Nearest neighbor minority samples
synthesis.
• For the center red circle for K=2, we
have two neighbors and we have
possibility of two synthesis samples for
dup_size=1,
• We randomly picks one of the small
new red samples.
41
Demo for predictive classifier
for imbalance datasets.
Checklist
42
1) Look the package installation guide accompanying this tutorial.
2) Clone the github:
https://github.com/IBM/xgboost-financial-predictions/
3) Data: data/bank.csv
4) Notebook:
notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb
5) Slides: https://github.com/aloknsingh/ds_xgboost_clf_4_imbalance_data/slides
6) You can use one of the following platforms
a) Your laptop
b) IBM Watson Studio
Goto Notebook…
43
44
45

More Related Content

What's hot

Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsRebecca Bilbro
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsAseda Owusua Addai-Deseh
 
Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...Edureka!
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndBigML, Inc
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine LearningHayim Makabee
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRaveen Perera
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelSri Ambati
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceMala Deep Upadhaya
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkRoger Barga
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainNishant Jain
 

What's hot (20)

Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
 
Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our World
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-End
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited Talk
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
 

Similar to Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance datasets

ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...Alok Singh
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveBigML, Inc
 
artificggggggggggggggialintelligence.pdf
artificggggggggggggggialintelligence.pdfartificggggggggggggggialintelligence.pdf
artificggggggggggggggialintelligence.pdftt4765690
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformDutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformBigML, Inc
 
Supervised learning
Supervised learningSupervised learning
Supervised learningankit_ppt
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroSi Krishan
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Animesh Singh
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
DutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorDutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 

Similar to Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance datasets (20)

ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business Perspective
 
artificggggggggggggggialintelligence.pdf
artificggggggggggggggialintelligence.pdfartificggggggggggggggialintelligence.pdf
artificggggggggggggggialintelligence.pdf
 
Ai in finance
Ai in financeAi in finance
Ai in finance
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformDutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Software Design principales
Software Design principalesSoftware Design principales
Software Design principales
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
DutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorDutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive Sector
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance datasets

  • 1. 1
  • 2. About Presenter Alok Singh Alok Singh is a Principal Engineer at the IBM CODAIT (Center for Open-Source Data and AI Technologies). He has built and architected multiple analytical frameworks and implemented machine learning algorithms along with various data science use cases. His interest is in creating Big Data and scalable machine learning software and algorithms. He has also created many Data Science based applications. – Interests: Big Data, Scalable Machine Learning Software, and Algorithms. – Contact: url- http://codait.org email- singh_alok@hotmail.com github -https://github.com/aloknsingh 2
  • 3. 3 Agenda • IBM and Data Science • Why this talk? • Tools and Technology • Overview and flow of workshop • Scikit-learn based ML pipeline • Various Model evaluation techniques • Iterative Model training • Hands on tutorial for building predictive models for imbalance datasets • Setup • Tutorial • Q/A
  • 4. 4 IBM Commitments to Open Source AI and Data Science
  • 5. Center for Open Source Data and AI Technologies CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 5
  • 6. Center for Open Source Data and AI Technologies Code - Build and improve practical frameworks to enable more developers to realize immediate value Content – Showcase solutions to complex and real world AI problems Community – Bring developers and data scientists to engage with IBM Improving Enterprise AI lifecycle in Open Source Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow CODAIT codait.org 6
  • 7. A L L Y O U R T O O L S I N O N E P L A C E IBM Watson Studio Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful. datascience.ibm.com IBM Watson Studio for Data Science Experience
  • 8. IBM Watson Studio Click here to watch video
  • 9. 9 Why This Talk for building Predictive Models for Imbalance Datasets?
  • 10. Challenges in building ML model 10 • Non Representative data • Insufficient data • Poor quality data • Imbalance data set • Irrelevant features • Over fitting the model on data • Under fitting of the model on data • Whether model will generalize or not
  • 11. Imbalance Dataset: A Story … 11 We get real life labeled data sets from our clients and we are asked to build a classifier to predict if someone will buy our products. We clean our data and build latest state of art ML model for it. We run cross validation and feature engineering to select best model. We evaluate model and it doesn’t perform good on test dataset but in theory it should perform good. We explore more and finds that the number of labeled samples where someone buys the product is very less say 1 % and our classifier didn’t learn since it thought that those product that was bought was noise.
  • 12. What we will do in this workshop … 12 • Data Set Description. • Exploratory Analysis to understand the data. • Use various preprocessing to clean and prepare the data. • Use naive XGBoost to run the classification. • Use cross validation to get the model. • Plot, precision recall curve and ROC curve. • We will then tune it and use weighted positive samples to improve classification performance. • We will also talk about the following advanced techniques. • Oversampling of minority class and Undersampling of majority class. • SMOTE algorithms.
  • 14. Scikit-Learn 14 • Library of ML models and utilities • Has many out of box ML models and utilities that works out of box • Python is now very popular among data-scientists. • XGBoost allows one to tune various parameters. • Has pipeline, estimators and transformers concept that allows one to build big ML apps • Well Documented and a lot of examples and support group.
  • 15. Why and What of XGBoost. 15 • XGBoost is extreme gradient boosting algorithm based on trees and tends to perform very good out of the box compare to other ML algorithms. • XGBoost is popular amongst data-scientist and one of the most common ML algorithms used in Kaggle Competitions. • XGBoost allows one to tune various parameters. • XGBoost allows parallel processing. • Works very well with Python and Scikit Learn.
  • 16. 16 XGBoost Source: Official Tutorial => https://xgboost.readthedocs.io/en/latest/tutorials/model.html
  • 17. 17 XGBoost (continue …) Source: Official Tutorial => https://xgboost.readthedocs.io/en/latest/tutorials/model.html
  • 18. 18 XGBoost (continue …) Source: From the XGBoost Author=> https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
  • 19. 19 Overview of the this tutorial
  • 20. Scikit-based ML flow Read input Data Exploration Preprocessing Feature Engineering XGBoost based iterative model training Scoring/Genera lization
  • 21. Metrics for Model Performance. 21 To come up with the best model, we should evaluate model performance for comparison among various models. There are many ways to evaluate the model performance for classification. Before we go into model training we will understand various model evaluation criteria 1) Accuracy Score 2) Confusion Matrix 3) ROC curve 4) Precision Recall Curve
  • 23. 23 Receiver Operating Characteristics (ROC) By FeanDoe - Modified version from Walber's Precision and Recall https://commons.wikimedia.org/wiki/File:Precisionrecall.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=65826093
  • 24. 24 ROC (continue…) Source:Wikipedia By Sharpr - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=44059691
  • 25. 25 Precision-Recall Curve(PR Curve) By Walber [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons
  • 27. F1-Score 27 • Since there is the inverse relationship between precision and recall . • You increase one , other will go down and vice versa. • And hence if we want to do justice to both the metrics, we usually use the combination of both . • We use harmonic mean to combine them.
  • 28. Which Metrics to use for Imbalance Datasets? 28 • We will use precision recall with varying threshold to find out the best threshold • But we should note that Precision and recall have inverse relationship and hence if we want high precision then we should be fine with low recall and vice versa. • There are many applications where high precision might be desired. For example if we train a model to detect safe website for kids. It's ok to reject many safe website (low recall) as long as we correctly reject bad website (high precision) • However for our applications, i.e bank client's CD subscription prediction, higher recall is desirable. Since it is perfectly fine to have false client i.e model predict client will accept the subscription but will actually decline it, as long as we predict almost all the client who is likely to accept the subscription and thus improving the balance sheet.
  • 29. High Precision Example (ML Model for Net-Nanny) 29
  • 30. High Recall Example (ML Model for Potential Customer) 30
  • 33. First Attempt to XGBoost Model Training 33 • We will split dataset into training and test datasets • We will first create a simple XGBoost model with using traditional ML techniques. • We will evaluate models using various metrics we discussed. • Analyze the reports and concludes the model is not good.
  • 34. Strategies for better Classifier for the Imbalance Dataset. 34 1. Use the weighted class i.e give higher weights to minority positive class i.e 'yes’ label. 2. Over-Sampling of the minority class and under-sampling of the majority class. 3. Use the SMOTE (Synthetic Minority Oversampling Techniques).
  • 35. Weighted Class classifier 35 Since at training time, classifier doesn’t learn from minority samples The natural approach would be to some how increase the number of minority samples One way to do is to give more weight to minority class at the time of evaluation of the node split.
  • 36. Weighted Class classifier (continue…) 36
  • 37. Minority Over-Sampling and Majority Under-sampling 37 • Idea is similar to Weighted class • Instead of giving higher weights, we just replicate a subset of random minority samples • Also we can under samples the majority class i.e don’t take all the samples from majority class
  • 38. SMOTE algorithm (Synthetic Minority Oversampling Technique) 38 • Intuitively, just replicating the minority sample sounds cheating. • So better approach is to have new synthetic samples created by interpolating and extrapolating the two or more samples. • The candidate samples for interpolation can come from majority or minority or the combination of them. • The suggestion by SMOTE is to use the samples on the decision boundary . • SMOTE Paper by V Chawla and etal: https://arxiv.org/pdf/1106.1813.pdf
  • 40. SMOTE(continue…) 40 • Nearest neighbor minority samples synthesis. • For the center red circle for K=2, we have two neighbors and we have possibility of two synthesis samples for dup_size=1, • We randomly picks one of the small new red samples.
  • 41. 41 Demo for predictive classifier for imbalance datasets.
  • 42. Checklist 42 1) Look the package installation guide accompanying this tutorial. 2) Clone the github: https://github.com/IBM/xgboost-financial-predictions/ 3) Data: data/bank.csv 4) Notebook: notebooks/predict_bank_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb 5) Slides: https://github.com/aloknsingh/ds_xgboost_clf_4_imbalance_data/slides 6) You can use one of the following platforms a) Your laptop b) IBM Watson Studio
  • 44. 44
  • 45. 45

Editor's Notes

  1. 4 Slides time:3-4 mins
  2. The IBM Watson Studio Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful. The URL for DSX is datascience.ibm.com We believe that a winning platform for data science needs to integrate with both open source and value-add tooling. We also plan on integrating popular tools from strategic partners so that we can better serve the needs of our clients and better fit into their tooling preferences and underlying infrastructure. The goal is to have all of the tools that a Data Scientist needs in one place, thereby making it easier for data science teams to learn, create, and collaborate.
  3. IBM Watson Studios Data Science Experience has been developed based on interviews with hundreds of Data Scientists and the feedback of thousands of users. Watch this design video at https://www.youtube.com/watch?v=1HjzkLRdP5k to learn more about how and why we created the Data Science Experience.
  4. For context on why XGBoost and class imbalances are a relevant topic, it’s important to know that most of the real life data set contains very small amount of samples of interest i.e in our case, the number of people who purchased Bank’s term deposit financial product are very little compare to the number of people who didn’t purchase. And when we build supervised machine learning model on this kind of dataset. Our model doesn’t perform good on the dataset since while training it thinks that since it is predicting most of the negative samples right, it is doing great. However, as the user, we are mostly interested in the positive samples. We will see now to solve those problems.
  5. All those terms will be explained in later slides in details
  6. 4 Slides => 3 mins
  7. Time: 1 min Explain: XGBoost is ensemble method . Lets consider the CART algorithms, Here we build a tree
  8. Explain: However in the XGBoost instead of one tree or many tree as in random forest we have a sequence of trees which iteratively refine itself.
  9. Mathematically it is the refined of each steps by series of function approximate for previous errors Note that at each steps we change complexity of tree Recommend to go through the whole tutorial.
  10. In this section , we will look into various aspect of the workshop that we will cover , also we will go in details in some of the concepts that will be used heavily in the notebooks.
  11. Time (3 m) Explain: how we will go through each steps in shorts
  12. Explain(3 min) Talk about history Each of 3 curves and how it relates How shifting in 1 curve effect 3 curve. Lets see what is ROC curve? We can get the TP, TN, FN, FP from confusion matrix as explained in previous section. ML models also gives us the probability of a sample being a positive or negative. This allows us to make the better decision. For ROC curve, we use TPR : True Positive Rate (aka Recall) = sensitivity i.e number of samples, model correctly classifies a positive sample as positive. FPR : False Positive Rate (aka 1- Specificity)  i.e number of samples, model incorrectly classifies a negative sample as positive. obviously, we would like to have higher TPR and lower FPR.
  13. Each of 3 curves and how it relates How shifting in 1 curve effect 3 curve For example, consider two predicted positive samples, one with 0.55 probability and other with 0.95 probability. These samples are considered positive because by default the threshold is 0.5. If we use the threshold of 0.6 then first sample will be negative and the second will be positive. In short, the values of TP, FP, TN, FN will change depending on the threshold. Thus we see that by adjusting the threshold, one can change the model's performance and for binary classification this adjustment is very often done. When we adjust the threshold, the value of TPR and FPR will also change, as they are function of TP, FP, TN, FN and when we plot TPR vs FPR for various threhold, we get the ROC curve. For a model to have best performance, we would like to have model rank a randomly chosen positive sample higher than a randomly chosen negative samples and this is what we call Area Under the Curve (AUC) metric.
  14. Time: 3min for 2 slides Why PR is better than ROC for certain data Explain the difference from ROC and definition of PR curve. Examples like documents retrieval and others more details in notebooks We have a good measure for measuring model performance using ROC curve. However, when we have the class imbalances, ROC curve doesn't perform good. To understand that see the equation for TPR and FPR and in the case of class imbalances, the number of positive samples are quite few and the number of FN relative to TP is quite more and hence if there is any change in TP or FP, that will be very small compare to TN and FN and hence even if I get get improvement ROC curve won't be able to capture it. Hence we use another metrics which is precision and is defined as TP/(TP+FP). We note that since it is the odd ratio of TP and FP both of them are sensitive to changes in positive samples.This matrix is very useful for measuring the performance in case of class imbalance.
  15. An Example of plot Inverse relationship between p and r So there must be some middle point for best
  16. Key Idea: Precision vs Recall which to choose based on threshold? Answer: depends on use case or application Why Recall better in our application
  17. Explain: now we know how to measure our models’ performance lets go on and see how we plan to train our model.
  18. Time (5-10 min) TBD Create 1 more flow chart for iterative model training rectangle.
  19. TBD: Expand this slides to 3 more slides Explain(5-10min)
  20. You give more weights to minority samples at the time of evaluation. Recall from XGBoost that When we split the tree’s node at each iterations, for calculating entropy of the split. In this case, we give more weights to the minority samples
  21. Since recall is important for our case, Curves at the top for varying weights gives us recall for various weights. Various threadhold is on x-axis and recall is on y axis. First we choose the right threshold to balance recall and precision and then chose the best weights. Note: higher weight doesn’t necessarily means better
  22. Similar to weighted samples but just replicate (minority) or remove(majority) the samples
  23. Time: 3 slides == 3 min
  24. SMOTE: Lets see this in the dataset with two dimensions We interpolate the minority samples*big red circles) so that to produce small red circles as new synthetic samples. The number of new samples between two samples is controlled by ‘dup_size’ parameter.
  25. -Nearest neighbor minority samples synthesis.
  26. NOTE: At this time 25 mins should have passed Next section 50 mins
  27. Notes: Use IBM Watson for illustration If internet is bad use, Anaconda Jupyter lab
  28. Finish in 50 mins
  29. Time : 10 min
  30. QA