IntroML_5_Classification_part2

Introduction to
Machine Learning
(5 ECTS)
Giovanni Di Liberto
Asst. Prof. in Intelligent Systems, SCSS
Room G.15, O’Reilly Institute

Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Binary classification
• Evaluation
• Overfitting
• Cross-validation
• Imbalanced datasets
• Multiclass classification

Overview lecture
3
• Classification algorithms
• K-nearest neighbour (KNN)
• Decision tree
• Support Vector Machines (SVM)
• Data projection (introduction)

Binary classification – evaluation metrics
4
Imbalanced datasets:
Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
- 10 digits
- Each digit with the same number of occurrences in the
dataset
- Ideal chance-level of a multiclass classifier: 1/10 = 0.1 = 10%
(what is the chance of decoding the exact digit)
- Ideal chance-level of a binary classifier (is it a 5 or not?)
- It’s tricky. For example, a classifier that always returns ‘not a
5’ would be 90% correct (as 90% of the digits are not a 5).
So, a good classifier should do better than that. But better in
what? Precision, recall, or both?

Precision vs. recall – confusion matrix
5
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73
3 out of 4 of my
predictions were correct.
I made one mistake. I
could have been more
precise!
I detected 3 out of 5
elements. I missed 2 of
them!

Precision vs. recall – confusion matrix
6

7
Imbalanced datasets:

8
Trade-off

9
Keras, and TensorFlow”, Aurélien Géron, 2019 ROC: Receiver operating characteristic

10
Keras, and TensorFlow”, Aurélien Géron, 2019 ROC: Receiver operating characteristic

11
F1-Score = harmonic mean of precision and recall

Multiclass classification – evaluation metrics
12
Actual
class
Predicted class Predicted class
Lots of instances are misclassified as ‘8’
Great classification result!

Baseline – real vs. ideal
13
- Small datasets have a higher chance that a random classifier would get it right
by chance
- So, classification results should be compared to a baseline (or chance level)
that is calculated by taking into account the sample size (N)
https://www.discovermagazine.com/mind/machine-learning-exceeding-chance-level-by-chance

Baseline – real vs. ideal - intuition
14
- N: number of coin tosses; x̄: average number of heads
- Large dataset (N = 10000): 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 …
-> P(‘1’) = 50% -> half of the time we get ‘1’. Small fluctuation, small change in overall balance
between classes (e.g., 50% + 1 = 50.01%)
- Small dataset (N = 10): 0 1 0 1 1 1 1 1 0 0
-> P(‘1’) = 50%, small fluctuation, large change in overall balance between classes (e.g., 50% + 1
= 60%)
https://www.discovermagazine.com/mind/machine-learning-exceeding-chance-level-by-chance

Classification in Python
15
X is the data matrix
(features)
y is the class (‘five’ or
‘not a five’)

Types of classification
17
https://machinelearningmastery.com/types-of-classification-in-machine-learning/
Each type may require different methods
Binary Multiclass Imbalanced
e.g., medical diagnosis
Anomaly detection
•Logistic Regression
•k-Nearest Neighbors
•Decision Trees
•Support Vector Machine
•Naive Bayes
•k-Nearest Neighbors
•Decision Trees
•Naive Bayes
•Random Forest
•Gradient Boosting
Binary (one vs. all, one vs. one)
•Support Vector Machine
•Logistic regression

Trinity College Dublin, The University of Dublin 18
K-nearest neighbours (KNN)
https://www.analyticssteps.com/blogs/how-does-k-nearest-neighbor-works-machine-learning-classification-problem
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
New instance (Red)
Neighbourhood: k=5:
5 green, 0 blue -> selecting green class

https://www.analyticssteps.com/blogs/how-does-k-nearest-neighbor-works-machine-learning-classification-problem
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
New instance (Red)
Neighbourhood: k=5:
3 green, 2 blue -> selecting green class
Step 1: labelled data
Step 2: calculate distance
between new instance and k-
nearest neighbours
Step 3: Count! What’s the
most frequent class in the
neighbourhood?

Algorithm:
Given a dataset
For each new instance
Find neighbourhood based on feature space
Select most frequent class in the neighbourhood
Pros:
- Simple
- Applies to non-linear data
- There is no need for difficult model fit and tuning
Cons (basic version):
- The model needs to store large amounts of data
- Slow at generating predictions
- Slower and heavier with increasing dataset size

Example: Is a bike damaged?
Based on:
- Feature 1: average speed
- Feature 2: how much was it
used in the last 24h
How much was it used (hours)
Average speed

Example: Is a bike damaged?
Based on:
- Feature 1: average speed
- Feature 2: how much was it
used in the last 24h
Imbalanced classification
Average speed

Decision tree
Average speed
Used > 3h
Used > 6h
Avg speed > 15km/h
Yes
No
No
No
Yes
Yes

Decision tree
Optimal split at every iteration?
We need to select a metric! -> homogeneity of the target variable in the subsets (e.g., entropy,
information gain)

Decision tree

Decision tree
- Simple to understand and visualise
- It can handle both numerical and categorical data
- It works with little data
- Not great classification results
- Unstable (small changes in the data may result in big changes in the decision tree
- A Random forest runs many decision trees on subsamples of the data. The combination of many
trees leads to better classification results. However, that is a computationally expensive process (it
takes time).

Support Vector Machine (SVM)
Linear Binary SVM Classification
- Scenario where the two classes are linearly
separable
- The solid line in the plot on the right represents
the decision boundary of an SVM classifier
- This line separates the two classes + stays as far
away from the closest training instances as
possible

Soft-margin classification. It limits the margin violations, but they are indeed possible and
tolerated. How much are they tolerated? That is decided by setting the parameter C.
- Small C: wider margin, lots of data-points between the margins
- Large C: smaller margin with fewer margin violations.
- A very large C would not be good (too specific to this dataset, too sensitive to the
outliers)

http://www.mlfactor.com/svm.html

- Some datasets are not even close
to being linearly separable.
- One approach is to use
polynomial features
e.g., x2 = (x1)2
x3 = (x1)3

Data projection
x1
x2
Y ∈ {green,blue}
X: [x1, x2]
Xproj = X - [2,0]
Xproj = [x1, x2] - [2,0]
Xproj = [x1-2, x2]
xproj1
xproj2

Data projection
x1
x2
Y ∈ {green,blue}
X: [x1, x2]
Xproj = X - [2,3]
Xproj = [x1, x2] - [2,3]
Xproj = [x1-2, x2-3]
xproj1
xproj2

Data projection
A projection is a transformation of data points from one axis system to another
x1
x2
xproj1
xproj2
xproj1
xproj2

Data projection
x1
x2
x1
x2
Bad projection Good projection

x1
x2
Good projection
Data projection
LDA: Linear Discriminant Analysis
Find the axis that:
- Maximises the variance of the class
means (between-class)
- Minimises the within-class variance

x1
x2
Good projection
Data projection
xproj
Perfect separability between classes

Data projection
x1
x2
Y ∈ {green,blue}
x2
x1
X: [x1, x2] Sometimes it is easier to look at things from a different angle,
instead of searching for a complicated solution

IntroML_5_Classification_part2

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a IntroML_5_Classification_part2

Semelhante a IntroML_5_Classification_part2 (20)

Último

Último (20)

IntroML_5_Classification_part2

Notas do Editor