2. Trinity College Dublin, The University of Dublin
Overview previous lecture
2
• Binary classification
• Evaluation
• Overfitting
• Cross-validation
• Imbalanced datasets
• Multiclass classification
3. Trinity College Dublin, The University of Dublin
Overview lecture
3
• Classification algorithms
• K-nearest neighbour (KNN)
• Decision tree
• Support Vector Machines (SVM)
• Data projection (introduction)
4. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
4
Imbalanced datasets:
Dataset
Class
1 0 0 0 0
Binary classification task:
- Is this a number five or not?
- 10 digits
- Each digit with the same number of occurrences in the
dataset
- Ideal chance-level of a multiclass classifier: 1/10 = 0.1 = 10%
(what is the chance of decoding the exact digit)
- Ideal chance-level of a binary classifier (is it a 5 or not?)
- It’s tricky. For example, a classifier that always returns ‘not a
5’ would be 90% correct (as 90% of the digits are not a 5).
So, a good classifier should do better than that. But better in
what? Precision, recall, or both?
5. Trinity College Dublin, The University of Dublin
Precision vs. recall – confusion matrix
5
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Prediction Accuracy = (3+5)/(3+5+1+2) = 8/11 ~ 0.73
3 out of 4 of my
predictions were correct.
I made one mistake. I
could have been more
precise!
I detected 3 out of 5
elements. I missed 2 of
them!
7. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
7
Imbalanced datasets:
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
8. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
8
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Trade-off
9. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
9
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019 ROC: Receiver operating characteristic
10. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
10
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019 ROC: Receiver operating characteristic
11. Trinity College Dublin, The University of Dublin
Binary classification – evaluation metrics
11
F1-Score = harmonic mean of precision and recall
12. Trinity College Dublin, The University of Dublin
Multiclass classification – evaluation metrics
12
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Actual
class
Predicted class Predicted class
Lots of instances are misclassified as ‘8’
Great classification result!
13. Trinity College Dublin, The University of Dublin
Baseline – real vs. ideal
13
- Small datasets have a higher chance that a random classifier would get it right
by chance
- So, classification results should be compared to a baseline (or chance level)
that is calculated by taking into account the sample size (N)
https://www.discovermagazine.com/mind/machine-learning-exceeding-chance-level-by-chance
14. Trinity College Dublin, The University of Dublin
Baseline – real vs. ideal - intuition
14
- N: number of coin tosses; x̄: average number of heads
- Large dataset (N = 10000): 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 …
-> P(‘1’) = 50% -> half of the time we get ‘1’. Small fluctuation, small change in overall balance
between classes (e.g., 50% + 1 = 50.01%)
- Small dataset (N = 10): 0 1 0 1 1 1 1 1 0 0
-> P(‘1’) = 50%, small fluctuation, large change in overall balance between classes (e.g., 50% + 1
= 60%)
https://www.discovermagazine.com/mind/machine-learning-exceeding-chance-level-by-chance
15. Trinity College Dublin, The University of Dublin
Classification in Python
15
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
X is the data matrix
(features)
y is the class (‘five’ or
‘not a five’)
16. Trinity College Dublin, The University of Dublin
Types of classification
17
https://machinelearningmastery.com/types-of-classification-in-machine-learning/
Each type may require different methods
Binary Multiclass Imbalanced
e.g., medical diagnosis
Anomaly detection
•Logistic Regression
•k-Nearest Neighbors
•Decision Trees
•Support Vector Machine
•Naive Bayes
•k-Nearest Neighbors
•Decision Trees
•Naive Bayes
•Random Forest
•Gradient Boosting
Binary (one vs. all, one vs. one)
•Support Vector Machine
•Logistic regression
17. Trinity College Dublin, The University of Dublin 18
K-nearest neighbours (KNN)
https://www.analyticssteps.com/blogs/how-does-k-nearest-neighbor-works-machine-learning-classification-problem
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
New instance (Red)
Neighbourhood: k=5:
5 green, 0 blue -> selecting green class
18. Trinity College Dublin, The University of Dublin 19
K-nearest neighbours (KNN)
https://www.analyticssteps.com/blogs/how-does-k-nearest-neighbor-works-machine-learning-classification-problem
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
New instance (Red)
Neighbourhood: k=5:
3 green, 2 blue -> selecting green class
Step 1: labelled data
Step 2: calculate distance
between new instance and k-
nearest neighbours
Step 3: Count! What’s the
most frequent class in the
neighbourhood?
19. Trinity College Dublin, The University of Dublin 20
K-nearest neighbours (KNN)
Algorithm:
Given a dataset
For each new instance
Find neighbourhood based on feature space
Select most frequent class in the neighbourhood
Pros:
- Simple
- Applies to non-linear data
- There is no need for difficult model fit and tuning
Cons (basic version):
- The model needs to store large amounts of data
- Slow at generating predictions
- Slower and heavier with increasing dataset size
20. Trinity College Dublin, The University of Dublin 21
K-nearest neighbours (KNN)
Example: Is a bike damaged?
Based on:
- Feature 1: average speed
- Feature 2: how much was it
used in the last 24h
How much was it used (hours)
Average speed
21. Trinity College Dublin, The University of Dublin 22
K-nearest neighbours (KNN)
Example: Is a bike damaged?
Based on:
- Feature 1: average speed
- Feature 2: how much was it
used in the last 24h
Imbalanced classification
How much was it used (hours)
Average speed
22. Trinity College Dublin, The University of Dublin 23
Decision tree
How much was it used (hours)
Average speed
Used > 3h
Used > 6h
Avg speed > 15km/h
Yes
No
No
No
Yes
Yes
23. Trinity College Dublin, The University of Dublin 24
Decision tree
Optimal split at every iteration?
We need to select a metric! -> homogeneity of the target variable in the subsets (e.g., entropy,
information gain)
24. Trinity College Dublin, The University of Dublin 25
Decision tree
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
25. Trinity College Dublin, The University of Dublin 26
Decision tree
- Simple to understand and visualise
- It can handle both numerical and categorical data
- It works with little data
- Not great classification results
- Unstable (small changes in the data may result in big changes in the decision tree
- A Random forest runs many decision trees on subsamples of the data. The combination of many
trees leads to better classification results. However, that is a computationally expensive process (it
takes time).
26. Trinity College Dublin, The University of Dublin 27
Support Vector Machine (SVM)
Linear Binary SVM Classification
- Scenario where the two classes are linearly
separable
- The solid line in the plot on the right represents
the decision boundary of an SVM classifier
- This line separates the two classes + stays as far
away from the closest training instances as
possible
27. Trinity College Dublin, The University of Dublin 28
Support Vector Machine (SVM)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
28. Trinity College Dublin, The University of Dublin 29
Support Vector Machine (SVM)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
29. Trinity College Dublin, The University of Dublin 30
Support Vector Machine (SVM)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
30. Trinity College Dublin, The University of Dublin 31
Support Vector Machine (SVM)
“Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow”, Aurélien Géron, 2019
Soft-margin classification. It limits the margin violations, but they are indeed possible and
tolerated. How much are they tolerated? That is decided by setting the parameter C.
- Small C: wider margin, lots of data-points between the margins
- Large C: smaller margin with fewer margin violations.
- A very large C would not be good (too specific to this dataset, too sensitive to the
outliers)
31. Trinity College Dublin, The University of Dublin 32
Support Vector Machine (SVM)
http://www.mlfactor.com/svm.html
32. Trinity College Dublin, The University of Dublin 33
Support Vector Machine (SVM)
- Some datasets are not even close
to being linearly separable.
- One approach is to use
polynomial features
e.g., x2 = (x1)2
x3 = (x1)3
33. Trinity College Dublin, The University of Dublin 34
Data projection
x1
x2
Y ∈ {green,blue}
X: [x1, x2]
Xproj = X - [2,0]
Xproj = [x1, x2] - [2,0]
Xproj = [x1-2, x2]
xproj1
xproj2
34. Trinity College Dublin, The University of Dublin 35
Data projection
x1
x2
Y ∈ {green,blue}
X: [x1, x2]
Xproj = X - [2,3]
Xproj = [x1, x2] - [2,3]
Xproj = [x1-2, x2-3]
xproj1
xproj2
35. Trinity College Dublin, The University of Dublin 36
Data projection
A projection is a transformation of data points from one axis system to another
x1
x2
xproj1
xproj2
xproj1
xproj2
36. Trinity College Dublin, The University of Dublin 37
Data projection
x1
x2
x1
x2
Bad projection Good projection
37. Trinity College Dublin, The University of Dublin 38
x1
x2
Good projection
Data projection
LDA: Linear Discriminant Analysis
Find the axis that:
- Maximises the variance of the class
means (between-class)
- Minimises the within-class variance
38. Trinity College Dublin, The University of Dublin 39
x1
x2
Good projection
Data projection
xproj
Perfect separability between classes
39. Trinity College Dublin, The University of Dublin 40
Data projection
x1
x2
Y ∈ {green,blue}
x2
x1
X: [x1, x2] Sometimes it is easier to look at things from a different angle,
instead of searching for a complicated solution
Notas do Editor
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different metrics could be used to define what “best” means, such as information gain (entropy)
Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different metrics could be used to define what “best” means, such as information gain (entropy)
Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different metrics could be used to define what “best” means, such as information gain (entropy)
Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different metrics could be used to define what “best” means, such as information gain (entropy)
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,
Mention that the main challenge is always to determine those axes (features). Not just 2D, multidimensional. It could be age, height,