1 Supervised learning

Introduction to Machine
Learning
(Supervised learning)
Dmytro Fishman (dmytro@ut.ee)

This is an introduction to the topic

This is an introduction to the topic
We will try to provide a beautiful scenery

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
“We love you, Mummy!”

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
Petal
Sepal

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
Petal
Sepal
Word 1 25
Word 2 23
Word 3 12
… …

Petal
Sepal
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
Word 1 25
Word 2 23
Word 3 12
… …

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Big DataBig Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Big Data
Astronomical?
Youtubical?
Big Data: Astronomical or Genomical? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Genomical?

Big Data
Astronomical? Genomical?
Youtubical?
1 Exabyte/year 2-40 Exabyte/year
1-2 Exabyte/year

Big Data
Astronomical? Genomical?
Youtubical?
1 Exabyte/year 2-40 Exabyte/year
1-2 Exabyte/year
1 Exabyte =1012 Mb

There is a lot of data
produced nowadays
But there are also a vast number of potential ways to use this
data

Benign Malignant
Skin cancer example

Malignant?
Tumour size
Benign Malignant
Skin cancer example
Yes(1)
No(0)

Malignant?
Tumour size
Yes(1)
Benign Malignant
Skin cancer example
No(0)

Malignant?
Tumour size
Yes(1)
No(0)
Benign Malignant
Skin cancer example
Malignant?

Tumour size
Age
Malignant?
Other features:
Lesion type
Lesion conﬁguration
Texture
Location
Distribution
…
Potentially inﬁnitely many features!

Classiﬁcation task
Predicting discrete value output using previously labeled
examples
Binary classiﬁcation

examples
also binary classiﬁcation

examples
also binary classiﬁcation
Every time you have to distinguish between TWO
CLASSES it is a binary classiﬁcation

Multiclass classiﬁcation
examples

Housing price prediction
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500

Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price?

Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price

Regression task
Size in m2
Price in
1000’s ($)
400
100
200
300
100 200 300 400 500
Price

Malignant?
Tumour size
Yes(1)
No(0)
Benign Malignant
Malignant?
VS
Classiﬁcation Regression
Supervised Learning
Size in m2
Pricein1000’s($)
400
100
200
300
100 200 300 400 500
Price?

You are running a company which has two problems,
namely: 
Q:
a. Both problems are examples of classification problems
b. The first one is a classification task and the second one
regression problem
c. The first one is regression problem and the second one as
classification task
d. Both problems are regression problems
1. For each user in the database predict if this user will
continue using your company’s product or will move to
competitors (churn).
2. Predict the profit of your company at the end of this year
based on previous records.
How would you approach these problems?

Unsupervised Learning
examples slides
Clustering with google queries
On a contrary to the ﬁrst category we don’t have labels to our
classes (graphs with two features from previous examples
turns into unlabelled one)
Gene expression clustering
Quiz question: “of the following examples, which would you
address using a n unsupervised learning algorithm?”

Tumour size
Age
Supervised Learning

Tumour size
Age

Tumour size
Age
Is there any interesting hidden structure in this data?

Tumour size
Age
Is there any interesting hidden structure in this data?
What does this hidden structure correspond to?

Gene expression
Two interesting
groups of
species

Gene expression
Two interesting
groups of genes

Q1: Some telecommunication company wants to segment
their customers into distinct groups in order to send
appropriate subscription offers, this is an example of ...
Q2: You are given data about seismic activity in Japan, and
you want to predict a magnitude of the next earthquake,
this is in an example of ...
Q3: Assume you want to perform supervised learning and
to predict number of newborns according to size of storks'
population (http://www.brixtonhealth.com/storksBabies.pdf),
it is an example of ...
Q4: Discriminating between spam and ham e-mails is a
classiﬁcation task, true or false?

appropriate subscription offers, this is an example of
clustering
this is in an example of ...
Quiz: Assume you want to perform supervised learning and
Quiz: Discriminating between spam and ham e-mails is a

clustering
this is in an example of regression
Quiz: Discriminating between spam and ham e-mails is a

it is an example of stupidity regression
clustering

classiﬁcation task.
it is an example of stupidity regression
clustering

MNIST dataset
(10000 images)
Instance Label
28px
28px
3

MNIST dataset
(10000 images)
Instance Label0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
3

MNIST dataset
(10000 images)
In total 784 pixel values
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3

Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
could be downloaded from: http://yann.lecun.com/exdb/mnist/
3

Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
Feature

Pixel values Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
In total 784 pixel valuesFeatures are also some times
referred to as dimensions

Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
In total 784 pixel valuesFeatures are also some times
referred to as dimensions
This images are 784 dimensional

Feature vectors Labels
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
Data is loaded.
What should we do now?

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
Data is loaded.
We would like to build a tool that would
be able to automatically recognise
handwritten images

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 1
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 8
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 1
Instances
MNIST dataset
(10000 images)
0 0 0 0 0 0 0
0 155 255 255 255 155 0
255 255 255 255 255 255 255
255 155 78 78 155 255 255
255 0 0 0 0 155 255
28px
28px
(0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 …)
3
Data is loaded.
We would like to build a tool that would
be able to automatically recognise
handwritten images
Let’s get to the ﬁrst algorithm

How to quantitatively say which of
these pairs are more similar?
& &
A B CA
OR

& &
A B CA
What about computing their pixel-wise difference?
OR

& &
A B CA
OR
Σ
784
|Ai - Ci|Σ
784
|Ai - Bi|i i

& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i

& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i
A is more similar to C than B

& &
A B CA
OR
Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03i i
A is more similar (closer) to C than B

Σ
784
|Ai - Ci| = 107.38Σ
784
|Ai - Bi| = 137.03
& &
A B CA
i i
OR
A is more similar (closer) to C than B

Instance Label
?
DatasetFor each new instance
We asked our friend to
write a bunch of new
digits so that we can
have something to
recognise, here is the
ﬁrst one of them

Instance Label
?
DatasetFor each new instance

Instance Label
?
1.Compute pixel-wise
distance to all training
examples
For each new instance Dataset

Instance Label
?
examples
2. Find the closest
training example

Instance Label
3
examples
2. Find the closest
training example
3. Report it’s label
Nearest Neighbour
classiﬁer

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
For each new instance

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Very easy to implement

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Very slow classiﬁcation time

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Suffers from
the curse of dimensionality
Could be a good choice for
low-dimensional problems

Curse of dimensionality
Remember we said
that our instances are
784 dimensional?

Curse of dimensionality
Remember we said
that our instances are
784 dimensional?
This is a lot!

http://cs231n.github.io/classiﬁcation/

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Suffers from

For each test example
Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Fast training time O(C)
Suffers from
NN is rarely used in
practice

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Suffers from
Can we ﬁnd a better
algorithm?

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
Back to binary classiﬁcation
*for a sec

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
*for a sec
pixel #213
pixel #213

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
*for a sec
pixel #213
> 163 <= 163
pixel #213

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
*for a sec
pixel #213
> 163 <= 163
pixel #216
pixel #216
> 30 <= 30

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
pixel #216
VS
*for a sec
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree

Instances
pixel #216
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
pixel #216
> 30 <= 30
VS
*for a sec
pixel #213
> 163 <= 163
Split

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
Back to binary classiﬁcation pixel #213
> 163 <= 163
pixel #216
pixel #216
> 30 <= 30

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
*for a sec
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
How do you know which
features to use for best splits?
Split

VS
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
*for a sec
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
How do you know which
features to use for best splits?
Split
Using various goodness metrics
such as information gain or gini
impurity to deﬁne “best”

Decision (classiﬁcation) tree algorithm
1.Construct a decision
tree based on training
examples

examples
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
2.Make corresponding
comparisons
examples
#213

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
?
comparisons
examples
#213
#216

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6
3. Report label
examples
comparisons
#213
#216

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
Once the tree is constructed
maximum 2 comparisons
would be needed to test a new
example3. Report label
examples
comparisons
#213
#216

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Instance Label
6 Depth=2
In general decision trees are
*always faster than NN
algorithm3. Report label
examples
comparisons
*remember, shit happens
#213
#216

algorithm?
Disadvantages of NN
Suffers from

algorithm?
Disadvantages of NNDisadvantages of DT
Very slow classiﬁcation time Very slow classiﬁcation time
Suffers from

algorithm?
Disadvantages of NNDisadvantages of DT
Also suffers from
Very slow classiﬁcation time Very slow classiﬁcation time
Suffers from

Is there a way to break the curse?

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
Decision tree algorithm is non-parametric and
deterministic

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
deterministic
The shape of the
tree is determined
by data not our
choice

pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
deterministic
This means that we will always have the same
output given the same input…
The shape of the
tree is determined
by data not our
choice

The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
deterministic
Are all input dimensions equally important
for classiﬁcation?

The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
deterministic
How about building a lot of trees from
random parts of the data and then merging
their predictions?

The shape of the
tree is determined
by data not our
choice
pixel #213
> 163 <= 163
pixel #216
> 30 <= 30
deterministic
How about building a lot of trees from
random parts of the data and then merging
their predictions?
Random forest algorithm

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
Randomly discard some rows

Randomly discard some rows and columns
0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
Build a decision tree
based on remaining data
pixel #213
> 163 <= 163
pixel #216
> 0 = 0

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
Build a decision tree
based on remaining data
Repeat N times until N trees
are constructed

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 254 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30

pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
?

pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
?
For each new instance Use all constructed trees to
generate predictions

pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
?
For each new instance Predictions
Tree #2
Tree #1
Tree #3

pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
Tree #2
Tree #1
Tree #3?
Average 2/3

pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30
Instance Label
Tree #2
Tree #1
Tree #36
Average 2/3 = 66.6%

Instance Label
6
Tree #2
Tree #1
Tree #3
Average 2/3 = 66.6%
Quiz time
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
pixel #213
> 163 <= 163
pixel #214
> 253 <= 253
pixel #216
> 30 <= 30

Q1:
Which classiﬁcation algorithm(s) has(ve) the following
weaknesses:
• It takes more time to train the classiﬁer then to
classify a new instance
• It suffers from the curse of dimensionality
A. Nearest neighbour algorithm
B. Decision tree
C. Random forest algorithm
D. None of the above
E. All of the above

Q1:
B. Decision tree
D. None of the above
E. All of the above
• It takes more time to train the classiﬁer then to
classify a new example
• It suffers from the curse of dimensionality
pixel #213
> 163 <= 163
pixel #216
> 0 = 0
Which classiﬁcation algorithm(s) has(ve) the following
weaknesses:

Q2:
A. Prohibitively slow running time at
training given a lot of data
B. Highly biased classification due
prevalence of one of the classes
C. High classification error due to
excessively complex classifier
D. Poor performance of the classifier
trained on data with large number of
features
E. None of the above
Which of the following statements best defines the curse of
dimensionality

Q2:
Which of the following statements best defines the curse of
dimensionality
A. Prohibitively slow running time at
training given a lot of data
B. Highly biased classification due
prevalence of one of the classes
C. High classification error due to
excessively complex classifier
D. Poor performance of the classifier
trained on data with large number
of features
E. None of the above

Q3:
Which of the following algorithms you would prefer If you
would have to classify instances from low-dimensional data?
B. Decision tree algorithm
D. All mentioned would cope
E. None of the above are suitable

B. Decision tree algorithm
D. All mentioned would cope
E. None of the above are suitable
Q3:
Which of the following algorithm(s) you would prefer If you
would have to classify instances from low-dimensional data?

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 202 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
Let us go primitive, and focus only on two
pixels

0 0 0 0 0 0 0 31 132 254 253 254 213 82 0 0 0 0 0 0 … 3
0 0 0 0 0 0 0 25 142 254 254 193 30 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 123 254 87 0 0 0 0 0 0 0 0 0 … 6
0 0 0 0 0 0 0 0 59 163 254 202 254 194 112 18 0 0 0 0 … 3
0 0 0 0 0 0 0 0 19 227 254 84 0 0 0 0 0 0 0 0 … 6
Instances
pixels
It does not really matter, which ones. I will take these two
because we got use to them already :)

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
pixels

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Now, let’s visualise them on a 2-D plotPixel#215
Pixel #213
254
2540
0

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Support Vector Machine (SVM)Pixel#215
Pixel #213
254
2540
0

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
1.Identify the right hyper-
plane
A
B C
Is it A, B or C?
Support Vector Machine (SVM)

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
plane
2. Maximise the distance
between nearest points
and a hyper-plane

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
Margin

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
Closest points that deﬁne hyper-
plane are called support vectors

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
3. Larger the distance
from the hyper-plane to
the instance, more
confident the classifier
about its prediction
more
confidence

Features Labels
254 254 3
254 193 6
254 0 6
163 202 3
227 84 6
Instances
Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
the instance, more

Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
the instance, more
Instance Label
?

Pixel#215
Pixel #213
254
2540
0
plane
and a hyper-plane
the instance, more
Instance Label
6

Pixel#215
Pixel #213
254
2540
0

y
x
254
2540
0
Let’s make another dimension
z a*x2 b*y2+=

y
x
254
2540
0
z a*x2 b*y2+=
z
x
2540
0

y
x
254
2540
0
z a*x2 b*y2+=
z
x
2540
0
This transformation is called a kernel trick
and function z is the kernel

y
x
254
2540
0
z a*x2 b*y2+=
z
x
2540
0
This transformation is called a kernel trick
and function z is the kernel
Wow, wow, wow, hold on!
How does this actually work?

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Suffers from
Comparison with SVM
Disadvantages of SVM
Suffers from

Instance Label
3
examples
2. Find the closest
training example
Advantages of NN
Disadvantages of NN
Suffers from
Comparison with SVM
Disadvantages of SVM
Suffers from
It might be tricky to choose
the right kernel

Q:
How would you approach a multi-classiﬁcation task using
SVM?

Q:
How would you approach a multi-classiﬁcation task using
SVM?
Pixel#215
Pixel #213
254
2540
0

100% accurate!

accuracy =
correctly classiﬁed instances
total number of instances
100% accurate!

Can we trust this model?
100% accurate!

Consider the following example:
100% accurate!

Whatever happens,
predict 0
100% accurate!

Whatever happens,
predict 0
Accuracy = 49/50
100% accurate!

Whatever happens,
predict 0
Accuracy = 98%
100% accurate!

Count
Histogram could help you ﬁgure
out if your dataset is unbalanced
100% accurate!

What if my data is unbalanced?
Count
100% accurate!

There are few ways, we are
going to discuss them later Count
What if my data is unbalanced?
100% accurate!

In our case data is balanced:
100% accurate!

100% accurate!
We have balanced data:

100% accurate!
We have balanced data:
😒

So, what happened?
100% accurate!

Training the model
Feature#2
Feature #1
Let’s add more examples

Training the model
Feature#2
Feature #1

Training the model
Still linearly separable
Feature#2
Feature #1

Still linearly separable
Training the model
Feature#2
Feature #1

Training the model
Feature#2
Feature #1
How about now?

Feature#2
Feature #1
Training the model
Feature#2
Feature #1

Simple; not perfect ﬁt Complicated; ideal ﬁt
Which model we should use?
Training the model
Feature#2
Feature #1
Feature#2
Feature #1

Simple; not perfect ﬁt Complicated; ideal ﬁt
Training the model
Which model we should use?
Feature#2
Feature #1
Feature#2
Feature #1

So, what happened?
Overﬁtting
100% accurate!

So, what happened?
Too general
model
Just right! Overﬁtting
100% accurate!

So, what happened?
Too general
model
Just right! Overﬁtting
We should split our data into train and test sets
100% accurate!

Split into train and test
Normally we would
split data into 80%
train and 20% test
sets

Normally we would
split data into 80%
train and 20% test
sets
As we have a lot of
data we can afford
50/50 ratio

Can we do better than 90%?
Normally we would
split data into 80%
train and 20% test
sets
As we have a lot of
data we can afford
50/50 ratio

Pixel#215
Pixel #213
254
2540
0
Model
hyper-parameter

Pixel#215
Pixel #213
254
2540
0
C = 1

Pixel#215
Pixel #213
254
2540
0
In red are ares where
penalty is applied to
instances close to the line
C = 1

Pixel#215
Pixel #213
254
2540
0
In green are areas where
no penalty is applied
C = 1

Pixel#215
Pixel #213
254
2540
0
Total amount of penalty applied to the classiﬁer is called loss
Classiﬁers try to minimise loss by adjusting their parameters
C = 1

Pixel#215
Pixel #213
254
2540
0
C = 1
This instance increases
the penalty

Pixel#215
Pixel #213
254
2540
0
Now it is in a green area
C = 1

Parameter tuning
Algorithm Hyper-parameters
K-nearest
neighbour
K - number of neighbours, (1,…,100)
Decision Tree Metric (‘gini’, ‘information gain’)
Random Forest
Number of trees (3,…,100, more better),
metric (‘gini’, information gain’)
SVM C (10-5,…,102) and gamma (10-15,…,102)

Let’s try different C maybe
our score will improve

Nope…

Fail again…

It is getting depressive…

Hurrah!

Hurrah!
You may not have noticed but…

Hurrah!
You may not have noticed but…
We are overﬁtting again…

Training 60%
The whole
dataset 100%

Training 60%
For ﬁtting initial model
The whole
dataset 100%

Training 60%
Validation 20%
The whole
dataset 100%

The whole
dataset 100%
Training 60%
Validation 20%
For parameter tuning &
performance evaluation
5/7

The whole
dataset 100%
Training 60%
Validation 20%
7/7

The whole
dataset 100%
Training 60%
Validation 20%
7/7
Testing 20%

The whole
dataset 100%
Training 60%
Validation 20%
7/7
Testing 20%
For one shot evaluation
of trained model
5/5

The whole
dataset 100%
Training 60%
Validation 20%
7/7
Testing 20%
of trained model
5/5
But what happens when you overﬁt
validation set?

The whole
dataset 100%
Training 60%
Validation 20%
Testing 20%
of trained model
5/5You’re doing great!
🙂

The whole
dataset 100%
Training 60%
Validation 20%
Testing 20%
of trained model
🙂

The whole
dataset 100%
Training 60%
Validation 20%
Testing 20%
of trained model
🙂 😒

The whole dataset 100%
Cross Validation (CV) Algorithm

Training data 80%
Test 20%

Training data 80%

20%20%20% 20%
Training data 80%

Training data 80%
20%20%20% 20%
Train on 60% of data Validate on
20%
20%20%20% 20%

20%20%20% 20%
Training data 80%
TrainTrainTrain Val
20%

0.75
20%20%20% 20%
Training data 80%
TrainTrainTrain Val
20%

0.75
ValTrainTrain Train 0.85
20%20%20% 20%
Training data 80%
TrainTrainTrain Val

20%20%20% 20%
Training data 80%
0.75
ValTrainTrain Train 0.85
TrainTrainTrain Val
TrainValTrain Train 0.91

0.75
0.85
TrainTrainVal Train
0.91
0.68
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train

TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = ?

TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = 0.75

TrainTrainVal Train
20%20%20% 20%
Training data 80%
ValTrainTrain Train
TrainTrainTrain Val
TrainValTrain Train
0.75
0.85
0.91
0.68
MEAN (0.75, 0.85, 0.91, 0.68) = 0.75
Choose the best model/paramters based on
this estimate and then apply it to test set

Raw Data
Machine Learning pipeline

Raw Data Preprocessing

Feature
extraction

Feature
extraction
Split into
train & test

Feature
extraction
Split into
train & test
test set

Feature
extraction
Split into
train & test
test set
Choose a model

Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV

Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set

Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate ﬁnal
model on
the test set
test set

Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate ﬁnal
model on
the test set
test set
Report your results

Feature
extraction
Split into
train & test
test set
Choose a model
Find best
parameters
using CV
Train the model
on the whole
training set
Evaluate ﬁnal
model on
the test set
test set
Report your results
Problem

A machine learning algorithm usually
corresponds to a combination of the
following 3 elements
The choice of a speciﬁc mapping function family F (K-NN,
SVM, DT, RF, Neural Networks etc.).

Way to evaluate the quality of a function f out of F. Ways of
saying how bad/good this function f is doing in classifying
real world objects.

a way to search for a better function f out of F. How to
choose parameters so that the performance of f would
improve.
Way to evaluate the quality of a function f out of F. Ways of
saying how bad/good this function f is doing in classifying
real world objects.

https://github.com/sugyan/tensorﬂow-mnist

References
• Machine Learning by Andrew Ng (https://www.coursera.org/learn/machine-
learning)
• Introduction to Machine Learning by Pascal Vincent given at Deep Learning
Summer School, Montreal 2015 (http://videolectures.net/
deeplearning2015_vincent_machine_learning/)
• Welcome to Machine Learning by Konstantin Tretyakov delivered at AACIMP
Summer School 2015 (http://kt.era.ee/lectures/aacimp2015/1-intro.pdf)
• Stanford CS class: Convolutional Neural Networks for Visual Recognition by
Andrej Karpathy (http://cs231n.github.io/)
• Data Mining Course by Jaak Vilo at University of Tartu (https://courses.cs.ut.ee/
MTAT.03.183/2017_spring/uploads/Main/DM_05_Clustering.pdf)
• Machine Learning Essential Conepts by Ilya Kuzovkin (https://
www.slideshare.net/iljakuzovkin)
• From the brain to deep learning and back by Raul Vicente Zafra and Ilya
Kuzovkin (http://www.uttv.ee/naita?id=23585&keel=eng)

www.biit.cs.ut.ee www.ut.ee www.quretec.ee

1 Supervised learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 1 Supervised learning

Semelhante a 1 Supervised learning (20)

Mais de Dmytro Fishman

Mais de Dmytro Fishman (14)

Último

Último (20)

1 Supervised learning