SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
A short introduction to statistical learning 
Nathalie Villa-Vialaneix 
nathalie.villa@toulouse.inra.fr 
http://www.nathalievilla.org 
Axe “Apprentissage et Processus” 
October 15th, 2014 - Unité MIA-T, INRA, Toulouse 
Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25
Outline 
1 Introduction 
Background and notations 
Underfitting / Overfitting 
Consistency 
2 SVM 
Nathalie Villa-Vialaneix | Introduction to statistical learning 2/25
Outline 
1 Introduction 
Background and notations 
Underfitting / Overfitting 
Consistency 
2 SVM 
Nathalie Villa-Vialaneix | Introduction to statistical learning 3/25
Background 
Purpose: predict Y from X; 
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background 
Purpose: predict Y from X; 
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); 
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background 
Purpose: predict Y from X; 
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); 
What we want: estimate unknown Y from new X: xn+1, . . . , xm. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background 
Purpose: predict Y from X; 
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); 
What we want: estimate unknown Y from new X: xn+1, . . . , xm. 
X can be: 
numeric variables; 
or factors; 
or a combination of numeric variables and factors. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background 
Purpose: predict Y from X; 
What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); 
What we want: estimate unknown Y from new X: xn+1, . . . , xm. 
X can be: 
numeric variables; 
or factors; 
or a combination of numeric variables and factors. 
Y can be: 
a numeric variable (Y 2 R) ) (supervised) regression régression; 
a factor ) (supervised) classification discrimination. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
if Y is numeric, n is called a regression function fonction de 
classification; 
if Y is a factor, n is called a classifier classifieur; 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
if Y is numeric, n is called a regression function fonction de 
classification; 
if Y is a factor, n is called a classifier classifieur; 
n is said to be trained or learned from the observations (xi ; yi)i . 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
if Y is numeric, n is called a regression function fonction de 
classification; 
if Y is a factor, n is called a classifier classifieur; 
n is said to be trained or learned from the observations (xi ; yi)i . 
Desirable properties 
accuracy to the observations: predictions made on known data are 
close to observed values; 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
if Y is numeric, n is called a regression function fonction de 
classification; 
if Y is a factor, n is called a classifier classifieur; 
n is said to be trained or learned from the observations (xi ; yi)i . 
Desirable properties 
accuracy to the observations: predictions made on known data are 
close to observed values; 
generalization ability: predictions made on new data are also 
accurate. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Basics 
From (xi ; yi)i , definition of a machine, n s.t.: 
^ynew = n(xnew): 
if Y is numeric, n is called a regression function fonction de 
classification; 
if Y is a factor, n is called a classifier classifieur; 
n is said to be trained or learned from the observations (xi ; yi)i . 
Desirable properties 
accuracy to the observations: predictions made on known data are 
close to observed values; 
generalization ability: predictions made on new data are also 
accurate. 
Conflicting objectives!! 
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Underfitting/Overfitting sous/sur - apprentissage 
Function x ! y to be estimated 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
Observations we might have 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
Observations we do have 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
First estimation from the observations: underfitting 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
Second estimation from the observations: accurate estimation 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
Third estimation from the observations: overfitting 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissage 
Summary 
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Errors 
training error (measures the accuracy to the observations) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors 
training error (measures the accuracy to the observations) 
I if y is a factor: misclassification rate 
]f^yi , yi ; i = 1; : : : ; ng 
n 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors 
training error (measures the accuracy to the observations) 
I if y is a factor: misclassification rate 
]f^yi , yi ; i = 1; : : : ; ng 
n 
I if y is numeric: mean square error (MSE) 
1 
n 
Xn 
i=1 
(^yi  yi)2 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors 
training error (measures the accuracy to the observations) 
I if y is a factor: misclassification rate 
]f^yi , yi ; i = 1; : : : ; ng 
n 
I if y is numeric: mean square error (MSE) 
1 
n 
Xn 
i=1 
(^yi  yi)2 
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors 
training error (measures the accuracy to the observations) 
I if y is a factor: misclassification rate 
]f^yi , yi ; i = 1; : : : ; ng 
n 
I if y is numeric: mean square error (MSE) 
1 
n 
Xn 
i=1 
(^yi  yi)2 
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) 
test error: a way to prevent overfitting (estimates the generalization 
error) is the simple validation 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors 
training error (measures the accuracy to the observations) 
I if y is a factor: misclassification rate 
]f^yi , yi ; i = 1; : : : ; ng 
n 
I if y is numeric: mean square error (MSE) 
1 
n 
Xn 
i=1 
(^yi  yi)2 
or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) 
test error: a way to prevent overfitting (estimates the generalization 
error) is the simple validation 
1 split the data into training/test sets (usually 80%/20%) 
2 train n from the training dataset 
3 calculate the test error from the remaining data 
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Example 
Observations 
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Example 
Training/Test datasets 
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Example 
Training/Test errors 
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Example 
Summary 
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Consistency in the parametric/non parametric case 
Example in the parametric framework (linear methods) 
an assumption is made on the form of the relation between X and Y: 
Y =
TX +
is estimated from the observations (x1; y1), . . . , (xn; yn) by a given 
method which calculates a
n. 
The estimation is said to be consistent if
n n!+1 
!
under (eventually) 
technical assumptions on X, , Y. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
Consistency in the parametric/non parametric case 
Example in the nonparametric framework 
the form of the relation between X and Y is unknown: 
Y = (X) +  
 is estimated from the observations (x1; y1), . . . , (xn; yn) by a given 
method which calculates a n. 
The estimation is said to be consistent if n n!+1 
!  under (eventually) 
technical assumptions on X, , Y. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
Consistency from the statistical learning perspective 
[Vapnik, 1995] 
Question: Are we really interested in estimating  or... 
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
Consistency from the statistical learning perspective 
[Vapnik, 1995] 
Question: Are we really interested in estimating  or... 
... rather in having the smallest prediction error? 
Statistical learning perspective: a method that builds a machine n from 
the observations is said to be (universally) consistent if, given a risk 
function R : R  R ! R+ (which calculates an error), 
E (R(n(X); Y)) 
n!+1 
! inf 
:X!R 
E (R((X); Y)) ; 
for any distribution of (X; Y) 2 X  R. 
Definitions: L = inf:X!R E (R((X); Y)) and L = E (R((X); Y)). 
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
Desirable properties from a mathematical perspective 
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification) 
Learning process: choose a machine n in a class of functions 
C  f : X ! Rg (e.g., C is the set of all functions that can be build using a 
SVM). 
Error decomposition 
Ln  L  
 
Ln  inf 
2C 
L 
 
+ 
 
inf 
2C 
L  L 
 
with 
inf2C L  L is the richness of C (i.e., C must be rich to ensure that 
this term is small); 
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
Desirable properties from a mathematical perspective 
Simplified framework: X 2 X and Y 2 f1; 1g (binary classification) 
Learning process: choose a machine n in a class of functions 
C  f : X ! Rg (e.g., C is the set of all functions that can be build using a 
SVM). 
Error decomposition 
Ln  L  
 
Ln  inf 
2C 
L 
 
+ 
 
inf 
2C 
L  L 
 
with 
inf2C L  L is the richness of C (i.e., C must be rich to ensure that 
this term is small); 
Ln  inf2C L  2 sup2C jLn  Lj, Ln = 1 
n 
Pni 
=1 R((xi); yi) is 
the generalization capability of C (i.e., in the worst case, the empirical 
error must be close to the true error: C must not be too rich to ensure 
that this term is small). 
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
Outline 
1 Introduction 
Background and notations 
Underfitting / Overfitting 
Consistency 
2 SVM 
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/25
Basic introduction 
Binary classification problem: X 2 H et Y 2 f1; 1g 
A training set is given: (x1; y1); : : : ; (xn; yn) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
Basic introduction 
Binary classification problem: X 2 H et Y 2 f1; 1g 
A training set is given: (x1; y1); : : : ; (xn; yn) 
SVM is a method based on kernels. It is universally consistent method, 
given that the kernel is universal [Steinwart, 2002]. 
Extensions to the regression case exist (SVR or LS-SVM) that are also 
universally consistent when the kernel is universal. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
Optimal margin classification 
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classification 
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classification 
w 
margin: 1 
kwk2 
Support Vector 
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classification 
w 
margin: 1 
kwk2 
Support Vector 
w is chosen such that: 
minw kwk2 (the margin is the largest), 
under the constraints: yi(hw; xii + b)  1; 1  i  n (the separation 
between the two classes is perfect). 
) ensures a good generalization capability. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Soft margin classification 
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classification 
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classification 
w 
margin: 1 
kwk2 
Support Vector 
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classification 
w 
margin: 1 
kwk2 
Support Vector 
w is chosen such that: 
minw; kwk2 + C 
Pni 
=1 i (the margin is the largest), 
under the constraints: yi(hw; xii + b)  1  i ; 1  i  n; 
i  0; 1  i  n: 
(the separation between the two classes is almost perfect). 
) allowing a few errors improves the richness of the class. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Non linear SVM 
Original space X 
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVM 
Original space X Feature space H 
	 (non linear) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVM 
Original space X Feature space H 
	 (non linear) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVM 
Original space X Feature space H 
	 (non linear) 
w 2 H is chosen such that (PC;H): 
minw; kwk2 
H + C 
Pni 
=1 i (the margin in the feature space is the 
largest), 
under the constraints: yi(hw; 	(xi)iH + b)  1  i ; 1  i  n; 
i  0; 1  i  n: 
(the separation between the two classes in the feature space is 
almost perfect). 
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
SVM from different points of view 
A regularization problem: (PC;H) , 
(P2 
;H) : min 
w2H 
1 
n 
Xn 
i=1 
R(fw(xi); yi) 
|                        {z                        } 
error term 
+ kwk2 
|{zH} 
; 
penalization term 
where fw(x) = h	(x);wiH and R(^y; y) = max(0; 1  ^yy) (hinge loss 
function) 
errors versus ^y for y = 1: 
I blue: hinge loss; 
I green: misclassification error. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
SVM from different points of view 
A regularization problem: (PC;H) , 
(P2 
;H) : min 
w2H 
1 
n 
Xn 
i=1 
R(fw(xi); yi) 
|                        {z                        } 
error term 
+ kwk2 
|{zH} 
; 
penalization term 
where fw(x) = h	(x);wiH and R(^y; y) = max(0; 1  ^yy) (hinge loss 
function) 
A dual problem: (PC;H) , 
(DC;X) : max2Rn 
Pni 
=1 i  
Pni 
=1 
Pnj 
=1 ijyiyjh	(xi); 	(xj)iH; 
with 
PNi 
=1 iyi = 0; 
0  i  C; 1  i  n: 
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
SVM from different points of view 
A regularization problem: (PC;H) , 
(P2 
;H) : min 
w2H 
1 
n 
Xn 
i=1 
R(fw(xi); yi) 
|                        {z                        } 
error term 
+ kwk2 
|{zH} 
; 
penalization term 
where fw(x) = h	(x);wiH and R(^y; y) = max(0; 1  ^yy) (hinge loss 
function) 
A dual problem: (PC;H) , 
(DC;X) : max2Rn 
Pni 
=1 i  
Pni 
=1 
Pnj 
=1 ijyiyjK(xi ; xj); 
with 
PNi 
=1 iyi = 0; 
0  i  C; 1  i  n: 
There is no need to know 	 and H: 
I choose a function K with a few good properties; 
I use it as the dot product in H: 
8 u; v 2 H; K(u; v) = h	(u); 	(v)iH. 
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
Which kernels? 
Minimum properties that a kernel should fulfilled 
symmetry: K(u; u0) = K(u0; u) 
positivity: 8N 2 N, 8 (i)  RN, 8 (xi)  XN, 
P 
i;j ijK(xi ; xj)  0. 
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function 	 : X ! H 
such that: 
8 u; v 2 H; K(u; v) = h	(u); 	(v)iH 
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
Which kernels? 
Minimum properties that a kernel should fulfilled 
symmetry: K(u; u0) = K(u0; u) 
positivity: 8N 2 N, 8 (i)  RN, 8 (xi)  XN, 
P 
i;j ijK(xi ; xj)  0. 
[Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function 	 : X ! H 
such that: 
8 u; v 2 H; K(u; v) = h	(u); 	(v)iH 
Examples 
the Gaussian kernel: 8 x; x0 2 Rd, K(x; x0) = e
kxx0k2 (it is universal 
for all bounded subset of Rd); 
the linear kernel: 8 x; x0 2 Rd, K(x; x0) = xT (x0) (it is not universal). 
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
In summary, how does the solution write???? 
n(x) = 
X 
i 
iyiK(xi ; x) 
where only a few i , 0. i such that i , 0 are the support vectors! 
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/25
I’m almost dead with all these stuffs on my mind!!! 
What in practice? 
data(iris) 
iris - iris[iris$Species%in%c(versicolor,virginica),] 
plot(iris$Petal.Length , iris$Petal.Width , col=iris$Species , 
pch=19) 
legend(topleft, pch=19, col=c(2,3), 
legend=c(versicolor, virginica)) 
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/25

Mais conteúdo relacionado

Mais procurados

Nonlinear Manifolds in Computer Vision
Nonlinear Manifolds in Computer VisionNonlinear Manifolds in Computer Vision
Nonlinear Manifolds in Computer Vision
zukun
 

Mais procurados (11)

When is undersampling effective in unbalanced classification tasks?
When is undersampling effective in unbalanced classification tasks?When is undersampling effective in unbalanced classification tasks?
When is undersampling effective in unbalanced classification tasks?
 
Nonlinear Manifolds in Computer Vision
Nonlinear Manifolds in Computer VisionNonlinear Manifolds in Computer Vision
Nonlinear Manifolds in Computer Vision
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Regression: A skin-deep dive
Regression: A skin-deep diveRegression: A skin-deep dive
Regression: A skin-deep dive
 
04 regression
04 regression04 regression
04 regression
 
Talk 4
Talk 4Talk 4
Talk 4
 
Simple linear regression project
Simple linear regression projectSimple linear regression project
Simple linear regression project
 
How mathematicians predict the future?
How mathematicians predict the future?How mathematicians predict the future?
How mathematicians predict the future?
 
Statr session 25 and 26
Statr session 25 and 26Statr session 25 and 26
Statr session 25 and 26
 

Destaque

Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
Kurt Holst
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
Slideshare
 
Stanford - Statistical Learning
Stanford - Statistical LearningStanford - Statistical Learning
Stanford - Statistical Learning
Ravi Sankar Varma
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
Slideshare
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
The First English-Persian statistical machine translation
The First English-Persian statistical machine translationThe First English-Persian statistical machine translation
The First English-Persian statistical machine translation
Mahsa Mohaghegh
 
lec21.ppt
lec21.pptlec21.ppt
lec21.ppt
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 

Destaque (20)

Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Statistical learning intro
Statistical learning introStatistical learning intro
Statistical learning intro
 
Stanford - Statistical Learning
Stanford - Statistical LearningStanford - Statistical Learning
Stanford - Statistical Learning
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The First English-Persian statistical machine translation
The First English-Persian statistical machine translationThe First English-Persian statistical machine translation
The First English-Persian statistical machine translation
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans RVisualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
 
Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014
 
Mining co-expression network
Mining co-expression networkMining co-expression network
Mining co-expression network
 
Chapter 01
Chapter 01Chapter 01
Chapter 01
 
lec21.ppt
lec21.pptlec21.ppt
lec21.ppt
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?
 

Semelhante a A short introduction to statistical learning

2013.03.26 Bayesian Methods for Modern Statistical Analysis
2013.03.26 Bayesian Methods for Modern Statistical Analysis2013.03.26 Bayesian Methods for Modern Statistical Analysis
2013.03.26 Bayesian Methods for Modern Statistical Analysis
NUI Galway
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
butest
 

Semelhante a A short introduction to statistical learning (20)

A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
An introduction to neural networks
An introduction to neural networksAn introduction to neural networks
An introduction to neural networks
 
An introduction to neural network
An introduction to neural networkAn introduction to neural network
An introduction to neural network
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
2013.03.26 Bayesian Methods for Modern Statistical Analysis
2013.03.26 Bayesian Methods for Modern Statistical Analysis2013.03.26 Bayesian Methods for Modern Statistical Analysis
2013.03.26 Bayesian Methods for Modern Statistical Analysis
 
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
 
An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...
 
Classification
ClassificationClassification
Classification
 
RuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative Systems
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theory
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
Multiple comparison problem
Multiple comparison problemMultiple comparison problem
Multiple comparison problem
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big Data
 
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
 
Eco550 Assignment 1
Eco550 Assignment 1Eco550 Assignment 1
Eco550 Assignment 1
 
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdf
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity data
 

Mais de tuxette

Mais de tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 

Último

Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Último (20)

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 

A short introduction to statistical learning

  • 1. A short introduction to statistical learning Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org Axe “Apprentissage et Processus” October 15th, 2014 - Unité MIA-T, INRA, Toulouse Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25
  • 2. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 SVM Nathalie Villa-Vialaneix | Introduction to statistical learning 2/25
  • 3. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 SVM Nathalie Villa-Vialaneix | Introduction to statistical learning 3/25
  • 4. Background Purpose: predict Y from X; Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
  • 5. Background Purpose: predict Y from X; What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
  • 6. Background Purpose: predict Y from X; What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
  • 7. Background Purpose: predict Y from X; What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
  • 8. Background Purpose: predict Y from X; What we have: n observations of (X; Y), (x1; y1), . . . , (xn; yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Y can be: a numeric variable (Y 2 R) ) (supervised) regression régression; a factor ) (supervised) classification discrimination. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
  • 9. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 10. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): if Y is numeric, n is called a regression function fonction de classification; if Y is a factor, n is called a classifier classifieur; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 11. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): if Y is numeric, n is called a regression function fonction de classification; if Y is a factor, n is called a classifier classifieur; n is said to be trained or learned from the observations (xi ; yi)i . Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 12. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): if Y is numeric, n is called a regression function fonction de classification; if Y is a factor, n is called a classifier classifieur; n is said to be trained or learned from the observations (xi ; yi)i . Desirable properties accuracy to the observations: predictions made on known data are close to observed values; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 13. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): if Y is numeric, n is called a regression function fonction de classification; if Y is a factor, n is called a classifier classifieur; n is said to be trained or learned from the observations (xi ; yi)i . Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 14. Basics From (xi ; yi)i , definition of a machine, n s.t.: ^ynew = n(xnew): if Y is numeric, n is called a regression function fonction de classification; if Y is a factor, n is called a classifier classifieur; n is said to be trained or learned from the observations (xi ; yi)i . Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Conflicting objectives!! Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
  • 15. Underfitting/Overfitting sous/sur - apprentissage Function x ! y to be estimated Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 16. Underfitting/Overfitting sous/sur - apprentissage Observations we might have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 17. Underfitting/Overfitting sous/sur - apprentissage Observations we do have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 18. Underfitting/Overfitting sous/sur - apprentissage First estimation from the observations: underfitting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 19. Underfitting/Overfitting sous/sur - apprentissage Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 20. Underfitting/Overfitting sous/sur - apprentissage Third estimation from the observations: overfitting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 21. Underfitting/Overfitting sous/sur - apprentissage Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
  • 22. Errors training error (measures the accuracy to the observations) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 23. Errors training error (measures the accuracy to the observations) I if y is a factor: misclassification rate ]f^yi , yi ; i = 1; : : : ; ng n Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 24. Errors training error (measures the accuracy to the observations) I if y is a factor: misclassification rate ]f^yi , yi ; i = 1; : : : ; ng n I if y is numeric: mean square error (MSE) 1 n Xn i=1 (^yi yi)2 Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 25. Errors training error (measures the accuracy to the observations) I if y is a factor: misclassification rate ]f^yi , yi ; i = 1; : : : ; ng n I if y is numeric: mean square error (MSE) 1 n Xn i=1 (^yi yi)2 or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 26. Errors training error (measures the accuracy to the observations) I if y is a factor: misclassification rate ]f^yi , yi ; i = 1; : : : ; ng n I if y is numeric: mean square error (MSE) 1 n Xn i=1 (^yi yi)2 or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 27. Errors training error (measures the accuracy to the observations) I if y is a factor: misclassification rate ]f^yi , yi ; i = 1; : : : ; ng n I if y is numeric: mean square error (MSE) 1 n Xn i=1 (^yi yi)2 or root mean square error (RMSE) or pseudo-R2: 1MSE=Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation 1 split the data into training/test sets (usually 80%/20%) 2 train n from the training dataset 3 calculate the test error from the remaining data Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
  • 28. Example Observations Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
  • 29. Example Training/Test datasets Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
  • 30. Example Training/Test errors Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
  • 31. Example Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
  • 32. Consistency in the parametric/non parametric case Example in the parametric framework (linear methods) an assumption is made on the form of the relation between X and Y: Y =
  • 33. TX +
  • 34. is estimated from the observations (x1; y1), . . . , (xn; yn) by a given method which calculates a
  • 35. n. The estimation is said to be consistent if
  • 37. under (eventually) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
  • 38. Consistency in the parametric/non parametric case Example in the nonparametric framework the form of the relation between X and Y is unknown: Y = (X) + is estimated from the observations (x1; y1), . . . , (xn; yn) by a given method which calculates a n. The estimation is said to be consistent if n n!+1 ! under (eventually) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
  • 39. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating or... Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
  • 40. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating or... ... rather in having the smallest prediction error? Statistical learning perspective: a method that builds a machine n from the observations is said to be (universally) consistent if, given a risk function R : R R ! R+ (which calculates an error), E (R(n(X); Y)) n!+1 ! inf :X!R E (R((X); Y)) ; for any distribution of (X; Y) 2 X R. Definitions: L = inf:X!R E (R((X); Y)) and L = E (R((X); Y)). Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
  • 41. Desirable properties from a mathematical perspective Simplified framework: X 2 X and Y 2 f1; 1g (binary classification) Learning process: choose a machine n in a class of functions C f : X ! Rg (e.g., C is the set of all functions that can be build using a SVM). Error decomposition Ln L Ln inf 2C L + inf 2C L L with inf2C L L is the richness of C (i.e., C must be rich to ensure that this term is small); Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
  • 42. Desirable properties from a mathematical perspective Simplified framework: X 2 X and Y 2 f1; 1g (binary classification) Learning process: choose a machine n in a class of functions C f : X ! Rg (e.g., C is the set of all functions that can be build using a SVM). Error decomposition Ln L Ln inf 2C L + inf 2C L L with inf2C L L is the richness of C (i.e., C must be rich to ensure that this term is small); Ln inf2C L 2 sup2C jLn Lj, Ln = 1 n Pni =1 R((xi); yi) is the generalization capability of C (i.e., in the worst case, the empirical error must be close to the true error: C must not be too rich to ensure that this term is small). Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
  • 43. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 SVM Nathalie Villa-Vialaneix | Introduction to statistical learning 12/25
  • 44. Basic introduction Binary classification problem: X 2 H et Y 2 f1; 1g A training set is given: (x1; y1); : : : ; (xn; yn) Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
  • 45. Basic introduction Binary classification problem: X 2 H et Y 2 f1; 1g A training set is given: (x1; y1); : : : ; (xn; yn) SVM is a method based on kernels. It is universally consistent method, given that the kernel is universal [Steinwart, 2002]. Extensions to the regression case exist (SVR or LS-SVM) that are also universally consistent when the kernel is universal. Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
  • 46. Optimal margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
  • 47. Optimal margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
  • 48. Optimal margin classification w margin: 1 kwk2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
  • 49. Optimal margin classification w margin: 1 kwk2 Support Vector w is chosen such that: minw kwk2 (the margin is the largest), under the constraints: yi(hw; xii + b) 1; 1 i n (the separation between the two classes is perfect). ) ensures a good generalization capability. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
  • 50. Soft margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
  • 51. Soft margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
  • 52. Soft margin classification w margin: 1 kwk2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
  • 53. Soft margin classification w margin: 1 kwk2 Support Vector w is chosen such that: minw; kwk2 + C Pni =1 i (the margin is the largest), under the constraints: yi(hw; xii + b) 1 i ; 1 i n; i 0; 1 i n: (the separation between the two classes is almost perfect). ) allowing a few errors improves the richness of the class. Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
  • 54. Non linear SVM Original space X Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
  • 55. Non linear SVM Original space X Feature space H (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
  • 56. Non linear SVM Original space X Feature space H (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
  • 57. Non linear SVM Original space X Feature space H (non linear) w 2 H is chosen such that (PC;H): minw; kwk2 H + C Pni =1 i (the margin in the feature space is the largest), under the constraints: yi(hw; (xi)iH + b) 1 i ; 1 i n; i 0; 1 i n: (the separation between the two classes in the feature space is almost perfect). Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
  • 58. SVM from different points of view A regularization problem: (PC;H) , (P2 ;H) : min w2H 1 n Xn i=1 R(fw(xi); yi) | {z } error term + kwk2 |{zH} ; penalization term where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss function) errors versus ^y for y = 1: I blue: hinge loss; I green: misclassification error. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
  • 59. SVM from different points of view A regularization problem: (PC;H) , (P2 ;H) : min w2H 1 n Xn i=1 R(fw(xi); yi) | {z } error term + kwk2 |{zH} ; penalization term where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss function) A dual problem: (PC;H) , (DC;X) : max2Rn Pni =1 i Pni =1 Pnj =1 ijyiyjh (xi); (xj)iH; with PNi =1 iyi = 0; 0 i C; 1 i n: Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
  • 60. SVM from different points of view A regularization problem: (PC;H) , (P2 ;H) : min w2H 1 n Xn i=1 R(fw(xi); yi) | {z } error term + kwk2 |{zH} ; penalization term where fw(x) = h (x);wiH and R(^y; y) = max(0; 1 ^yy) (hinge loss function) A dual problem: (PC;H) , (DC;X) : max2Rn Pni =1 i Pni =1 Pnj =1 ijyiyjK(xi ; xj); with PNi =1 iyi = 0; 0 i C; 1 i n: There is no need to know and H: I choose a function K with a few good properties; I use it as the dot product in H: 8 u; v 2 H; K(u; v) = h (u); (v)iH. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
  • 61. Which kernels? Minimum properties that a kernel should fulfilled symmetry: K(u; u0) = K(u0; u) positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN, P i;j ijK(xi ; xj) 0. [Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H such that: 8 u; v 2 H; K(u; v) = h (u); (v)iH Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
  • 62. Which kernels? Minimum properties that a kernel should fulfilled symmetry: K(u; u0) = K(u0; u) positivity: 8N 2 N, 8 (i) RN, 8 (xi) XN, P i;j ijK(xi ; xj) 0. [Aronszajn, 1950]: 9 a Hilbert space (H; h:; :iH) and a function : X ! H such that: 8 u; v 2 H; K(u; v) = h (u); (v)iH Examples the Gaussian kernel: 8 x; x0 2 Rd, K(x; x0) = e kxx0k2 (it is universal for all bounded subset of Rd); the linear kernel: 8 x; x0 2 Rd, K(x; x0) = xT (x0) (it is not universal). Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
  • 63. In summary, how does the solution write???? n(x) = X i iyiK(xi ; x) where only a few i , 0. i such that i , 0 are the support vectors! Nathalie Villa-Vialaneix | Introduction to statistical learning 19/25
  • 64. I’m almost dead with all these stuffs on my mind!!! What in practice? data(iris) iris - iris[iris$Species%in%c(versicolor,virginica),] plot(iris$Petal.Length , iris$Petal.Width , col=iris$Species , pch=19) legend(topleft, pch=19, col=c(2,3), legend=c(versicolor, virginica)) Nathalie Villa-Vialaneix | Introduction to statistical learning 20/25
  • 65. I’m almost dead with all these stuffs on my mind!!! What in practice? library(e1071) res.tune - tune.svm(Species ~ ., data=iris , kernel=linear, cost = 2^(-1:4)) # Parameter tuning of 'svm': # - sampling method: 10fold cross validation # - best parameters: # cost # 0.5 # - best performance: 0.05 res.tune$best.model # Call: # best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4), # kernel = linear) # Parameters: # SVM-Type: C-classification # SVM-Kernel: linear # cost: 0.5 # gamma: 0.25 # Number of Support Vectors: 21 Nathalie Villa-Vialaneix | Introduction to statistical learning 21/25
  • 66. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune$best.model$fitted , iris$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 45 0 % virginica 0 5 50 plot(res.tune$best.model , data=iris , Petal.Width~Petal.Length , slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 22/25
  • 67. I’m almost dead with all these stuffs on my mind!!! What in practice? res.tune - tune.svm(Species ~ ., data=iris , gamma = 2^(-1:1), cost = 2^(2:4)) # Parameter tuning of 'svm': # - sampling method: 10fold cross validation # - best parameters: # gamma cost # 0.5 4 # - best performance: 0.08 res.tune$best.model # Call: # best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1), # cost = 2^(2:4)) # Parameters: # SVM-Type: C-classification # SVM-Kernel: radial # cost: 4 # gamma: 0.5 # Number of Support Vectors: 32 Nathalie Villa-Vialaneix | Introduction to statistical learning 23/25
  • 68. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune$best.model$fitted , iris$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 49 0 % virginica 0 1 50 plot(res.tune$best.model , data=iris , Petal.Width~Petal.Length , slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 24/25
  • 69. References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18:768–791. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York, USA. and more can be found on my website: http://nathalievilla.org/learning.html Nathalie Villa-Vialaneix | Introduction to statistical learning 25/25