Machine learning (5)

N
CS229 Lecture notes 
Andrew Ng 
Part VII 
Regularization and model 
selection 
Suppose we are trying to select among several different models for a learning 
problem. For instance, we might be using a polynomial regression model 
h(x) = g(θ0 + θ1x + θ2x2 + · · · + θkxk), and wish to decide if k should be 
0, 1, . . . , or 10. How can we automatically select a model that represents 
a good tradeoff between the twin evils of bias and variance1? Alternatively, 
suppose we want to automatically choose the bandwidth parameter τ for 
locally weighted regression, or the parameter C for our ℓ1-regularized SVM. 
How can we do that? 
For the sake of concreteness, in these notes we assume we have some 
finite set of models M = {M1, . . . ,Md} that we’re trying to select among. 
For instance, in our first example above, the model Mi would be an i-th 
order polynomial regression model. (The generalization to infinite M is not 
hard.2) Alternatively, if we are trying to decide between using an SVM, a 
neural network or logistic regression, then M may contain these models. 
1Given that we said in the previous set of notes that bias and variance are two very 
different beasts, some readers may be wondering if we should be calling them “twin” evils 
here. Perhaps it’d be better to think of them as non-identical twins. The phrase “the 
fraternal twin evils of bias and variance” doesn’t have the same ring to it, though. 
2If we are trying to choose from an infinite set of models, say corresponding to the 
possible values of the bandwidth  ∈ R+, we may discretize  and consider only a finite 
number of possible values for it. More generally, most of the algorithms described here 
can all be viewed as performing optimization search in the space of models, and we can 
perform this search over infinite model classes as well. 
1
2 
1 Cross validation 
Let’s suppose we are, as usual, given a training set S. Given what we know 
about empirical risk minimization, here’s what might initially seem like a 
algorithm, resulting from using empirical risk minimization for model selec- 
tion: 
1. Train each model Mi on S, to get some hypothesis hi. 
2. Pick the hypotheses with the smallest training error. 
This algorithm does not work. Consider choosing the order of a poly- 
nomial. The higher the order of the polynomial, the better it will fit the 
training set S, and thus the lower the training error. Hence, this method will 
always select a high-variance, high-degree polynomial model, which we saw 
previously is often poor choice. 
Here’s an algorithm that works better. In hold-out cross validation 
(also called simple cross validation), we do the following: 
1. Randomly split S into Strain (say, 70% of the data) and Scv (the remain- 
ing 30%). Here, Scv is called the hold-out cross validation set. 
2. Train each model Mi on Strain only, to get some hypothesis hi. 
3. Select and output the hypothesis hi that had the smallest error ˆεScv(hi) 
on the hold out cross validation set. (Recall, ˆεScv(h) denotes the empir- 
ical error of h on the set of examples in Scv.) 
By testing on a set of examples Scv that the models were not trained on, 
we obtain a better estimate of each hypothesis hi’s true generalization error, 
and can then pick the one with the smallest estimated generalization error. 
Usually, somewhere between 1/4 − 1/3 of the data is used in the hold out 
cross validation set, and 30% is a typical choice. 
Optionally, step 3 in the algorithm may also be replaced with selecting 
the model Mi according to argmini ˆεScv(hi), and then retraining Mi on the 
entire training set S. (This is often a good idea, with one exception being 
learning algorithms that are be very sensitive to perturbations of the initial 
conditions and/or data. For these methods, Mi doing well on Strain does not 
necessarily mean it will also do well on Scv, and it might be better to forgo 
this retraining step.) 
The disadvantage of using hold out cross validation is that it “wastes” 
about 30% of the data. Even if we were to take the optional step of retraining
3 
the model on the entire training set, it’s still as if we’re trying to find a good 
model for a learning problem in which we had 0.7m training examples, rather 
than m training examples, since we’re testing models that were trained on 
only 0.7m examples each time. While this is fine if data is abundant and/or 
cheap, in learning problems in which data is scarce (consider a problem with 
m = 20, say), we’d like to do something better. 
Here is a method, called k-fold cross validation, that holds out less 
data each time: 
1. Randomly split S into k disjoint subsets of m/k training examples each. 
Let’s call these subsets S1, . . . , Sk. 
2. For each model Mi, we evaluate it as follows: 
For j = 1, . . . , k 
Train the model Mi on S1 ∪ · · · ∪ Sj−1 ∪ Sj+1 ∪ · · · Sk (i.e., train 
on all the data except Sj) to get some hypothesis hij . 
Test the hypothesis hij on Sj , to get ˆεSj (hij). 
The estimated generalization error of model Mi is then calculated 
as the average of the ˆεSj (hij)’s (averaged over j). 
3. Pick the model Mi with the lowest estimated generalization error, and 
retrain that model on the entire training set S. The resulting hypothesis 
is then output as our final answer. 
A typical choice for the number of folds to use here would be k = 10. 
While the fraction of data held out each time is now 1/k—much smaller 
than before—this procedure may also be more computationally expensive 
than hold-out cross validation, since we now need train to each model k 
times. 
While k = 10 is a commonly used choice, in problems in which data is 
really scarce, sometimes we will use the extreme choice of k = m in order 
to leave out as little data as possible each time. In this setting, we would 
repeatedly train on all but one of the training examples in S, and test on that 
held-out example. The resulting m = k errors are then averaged together to 
obtain our estimate of the generalization error of a model. This method has 
its own name; since we’re holding out one training example at a time, this 
method is called leave-one-out cross validation. 
Finally, even though we have described the different versions of cross vali- 
dation as methods for selecting a model, they can also be used more simply to 
evaluate a single model or algorithm. For example, if you have implemented
4 
some learning algorithm and want to estimate how well it performs for your 
application (or if you have invented a novel learning algorithm and want to 
report in a technical paper how well it performs on various test sets), cross 
validation would give a reasonable way of doing so. 
2 Feature Selection 
One special and important case of model selection is called feature selection. 
To motivate this, imagine that you have a supervised learning problem where 
the number of features n is very large (perhaps n ≫ m), but you suspect that 
there is only a small number of features that are “relevant” to the learning 
task. Even if you use a simple linear classifier (such as the perceptron) over 
the n input features, the VC dimension of your hypothesis class would still be 
O(n), and thus overfitting would be a potential problem unless the training 
set is fairly large. 
In such a setting, you can apply a feature selection algorithm to reduce the 
number of features. Given n features, there are 2n possible feature subsets 
(since each of the n features can either be included or excluded from the 
subset), and thus feature selection can be posed as a model selection problem 
over 2n possible models. For large values of n, it’s usually too expensive to 
explicitly enumerate over and compare all 2n models, and so typically some 
heuristic search procedure is used to find a good feature subset. The following 
search procedure is called forward search: 
1. Initialize F = ∅. 
2. Repeat { 
(a) For i = 1, . . . , n if i6∈ F, let Fi = F ∪ {i}, and use some ver- 
sion of cross validation to evaluate features Fi. (I.e., train your 
learning algorithm using only the features in Fi, and estimate its 
generalization error.) 
(b) Set F to be the best feature subset found on step (a). 
} 
3. Select and output the best feature subset that was evaluated during the 
entire search procedure.
5 
The outer loop of the algorithm can be terminated either when F = 
{1, . . . , n} is the set of all features, or when |F| exceeds some pre-set thresh- 
old (corresponding to the maximum number of features that you want the 
algorithm to consider using). 
This algorithm described above one instantiation of wrapper model 
feature selection, since it is a procedure that “wraps” around your learning 
algorithm, and repeatedly makes calls to the learning algorithm to evaluate 
how well it does using different feature subsets. Aside from forward search, 
other search procedures can also be used. For example, backward search 
starts off with F = {1, . . . , n} as the set of all features, and repeatedly deletes 
features one at a time (evaluating single-feature deletions in a similar manner 
to how forward search evaluates single-feature additions) until F = ∅. 
Wrapper feature selection algorithms often work quite well, but can be 
computationally expensive given how that they need to make many calls to 
the learning algorithm. Indeed, complete forward search (terminating when 
F = {1, . . . , n}) would take about O(n2) calls to the learning algorithm. 
Filter feature selection methods give heuristic, but computationally 
much cheaper, ways of choosing a feature subset. The idea here is to compute 
some simple score S(i) that measures how informative each feature xi is about 
the class labels y. Then, we simply pick the k features with the largest scores 
S(i). 
One possible choice of the score would be define S(i) to be (the absolute 
value of) the correlation between xi and y, as measured on the training data. 
This would result in our choosing the features that are the most strongly 
correlated with the class labels. In practice, it is more common (particularly 
for discrete-valued features xi) to choose S(i) to be the mutual information 
MI(xi, y) between xi and y: 
MI(xi, y) = X xi2{0,1} X y2{0,1} 
p(xi, y) log 
p(xi, y) 
p(xi)p(y) 
. 
(The equation above assumes that xi and y are binary-valued; more generally 
the summations would be over the domains of the variables.) The probabil- 
ities above p(xi, y), p(xi) and p(y) can all be estimated according to their 
empirical distributions on the training set. 
To gain intuition about what this score does, note that the mutual infor- 
mation can also be expressed as a Kullback-Leibler (KL) divergence: 
MI(xi, y) = KL (p(xi, y)||p(xi)p(y)) 
You’ll get to play more with KL-divergence in Problem set #3, but infor- 
mally, this gives a measure of how different the probability distributions
6 
p(xi, y) and p(xi)p(y) are. If xi and y are independent random variables, 
then we would have p(xi, y) = p(xi)p(y), and the KL-divergence between the 
two distributions will be zero. This is consistent with the idea if xi and y 
are independent, then xi is clearly very “non-informative” about y, and thus 
the score S(i) should be small. Conversely, if xi is very “informative” about 
y, then their mutual information MI(xi, y) would be large. 
One final detail: Now that you’ve ranked the features according to their 
scores S(i), how do you decide how many features k to choose? Well, one 
standard way to do so is to use cross validation to select among the possible 
values of k. For example, when applying naive Bayes to text classification— 
a problem where n, the vocabulary size, is usually very large—using this 
method to select a feature subset often results in increased classifier accuracy. 
3 Bayesian statistics and regularization 
In this section, we will talk about one more tool in our arsenal for our battle 
against overfitting. 
At the beginning of the quarter, we talked about parameter fitting using 
maximum likelihood (ML), and chose our parameters according to 
θML = argmax 
 
m 
Yi 
=1 
p(y(i)|x(i); θ). 
Throughout our subsequent discussions, we viewed θ as an unknown param- 
eter of the world. This view of the θ as being constant-valued but unknown 
is taken in frequentist statistics. In the frequentist this view of the world, θ 
is not random—it just happens to be unknown—and it’s our job to come up 
with statistical procedures (such as maximum likelihood) to try to estimate 
this parameter. 
An alternative way to approach our parameter estimation problems is to 
take the Bayesian view of the world, and think of θ as being a random 
variable whose value is unknown. In this approach, we would specify a 
prior distribution p(θ) on θ that expresses our “prior beliefs” about the 
parameters. Given a training set S = {(x(i), y(i))}m 
i=1, when we are asked to 
make a prediction on a new value of x, we can then compute the posterior
7 
distribution on the parameters 
p(θ|S) = 
p(S|θ)p(θ) 
p(S) 
= Qm 
i=1 p(y(i)|x(i), θ)p(θ) 
R (Qm 
i=1 p(y(i)|x(i), θ)p(θ)) dθ 
(1) 
In the equation above, p(y(i)|x(i), θ) comes from whatever model you’re using 
for your learning problem. For example, if you are using Bayesian logistic re- 
gression, then you might choose p(y(i)|x(i), θ) = h(x(i))y(i)(1−h(x(i)))(1−y(i)), 
where h(x(i)) = 1/(1 + exp(−θT x(i))).3 
When we are given a new test example x and asked to make it prediction 
on it, we can compute our posterior distribution on the class label using the 
posterior distribution on θ: 
p(y|x, S) = Z 
p(y|x, θ)p(θ|S)dθ (2) 
In the equation above, p(θ|S) comes from Equation (1). Thus, for example, 
if the goal is to the predict the expected value of y given x, then we would 
output4 
E[y|x, S] = Zy 
yp(y|x, S)dy 
The procedure that we’ve outlined here can be thought of as doing “fully 
Bayesian” prediction, where our prediction is computed by taking an average 
with respect to the posterior p(θ|S) over θ. Unfortunately, in general it is 
computationally very difficult to compute this posterior distribution. This is 
because it requires taking integrals over the (usually high-dimensional) θ as 
in Equation (1), and this typically cannot be done in closed-form. 
Thus, in practice we will instead approximate the posterior distribution 
for θ. One common approximation is to replace our posterior distribution for 
θ (as in Equation 2) with a single point estimate. The MAP (maximum 
a posteriori) estimate for θ is given by 
θMAP = argmax 
 
m 
Yi 
=1 
p(y(i)|x(i), θ)p(θ). (3) 
3Since we are now viewing  as a random variable, it is okay to condition on it value, 
and write “p(y|x, )” instead of “p(y|x; ).” 
4The integral below would be replaced by a summation if y is discrete-valued.
8 
Note that this is the same formulas as for the ML (maximum likelihood) 
estimate for θ, except for the prior p(θ) term at the end. 
In practical applications, a common choice for the prior p(θ) is to assume 
that θ ∼ N(0, τ 2I). Using this choice of prior, the fitted parameters θMAP 
will have smaller norm than that selected by maximum likelihood. (See 
Problem Set #3.) In practice, this causes the Bayesian MAP estimate to be 
less susceptible to overfitting than the ML estimate of the parameters. For 
example, Bayesian logistic regression turns out to be an effective algorithm for 
text classification, even though in text classification we usually have n ≫ m.

Recomendados

Machine learning por
Machine learningMachine learning
Machine learningSukhwinder Singh
46 visualizações15 slides
A Maximum Entropy Approach to the Loss Data Aggregation Problem por
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
524 visualizações44 slides
Lecture7 cross validation por
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
1.4K visualizações27 slides
Principle of Maximum Entropy por
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
11.4K visualizações31 slides
Max Entropy por
Max EntropyMax Entropy
Max Entropyjianingy
2.3K visualizações60 slides
Big Data Analysis por
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
63.1K visualizações77 slides

Mais conteúdo relacionado

Mais procurados

Introduction to Supervised ML Concepts and Algorithms por
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
31K visualizações73 slides
Machine learning and Neural Networks por
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
843 visualizações113 slides
ppt por
pptppt
pptbutest
1.8K visualizações33 slides
PAC Learning and The VC Dimension por
PAC Learning and The VC DimensionPAC Learning and The VC Dimension
PAC Learning and The VC Dimensionbutest
2.3K visualizações29 slides
Meta-learning of exploration-exploitation strategies in reinforcement learning por
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningUniversité de Liège (ULg)
815 visualizações35 slides
Computational Learning Theory por
Computational Learning TheoryComputational Learning Theory
Computational Learning Theorybutest
5K visualizações60 slides

Mais procurados(18)

Introduction to Supervised ML Concepts and Algorithms por NBER
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
NBER31K visualizações
Machine learning and Neural Networks por butest
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
butest843 visualizações
ppt por butest
pptppt
ppt
butest1.8K visualizações
PAC Learning and The VC Dimension por butest
PAC Learning and The VC DimensionPAC Learning and The VC Dimension
PAC Learning and The VC Dimension
butest2.3K visualizações
Meta-learning of exploration-exploitation strategies in reinforcement learning por Université de Liège (ULg)
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
Université de Liège (ULg)815 visualizações
Computational Learning Theory por butest
Computational Learning TheoryComputational Learning Theory
Computational Learning Theory
butest5K visualizações
Jsm09 talk por Christian Robert
Jsm09 talkJsm09 talk
Jsm09 talk
Christian Robert593 visualizações
Learning for exploration-exploitation in reinforcement learning. The dusk of ... por Université de Liège (ULg)
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Université de Liège (ULg)1.4K visualizações
Ensemble hybrid learning technique por DishaSinha9
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
DishaSinha9164 visualizações
Figure 1.doc por butest
Figure 1.docFigure 1.doc
Figure 1.doc
butest225 visualizações
AML_030607.ppt por butest
AML_030607.pptAML_030607.ppt
AML_030607.ppt
butest370 visualizações
Introdution and designing a learning system por swapnac12
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac122.5K visualizações
Pyro Primer por ZachWolpe
Pyro PrimerPyro Primer
Pyro Primer
ZachWolpe99 visualizações
(Machine Learning) Ensemble learning por Omkar Rane
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
Omkar Rane296 visualizações
Borderline Smote por Trector Rancor
Borderline SmoteBorderline Smote
Borderline Smote
Trector Rancor5.2K visualizações
Yes III: Computational methods for model choice por Christian Robert
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
Christian Robert595 visualizações
3_learning.ppt por butest
3_learning.ppt3_learning.ppt
3_learning.ppt
butest673 visualizações
Lecture 9: Machine Learning in Practice (2) por Marina Santini
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
Marina Santini8K visualizações

Destaque

Computer network (17) por
Computer network (17)Computer network (17)
Computer network (17)NYversity
500 visualizações47 slides
Fundamentals of c programming por
Fundamentals of c programmingFundamentals of c programming
Fundamentals of c programmingNYversity
448 visualizações24 slides
Machine learning (11) por
Machine learning (11)Machine learning (11)
Machine learning (11)NYversity
192 visualizações6 slides
Programming methodology lecture15 por
Programming methodology lecture15Programming methodology lecture15
Programming methodology lecture15NYversity
261 visualizações19 slides
Computer network (11) por
Computer network (11)Computer network (11)
Computer network (11)NYversity
353 visualizações38 slides
Machine learning (10) por
Machine learning (10)Machine learning (10)
Machine learning (10)NYversity
330 visualizações9 slides

Destaque(19)

Computer network (17) por NYversity
Computer network (17)Computer network (17)
Computer network (17)
NYversity500 visualizações
Fundamentals of c programming por NYversity
Fundamentals of c programmingFundamentals of c programming
Fundamentals of c programming
NYversity448 visualizações
Machine learning (11) por NYversity
Machine learning (11)Machine learning (11)
Machine learning (11)
NYversity192 visualizações
Programming methodology lecture15 por NYversity
Programming methodology lecture15Programming methodology lecture15
Programming methodology lecture15
NYversity261 visualizações
Computer network (11) por NYversity
Computer network (11)Computer network (11)
Computer network (11)
NYversity353 visualizações
Machine learning (10) por NYversity
Machine learning (10)Machine learning (10)
Machine learning (10)
NYversity330 visualizações
Programming methodology lecture12 por NYversity
Programming methodology lecture12Programming methodology lecture12
Programming methodology lecture12
NYversity159 visualizações
Computer network (12) por NYversity
Computer network (12)Computer network (12)
Computer network (12)
NYversity1.7K visualizações
Programming methodology lecture04 por NYversity
Programming methodology lecture04Programming methodology lecture04
Programming methodology lecture04
NYversity278 visualizações
Programming methodology lecture28 por NYversity
Programming methodology lecture28Programming methodology lecture28
Programming methodology lecture28
NYversity277 visualizações
Machine learning (7) por NYversity
Machine learning (7)Machine learning (7)
Machine learning (7)
NYversity174 visualizações
Programming methodology lecture22 por NYversity
Programming methodology lecture22Programming methodology lecture22
Programming methodology lecture22
NYversity191 visualizações
Programming methodology lecture08 por NYversity
Programming methodology lecture08Programming methodology lecture08
Programming methodology lecture08
NYversity181 visualizações
Machine learning (2) por NYversity
Machine learning (2)Machine learning (2)
Machine learning (2)
NYversity178 visualizações
Machine learning (1) por NYversity
Machine learning (1)Machine learning (1)
Machine learning (1)
NYversity328 visualizações
3016 all-2007-dist por NYversity
3016 all-2007-dist3016 all-2007-dist
3016 all-2007-dist
NYversity609 visualizações
Machine learning (4) por NYversity
Machine learning (4)Machine learning (4)
Machine learning (4)
NYversity297 visualizações
Computer network (4) por NYversity
Computer network (4)Computer network (4)
Computer network (4)
NYversity512 visualizações
Formal language & automata theory por NYversity
Formal language & automata theoryFormal language & automata theory
Formal language & automata theory
NYversity6.2K visualizações

Similar a Machine learning (5)

Model Selection and Validation por
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
106 visualizações21 slides
Learning On The Border:Active Learning in Imbalanced classification Data por
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
1.1K visualizações32 slides
Essentials of machine learning algorithms por
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
170 visualizações28 slides
Machine Learning Guide maXbox Starter62 por
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Max Kleiner
65 visualizações8 slides
Ai_Project_report por
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
116 visualizações13 slides
Supervised Learning.pdf por
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
8 visualizações35 slides

Similar a Machine learning (5)(20)

Model Selection and Validation por gmorishita
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
gmorishita106 visualizações
Learning On The Border:Active Learning in Imbalanced classification Data por 萍華 楊
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
萍華 楊1.1K visualizações
Essentials of machine learning algorithms por Arunangsu Sahu
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu170 visualizações
Machine Learning Guide maXbox Starter62 por Max Kleiner
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
Max Kleiner65 visualizações
Ai_Project_report por Ravi Gupta
Ai_Project_reportAi_Project_report
Ai_Project_report
Ravi Gupta116 visualizações
Supervised Learning.pdf por gadissaassefa
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
gadissaassefa8 visualizações
DAA UNIT 3 por SURBHI SAROHA
DAA UNIT 3DAA UNIT 3
DAA UNIT 3
SURBHI SAROHA202 visualizações
MLlectureMethod.ppt por butest
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
butest307 visualizações
MLlectureMethod.ppt por butest
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
butest240 visualizações
Lecture 2 por butest
Lecture 2Lecture 2
Lecture 2
butest503 visualizações
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf por info893569
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
info8935692 visualizações
SVM - Functional Verification por Sai Kiran Kadam
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam130 visualizações
Machine learning interview questions and answers por kavinilavuG
Machine learning interview questions and answersMachine learning interview questions and answers
Machine learning interview questions and answers
kavinilavuG295 visualizações
WEKA: Credibility Evaluating Whats Been Learned por DataminingTools Inc
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
DataminingTools Inc2.7K visualizações
WEKA:Credibility Evaluating Whats Been Learned por weka Content
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
weka Content950 visualizações
Machine Learning: Decision Trees Chapter 18.1-18.3 por butest
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest1.2K visualizações
BPstudy sklearn 20180925 por Shintaro Fukushima
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
Shintaro Fukushima1.1K visualizações
Transfer Learning for Improving Model Predictions in Highly Configurable Soft... por Pooyan Jamshidi
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi932 visualizações

Mais de NYversity

Programming methodology-1.1 por
Programming methodology-1.1Programming methodology-1.1
Programming methodology-1.1NYversity
498 visualizações12 slides
Programming methodology lecture27 por
Programming methodology lecture27Programming methodology lecture27
Programming methodology lecture27NYversity
279 visualizações12 slides
Programming methodology lecture26 por
Programming methodology lecture26Programming methodology lecture26
Programming methodology lecture26NYversity
485 visualizações16 slides
Programming methodology lecture25 por
Programming methodology lecture25Programming methodology lecture25
Programming methodology lecture25NYversity
233 visualizações18 slides
Programming methodology lecture24 por
Programming methodology lecture24Programming methodology lecture24
Programming methodology lecture24NYversity
292 visualizações17 slides
Programming methodology lecture23 por
Programming methodology lecture23Programming methodology lecture23
Programming methodology lecture23NYversity
248 visualizações15 slides

Mais de NYversity(20)

Programming methodology-1.1 por NYversity
Programming methodology-1.1Programming methodology-1.1
Programming methodology-1.1
NYversity498 visualizações
Programming methodology lecture27 por NYversity
Programming methodology lecture27Programming methodology lecture27
Programming methodology lecture27
NYversity279 visualizações
Programming methodology lecture26 por NYversity
Programming methodology lecture26Programming methodology lecture26
Programming methodology lecture26
NYversity485 visualizações
Programming methodology lecture25 por NYversity
Programming methodology lecture25Programming methodology lecture25
Programming methodology lecture25
NYversity233 visualizações
Programming methodology lecture24 por NYversity
Programming methodology lecture24Programming methodology lecture24
Programming methodology lecture24
NYversity292 visualizações
Programming methodology lecture23 por NYversity
Programming methodology lecture23Programming methodology lecture23
Programming methodology lecture23
NYversity248 visualizações
Programming methodology lecture20 por NYversity
Programming methodology lecture20Programming methodology lecture20
Programming methodology lecture20
NYversity207 visualizações
Programming methodology lecture19 por NYversity
Programming methodology lecture19Programming methodology lecture19
Programming methodology lecture19
NYversity185 visualizações
Programming methodology lecture18 por NYversity
Programming methodology lecture18Programming methodology lecture18
Programming methodology lecture18
NYversity281 visualizações
Programming methodology lecture17 por NYversity
Programming methodology lecture17Programming methodology lecture17
Programming methodology lecture17
NYversity200 visualizações
Programming methodology lecture16 por NYversity
Programming methodology lecture16Programming methodology lecture16
Programming methodology lecture16
NYversity276 visualizações
Programming methodology lecture14 por NYversity
Programming methodology lecture14Programming methodology lecture14
Programming methodology lecture14
NYversity170 visualizações
Programming methodology lecture13 por NYversity
Programming methodology lecture13Programming methodology lecture13
Programming methodology lecture13
NYversity184 visualizações
Programming methodology lecture11 por NYversity
Programming methodology lecture11Programming methodology lecture11
Programming methodology lecture11
NYversity208 visualizações
Programming methodology lecture10 por NYversity
Programming methodology lecture10Programming methodology lecture10
Programming methodology lecture10
NYversity295 visualizações
Programming methodology lecture09 por NYversity
Programming methodology lecture09Programming methodology lecture09
Programming methodology lecture09
NYversity219 visualizações
Programming methodology lecture07 por NYversity
Programming methodology lecture07Programming methodology lecture07
Programming methodology lecture07
NYversity199 visualizações
Programming methodology lecture06 por NYversity
Programming methodology lecture06Programming methodology lecture06
Programming methodology lecture06
NYversity207 visualizações
Programming methodology lecture05 por NYversity
Programming methodology lecture05Programming methodology lecture05
Programming methodology lecture05
NYversity207 visualizações
Programming methodology lecture03 por NYversity
Programming methodology lecture03Programming methodology lecture03
Programming methodology lecture03
NYversity266 visualizações

Último

How to empty an One2many field in Odoo por
How to empty an One2many field in OdooHow to empty an One2many field in Odoo
How to empty an One2many field in OdooCeline George
87 visualizações8 slides
Google solution challenge..pptx por
Google solution challenge..pptxGoogle solution challenge..pptx
Google solution challenge..pptxChitreshGyanani1
148 visualizações18 slides
Education and Diversity.pptx por
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptxDrHafizKosar
193 visualizações16 slides
Java Simplified: Understanding Programming Basics por
Java Simplified: Understanding Programming BasicsJava Simplified: Understanding Programming Basics
Java Simplified: Understanding Programming BasicsAkshaj Vadakkath Joshy
322 visualizações155 slides
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptx por
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptxCollective Bargaining and Understanding a Teacher Contract(16793704.1).pptx
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptxCenter for Integrated Training & Education
95 visualizações57 slides
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB... por
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
88 visualizações113 slides

Último(20)

How to empty an One2many field in Odoo por Celine George
How to empty an One2many field in OdooHow to empty an One2many field in Odoo
How to empty an One2many field in Odoo
Celine George87 visualizações
Google solution challenge..pptx por ChitreshGyanani1
Google solution challenge..pptxGoogle solution challenge..pptx
Google solution challenge..pptx
ChitreshGyanani1148 visualizações
Education and Diversity.pptx por DrHafizKosar
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptx
DrHafizKosar193 visualizações
Java Simplified: Understanding Programming Basics por Akshaj Vadakkath Joshy
Java Simplified: Understanding Programming BasicsJava Simplified: Understanding Programming Basics
Java Simplified: Understanding Programming Basics
Akshaj Vadakkath Joshy322 visualizações
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB... por Nguyen Thanh Tu Collection
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
Nguyen Thanh Tu Collection88 visualizações
REPRESENTATION - GAUNTLET.pptx por iammrhaywood
REPRESENTATION - GAUNTLET.pptxREPRESENTATION - GAUNTLET.pptx
REPRESENTATION - GAUNTLET.pptx
iammrhaywood138 visualizações
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx por Ms. Pooja Bhandare
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptxPharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx
Ms. Pooja Bhandare113 visualizações
Ch. 7 Political Participation and Elections.pptx por Rommel Regala
Ch. 7 Political Participation and Elections.pptxCh. 7 Political Participation and Elections.pptx
Ch. 7 Political Participation and Elections.pptx
Rommel Regala111 visualizações
Class 9 lesson plans por TARIQ KHAN
Class 9 lesson plansClass 9 lesson plans
Class 9 lesson plans
TARIQ KHAN51 visualizações
ICS3211_lecture 08_2023.pdf por Vanessa Camilleri
ICS3211_lecture 08_2023.pdfICS3211_lecture 08_2023.pdf
ICS3211_lecture 08_2023.pdf
Vanessa Camilleri231 visualizações
Classification of crude drugs.pptx por GayatriPatra14
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptx
GayatriPatra14101 visualizações
MercerJesse2.1Doc.pdf por jessemercerail
MercerJesse2.1Doc.pdfMercerJesse2.1Doc.pdf
MercerJesse2.1Doc.pdf
jessemercerail273 visualizações
CONTENTS.pptx por iguerendiain
CONTENTS.pptxCONTENTS.pptx
CONTENTS.pptx
iguerendiain62 visualizações
Women from Hackney’s History: Stoke Newington by Sue Doe por History of Stoke Newington
Women from Hackney’s History: Stoke Newington by Sue DoeWomen from Hackney’s History: Stoke Newington by Sue Doe
Women from Hackney’s History: Stoke Newington by Sue Doe
History of Stoke Newington163 visualizações
The basics - information, data, technology and systems.pdf por JonathanCovena1
The basics - information, data, technology and systems.pdfThe basics - information, data, technology and systems.pdf
The basics - information, data, technology and systems.pdf
JonathanCovena1146 visualizações
Sociology KS5 por WestHatch
Sociology KS5Sociology KS5
Sociology KS5
WestHatch85 visualizações
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant... por Ms. Pooja Bhandare
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
Ms. Pooja Bhandare133 visualizações

Machine learning (5)

  • 1. CS229 Lecture notes Andrew Ng Part VII Regularization and model selection Suppose we are trying to select among several different models for a learning problem. For instance, we might be using a polynomial regression model h(x) = g(θ0 + θ1x + θ2x2 + · · · + θkxk), and wish to decide if k should be 0, 1, . . . , or 10. How can we automatically select a model that represents a good tradeoff between the twin evils of bias and variance1? Alternatively, suppose we want to automatically choose the bandwidth parameter τ for locally weighted regression, or the parameter C for our ℓ1-regularized SVM. How can we do that? For the sake of concreteness, in these notes we assume we have some finite set of models M = {M1, . . . ,Md} that we’re trying to select among. For instance, in our first example above, the model Mi would be an i-th order polynomial regression model. (The generalization to infinite M is not hard.2) Alternatively, if we are trying to decide between using an SVM, a neural network or logistic regression, then M may contain these models. 1Given that we said in the previous set of notes that bias and variance are two very different beasts, some readers may be wondering if we should be calling them “twin” evils here. Perhaps it’d be better to think of them as non-identical twins. The phrase “the fraternal twin evils of bias and variance” doesn’t have the same ring to it, though. 2If we are trying to choose from an infinite set of models, say corresponding to the possible values of the bandwidth ∈ R+, we may discretize and consider only a finite number of possible values for it. More generally, most of the algorithms described here can all be viewed as performing optimization search in the space of models, and we can perform this search over infinite model classes as well. 1
  • 2. 2 1 Cross validation Let’s suppose we are, as usual, given a training set S. Given what we know about empirical risk minimization, here’s what might initially seem like a algorithm, resulting from using empirical risk minimization for model selec- tion: 1. Train each model Mi on S, to get some hypothesis hi. 2. Pick the hypotheses with the smallest training error. This algorithm does not work. Consider choosing the order of a poly- nomial. The higher the order of the polynomial, the better it will fit the training set S, and thus the lower the training error. Hence, this method will always select a high-variance, high-degree polynomial model, which we saw previously is often poor choice. Here’s an algorithm that works better. In hold-out cross validation (also called simple cross validation), we do the following: 1. Randomly split S into Strain (say, 70% of the data) and Scv (the remain- ing 30%). Here, Scv is called the hold-out cross validation set. 2. Train each model Mi on Strain only, to get some hypothesis hi. 3. Select and output the hypothesis hi that had the smallest error ˆεScv(hi) on the hold out cross validation set. (Recall, ˆεScv(h) denotes the empir- ical error of h on the set of examples in Scv.) By testing on a set of examples Scv that the models were not trained on, we obtain a better estimate of each hypothesis hi’s true generalization error, and can then pick the one with the smallest estimated generalization error. Usually, somewhere between 1/4 − 1/3 of the data is used in the hold out cross validation set, and 30% is a typical choice. Optionally, step 3 in the algorithm may also be replaced with selecting the model Mi according to argmini ˆεScv(hi), and then retraining Mi on the entire training set S. (This is often a good idea, with one exception being learning algorithms that are be very sensitive to perturbations of the initial conditions and/or data. For these methods, Mi doing well on Strain does not necessarily mean it will also do well on Scv, and it might be better to forgo this retraining step.) The disadvantage of using hold out cross validation is that it “wastes” about 30% of the data. Even if we were to take the optional step of retraining
  • 3. 3 the model on the entire training set, it’s still as if we’re trying to find a good model for a learning problem in which we had 0.7m training examples, rather than m training examples, since we’re testing models that were trained on only 0.7m examples each time. While this is fine if data is abundant and/or cheap, in learning problems in which data is scarce (consider a problem with m = 20, say), we’d like to do something better. Here is a method, called k-fold cross validation, that holds out less data each time: 1. Randomly split S into k disjoint subsets of m/k training examples each. Let’s call these subsets S1, . . . , Sk. 2. For each model Mi, we evaluate it as follows: For j = 1, . . . , k Train the model Mi on S1 ∪ · · · ∪ Sj−1 ∪ Sj+1 ∪ · · · Sk (i.e., train on all the data except Sj) to get some hypothesis hij . Test the hypothesis hij on Sj , to get ˆεSj (hij). The estimated generalization error of model Mi is then calculated as the average of the ˆεSj (hij)’s (averaged over j). 3. Pick the model Mi with the lowest estimated generalization error, and retrain that model on the entire training set S. The resulting hypothesis is then output as our final answer. A typical choice for the number of folds to use here would be k = 10. While the fraction of data held out each time is now 1/k—much smaller than before—this procedure may also be more computationally expensive than hold-out cross validation, since we now need train to each model k times. While k = 10 is a commonly used choice, in problems in which data is really scarce, sometimes we will use the extreme choice of k = m in order to leave out as little data as possible each time. In this setting, we would repeatedly train on all but one of the training examples in S, and test on that held-out example. The resulting m = k errors are then averaged together to obtain our estimate of the generalization error of a model. This method has its own name; since we’re holding out one training example at a time, this method is called leave-one-out cross validation. Finally, even though we have described the different versions of cross vali- dation as methods for selecting a model, they can also be used more simply to evaluate a single model or algorithm. For example, if you have implemented
  • 4. 4 some learning algorithm and want to estimate how well it performs for your application (or if you have invented a novel learning algorithm and want to report in a technical paper how well it performs on various test sets), cross validation would give a reasonable way of doing so. 2 Feature Selection One special and important case of model selection is called feature selection. To motivate this, imagine that you have a supervised learning problem where the number of features n is very large (perhaps n ≫ m), but you suspect that there is only a small number of features that are “relevant” to the learning task. Even if you use a simple linear classifier (such as the perceptron) over the n input features, the VC dimension of your hypothesis class would still be O(n), and thus overfitting would be a potential problem unless the training set is fairly large. In such a setting, you can apply a feature selection algorithm to reduce the number of features. Given n features, there are 2n possible feature subsets (since each of the n features can either be included or excluded from the subset), and thus feature selection can be posed as a model selection problem over 2n possible models. For large values of n, it’s usually too expensive to explicitly enumerate over and compare all 2n models, and so typically some heuristic search procedure is used to find a good feature subset. The following search procedure is called forward search: 1. Initialize F = ∅. 2. Repeat { (a) For i = 1, . . . , n if i6∈ F, let Fi = F ∪ {i}, and use some ver- sion of cross validation to evaluate features Fi. (I.e., train your learning algorithm using only the features in Fi, and estimate its generalization error.) (b) Set F to be the best feature subset found on step (a). } 3. Select and output the best feature subset that was evaluated during the entire search procedure.
  • 5. 5 The outer loop of the algorithm can be terminated either when F = {1, . . . , n} is the set of all features, or when |F| exceeds some pre-set thresh- old (corresponding to the maximum number of features that you want the algorithm to consider using). This algorithm described above one instantiation of wrapper model feature selection, since it is a procedure that “wraps” around your learning algorithm, and repeatedly makes calls to the learning algorithm to evaluate how well it does using different feature subsets. Aside from forward search, other search procedures can also be used. For example, backward search starts off with F = {1, . . . , n} as the set of all features, and repeatedly deletes features one at a time (evaluating single-feature deletions in a similar manner to how forward search evaluates single-feature additions) until F = ∅. Wrapper feature selection algorithms often work quite well, but can be computationally expensive given how that they need to make many calls to the learning algorithm. Indeed, complete forward search (terminating when F = {1, . . . , n}) would take about O(n2) calls to the learning algorithm. Filter feature selection methods give heuristic, but computationally much cheaper, ways of choosing a feature subset. The idea here is to compute some simple score S(i) that measures how informative each feature xi is about the class labels y. Then, we simply pick the k features with the largest scores S(i). One possible choice of the score would be define S(i) to be (the absolute value of) the correlation between xi and y, as measured on the training data. This would result in our choosing the features that are the most strongly correlated with the class labels. In practice, it is more common (particularly for discrete-valued features xi) to choose S(i) to be the mutual information MI(xi, y) between xi and y: MI(xi, y) = X xi2{0,1} X y2{0,1} p(xi, y) log p(xi, y) p(xi)p(y) . (The equation above assumes that xi and y are binary-valued; more generally the summations would be over the domains of the variables.) The probabil- ities above p(xi, y), p(xi) and p(y) can all be estimated according to their empirical distributions on the training set. To gain intuition about what this score does, note that the mutual infor- mation can also be expressed as a Kullback-Leibler (KL) divergence: MI(xi, y) = KL (p(xi, y)||p(xi)p(y)) You’ll get to play more with KL-divergence in Problem set #3, but infor- mally, this gives a measure of how different the probability distributions
  • 6. 6 p(xi, y) and p(xi)p(y) are. If xi and y are independent random variables, then we would have p(xi, y) = p(xi)p(y), and the KL-divergence between the two distributions will be zero. This is consistent with the idea if xi and y are independent, then xi is clearly very “non-informative” about y, and thus the score S(i) should be small. Conversely, if xi is very “informative” about y, then their mutual information MI(xi, y) would be large. One final detail: Now that you’ve ranked the features according to their scores S(i), how do you decide how many features k to choose? Well, one standard way to do so is to use cross validation to select among the possible values of k. For example, when applying naive Bayes to text classification— a problem where n, the vocabulary size, is usually very large—using this method to select a feature subset often results in increased classifier accuracy. 3 Bayesian statistics and regularization In this section, we will talk about one more tool in our arsenal for our battle against overfitting. At the beginning of the quarter, we talked about parameter fitting using maximum likelihood (ML), and chose our parameters according to θML = argmax m Yi =1 p(y(i)|x(i); θ). Throughout our subsequent discussions, we viewed θ as an unknown param- eter of the world. This view of the θ as being constant-valued but unknown is taken in frequentist statistics. In the frequentist this view of the world, θ is not random—it just happens to be unknown—and it’s our job to come up with statistical procedures (such as maximum likelihood) to try to estimate this parameter. An alternative way to approach our parameter estimation problems is to take the Bayesian view of the world, and think of θ as being a random variable whose value is unknown. In this approach, we would specify a prior distribution p(θ) on θ that expresses our “prior beliefs” about the parameters. Given a training set S = {(x(i), y(i))}m i=1, when we are asked to make a prediction on a new value of x, we can then compute the posterior
  • 7. 7 distribution on the parameters p(θ|S) = p(S|θ)p(θ) p(S) = Qm i=1 p(y(i)|x(i), θ)p(θ) R (Qm i=1 p(y(i)|x(i), θ)p(θ)) dθ (1) In the equation above, p(y(i)|x(i), θ) comes from whatever model you’re using for your learning problem. For example, if you are using Bayesian logistic re- gression, then you might choose p(y(i)|x(i), θ) = h(x(i))y(i)(1−h(x(i)))(1−y(i)), where h(x(i)) = 1/(1 + exp(−θT x(i))).3 When we are given a new test example x and asked to make it prediction on it, we can compute our posterior distribution on the class label using the posterior distribution on θ: p(y|x, S) = Z p(y|x, θ)p(θ|S)dθ (2) In the equation above, p(θ|S) comes from Equation (1). Thus, for example, if the goal is to the predict the expected value of y given x, then we would output4 E[y|x, S] = Zy yp(y|x, S)dy The procedure that we’ve outlined here can be thought of as doing “fully Bayesian” prediction, where our prediction is computed by taking an average with respect to the posterior p(θ|S) over θ. Unfortunately, in general it is computationally very difficult to compute this posterior distribution. This is because it requires taking integrals over the (usually high-dimensional) θ as in Equation (1), and this typically cannot be done in closed-form. Thus, in practice we will instead approximate the posterior distribution for θ. One common approximation is to replace our posterior distribution for θ (as in Equation 2) with a single point estimate. The MAP (maximum a posteriori) estimate for θ is given by θMAP = argmax m Yi =1 p(y(i)|x(i), θ)p(θ). (3) 3Since we are now viewing as a random variable, it is okay to condition on it value, and write “p(y|x, )” instead of “p(y|x; ).” 4The integral below would be replaced by a summation if y is discrete-valued.
  • 8. 8 Note that this is the same formulas as for the ML (maximum likelihood) estimate for θ, except for the prior p(θ) term at the end. In practical applications, a common choice for the prior p(θ) is to assume that θ ∼ N(0, τ 2I). Using this choice of prior, the fitted parameters θMAP will have smaller norm than that selected by maximum likelihood. (See Problem Set #3.) In practice, this causes the Bayesian MAP estimate to be less susceptible to overfitting than the ML estimate of the parameters. For example, Bayesian logistic regression turns out to be an effective algorithm for text classification, even though in text classification we usually have n ≫ m.