Ch.5 machine learning basics

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 5 Machine Learning Basics
JIN HO LEE
2018-10-6
JIN HO LEE Chapter 5 Machine Learning Basics 2018-10-6 1 / 47

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
• 5.1 Learning Algorithms
• 5.2 Capacity, Overfitting and Underfitting
• 5.3 Hyperparameters and Validation Sets
• 5.4 Estimators, Bias and Variance
• 5.5 Maximum Likelihood Estimation
• 5.6 Bayesian Statistics
• 5.7 Supervised Learning Algorithms
• 5.8 Unsupervised Learning Algorithms
• 5.9 Stochastic Gradient Descent
• 5.10 Building a Machine Learning Algorithm
• 5.11 Challenges Motivating Deep Learning

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms
5.1 Learning Algorithms
Definition
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.1 The Task, T
5.1.1 The Task, T
Many kinds of tasks can be solved with machine learning. Some of the
most common machine learning tasks include the following:
• Classification: 주어진 데이터를 k개의 category로 분류하는 함수를 찾는
task이다. 주어진 data가 dimension 이 n이라면 f : Rn → {1, · · · , k}인
함수를 찾는 문제이다.
• Regression: 주어진 데이터를 실수에 대응시키는 함수를 찾는 task이다.
주어진 data가 dimension 이 n이라면 f : Rn → R인 함수를 찾는 문제이다.
• Transcription: 상대적으로 덜 구조적으로 표현되는 데이터에 대한 task
이다. 예를 들면 music translation 이나 OCR(Optical Character
Recognition) 이 있다.
• Machine translation: 번역

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.1 The Task, T
• Anomaly detection: unusual 또는 atypical 문제를 찾는 task 이다. 예를
들면 카드사기 찾기 같은 task 가 있다.
• Synthesis and sampling: 새로운 example을 생성(generate)하는 task
이다. VAE, GAN 이 대표적인 예제이다.
• Denoising: corrupter example ˜x ∈ R 에 대해서 clean example x ∈ R 을
찾는 task 이다.
• Density estimation or probability mass function estimation: 주어진
데이터에서 probability density function (if x is continuous) 또는 a
probability mass function (if x is discrete)를 learning 하는 task 로 p(x|ˆx) 를
이용하여 학습한다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.2 The performance Measure P
5.1.2 The performance Measure P
• 머신러닝 알고리즘의 성능을 평가하기 위해 양적인 측정(quatitative
measure)가 필요하다. accuracy 또는 error rate 가 대표적인 performance
measure 이다. 회귀(regression) 문제에서는 MSE(mean suqared error) 를
자주 사용하고, 2진분류문제(binary classification) 에서는 cross entropy 를
자주 사용하고, 3항 이상의 분류문제는 softmax 함수를 주로 사용한다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.3 The Experience, E
5.1.3 The Experience, E
• 머신러닝은 크게 unsupervised learning 과 supervised learing 으로
나뉜다. 이 책에서 다루는 대부분의 머신러닝 알고리즘은 data set 을
이용하여 학습된다.
• supervised learning 은 label 또는 target 이 있는 dataset 으로 학습을
하는 알고리즘이다. 따라서 supervised learning 의 목표는 새로운 unlabeled
data 가 들어왔을 때 기존의 dataset 을 이용하여 정확하게 예측하는 것을
목표로 학습된다. random vector x 가 있을 때, associated value 또는 vector
y 에 대하여 p(y|x) 를 학습하게 된다.
• unsupervised learning 은 많은 feature 를 포함한 dataset 에서 유용한
성질을 학습하는 머신러닝 알고리즘의 종류이다. unsupervised deep
learning 알고리즘은 dataset 이 생성하는 전체 확률분포를 학습하는것이
목표이다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.3 The Experience, E
• 반면에, 어떤 머신러닝 알고리즘은 정해진 dataset 으로 학습되지 않는다.
reinforcement learning 은 environment 와 상호작용을 하며 feedback 을
받아 학습하는 알고리즘의 한 종류이다.
• dataset 을 만드는 일반적인 방법은 design matrix 를 만드는 것이다.
design matrix 은 하나의 example 을 하나의 row 로 만들어서 세로로 쌓아진
행렬형태의 dataset 이다. design matrix 의 각각의 열은 feature 를
나타낸다. 예를 들어 Iris dataset 은 총 150 의 example, 4개의 feature 가
있는 dataset 이므로 X ∈ R150×4 이면 Xi,1 은 꽃받침 길이를 나타낸다.
• 크기가 다른 사진처럼 design matrix 로 만들지 못하는 dataset 도
존재한다. 이러한 경우 처리방법은 9장과 10장에서 다루기로 하자.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.4 Example: Linear Regression
5.1.4 Example: Linear Regression
• Linear regression 은 주어진 input vector x ∈ Rn+1(x0 = 1) 에 대해
scalar y ∈ R 를 잘 예측하는 벡터 w ∈ Rn+1 를 찾는 문제이다. y 를
예측하는 값을 ˆy 는 아래와 같이 정의된다:
ˆy = wT
x = w0 + w1x1 + · · · + wnxn.
여기서 w ∈ Rn+1 를 parameter 라고 하고, 각각의 wi 를 weight 라고 한다.
• 이제 Linear regression 을 풀기 위해 task T 를 아래와 같이 정의해 보자:
T : to predict y from x by outputting ˆy = wTx.
• 다음은 performance measure P 는 training set 에서 MSEtrain(Mean
Squared Error) 를 최소화하는 것으로 하자:
MSEtrain =
1
m
||(ˆy(train)
− y(train)
||2
2.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Learning Algorithms 5.1.4 Example: Linear Regression
• MSEtrain 이 최소값을 찾기 위해 ∇wMSEtrain = 0 을 이용해보자:
0 = ∇wMSEtrain
= ∇w
1
m
||ˆy(train)
− y(train)
||2
2
=
1
m
∇W||X(train)
w − y(train)
||2
2
×m
= ∇w(X(train)
w − y(train)
)T
(X(train)
w − y(train)
) = 0
= ∇w(wT
X(train)T
X(train)
w − 2wT
X(train)T
y(train)
+ y(train)T
y(train)
)
= 2X(train)T
X(train)
w − 2X(train)T
y(train)
따라서
w = (X(train)T
X(train)
)−1
X(train)T
y(train)
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2 Capacity, Overfitting and Underfitting
• 머신러닝은 주어진 데이터를 기반으로 만들어진 알고리즘으로 새로운
데이터(unseen input)에 대해서도 잘 작동하기를 원한다. 이러한 ability 를
generalization 이라고 한다.
• 일반적으로 머신러닝 모델을 훈련시킬 때 데이터를 training data와 test
data로 나눠서 훈련시킨다. 이 때 training data에서 생긴 error를 training
error, test data에서 생긴 error를 test error 라고 한다.
• 5.1절의 linear regression 예제에서는 training error를 최소화하는
방법으로 훈련시켰었다:
1
m(train)
||X(train)
w − y(train)
||2
2.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• 그런데 당연히 test error 1
m(test) ||X(test)w − y(test)||2
2 에 대해서도 고려해야
한다. training error와 test error와의 관계는 statistical learning theory에서
다룬다.
• training data와 test data가 확률분포로 생성된 dataset을
data-generating process라고 한다. 이렇게 data를 생성할 때에는 i.i.d.
assumption을 만족해야 한다. 데이터들끼리는 independent해야하고,
training data와 test data의 분포는 identically distributed해야 한다는
가정이다. data를 생성한 확률분포를 data-generating distribution이라고
하고 pdata 라고 표기한다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• 머신러닝 알고리즘이 잘 작동하기 위해서 우리는 아래의 두 가지를
고려해야 한다.
1. training error를 줄이고
2. training error와 test error의 차이를 줄이자.
위의 1, 2번은 underfitting, overfitting 문제와 관련이 있다. 모델을
평가하는 방법은 여러가지 capacity가 있는데 low capacity는 training set에
fitting 시키기 어려운 경우일 수 있고(underfitting), high capacity는
training set을 외워서 생기는 결과(overfitting)일 수 있다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• Informally, a model’s capacity is its ability to fit a wide variety of
functions. 예를 들어, layer 가 많고 layer 마다 unit 이 많은 neural network
은 capacity 가 크다고 한다.
• hypothesis space는 learning algorithm이 허용하는 함수들의 집합을
의미한다. 예를 들면 linear regression을 할 때의 hypothesis space는 모든
linear function들의 집합이다. 당연히 주어진 dataset에 대해서 hypothesis
space를 넓힐수록 capacity는 좋아진다.
• bainary classifier f의 Vapnik-Chervonenkis dimension(VC
dimension)은 집합 X의 임의의 부분집합 S ⊂ X에 대해 S와 SC 가 f로
구분이 가능한 최대의 n(X)을 의미한다. 예를 들어, 2차원 유클리드공간의
임의의 세 점으로 구성된 집합 X = {x1, x2, x3}에서 부분집합 A와 AC 의
순서쌍 (A, AC)는 총 8가지가 있는데 모두 직선으로 분류가 가능하다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• 하지만, 2차원에서 4개의 점이 {(0, 0), (0, 1), (1, 0), (1, 1)} 이렇게 4개가
있을 때 하나의 직선(=lineart classifier in 2-dimensional Euclidean space)
은 {(0, 0), (1, 1)}과 {(0, 1), (1, 0)}을 분류할 수 없다. 따라서 2차원
유클리드공간에서 직선의 VC dimension은 2이다. 일반적으로 n차원
유클리드공간에서 linear classifier의 VC dimension은 n + 1 이다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2 Capacity, Overfitting and Underfitting 5.2.1 The No Free Lunch Theorem
5.2.1 The No Free Lunch Theorem
• The no free lunch theorem for machine learning (Wolpert, 1996) states
that, averaged over all possible data generating distributions, every
classification algorithm has the same error rate when classifying previously
unobserved points. In other words, in some sense, no machine learning
algorithm is universally any better than any other.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2 Capacity, Overfitting and Underfitting 5.2.2 Regularization
5.2.2 Regularization
• 우리의 목표는 주어진 데이터 뿐만 아니라 새로운 데이터에 대해서도 잘
작동하는 알고리즘을 원한다. 그러기 위해서 주어진 데이터를 조금만
사용하는 방법이 regulization 이다. 이 때 regulization 하는 함수를
regulizer 라고 한다.
• 예를 들어, Linear regression에서 regulizer Ω(ω) = ωTω 를 쓰면 cost
function 을 아래와 같이 쓸 수 있다:
J(ω) = MSEtrain + λωT
ω
여기서 λ는 얼마만큼 regulization 하는가에 대한 상수로 우리가 정해야하는
값이다. 자세한 내용은 Chapter 7 에서 알아보자.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.3 Hyperparameters and Validation Sets
5.3 Hyperparameter and Validation Sets
• 머신러닝 알고리즘은 알고리즘의 행동을 결정하는 여러가지 세팅이
필요하다. 이런 세팅을 hyperparameter 라고 한다. hyperparameter 는
알고리즘이 자체적으로 학습하지는 못해서 model 을 만들써 설정해주게
된다. 예를 들어 learing rate, error function, epoch, layer 갯수 등이
hyperparameter 이다.
• model 이 training set 에 overfitting 되는것을 방지하기 위해 validation
set 을 사용할 수 있다. 보통 데이터의 80% 를 training set 으로 사용하고,
20% 를 validation set 으로 사용한다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.4 Estimators, Bias and Variance 5.4.1 Point Estimation
5.4.1 Point Estimation
• Let {x(1), · · · , x(m)} be a set of i.i.d data points. A point estimator or
statistic is any function of the data:
ˆθm = g(x(1)
, · · · , x(m)
).
• Frequentist의 관점에서 보면 true parameter θ는 정해져 있지만 알지
못하는 상태이고 우리는 그것을 데이터로부터 estimate 해야한다. 따라서 ˆθ
는 random variable 이다.
• 다른 경우로 input x에 대한 output y의 관계에 관심이 있는 경우도 있다.
이 때에는 함수 f로 x와 y의 관계를 estimation할 수 있다. function
estimator를 ˆf로 표기한다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.4 Estimators, Bias and Variance 5.4.2 Bias
5.4.2 Bias
Definition
The bias of an estimator ˆθ is defined as:
bias(ˆθm) = E(ˆθm) − θ,
where the expectation is over the data (seen as m-samples from a random
variable) and θ is the true underlying value of θ. An estimator ˆθm is said
to be unbiased if bias(ˆθm) = 0(that is E(ˆθm) = θ). An estimator ˆθm is
said to be asymptotically unbiased if limm→∞ bias(ˆθm) = 0 (that is
limm→∞ E(ˆθm) = θ).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Example : Bernoulli Distribution
• Consider a set of samples {x(1), · · · , x(m)} that are i.i.d according to a
Bernoulli distribution with mean θ:
P(x(i)
, θ) = θx(i)
(1 − θ)1−x(i)
.
We determine an estimator as follow:
ˆθm =
1
m
m∑
i=1
x(i)
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Then we have
bias(ˆθm) = E[ˆθm] − θ
= E
[
1
m
m∑
i=1
x(i)
]
− θ
=
1
m
m∑
i=1
E
[
x(i)
]
− θ
=
1
m
m∑
i=1
1∑
x(i)=0
(
x(i)
θx(i)
(1 − θ)1−x(i)
)
− θ
=
1
m
m∑
i=1
(θ) − θ = 0
Since bias(ˆθ) = 0, we say that our estimator ˆθ is unbiased.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.4 Estimators, Bias and Variance 5.4.4 Trading off Bias and Variance to Minimize MSE
5.4.4 Trading off Bias and Variance to Minimize MSE
• Mean squared error(MSE) 는 아래와 같이 쓸 수 있다
MSE = E[(ˆθm − θ)2
]
= bias(ˆθm)2
+ Var(ˆθm)
따라서 bais와 variance는 한쪽이 크면 다른 한쪽은 작아지게 된다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.4 Estimators, Bias and Variance 5.4.5 Consistency
5.4.5 Consistency
• 우리는 데이터의 크기가 커질수록 estimator가 true value에
수렴하는것을 원한다. 즉,
plim
m→∞
ˆθm = θ
가 되기를 원한다. 여기서 plim의 정의는 아래와 같다:
∀ε > 0, P(|ˆθm − θ| > ε) → 0 as m → ∞.
Almost sure convergence of a sequence of random variables
X(1), X(2), · · · to a value x occurs when p(limm→∞ X(m) = x) = 1.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.5 Maximum Likelihood Estimation
• Consider a set of m examples X = {x(1), · · · , x(m)} drawn independently
from unknown data generating distribution pdata(X).
• Let pmodel(X; θ) be a parametric family of probability distributions over
the same space parametrized by θ. The maximum likelihood estimator
for θ is then defined as
θML = argmax
θ
pmodel(X; θ)
= argmax
θ
m∏
i=1
pmodel(x(i)
; θ)
Since logarithm is increasing, we can write
θML = argmax
θ
m∑
i=1
log pmodel(x(i)
; θ). (5.58)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
X = {x(1), · · · , x(m)} 는 independant 하다고 했으므로
pmodel(x; θ) =
∑
i pmodel(x(i); θ) 에 의해 다음을 얻을 수 있다:
θML = argmax
θ
m∑
i=1
log pmodel(x(i)
; θ) = argmax
θ
log pmodel(x); θ) (5.58’)
θML = argmax
θ
Ex∼ˆpdata
logpmodel(x; θ). (5.59)
위의 식에서 오른쪽을 empirical distribution ˆpdata 를 사용하면 아래와 같이
쓸 수 있다:
θML = argmax
θ
Ex∼ˆpdata
log pmodel(x; θ)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
KL divergence는 아래와 같이 정의된다:
DKL(ˆpdata∥pmodel) = EX∼ˆpmodel
[log ˆpdata(x) − log pmodel(x)].
위의 식에서 EX∼ˆpmodel
[log ˆpdata(x)] 는 고정된 값이므로KL divergence를
줄이기 위해서는 −EX∼ˆpdata
[log pmodel(x)] 를 줄여야 한다. cross-entropy 는
다음과 같이 정의되고 H(ˆpdata, pmodel) = H(ˆpdata) + DKL(ˆpdata∥pmodel)
EX∼ˆpmodel
[log ˆpdata(x)] 는 고정된 값이므로 KL divergence 를 최소화하는
문제는 cross- entropy 를 최소화하는 문제와 동일하다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.5 Maximum Likelihood Estimation 5.5.1 Conditional Log-Likelihood and Mean Squared Error
5.5.1 Conditional Log-Likelihood and Mean Squared Error
• The conditional maximal likelihood estimator for θ is defined as:
θML = argmax
θ
P(Y|X, θ).
If the examples are assumend to be i.i.d., then this can be decomposed into
θML = argmax
θ
m∑
i=1
log P(y(i)
|x(i)
; θ).
• section 5.1.4과 같은 결과를 얻기 위해 p(y|x) = N(y;ˆy(x; ω), σ2) 로
정의하자. The function ˆy(x : ω) gives the prediction of the mean of the
Gaussian. (식 (5.69)를 보면 ˆy = ωTx를 이용하여 식 (5.71)을 유도하였다.
즉, ˆy(x; ω)는 함수이고 ˆy를 이용하여 P(y|x)는 평균이 ˆy(x; ω), 분산이 σ2 인
distribution 이다)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Example: Linear Regression as Maximum Likelihood
• section 5.1.4에서는 MSE를 줄이는 방법으로 input x에 대해 ˆy를
mapping 하였다. 이번에는 maximal likelihood estimation의 관점으로
접근해보자. 즉, single prediction ˆy를 구하는것이 아니라 conditional
distribution p(y|x)를 구해보자.
• 우리는 i.i.d. 를 가정했으므로 distribution p(y|x) = N(y;ˆy(x; ω), σ2) 을
(5.63)에 적용하면 아래의 식을 얻을 수 있다:
θML = argmax
θ
m∑
i=1
log p(y(i)
|x(i)
; θ)
= argmax
θ
m∑
i=1
log
(√
1
2πσ2
exp
(
−
1
2σ2
(x − ˆy(i)
)2
))

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Example: Linear Regression as Maximum Likelihood
= argmax
θ
(
−mlogσ −
m
2
log(2π) −
m∑
i=1
∥ˆy(i) − y(i)∥2
2σ2
)
= argmax
θ
(
−
m∑
i=1
∥ˆy(i) − y(i)∥2
2σ2
)
= argmin
θ
m∑
i=1
∥ˆy(i)
− y(i)
∥2
where ˆy(i) is the output of the linear regression on the i-th input x(i) and m
is the number of the training examples.
따라서 위의 p(y|x)에 대한 가정에 대해서 maximal log-likelyhood 를 찾는
문제와 minimun mean square error 를 찾는 문제는 같은 결과를 얻을 수
있다.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.5 Maximum Likelihood Estimation 5.5.2 Properties of Maximum Likelihood
5.5.2 Properties of Maximum Likelihood
Under appropriate conditions, maximum likelihood estimator has the
property of consistency (see Sec. 5.4.5 above), meaning that as the
number of training examples approaches infinity, the maximum likelihood
estimate of a parameter converges to the true value of the parameter.
These conditions are:
• The true distribution pdata must lie within the model family pmodel(·; θ).
Otherwise, no estimator can recover pdata.
• The true distribution pdata must correspond to exactly one value of θ.
Otherwise, maximum likelihood can recover the correct pdata, but will not
be able to determine which value of θ was used by the data generating
processing.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Example
출처 : http://people.missouristate.edu/songfengzheng/
Teaching/MTH541/Lecture%20notes/MLE.pdf
Suppose that X is a discrete random variable with the following probability
mass function: where 0 ≤ θ ≤ 1 is a parameter. The following 10
independent observations
X 0 1 2 3
P(X) 2θ/3 θ/3 2(1 − θ)/3 (1 − θ)/3
were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1). What is the
maximum likelihood estimate of µ.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Solution: Since the sample is (3,0,2,1,3,2,1,0,2,1), the likelihood is
L(θ) =
n∏
i=1
P(Xi|θ)
=
(
2θ
3
)2 (
θ
3
)3 (
2(1 − θ)
3
)2 (
(1 − θ)
3
)2
Let us look at the log likelihood function
l(θ) = logL(θ) =
n∑
i=1
P(Xi|θ)
= 2
(
log
2
3
+ logθ
)
+ 3
(
log
1
3
+ logθ
)
+ 3
(
log
2
3
+ log(1 − θ)
)
+ 2
(
log
1
3
+ log(1 − θ)
)
= 5logθ + 5log(1 − θ) + C

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Then
dl(θ)
dθ
=
5
θ
−
5
1 − θ
= 0
implies that ˆθ = 0.5.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.6 Bayesian Statistics
• data sample {x(1), · · · , x(m)}이 있다고 하자. 이 때 p(θ|x(1), · · · , x(m))는
Bayes’s rule을 이용하여 아래와 같이 쓸 수 있다:
p(θ|x(1)
, · · · , x(m)
) =
p(x(1), · · · , x(m)|θ)p(θ)
p(x(1), · · · , x(m))
.
• MLE에 비해 Bayesian 추정은 두 가지 중요한 차이를 제공한다.
첫번째로, MLE 접근법이 point estimation 하는것과 달리 Bayesian
접근법은 full distribution 에 대한 estimation 을 한다. 예를 들어 m 개의
example 이 있을 때 (m + 1) 번째 sample 은 다음과 같이 예측할 수 있다.
p(x(m+1)
|x(1)
, · · · , x(m)
) =
∫
p(x(m+1
)|θ)p(θ|x(1)
, · · · , x(m)
)dθ.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The second important difference between the Bayesian approach to
estimation and the maximum likelihood approach is due to the
contribution of the Bayesian prior distribution. The prior has an influence
by shifting probability mass density towards regions of the parameter space
that are preferred a priori. In practice, the prior often expresses a
preference for models that are simpler or more smooth. Critics of the
Bayesian approach identify the prior as a source of subjective human
judgment impacting the predictions.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Example: Bayesian Linear Regression
linear regression 에서는 input x ∈ Rn 에 대해서 predict value y ∈ R은
ω ∈ Rn 에 대해 parametrized form으로 나타난다:
ˆy = ωT
x.
m개의 training sample (X(train), y(train))이 있을 때, prediction vector는
ˆy(train) = X(train)w 이다. 만약 y(train) 이 정규분포를 따른다면 아래와 같이
쓸 수 있다:
P(y(train)
|X(train)
, ω)
= N(y(train)
; X(train)
w, I)
=
√
1
(2π)ndet(I)
exp
(
−
1
2
(y(train)
− X(train)
ω)T
(y(train)
− X(train)
ω)
)
∝ exp
(
−
1
2
(y(train)
− X(train)
ω)T
(y(train)
− X(train)
ω)
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
앞으로는 (X(train), y(train))를 (X, y)로 쓰자. prior distribution p(ω)는
Gaussian 이라고 가정하자:
p(ω) = N(ω; µ0, Λ0)T
∝ exp
(
−
1
2
(ω − µ0)Λ−1
0 (ω − µ0)
)
이 때 posterior distribution은 아래와 같이 쓸 수 있다.
p(ω|X, y) ∝ p(y|X, ω)p(ω)
∝ exp
(
−
1
2
(y − Xω)T
(y − Xω)
)
× exp
(
−
1
2
(ω − µ0)T
Λ−1
0 (ω − µ0)
)
(1)
where
(y − Xω)T
(y − Xω) = (yT
− ωT
XT
)(y − Xω)
= yT
y − yT
Xω − ωT
XT
y + ωT
XT
Xω.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Note that y ∈ Rm×1, X ∈ Rm×n, ω ∈ Rn. Since yTXω, ωTXTy ∈ R and
(yTXω)T = ωTXTy, we have yTXω = ωTXTy. It implies that
exp
(
(y − Xω)T
(y − Xω)
)
= exp
(
(yT
y − yT
Xω − ωT
XT
y + ωT
XT
Xω)
)
= exp(yT
y)exp(−2yT
Xω + ωT
XT
Xω)
∝ exp(−2yT
Xω + ωT
XT
Xω) (2)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The covariance matrix Λ0 is symmetric. i.e. (Λ−1
0 )T = (ΛT
0 )−1 = Λ−1
0 .
Then we have (ωTΛ−1
0 µ0)T = µT
0 (Λ−1
0 )Tω ∈ R. It implies that
exp
(
(ω − µ0)T
Λ−1
0 (ω − µ0)
)
= exp
(
ωT
Λ−1
0 ω − ωT
Λ−1
0 µ0 − µT
0 Λ−1
0 ω + µT
0 Λ−1
0 µ0
)
= exp(µT
0 Λ−1
0 µ0)exp(−2µT
0 Λ−1
0 ω + ωT
Λ−1
0 ω)
∝ exp(−2µT
0 Λ−1
0 ω + ωT
Λ−1
0 ω) (3)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
By (1), (2) and (3), we have
p(ω|X, y) ∝ exp
(
−
1
2
(
−2yT
Xω + ωT
XT
Xω + ωT
Λ−1
0 ω − 2µT
0 Λ−1
0 ω
)
)
.
We now define Λm = (XTX + Λ−1
0 )−1 and µm = Λm(XTy + Λ−1
0 µ0).
Using these new variables, we find that the posterior may be rewritten as a
Gaussian distribution:
p(ω|X, y) ∝ exp
(
−
1
2
(ω − µm)T
Λ−1
0 (ω − µm) +
1
2
µT
mΛ−1
0 µm
)
∝ exp
(
−
1
2
(ω − µm)T
Λ−1
0 (ω − µm)
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• Examining this posterior distribution allows us to gain some intuition for
the effect of Bayesian inference. In most situations, we set µ0 to 0. If we
set Λ0 = 1
αI, then µm gives the same estimate of w as does frequentist
linear regression with a weight decay penalty of αωTω. One difference is
that the Bayesian estimate is undefined if is set to zero — we are not
allowed to begin the Bayesian learning process with an infinitely wide prior
on w. The more important difference is that the Bayesian estimate
provides a covariance matrix, showing how likely all the different values of
w are, rather than providing only the estimate µm.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.6 Bayesian Statistics 5.6.1 Maximum A Posteriori (MAP) Estimation
5.6.1 Maximum A Posteriori (MAP) Estimation
• Baye’s Theorem 을 다시 살펴보자. Baye’s Theorem 은 확률의 곱셈법칙
P(X ∩ Y) = P(X|Y)P(Y) 로 표현되는데, P(X ∩ Y) = P(Y ∩ X) 이므로
P(X|Y)P(Y) = P(Y|X)P(X) 를 얻을 수 있다.
• 여기서 p(x|θ)는 likelihood, p(θ)는 prior, p(θ|X)를 posterior라고 한다.
그리고 각각은 observation, 사전확률, 주어진 데이터에 대한 현상의 확률을
의미한다. p(X)는 θ와 관계 없으므로
argmax
θ
p(θ|X) = argmax
θ
p(x|θ)p(θ)
p(X)
= argmax
θ
p(x|θ)p(θ)
을 얻을 수 있다. log함수는 증가함수이므로 아래와 같이 θMAP 를 정의하자:
Definition (Maximum A Posteriori (MAP))
θMAP = argmax
θ
p(θ|X) = argmax
θ
log p(x|θ) + p(θ)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.6 Bayesian Statistics 5.6.1 Maximum A Posteriori (MAP) Estimation
Example
weight가 w인 Gaussian prior distribution N
(
ω; 0, 1
λ I
)
를 따르는 linear
regression model 의 log-prior term은 (5.18)과 유사한 weight decay penalty
를 갖는다.
Sketch of Proof.
Suppose that
p(ω) = N
(
ω; 0,
1
λ
I
)
=
√
λ2n
(2π)n
exp
(
−λ
2
ωT
ω
)
.
Then we have
log p(ω) = −λωT
ω + constant.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.9 Stochastic Gradient Descent
• The cost function used by a machine learning algorithm often
decomposes as a sum over training examples of some per-example loss
function. For example, the negative conditional log-likelihood of the
training data can be written as
J(θ) = Ex,y∼ˆpdata
L(x, y, θ) =
1
m
m∑
i=1
L(x(i)
, y(i)
, θ)
where L is the per-example loss L(x, y, θ) = −logp(y|x; θ).
For these additive cost functions, gradient descent requires computing
∇θJ(θ) =
1
m
m∑
i=1
∇θL(x(i)
, y(i)
, θ).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• SGD 의 insight 는 gradient 는 expectiation 이라는 것이다. expectation
은 sample 들의 small set 으로 으로 approximate 될 수 있다. 구체적으로는
알고리즘의 매 단계 마다 minibatch 샘플 B = {x(1), · · · , x(m′)} 을 추출하여
g =
1
m
∇θ
m′
∑
i=1
L(x(i)
, y(i)
, θ)
를 이용하여 gradient 를 추정한다. 이 때 SGD 알고리즘은 learning rate ε
에 대해서 다음 방법으로 학습한다:
θ ← θ − εg.
SGD 에 대한 자세한 내용은 part II 에서 자세히 다루기로 하자.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.10 Building a Machine Learning Algorith
5.10 Building a Machine Learning Algorith
• 대부분의 딥러닝 알고리즘은 dataset 을 준비하고, cost function 을
정의하고 최척화하는 방법으로 학습된다.
예를 들어 linear regression 에서는 dataset X 와 y 를 결합하고, cost function
J(w, b) = −Ex,y∼ˆpdata
logpmodel(y|x),
logpmodel(y|x) = N(y; xTw + b, 1), 은 normal equation 을 이용하여
gradient 가 0 이 되는 값을 찾을 수 있다. 여기에 regulaization term 을
더하여 weight decay 하여 training set 에 overfitting 되는것을 막는다:
J(w, b) = λ||w||2
2 − Ex,y∼ˆpdata
logpmodel(y|x),
unsupervised learning 은 label 이 없으므로 dataset X 로 부터 적절한 cost
function 을 만들어야 한다. 예를 들면 PCA 에서는 reconstruction error 를
사용한다:
J(w) = Ex∼ˆpdata
||x − r(x; w)||2
2
where r(x) = wTxw is the reconstruction function.

Ch.5 machine learning basics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ch.5 machine learning basics

Semelhante a Ch.5 machine learning basics (20)

Mais de Jinho Lee

Mais de Jinho Lee (10)

Ch.5 machine learning basics