2. Contents
• Maximum Likelihood learning
• Gradient descent based approach
• Markov Chain Monte Carlo sampling
• Contrastive Divergence
• Further topics for discussion:
– Result biasing of Contrastive Divergence
– Product of Experts
– High-dimensional data considerations
3. • Given:
– Probability model
• - model parameters
• - the partition function, defined as
– Training data
• Aim:
– Find that maximizes likelihood of training data:
– Or, that minimizes negative log of likelihood:
Maximum Likelihood learning
X = fxk
gK
k=1
p(x; £) = 1
Z(£)
f(x; £)
Z(£)
£
Z(£) =
R
f(x; £) dx
£
p(X; £) =
Q
K
k=1
1
Z(£)
f(xk
; £)
£
Toy example
Known result:
E(X; £) = K log(Z(£)) ¡ P
K
k=1
log(f(xk
; £))
f(x; £) = exp¡ (x¡¹)2
2¾2
£ = f¹; ¾g
Z(£) = ¾
p
2¼
4. • Method:
– at minimum
– Let’s assume that there is no linear solution…
Maximum Likelihood learning
@E(X;£)
@£
= 0
@E(X; £)
@£
=
@ log Z(£)
@£
¡ 1
K
KX
i=1
@ log f(xi
; £)
@£
=
@ log Z(£)
@£
¡
¿
@ log f(x; £)
@£
À
X
is the expectation of given the data distribution .
h¢i
X
¢ X
@E(X;£)
@£
= @ log(¾
p
2¼)
@£
+
¿
@ (x¡¹)2
2¾2
@£
À
X
@E(X;£)
@¹
= ¡
-
x¡¹
¾2
®
X
= 0 ) ¹ = hxi
X
@E(X;£)
@¾
= 1
¾
+
D
(x¡¹)2
¾3
E
X
= 0 ) ¾ =
p
h(x ¡ ¹)2i
X
5. – Move a fixed step size, , in the direction of steepest
gradient. (Not line search – see why later).
– This gives the following parameter update equation:
Gradient descent-based approach
´
£t+1 = £t
¡ ´
@E(X; £t
)
@£t
= £t
¡ ´
µ
@ log Z(£t
)
@£t
¡
¿
@ log f(x; £t
)
@£t
À
X
¶
6. – Recall . Sometimes this integral
will be algebraically intractable.
– This means we can calculate neither
nor (hence no line search).
– However, with some clever substitution…
– so
where can be estimated numerically.
Gradient descent-based approach
Z(£) =
R
f(x; £) dx
E(X; £)
@ log Z(£)
@£
@ log Z(£)
@£
= 1
Z(£)
@Z(£)
@£
= 1
Z(£)
@
@£
R
f(x; £) dx
= 1
Z(£)
R
@f(x;£)
@£
dx = 1
Z(£)
R
f(x; £)@ log f(x;£)
@£
dx
=
R
p(x; £)@ log f(x;£)
@£
dx =
D
@ log f(x;£)
@£
E
p(x;£)
D
@ log f(x;£)
@£
E
p(x;£)
£t+1 = £t
¡ ´
µD
@ log f(x;£t
)
@£t
E
p(x;£t
)
¡
D
@ log f(x;£t
)
@£t
E
X
¶
7. – To estimate we must draw samples from .
– Since is unknown, we cannot draw samples randomly
from a cumulative distribution curve.
– Markov Chain Monte Carlo (MCMC) methods turn random
samples into samples from a proposed distribution, without
knowing .
– Metropolis algorithm:
• Perturb samples e.g.
• Reject if
• Repeat cycle for all samples until stabilization of the distribution.
– Stabilization takes many cycles, and there is no accurate
criteria for determining when it has occurred.
Markov Chain Monte Carlo samplingD
@ log f(x;£)
@£
E
p(x;£)
p(x; £)
Z(£)
x0
k
= xk
+ randn(size(xk
))
x0
k
p(x0
k
;£)
p(xk
;£)
< rand(1)
Z(£)
8. – Let us use the training data, , as the starting point for our
MCMC sampling.
– Our parameter update equation becomes:
Markov Chain Monte Carlo sampling
X
£t+1 = £t
¡ ´
µD
@ log f(x;£t
)
@£t
E
X1
£t
¡
D
@ log f(x;£t
)
@£t
E
X0
£t
¶
Notation: - training data, - training data after cycles of MCMC,
- samples from proposed distribution with parameters .
n
X1
£
X0
£
Xn
£
£
9. – Let us make the number of MCMC cycles per iteration
small, say even 1.
– Our parameter update equation is now:
– Intuition: 1 MCMC cycle is enough to move the data from the
target distribution towards the proposed distribution, and so
suggest which direction the proposed distribution should
move to better model the training data.
Contrastive divergence
£t+1 = £t
¡ ´
µD
@ log f(x;£t
)
@£t
E
X1
£t
¡
D
@ log f(x;£t
)
@£t
E
X0
£t
¶
10. Contrastive divergence bias
– We assume:
– ML learning equivalent to minimizing , where
(Kullback-Leibler divergence).
– CD attempts to minimize
– Usually , but can sometimes bias
results.
– See “On Contrastive Divergence
Learning”, Carreira-Perpinan & Hinton, AIStats
2005, for more details.
PjjQ =
R
p(x) log p(x)
q(x)
dx
@E(X;£)
@£
¼
D
@ log f(x;£)
@£
E
X1
£
¡
D
@ log f(x;£)
@£
E
X0
£
X0
£
jjX1
£
X0
£
jjX1
£
¡ X1
£
jjX1
£
@
@£
(X0
£
jjX1
£
¡X1
£
jjX1
£
) =
D
@ log f(x;£)
@£
E
X1
£
¡
D
@ log f(x;£)
@£
E
X0
£
¡@X1
£
@£
@X1
£
jjX1
£
@X1
£
@X1
£
@£
@X1
£
jjX1
£
@X1
£
¼ 0