Gaussians

Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrew’s
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.

Gaussians
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599

Copyright © Andrew W. Moore Slide 1

Gaussians in Data Mining
• Why we should care
• The entropy of a PDF
• Univariate Gaussians
• Multivariate Gaussians
• Bayes Rule and Gaussians
• Maximum Likelihood and MAP using
Gaussians


1

Why we should care
• Gaussians are as natural as Orange Juice
and Sunshine
• We need them to understand Bayes Optimal
Classifiers
• We need them to understand regression
• We need them to understand neural nets
• We need them to understand mixture
models
•…
(You get the idea)

⎧1
The “box” w
| x |≤
if
⎪
p ( x) = ⎨ w 2
distribution w
⎪ 0 if | x |>
⎩ 2

1/w

-w/2 0 w/2


2

⎧1
The “box” w
| x |≤
⎪ w if 2
p ( x) = ⎨
distribution w
⎪ 0 if | x |>
⎩ 2

1/w

-w/2 0 w/2

w2
E[ X ] = 0 Var[ X ] =
12

Entropy of a PDF
∞

∫ p( x) log p( x)dx
Entropy of X = H [ X ] = −
x = −∞

Natural log (ln or loge)

The larger the entropy of a distribution…
…the harder it is to predict
…the harder it is to compress it
…the less spiky the distribution


3

⎧1
The “box” w
| x |≤
⎪ w if 2
p ( x) = ⎨
distribution w
⎪ 0 if | x |>
⎩ 2

1/w

-w/2 0 w/2
∞ w/ 2 w/ 2
1 1 1 1
∫ p( x) log p( x)dx = − ∫w / 2 w log w dx = − w log w x=−∫wdx = log w
H[ X ] = −
x = −∞ x=− /2


⎧1
Unit variance w
| x |≤
if
⎪
p ( x) = ⎨ w 2
box distribution w
⎪ 0 if | x |>
⎩ 2

E[ X ] = 0
1
w2
Var[ X ] =
23
12

−3 0 3
if w = 2 3 then Var[ X ] = 1 and H [ X ] = 1.242


4

The Hat ⎧ w− | x |
⎪ if |x| ≤ w
p ( x) = ⎨ w 2
distribution ⎪0 if |x| > w
⎩

E[ X ] = 0
w2
1
Var[ X ] =
w
6

−w w
0


Unit variance hat ⎧ w− | x |
⎪ if |x| ≤ w
p ( x) = ⎨ w 2
distribution ⎪0 if |x| > w
⎩

E[ X ] = 0
w2
1
Var[ X ] =
6
6

−6 0 6
if w = 6 then Var[ X ] = 1 and H [ X ] = 1.396


5

Dirac Delta
The “2 spikes” δ ( x = −1) + δ ( x = 1)
p ( x) =
distribution 2

E[ X ] = 0
1 1
δ ( x = −1) δ ( x = 1)
∞ 2 2
Var[ X ] = 1
2

-1 0 1
∞

∫ p( x) log p( x)dx = −∞
H[ X ] = −
x = −∞


Entropies of unit-variance
distributions
Distribution Entropy

Box 1.242

Hat 1.396

2 spikes -infinity

??? 1.4189 Largest possible
entropy of any unit-
variance distribution


6

Unit variance ⎛ x2 ⎞
1
exp⎜ − ⎟
p ( x) = ⎜ 2⎟
Gaussian 2π ⎝ ⎠

E[ X ] = 0

Var[ X ] = 1

∞

∫ p( x) log p( x)dx = 1.4189
H[ X ] = −
x = −∞


General ⎛ ( x − μ )2 ⎞
1
exp⎜ − ⎟
p ( x) = ⎜ 2σ 2 ⎟
Gaussian 2π σ ⎝ ⎠

σ=15
E[ X ] = μ

Var[ X ] = σ 2

μ=100


7

General Also known
⎛ ( x − μ )2 ⎞
as the normal
1
exp⎜ − ⎟
p ( x) =
distribution
⎜ 2σ 2 ⎟
Gaussian
or Bell-
2π σ ⎝ ⎠
shaped curve

σ=15
E[ X ] = μ

Var[ X ] = σ 2

μ=100

Shorthand: We say X ~ N(μ,σ2) to mean “X is distributed as a Gaussian
with parameters μ and σ2”.
In the above figure, X ~ N(100,152)


The Error Function
Assume X ~ N(0,1)
Define ERF(x) = P(X<x) = Cumulative Distribution of X

x

∫ p( z )dz
ERF ( x) =
z = −∞

⎛ z2 ⎞
x
1
∫ ⎜ 2⎟
exp⎜ − ⎟dz
=
2π ⎝ ⎠
z = −∞


8

Using The Error Function
Assume X ~ N(μ,σ2)
x−μ
P(X<x| μ,σ2) = ERF ( )
σ2


The Central Limit Theorem
• If (X1,X2, … Xn) are i.i.d. continuous random
variables
1n
• Then define z = f ( x1 , x2 ,...xn ) = ∑ xi
n i =1

• As n-->infinity, p(z)--->Gaussian with mean
E[Xi] and variance Var[Xi]

Somewhat of a justification for assuming
Gaussian noise is common

9

Other amazing facts about
Gaussians
• Wouldn’t you like to know?

• We will not examine them until we need to.


Bivariate Gaussians
⎛X⎞
Write r.v. X = ⎜ ⎟ Then define X ~ N (μ , Σ ) to mean
⎜Y ⎟
⎝⎠

( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) = 1 2
2π || Σ || 2

Where the Gaussian’s parameters are…

⎛ μx ⎞ ⎛σ 2 x σ xy ⎞
μ=⎜ ⎟ Σ=⎜ ⎟
⎜μ ⎟ ⎜σ σ 2y ⎟
⎝ y⎠ ⎝ xy ⎠

Where we insist that Σ is symmetric non-negative definite


10

Bivariate Gaussians
⎛X⎞
Write r.v. X = ⎜ ⎟ Then define X ~ N (μ , Σ ) to mean
⎜Y ⎟
⎝⎠

( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) = 1 2
2π || Σ || 2

Where the Gaussian’s parameters are…

⎛ μx ⎞ ⎛σ 2 x σ xy ⎞
μ=⎜ ⎟ Σ=⎜ ⎟
⎜μ ⎟ ⎜σ σ 2y ⎟
⎝ y⎠ ⎝ xy ⎠


It turns out that E[X] = μ and Cov[X] = Σ. (Note that this is a
resulting property of Gaussians, not a definition)*

*This note rates 7.4 on the pedanticness scale


Evaluating ( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) =
p(x): Step 1
1 2
2π || Σ || 2

1. Begin with vector x

x

μ


11

Evaluating ( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) =
p(x): Step 2
1 2
2π || Σ || 2

Define δ = x - μ
2.
x

δ
μ


Evaluating ( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) =
p(x): Step 3
1 2
2π || Σ || 2

Contours defined by
sqrt(δTΣ-1δ) = constant
Define δ = x - μ
2.
x
3. Count the number of contours
δ
crossed of the ellipsoids
formed Σ-1 μ
sqrt(δTΣ-1δ)
D = this count =
= Mahalonobis Distance
between x and μ


12

Evaluating ( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) =
p(x): Step 4
1 2
2π || Σ || 2

Define δ = x - μ
2.

exp(-D 2/2)
formed Σ-1
D = this count = sqrt(δTΣ-1δ)
between x and μ
4. Define w = exp(-D 2/2)
D2
x close to μ in squared Mahalonobis
space gets a large weight. Far away gets
a tiny weight


Evaluating ( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) =
p(x): Step 5
1 2
2π || Σ || 2

Define δ = x - μ
2.
exp(-D 2/2)

formed Σ-1
D = this count = sqrt(δTΣ-1δ)
between x and μ
4. Define w = exp(-D 2/2)
5. D2
1
to ensure∫ p(x)dx = 1
Multiply w by 1
2π || Σ || 2


13

Example

Observe: Mean, Principal axes,
Common convention: show contour
implication of off-diagonal
corresponding to 2 standard
covariance term, max gradient
deviations from mean
zone of p(x)


Example


14

Example

In this example, x and y are almost independent


Example

In this example, x and “x+y” are clearly not independent


15

Example

In this example, x and “20x+y” are clearly not independent


Multivariate Gaussians
⎛ X1 ⎞
⎜ ⎟
⎜X ⎟
Write r.v. X = ⎜ 2 ⎟ Then define X ~ N (μ , Σ ) to mean
M
⎜ ⎟
⎜X ⎟
⎝ m⎠

( )
1
exp − 1 (x − μ)T Σ −1 (x − μ)
p ( x) = m 1 2
(2π ) || Σ ||
2 2

⎛ μ1 ⎞ ⎛ σ 21 σ 12 L σ 1m ⎞
⎜ ⎟
Where the Gaussian’s ⎜⎟
⎜μ ⎟ ⎜σ σ 2 2 L σ 2m ⎟
parameters have… μ=⎜ 2⎟ Σ = ⎜ 12
M⎟
M
⎜M M O ⎟
⎜⎟
⎜μ ⎟ ⎜σ L σ 2m ⎟
σ 2m
⎝ m⎠ ⎝ 1m ⎠
Again, E[X] = μ and Cov[X] = Σ. (Note that this is a resulting property of Gaussians, not a definition)


16

General Gaussians
⎛ μ1 ⎞ ⎛ σ 21 σ 12 L σ 1m ⎞
⎜ ⎟
⎜⎟
⎜μ ⎟ ⎜σ σ 2 2 L σ 2m ⎟
μ=⎜ 2⎟ Σ = ⎜ 12
M⎟
M
⎜M M O ⎟
⎜⎟
⎜μ ⎟ ⎜σ L σ 2m ⎟
σ 2m
⎝ m⎠ ⎝ 1m ⎠

x2

x1

Axis-Aligned Gaussians
⎛ σ 21 0⎞
L
0 0 0
⎜ ⎟
⎛ μ1 ⎞ ⎜ 0 σ 22 0⎟
L
0 0
⎜⎟ ⎜0 0⎟
⎜μ ⎟ σ 2
L
0 0
Σ=⎜ ⎟
3
μ=⎜ 2⎟
⎜M M⎟
M M M O M
⎜⎟ ⎜ ⎟
⎜μ ⎟ L σ 2 m −1
⎝ m⎠ 0 0 0 0⎟
⎜
⎜0 σ 2m ⎟
L
⎝ ⎠
0 0 0
X i ⊥ X i for i ≠ j
x2

x1

17

Spherical Gaussians
⎛σ 2 0⎞
L
0 0 0
⎜ ⎟
⎛ μ1 ⎞ σ
⎜0 0⎟
2
L
0 0
⎜⎟ ⎜0 0⎟
⎜μ ⎟ σ 2
L
0 0
Σ=⎜ ⎟
μ=⎜ 2⎟
⎜M M⎟
M M M O M
⎜⎟ ⎜ ⎟
⎜μ ⎟ L σ2
⎝ m⎠ ⎜0 0 0 0⎟
⎜0 σ2⎟
L
⎝ ⎠
0 0 0
X i ⊥ X i for i ≠ j
x2

x1

Degenerate Gaussians
⎛ μ1 ⎞
⎜⎟
⎜μ ⎟
μ=⎜ 2⎟ || Σ ||= 0
M
⎜⎟
⎜μ ⎟
⎝ m⎠

What’s so wrong
with clipping
one’s toenails in
x2 public?

x1

18

Where are we now?
• We’ve seen the formulae for Gaussians
• We have an intuition of how they behave
• We have some experience of “reading” a
Gaussian’s covariance matrix

• Coming next:
Some useful tricks with Gaussians


Subsets of variables
⎛ X1 ⎞
⎜ ⎟
U=⎜ M ⎟
⎛ X1 ⎞
⎜ ⎟ ⎜X ⎟
⎛U⎞
⎜ X2 ⎟ ⎝ m (u ) ⎠
⎜ ⎟ where
Write X = ⎜ as X = ⎜ ⎟
M⎟ ⎛ X m ( u ) +1 ⎞
⎝V⎠
⎜ ⎟ ⎜ ⎟
⎜X ⎟ V =⎜ M ⎟
⎝ m⎠
⎜X ⎟
⎝ ⎠
m

This will be our standard notation for breaking an m-
dimensional distribution into subsets of variables


19

Gaussian Marginals ⎛U⎞ Margin-
⎜⎟ U
⎜V⎟
are Gaussian
alize
⎝⎠

⎛ X1 ⎞
⎛ X1 ⎞
⎜ ⎟ ⎛ X m ( u ) +1 ⎞
⎜ ⎟ ⎜ ⎟
⎛U⎞
⎜ X2 ⎟
as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟
Write X = ⎜ ⎜V⎟
M⎟ ⎝⎠ ⎜X ⎟ ⎜X ⎟
⎜ ⎟ ⎝ ⎠
⎜X ⎟ ⎝ m (u ) ⎠ m
⎝ m⎠
⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠

THEN U is also distributed as a Gaussian

U ~ N (μ u , Σ uu )

⎜⎟ U
⎜V⎟
are Gaussian
alize
⎝⎠

⎛ X1 ⎞
⎛ X1 ⎞
⎜ ⎟ ⎛ X m ( u ) +1 ⎞
⎜ ⎟ ⎜ ⎟
⎛U⎞
⎜ X2 ⎟
as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟
Write X = ⎜ ⎜⎟
M⎟ ⎝V⎠ ⎜X ⎟ ⎜X ⎟
⎜ ⎟ ⎝ ⎠
⎜X ⎟ ⎝ m (u ) ⎠ m
⎝ m⎠
⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ This fact is not
immediately obvious
Obvious, once we know
U ~ N (μ u , Σ uu ) it’s a Gaussian (why?)


20

⎜⎟ U
⎜V⎟
are Gaussian
alize
⎝⎠

⎛ X1 ⎞
⎛ X1 ⎞
⎜ ⎟ ⎛ X m ( u ) +1 ⎞
⎜ ⎟ ⎜ ⎟
⎛U⎞
⎜ X2 ⎟
as X = ⎜ ⎟ where U = ⎜ M ⎟, V = ⎜ M ⎟
Write X = ⎜ ⎜⎟
M⎟ ⎝V⎠ ⎜X ⎟ ⎜X ⎟
⎜ ⎟ ⎝ ⎠
⎜X ⎟ ⎝ m (u ) ⎠ m
⎝ m⎠ How would you prove
this?
⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠
p (u )
∫
= p (u , v ) d v
v
U ~ N (μ u , Σ uu ) = (snore...)


Matrix A
Linear Transforms AX
Multiply
X
remain Gaussian
Assume X is an m-dimensional Gaussian r.v.

X ~ N (μ , Σ )
p ≤ m ):
Define Y to be a p-dimensional r. v. thusly (note

Y = AX

…where A is a p x m matrix. Then…

( )
Y ~ N Aμ , AΣ A T
Note: the “subset” result is
a special case of this result


21

Adding samples of 2
independent Gaussians X X+Y
+
is Gaussian Y

if X ~ N (μ x , Σ x ) and Y ~ N (μ y , Σ y ) and X ⊥ Y

then X + Y ~ N (μ x + μ y , Σ x + Σ y )

Why doesn’t this hold if X and Y are dependent?
Which of the below statements is true?
If X and Y are dependent, then X+Y is Gaussian but possibly
with some other covariance
If X and Y are dependent, then X+Y might be non-Gaussian


Conditional of ⎛U⎞ Condition-
⎜⎟ U|V
⎜V⎟
Gaussian is Gaussian
alize
⎝⎠

⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
⎝V⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠

U | V ~ N (μ u |v , Σ u |v ) where
THEN

−
μ u|v = μ u + Σ T Σ vv1 ( V − μ v )
uv

−
Σ u |v = Σ uu − Σ T Σ vv1 Σ uv
uv


22

⎛ ⎛ 2977 ⎞ ⎛ 849 2 − 967 ⎞ ⎞
⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
⎛U⎞ ⎛ w⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ ⎟⎟
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟ ⎟, ⎜
⎜V⎟ ⎜ y⎟ ⎜ ⎟ ⎜−
⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟ 2⎟
⎝ Note: ⎜ ⎝ 76 given 967 3.68 ⎠ ⎟
⎝⎠ ⎠ ⎠ ⎝ value of
⎝ ⎝ v ⎠ ⎝ uv ⎠⎠ when
⎝ ⎠
v isyμ~, N (μ conditional
w , Σ | y ) where
v the
U | V ~ N (μ u |v , Σ u|v ) where THEN w |
THEN mean of | y is wμu
u
976 ( y − 76 )
μ w| y = 2977 −
−
μ u |v = μ u + Σ T Σ vv1 ( V − μ v )
uv
3.68 2
967 2
−
Σ u|v = Σ uu − Σ T Σ vv1 Σ uv Σ w| y = 849 2 − = 808 2
Note: marginal 23.68 mean is
uv

a linear function of v

P(w|m=82)
Note: conditional
variance can only be
equal to or smaller than P(w|m=76)
marginal variance
Note: conditional
variance is independent
of the given value of v

P(w)


Gaussians and the ⎛U⎞
U|V Chain
⎜⎟
⎜V⎟
chain rule
Rule
V ⎝⎠

Let A be a constant matrix

IF U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv )
⎛U⎞
⎜ ⎟ ~ N (μ , Σ ), with
THEN ⎜V⎟
⎝⎠
⎛ AΣ vv A T + Σ u |v AΣ vv ⎞
⎛ Aμ v ⎞
Σ=⎜ ⎟
μ=⎜
⎜μ ⎟ ⎟ ⎜ ( AΣ ) T Σ vv ⎟
⎝ v⎠ ⎝ ⎠
vv


24

Available Gaussian tools
⎛U⎞ ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞
Margin- ⎛U⎞
THEN U ~ N (μ u , Σ uu )
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜⎟ U ⎜⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
⎜V⎟ ⎝V⎠
alize ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠
⎝⎠
Matrix A
IF X ~ N (μ , Σ ) AND Y = AX THEN Y ~ N (Aμ , AΣ A T )
AX
Multiply
X
if X ~ N (μ x , Σ x ) and Y ~ N (μ y , Σ y ) and X ⊥ Y
then X + Y ~ N (μ x + μ y , Σ x + Σ y )
X
+ X+Y
Y Σ uv ⎞ ⎞ THEN U | V ~ N (μ , Σ )
⎛⎛μ ⎞ ⎛ Σ
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜V⎟ ⎜ ⎜ μ ⎟ ⎜ ΣT Σ vv ⎟ ⎟
u |v u |v

⎛U⎞ ⎝⎠ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠
Condition-
⎜⎟ U|V
⎜V⎟ alize −
μ u |v = μ u + Σ T Σ vv1 ( V − μ v )
where
⎝⎠ −
Σ u |v = Σ uu − Σ T Σ vv1 Σ uv
uv
uv

U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv )
⎛U⎞ IF
U|V Chain
⎜⎟
⎜V⎟
Rule ⎛U⎞ ⎛ AΣ vv A T + Σ u|v AΣ vv ⎞
V ⎝⎠ ⎜ ⎟ ~ N (μ , Σ ), with Σ = ⎜ ⎟
THEN ⎜V⎟ ⎜ ( AΣ ) T Σ vv ⎟
⎝⎠ ⎝ ⎠
vv


Assume…
• You are an intellectual snob
• You have a child


25

Intellectual snobs with children
• …are obsessed with IQ
• In the world as a whole, IQs are drawn from
a Gaussian N(100,152)


IQ tests
• If you take an IQ test you’ll get a score that,
on average (over many tests) will be your
IQ
• But because of noise on any one test the
score will often be a few points lower or
higher than your true IQ.

SCORE | IQ ~ N(IQ,102)


26

Assume…
• You drag your kid off to get tested
• She gets a score of 130
• “Yippee” you screech and start deciding how
to casually refer to her membership of the
top 2% of IQs in your Christmas newsletter.

P(X<130|μ=100,σ2=152) =
P(X<2| μ=0,σ2=1) =
erf(2) = 0.977


Assume…
• You drag your kid off to get tested
You are thinking:
• She gets a score of 130
Well sure the test isn’t accurate, so
• “Yippee” you screech andan IQ of 120 or she how
she might have start deciding
might have an 1Q of 140, but the
to casually refer most her IQ given the evidence the
to likely membership of
top 2% of IQs in “score=130” is, of course, newsletter.
your Christmas 130.

P(X<130|μ=100,σ2=152) =
P(X<2| μ=0,σ2=1) =
erf(2) = 0.977

Can we trust
this reasoning?

27

Maximum Likelihood IQ
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

IQ mle = arg max p ( s = 130 | iq )
iq

• The MLE is the value of the hidden parameter that
makes the observed data most likely
• In this case
IQ mle = 130


BUT….
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

IQ mle = arg max p ( s = 130 | iq )
iq

• The MLE is the value of the hidden parameter that
makes the observed data most likely
• In this case This is not the same as
“The most likely value of the
parameter given the observed
= 130
mle
IQ
data”


28

What we really want:
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

• Question: What is
IQ | (S=130)?

Called the Posterior
Distribution of IQ


Which tool or tools?
⎛U⎞
• IQ~N(100,152) Margin-
⎜⎟ U
⎜V⎟ alize
⎝⎠
• S|IQ ~ N(IQ, 102)
Matrix A
• S=130
AX
Multiply
X
• Question: What is X
+ X+Y
IQ | (S=130)? Y
⎛U⎞ Condition-
⎜⎟ U|V
⎜V⎟ alize
⎝⎠

⎛U⎞
U|V Chain
⎜⎟
⎜V⎟
Rule
V ⎝⎠

29

Plan
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

• Question: What is
IQ | (S=130)?

⎛S⎞ ⎛ IQ ⎞
S | IQ Chain Condition-
⎜⎟ ⎜⎟ IQ | S
Swap
⎜ IQ ⎟ ⎜S⎟
Rule alize
IQ ⎝⎠ ⎝⎠


Working… ⎛⎛μ ⎞ ⎛ Σ Σ uv ⎞ ⎞ THEN
⎛U⎞
IF ⎜ ⎟ ~ N ⎜ ⎜ u ⎟, ⎜ uu ⎟⎟
⎜ ⎜ μ ⎟ ⎜ ΣT
⎜⎟
Σ vv ⎟ ⎟
V⎠
⎝ ⎝ ⎝ v ⎠ ⎝ uv ⎠⎠
IQ~N(100,152) −
μ u |v = μ u + Σ T Σ vv1 ( V − μ v )
S|IQ ~ N(IQ, 102) uv

S=130
IF U | V ~ N (AV , Σ u |v ) and V ~ N (μ v , Σ vv )

Question: What is IQ | (S=130)? ⎛U⎞ ⎛ AΣ vv A T + Σ u |v AΣ vv ⎞
⎜ ⎟ ~ N (μ , Σ ), with Σ = ⎜ ⎟
THEN ⎜V⎟ ⎜ ( AΣ ) T Σ vv ⎟
⎝⎠ ⎝ ⎠
vv


30

Your pride and joy’s posterior IQ
• If you did the working, you now have
p(IQ|S=130)
• If you have to give the most likely IQ given
the score you should give
IQ map = arg max p (iq | s = 130 )
iq

• where MAP means “Maximum A-posteriori”


What you should know
• The Gaussian PDF formula off by heart
• Understand the workings of the formula for
a Gaussian
• Be able to understand the Gaussian tools
described so far
• Have a rough idea of how you could prove
them
• Be happy with how you could use them


31

Gaussians

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Recently uploaded

Recently uploaded (20)

Gaussians