Vision and reflection on Mining Software Repositories research in 2024
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
1. UAI2010 Tutorial, July 8, 1
Catalina Island.
Non-Gaussian Methods for
Learning Linear Structural
Equation Models
Shohei Shimizu and Yoshinobu Kawahara
Osaka University
1
Special thanks to
Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.
2. 2
Abstract
• Linear structural equation models (linear SEMs)
can be used to model data generating processes
of variables.
• We review a new approach to learn or estimate
linear SEMs.
• The new estimation approach utilizes
non-Gaussianity of data for model identification
and uniquely estimates much wider variety of
models.
3. 3
Outline
• Part I. Overview (70 min.) : Shohei
• Break (10 min.)
• Part II. Recent advances (40 min): Yoshi
– Time series
– Latent confounders
4. 4
Motivation (1/2)
• Suppose that data X was randomly generated
from either of the following two data generating
processes:
Model 1: Model 2:
x 1 =
e
1 x1 e1
or
x 1 = b 12 x 2 +
e
1
x = b x +
e
x2
e2
x 2 =
e
2
x1
x2
e1
e2
2 21 1 2 1 e 2
e , )
where and are latent variables ( disturbances, errors).
• We want to estimate or identify which model
t generated d th the d t data X Xb d based on th the d t data X l
only.
5. 5
Motivation (2/2)
• We want to identify which model generated the
data X based on the data X only.
• If x1 and x2 are Gaussian, it is well known that
we cannot identify the data generating process.
e 1 e
2 – M d l Models 1 d and 2 ll equally fit fitd t
data.
• If e 1 x1 and e
2 x2 are non-Gaussian, an interesting
result is obtained: We can identify which of
Models 1 and 2 generated the data
data.
• This tutorial reviews how such non-Gaussian
methods work.
7. 7
Basic problem setup (1/3)
• Assume that the data generating process of
continuous observed variables x
i is graphically
represented by a directed acyclic graph ( DAG).
p y y g p )
– Acyclicity means that there are no directed cycles.
Example of a directed E l f di t d
acyclic graph (DAG):
Example of a directed
cyclic graph:
x3 e3 x3 e3
x1 e1
x2 e2
x1 e1
x2 e2
is a parent of 1 etc. 3 x x
8. 8
Basic problem setup (2/3)
• Further assume linear relations of variables i . x
• Then we obtain a linear acyclic SEM (Wright, 1921; Bollen,
1989):
x = Bx + e i ij j i
x = Σ
b x + e or
where
j i
: parents of
– The are continuous latent variables that are not
model external
i e
determined inside the model, which we call influences (disturbances, errors).
– The e
i are of non-zero variance and are independent.
– The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG.
ij b
9. 9
p Example of linear y
acyclic SEMs
• A three-variable linear acyclic SEM:
⎥
⎤
⎢ ⎡
⎥
⎤
⎢ ⎡
⎥
⎤
⎢ ⎡
⎥ ⎤
⎢ ⎡
1 1 1 x 0 0 1.5 x e
1 3 1 x =1.5x + e
⎢
e
⎣ 3
⎦ ⎢ ⎢
⎥ ⎥ ⎢
⎣
+
x
⎦ ⎢ ⎢
⎥ ⎥ ⎢
⎣
⎢
⎣ ⎥ ⎥ ⎦ ⎢ ⎢
= −
⎥
⎦ ⎢ ⎢
⎥ ⎥
2
2
3
2
3
1.3 0 0
0 0 0
e
x
x
x
= − + or
x x e
=
2 1 2 1.3
x e
1442443
B
3 3 • B corresponds to the data-generating DAG:
x3 e3
1.5 ij j i b = 0⇔ No directed edge from x to x
x1
e1
-1.3 b ij ≠ 0 0⇔A di directed d d edge f
from x j to x
i x2 e2
10. 10 Assumption of acyclicity
• Acyclicity ensures existence of an ordering of
variables x
i that makes B lower-triangular with
zeros on the diagonal (Bollen, 1989).
⎥
⎥ ⎤
⎢ ⎢ ⎡
+ ⎥ ⎥ ⎤
⎢ ⎢ ⎡
⎥ ⎥ ⎤
x 0
x 0 0
⎢ ⎢ ⎡
= ⎥ ⎥ ⎤
⎢ ⎢ ⎡
e
3
⎢
1
⎣ x 2 0 1.3 x 2 e
2 x
3
1
⎤
3
⎥ ⎥
1
0
0 0
0 1.5 0 e
x
x
0
⎢ ⎢ ⎡
+ ⎥ ⎥ ⎤
⎢ ⎢ ⎡ ⎥ ⎥ ⎤
− = ⎥ ⎥ ⎤
⎢ ⎢ ⎡
⎢ ⎢ ⎡
e
1
⎢
2
⎣ x 3 0 0 0 x 3 e
3 x
1
2
1
2
0 0 1.5
1.3 0 0
e
x
x
⎢ ⎦
⎥ ⎢
⎥
⎢ ⎦
⎥ ⎣
⎥
⎢
⎥⎣
⎢
⎥ ⎣ −
⎢ ⎦
⎥
⎢ ⎦
14424430
B
⎥ ⎢
⎥ ⎣
⎢ ⎦
⎥
⎢
⎥⎣
⎢ ⎦
⎥
⎢
⎥ ⎣
⎢ ⎦
⎥
⎢ ⎦
1442443
B
x3 e3
perm The ordering is :
x1
e1
1.5
g
.
x < x < x
3 1 2
x x x
3 may be an ancestor of
1 , 2 , x2
-1.3
e2 but not vice versa.
11. 11 Assumption of independence
between external influences
• It implies that there are no latent confounders
(Spirtes et al. 2000)
– A latent confounder f
is a latent variable that is a parent of more
than or equal to two observed variables:
x1
x2
f
e1’
e2 e2’
confounder f
makes influences
• Such a latent external dependent (Part II):
x1 e1
x2 e2
12. 12 Basic problem setup (3/3):
Learning linear acyclic SEMs
• Assume that data X is randomly sampled
from a linear acyclic SEM (with no latent
confounders):
x1 e1
x = Bx+e b
21 x2 e2
• Goal: Estimate the path-coefficient matrix B by
observing data X only!
– B corresponds to the data-generating DAG.
14. 14 Under what conditions B is
identifiable?
≡
• `B is identifiable’ `B is uniquely determined or
estimated from p(x) x)’
.
• Linear acylic SEM:
B x1 e1
x = Bx+e b
x2 e2
21 – B and p(e) induce p(x).
– If p(x) are different for different B, then B is uniquely
determined.
15. 15
Conventional estimation principle:
Causal Markov condition
• If the data-generating model is a (linear)
acyclic SEM, causal Markov condition
holds :
– Each observed variable xxi is independent of its
non-descendants in the DAG conditional on its
t
i x
parents (Pearl & Verma, 1991) :
p
( x ) =Π ( | parentsof
)
i i p p x x
i
i=1
16. 16 Conventional methods based on
causal Markov condition
• Methods based on conditional independencies
(Spirtes & Glymour, 1991)
– Many linear acyclic SEMs give a same set of
conditional independences and equally fit data.
• Scoring methods based on Gaussianity
(Chickering, 2002)
– Many linear acyclic SEMs give a same Gaussian
distribution and equally fit data.
•• In many cases, the path-coefficient matrix B is
not uniquely determined.
17. 17 Example
• Two models with Gaussian e11 and : e 2 e
Model 1: Model 2:
1 1 x = e 1 2 1 x = 0.8x + e
x1 e1 x1 e1
2 1 2 x = 0.8x + e 2 2 x2 e2 x = e x2 e2
var( ) var( ) 1 1 2 ( ) ( ) 0, x = x = 1 2 E e = E e =
• Both introduce no conditional independence:
cov( ( x 1 , x 2
) = 0.8 ≠
0 • Both induce the same Gaussian distribution:
⎡
x
⎤
⎛
⎡
0⎤ ⎡
1 0 8⎤ ⎞
⎟ ⎟⎠
⎜ ⎜⎝
⎥⎦ ⎤
⎢⎣
⎤
⎥⎦
⎢⎣
1 N
x
⎥⎦
⎢⎣
0.8
0.8 1
0
0
~
2
19. 19 A new direction:
Non-Gaussian approach
• Non-Gaussian data in many applications:
– Neuroinformatics (Hyvarinen et al., 2001); Bioinformatics
(Sogawa et al., ICANN2010); Social sciences (Micceri, 1989);
Economics (Moneta, Entner, et al., 2010)
• Utilize non-Gaussianity for model identification:
– Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001)
• The path-coefficient matrix B is uniquely
estimated if ei are non-Gaussian.
i e
– Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)
20. Illustrative example: 20
Gaussian vs non-Gaussian
Gaussian Non-Gaussian
(uniform)
Model x2 x2
1:
x e
1 1 x1 e1 x1 x1
x 2 = 0 8
0.8x 1 +
e
2 x2 e2 Model 2:
=
M d l 2
x1
x2
x1 e1
x2
1 2 1 x = 0.8x + e
x2 e2
x1
2 2 x = e
( ) ( ) 0, 1 2 E e = E e =
var( ) var( ) 1 1 2 x = x =
21. 21 Linear Non-Gaussian
Acyclic Model: LiNGAM
(Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006)
• Non-Gaussian version of linear acyclic SEM:
x i = Σ b ij x or
j + e
i
x == Bx ++ e j : parents of
i
where
– The external influence variables e
(disturbances,
errors) are
• of non-variance
i zero variance.
• non-Gaussian and mutually independent.
22. 22
Identifiability of LiNGAM model
• LiNGAM model can be shown to be
identifiable.
– B is uniquely estimated.
• To see the identifiability, helpful to review
independent component analysis (ICA)
(Hyvarinen et al., 2001).
23. 23 Independent Component Analysis
(ICA) (Jutten & Herault, 1991; Comon, 1994)
• Observed random vector x is modeled by
p
=Σ x = As
i ij j
x a s or
where
Σ=
j
1
ij a
– The mixing matrix A = [ ] is square and is of full
column rank.
The latent ariables s
(– variables i independent components)
are of non-zero variance, non-Gaussian and mutually
independent.
• Then, A is identifiable up to permutation P and
scaling D of the columns:
A = APD ica
24. 24 Estimation of ICA
• Most of estimation methods estimate
( Hyvarinen et al., 2001)
W = A−1 :
y , )
x = As = W−1s
• Most of the methods minimize mutual information (or its
approximation) of estimated independent components:
s W x ica ˆ =
• W is estimated up to permutation P and scaling D of the rows:
W = PDW ( (= PDA PDA−1 i
ica
) • Consistent and computationally efficient algorithms:
– Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98)
– Semiparametric: no specific distributional assumption
26. 26 Identifiability of LiNGAM (1/3):
ICA achieves half of identification
•• LiNGAM model is ICA.
i x
– Observed variables are linear combinations of
non Gaussian independent external non-influences e
:
x = Bx + e⇔ x = (I −B)−1e
i = Ae =W−1e (W= I −B)
W = PDW = PD(I −B) ica
• ICA gives .
– P: unknown permutation matrix
– D: unknown scaling (diagonal) matrix
• Need to determine P and D to identify B.
27. 27 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (1/6)
• ICA gives .
W = PDW = PD(I −B) ica
– P : permutation matrix; D: scaling (diagonal) matrix
• We want to find such a permutation matrix P
that cancels the
permutation i e PP = I :
P, i.e., P P PW = PPDW = DW ica
I
P
• We can show (Shimizu et al., UAI05) (illustrated in the next slides):
– If PP = I
i e no = , i.e., permutation is made on the rows of
D W ,
has no zero in the diagonal (obvious by definition).
ica PW
DW
PP ≠ I
– If , i i.e., any id ti l nonidentical t ti permutation i is d made on th
the rows
of , has a zero in the diagonal.
ica DW PW
28. 28 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (2/6)
• By definition has all unities in the diagonal.
W= I −B
– The diagonal elements of B are all zeros.
• Acyclicity ensures existence of an ordering of variables
that makes B lower-triangular, and thenW W= I −B
is
also lower-triangular.
• So, WLG, Wcan be assumed to be lower-triangular:
⎥
⎥ ⎤
1 0
⎢ ⎢ ⎡
= * 1 0
W
0 No zeros
the diagonal!
⎢ ⎦
⎥ ⎢
⎣* * 1
in
29. 29 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (3/6)
• Premultiplying W
by a scaling (diagonal) matrix D does
W
NOT change the zero/non-zero pattern of :
⎥
⎥ ⎤
⎢ ⎢ ⎡
=
0 0
d
d
11
* 0
DW 0
⎥
⎥ ⎤
1 0
⎢ ⎢ ⎡
=
* 1 0
W
0 ⎢ ⎦
⎥ ⎢
⎣
33
22
* *
d
00
⎢ ⎦
⎥ ⎢
⎣
* * 1
No zeros in the diagonal!
30. 30 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (4/6)
• Any other permutation of the rows of
DW
g p DW
changes the zero/non-zero pattern of and
brings zero in the diagonal:
⎥
⎥ ⎤
0 0 0
⎢ ⎢ ⎡
= 11
22
12 d
* 0
d
P DW ⎥ ⎥ ⎤
⎢ ⎢ ⎡
=
0
11
* d
0
d
DW 0
0 0
⎢ ⎦
⎥ ⎢
⎣ 33
⎥
⎥ * * d
⎢
⎣
⎢ 33
⎦
22
* *
d
Exchanging 1st
and 2nd rows Zero in the diagonal!
31. 31 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (5/6)
• Any other permutation of the rows of
DW
changes the zero/non-zero pattern of and
g p DW
brings zero in the diagonal:
* * 33
DW ⎥
⎥ ⎤
⎥
⎥ ⎤
⎢ ⎢ ⎡
0 0
0
11
* d
0
d
⎢ ⎢ ⎡
13 =
* d
d
0 P DW
0
⎢ ⎦
⎥ ⎢
⎣
=
33
22
* *
d
⎢ ⎦
⎥ ⎢
⎣
0 0
11
22
d
00
Exchanging 1st
0
and 3rd rows Zero in the diagonal!
32. 32 Identifiability of LiNGAM (2/3):
No permutation indeterminacy (6/6)
P P
• We can find correct by finding that gives no
th di l f PW
zero on the diagonal of ica (Shimizu et al., UAI05). • Thus, we can solve the permutation
indeterminacy and obtain:
PW == PPDW == DW== D((I −−B)) ica
= I
33. 33 Identifiability of LiNGAM (3/3):
No scaling indeterminacy
• Now we have: PW =D(I −B) ica
• Then, D = diag ( PW
ica ) • Divide each row of PW i c a ica
y by its p g
corresponding diagonal element to get I −B , i.e., B
:
(PW )−1PW = D−1 D(I −B) = I −B diag ica ica
34. 34
Estimation of LiNGAM model
1. ICA-LiNGAM algorithm
2. DirectLiNGAM algorithm
35. 35
Two estimation algorithms
• ICA-LiNGAM algorithm
(Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006)
• DirectLiNGAM algorithm
(Shimizu, Hyvarinen, Kawahara & Washio, UAI09)
• Both estimate an ordering of variables that makes
the path-coefficient matrix B to be lower-triangular.
– Acyclicity ensures existence of such an ordering.
⎡
O⎤
A full DAG
⎤
perm perm perm e x x + ⎥⎦
=
1 3
⎢⎣
O
x1 x3
123
perm B
x2
redundant edges
36. 36 Once such an ordering is
found…
• Many existing (covariance-based) methods can do:
– Pruning the redundant path-coefficients
• Sparse methods like weighted lasso (Zou, 2006)
– Finding significant path-coefficients
• Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010)
x3 x1 x3 x1 e x x + ⎥ ⎤
⎢ ⎡
=
O
0 *
A full DAG
x2 x2
perm perm perm ⎥⎦
⎢⎣
0
0 *
* *
123
perm B
37. 37
1. Outline of ICA-LiNGAM algorithm
(Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006)
1. Estimate B by ICA 2. Pruning
g
+ permutation
x1 x2 x1 x2
x3 23 x3 b 13 b
Redundant edges
38. 38 ICA-LiNGAM algorithm (1/2):
Step 1. Estimation of B
1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to
obtain an estimate of
W = PDW = PD(I −B) ica
P
2. Find a permutation that makes the diagonal
ica PWˆ
elements of as large (non-zero) as possible in
b l t l
absolute value:
ˆ = min 1 Hungarian alg.
( PW )
P
P ˆ
( , Kuhn, 1955)
)
ica ii 3 3. N li Normalize h each row f of ˆ P ˆ W , th then we t
get an
estimate of I-B and .
ica PW
Bˆ
39. 39 ICA-LiNGAM algorithm (2/2):
Step 2. Pruning
•• Find such an ordering of variables that makes
estimated B be as close to be lower-triangular as
possible.
– Find a permutation matrix Q that minimizes the sum of the
elements in the upper triangular part of permuted ˆ
B:
( ) Σ≤
ˆ 2
Q=
min ˆQBQT i j
ij
Q
B
i≤ – Approximate algorithm for large variables (Hoyer et al., ICA06)
0.1
5 5
x1 x2
x1 x2
0 1 3
x3 x3
0.1
0.1 0.1 3
-0.01
40. 40 Basic properties
of ICA-LiNGAM algorithm
• ICA-LiNGAM algorithm = ICA + permutations
– Computationally efficient with the help of well-developed
ICA techniques.
• Potential problems
– ICA is an iterative search method:
• May get stuck in a local optimum if the initial guess or step
size is badly chosen.
– The permutation algorithms are not scale-invariant:
• May provide different estimates for different scales of
variables.
41. 41
Estimation of LiNGAM model
1. ICA-LiNGAM algorithm
2. DirectLiNGAM algorithm
42. 42
2. DirectLiNGAM algorithm
(Shimizu, Hyvarinen, Kawahara & Washio, UAI2009)
• Alternative estimation method without ICA
– Estimates an ordering of variables that makes path-coefficient
matrix B to be lower-triangular.
⎡
⎤
A full = O
perm perm perm x ⎥x + e
⎦
⎢⎣
DAG
x1 x3
⎣123⎦
perm B x2
Redundant edges
• Many existing (covariance-based) methods can
do further pruning or finding significant path
coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)
43. 43 Basic idea (1/2) :
An exogenous variable can be at the
top of a right ordering
• An g exogenous variable x
j is a variable with no parents (Bollen, 1989), here x
3 .
– The p g
corresponding row of B has all zeros.
• So, an exogenous variable can be at the top of
such an ordering that makes B lower-triangular
with zeros on the diagonal.
⎥
⎥ ⎤
⎢ ⎢ ⎡
+
⎥
⎥ ⎤
⎢ ⎢ ⎡
⎥ ⎥ ⎤
⎢ ⎢ ⎡
=
⎥
⎥ ⎤
⎢ ⎢ ⎡
0 0 0
e
x
x
3 3 3
1 5 0 e
x
x
0
0
0
0 0
x3 x1 x2
1.5 ⎢
⎣ 2
⎢ ⎦
⎥ ⎢
⎣
1
⎢ ⎦
⎥ ⎥
⎢
⎥⎣
⎢
⎣ −
⎢ ⎦
1
⎢ ⎦
⎥ 1
e
x
x
2
2
0 1.3 0
0
44. 44 Basic idea (2/2):
Regress exogenous out
• Compute residuals 3 x
the r( 3) (i =1 2) regressing the other
) 1,2) i
3 x (i =1,2) x i
variables on exogenous :
3) 3) – The residuals r 1 r(and r
2 r(form a LiNGAM model
model.
– The ordering of the residuals is equivalent to that of
p g corresponding g
original variables.
⎤
⎡ ⎤
⎡
⎤
⎡
⎤
⎡
3 3 3 x 000 000 00 x e
e
r 0 0
⎡
⎤
⎡
⎤⎡ ⎤
⎡
⎤
⎦ ⎢ ⎢ ⎢
⎣ 2
⎦ ⎢ ⎢ ⎢
⎣
⎥ ⎥ ⎥
+
⎦ ⎢ ⎢ ⎢
⎣ ⎥ ⎥ ⎥
⎦ ⎢ ⎢ ⎢
⎣ −
⎥ ⎥ ⎥
x =
0 0
⎥ ⎥ ⎥
e
1
x
1
⎦⎣ 2
⎣ 0
⎣ 1
2
1.5 0 0 1.3 e
x
x
0
⎥⎦
⎢⎣
⎥ +
⎦
⎡
⎢⎣
⎤
⎥⎦
⎢⎣
−
⎥ =
⎦
⎢⎣
1
2
(3)
1
r
(3)
2
(3)
1
(3)
2
1.3 e
r
r
(3)
2 (3) r
1 x3 x1 x2 r
1 r 1 x
• Exogenous (3 ) implies ` can be at the second top’.
45. 45 Outline of DirectLiNGAM
• Iteratively find exogenous variables until all the
variables are ordered:
3 x
1. Find an exogenous variable .
– P t Put x
3 at t th the t top f of th the d i
ordering.
– Regress out.
3 x
2 Fi d id l h r(3)
2. Find an exogenous residual, here .
– Put at the second top of the ordering.
R t
( )
1 r
1 x
– Regress ( 3 ) out.
3)
1 r
2 x
3. Put at the third top of the ordering and terminate.
Th The ti t estimated d d i ordering i
is x 3 < x 1 < x
2 . Step. 1 Step. 2 Step. 3
1 x3 x1 x2 r (3,1)
2 r
(3)
2 (3) r
46. 46-1 Identification of an exogenous
variable (two variable cases)
i) ( ) is exogenous ii) exogenous 1 1 x = e 1 exogenous. x
is NOT exogenous.
( 0)
x e
1 1 =
b b
( ) 1 12 2 12 x = b x + 1 b ≠ 0 e
2 21 1 2 21 x = x + e ≠ 2 2 x = e
x x
Regressing on ,
r x x x
cov( , )
var( )
1
2 1
2
(1)
2
2 1
x
x
= −
x x
Regressing on ,
r x
x x x
1
2 1
2
(1)
2
2 1
cov( , )
= −
( )
x b var
x
var( )
1
b x x
1 cov( , )
var( )
12 2
2
12 2 1
x
x
−
⎭ ⎬ ⎫
⎩ ⎨ ⎧
x
var( )
1
x b x
= −
2 21 1 = −
1 e
1 1 ⎩ ⎭ 2 = e
and (1) are NOT independent.
1 2 and (1) are independent. x r
1 2 x r
47. 46-2 Need to use Darmois-Skitovitch’
theorem (Darmois, 1953; Skitovitch, 1953)
ii) 1 exogenous x
( ) 1 12 2 1 12 x = b x + ⋅e b ≠ 0
Darmois-Skitovitch’ theorem:
Define is NOT exogenous. 1
2 x
1 x
two variables and as
x 2 = e
2 p
p
=Σ =Σ
j j
j j
x a e x a e 1 1 2 2 ,
1
x x
Regressing on ,
r x x x
cov( , )
2 1
2
(1)
2
1 2
var( )
x
x
= −
= j=
j
1 1
j e
where are independent
d random ibl
variables.
b
12 If there exists a non-Gaussian e i ( )
1
2
x x
2
1
b x x
1 12 cov( 2 , 1
)
var
var( )
var( )
e
x
x
−
⎭ ⎬ ⎫
⎩ ⎨ ⎧
= −
1 1for which , ⎩ ⎭ 0 1 2 ≠ i i a a
1 x 2 x
and are dependent.
and (1) are NOT independent.
1 2 x r
48. 47 Identification of an exogenous
variable (p variable cases)
cov( ) = − , j
1: x its residual
• Lemma and ( )
j
i j
j
i i
x
x x
r x
all is x
var( ) j
i ≠ j j are independent for ⇔ x
exogenous
I ti id tif i bl
• In practice, we can identify an exogenous variable
by finding a variable that is most independent of
its residuals.
49. 48
p
Independence measures
• Evaluate independence between a variable and a
residual by a nonlinear correlation:
corr{x g(r( j) )} (g = tanh)
, r j i
• Taking the sum residuals over all the residuals, we get:
T = Σcorr{x g(r( j) )}+ corr{g(x ) r( j)}
Σ≠
corr { j , g ( r i ) } +
corr { g ( j , r
i } i j
• C Can use more hi ti t d sophisticated measures as ll
well
(Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004).
– Kernel-based independence measure (Bach & Jordan, 2002)
often gives more accurate estimates (Sogawa et al., IJCNN10).
50. 49 Important properties of
DirectLiNGAM
• DirectLiNGAM repeats:
– Least squares simple linear regression
– Evaluation of pairwise independence between each
variable and its residuals.
• No algorithmic parameters like stepsize, initial
guesses, convergence criteria.
• Guaranteed convergence to the right solution in
a fixed number of steps (the number of
variables) if the data strictly follows the model.
51. 50 Estimation of LiNGAM model:
Summary (1/2)
• Two estimation algorithms:
– ICA-LiNGAM: Estimation using ICA
• Pros. Fast
•• Cons. Possible local optimum; Not scale-invariant
– DirectLiNGAM: Alternative estimation without ICA
• Pros. Guaranteed convergence; Scale-invariant
• Cons. Less fast
– Cf. Neither needs faithfulness (Shimizu et al., JMLR,
2006; Hoyer, personal comm., July, 2010).
52. 51 Estimation of LiNGAM model:
Summary (2/2)
• Experimental comparison of the two algorithms:
(Sogawa et al., IJCNN2010)
• Sample size: Both need at least 1000 sample size for
more than 10 variables.
• Scalability: Both can analyze 100 variables. The
performances depend on the sample size etc., of course!
• Scale invariance: ICA-LiNGAM is less robust for changing
scales of variables.
• Local optima?:
– For less than 10 variables, ICA-LiNGAM often a bit
better.
– For more than 10 variables, DirectLiNGAM often better
perhaps because the problem of local optima might
become more serious?
54. 53
Testing testable assumptions
• Non-Gaussianity of external influences i
: y e
– Gaussianity tests
• Could detect violations of some assumptions:
– Local test
i e
• Independence of external influences
• Conditional independencies between observed
variables x
(causal Markov condition)
• Linearity
– i Overall fit of the model assumptions
• Chi-square test using 3rd and/or 4th-order moments
(Shimizu & Kano, 2008)
• Still under development
55. 54
Reliability evaluation
• Need to evaluate statistical reliability of LiNGAM
results:
– Sample fluctuations
– Smaller non-Gaussianity makes the model closer to
be NOT identifiable.
• Reliability evaluation by bootstrapping:
(Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010)
– If either the sample size is small or the magnitude of
non-Gaussianity is small, LiNGAM would give very
different results for bootstrap samples.
60. 59
Final summary of Part I
• Use of non-Gaussianity in linear SEMs is
useful for model identification.
• Non-Gaussian data is encountered in many
applications.
• The non-Gaussian approach can be a good
option.
• Links to codes and papers:
http://homepage.mac.com/shoheishimizu/lingampapers.html
62. 61
Q. My data is Gaussian.
LiNGAM will not be useful.
• A. You’re right. Try Gaussian methods.
• Comment: Hoyer et al. (UAI2008) showed:
`To what extent one can identify the model for a
mixture of Gaussian and non-Gaussian external
influence variables’’.
63. 62
Q. I applied LiNGAM, but the result is not
reasonable to background knowledge.
• A. You might first want to check:
– Some model assumptions might be violated.
Æ Try other extensions of LiNGAM
or non-parametric methods PC or FCI etc.
(Spirtes et al., 2000).
– Small sample size or small non-Gaussianity
ÆÆ Try bootstrap to see if the result is reliable.
– Background knowledge might be wrong.
64. 63 Q. Relation to causal Markov
condition?
• A. The following 3 estimation principles are
equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010):
1. Maximize independence between external influences e
i .
2 2. Minimize the sum of entropies of external influences
e
.
3. Causal Markov condition (Each variable is independent of
non-i its descendants in the DAG conditional on its
parents) AND maximization of independence between
the parents of each variable and its corresponding
external influences i . e
65. 64
Q. I am a psychometrician and am
more interested in latent factors.
• A. Shimizu, Hoyer, and Hyvarinen. (2009)
proposes LiNGAM for latent factors:
f = Bf + d -- LiNGAM for latent factors
x = Gf + e -- Measurement model
66. 65 Others
• Q. Prior knowledge?
– It is possible to incorporate prior knowledge. The accuracy of
DirectLiNGAM is often greatly improved even if the amount of prior
knowledge is not so large (Inazumi et al., LVA/ICA2010).
• Q. Sparse LiNGAM?
– Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010).
– ICA + adaptive Lasso (Zou, 2006).
• Q. Bayesian approach?
– Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09).
• Q. The idea can be applied to discrete variables?
– One proposal by Peters et al. (AISTATS2010).
– Comment: if your discrete variables are close to be continuous, e.g.,
ordinal scales with many points, LiNGAM might work.
67. 66
Q Q. Nonlinear extensions?
• A SEMs have been A. Several nonlinear proposed:
– DAG; No latent confounders.
( ) i
i ij i 1. x =Σ f parent j of x + e -- Imoto et al. (2002)
j
( ) i i i i
2. x = f parents of x + e -- Hoyer et al. (NIPS08)
x = f − 1
i i ( f i ,
1 ( parents of x i ) + e i ) 3. ,2 -- Zhang et al. (UAI09)
• For two variable cases, unique identification possible
except several combinations of nonlinearities and
distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).
68. 67 Nonlinear extensions
(continued)
• Proposals to aim at computational efficiency (Mooij et al.,
ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09).
• Pros:
– Nonlinear models are more general than linear models.
• Cons:
– Computationally demanding.
• Current: at most 7 or 8 variables.
•• Perhaps, assumption of Gaussian external influences might
help: Imoto et al. (2002) analyzes 100 variables.
– More difficult to allow other possible violations of LiNGAM
assumptions, latent confounders etc.
69. 68
Q. My data follows neither such linear
SEMs nor such nonlinear SEMs as you
have talked.
• A non-parametric methods e g
A. Try non methods, e.g.,
– DAG: PC (Spirtes & Glymour, 1991)
– DAG with latent confounders: FCI (Spirtes et al., 1995).
x = f ( parents of x e
)
i i i i ,• Probably you get an ( probably large) ) class f
of
equivalence models rather than a single model,
but that would be the best you currently can.
70. 69
Q. Deterministic relations?
• A. LiNGAM is not applicable.
• See Daniusis et al. (UAI2010) for a method to
analyze deterministic relations.
72. Very brief review of
Linear SEMs and causality
See Pearl (2000) for details.
73. 72
We can give a causal interpretation
to each path-coefficient biijj (Pearl, 2000). b
• Example:
x e
= x1 e1
1 1
x b x +
e
e2
2
21 b
= 2 21 1 2
x = b
x +
e
3 32 2 3
x2
x3
e3
32 b
32 • Causa Causal e p e a o interpretation o of b
3 32
2 : If you change x 2 from a constant c 0 to a constant c
1 , then ( ) will change by ( ). 3 32 1 0 E x b c − c
74. 73
`g Change x 2 from c
0 to c 1
’ • It means `Fix 2 at regardless of the variables x 0 c
that previously determines x
2 and change the
constant from c 0 to c 1 ’ (Pearl, 2000). • `Fix x 2 at c
0 regardless of the variables that
determines x 2 ’ is denoted by do ( x 2 = c
0 ) . 0 c 2 x
– In SEM, `Replace the function determining with a
constant ’:
1 1
x = e
Mutilated model
with do((x2=c0):) 2 0 do x = c
1 1
x = e
x = b x +
e
= +
2 21 1 2
x b x e
2 0
b
x c
+
=
3 32 2 3 3 32 2 3 x = x e
(Assumption of invariance: the other functions don’t change)
75. 74 Average causal effect:
Definition (Rubin, 1974; Pearl, 2000)
• Average causal effect of x2 on x3 when
changing x22 c1: x 2 x 3 x
2 x2 from c 0 c0 to c
1 cc1:
E ( x 3 | do ( x 2 =c
1 cc1 c01 )) − E ( x 3 | do ( x 2 = c 0 cc0
)) ( ) 32 1 0 = b32 c − c b
– Computed based on the mutilated models
with do do((x 2 x2== c c1) 1 ) or do do(( x 2 x2== c
0 c0) ) . • Average causal effects can be computed based
on estimated path-coefficient matrix B.
76. 75 Cf. Conducting experiments with
random assignment is a method to
estimate average causal effects
• Example: a drug trial
• Random assignment of subjects into two groups
fixes x 2 to c 1 or c
0 regardless of the variables that
determines x
2 :
– 2 1 Experimental group with x = c
(take the drug)
– Control group with 2 0 (placebo) x = c
( ( )) ( ( )) 3 2 1 3 2 0 E x | do x = c − E x | do x = c
( ) for the exp. group - for the control group 3 E x ( ) 3 E x