SlideShare uma empresa Scribd logo
1 de 76
Baixar para ler offline
UAI2010 Tutorial, July 8, 1 
Catalina Island. 
Non-Gaussian Methods for 
Learning Linear Structural 
Equation Models 
Shohei Shimizu and Yoshinobu Kawahara 
Osaka University 
1 
Special thanks to 
Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.
2 
Abstract 
• Linear structural equation models (linear SEMs) 
can be used to model data generating processes 
of variables. 
• We review a new approach to learn or estimate 
linear SEMs. 
• The new estimation approach utilizes 
non-Gaussianity of data for model identification 
and uniquely estimates much wider variety of 
models.
3 
Outline 
• Part I. Overview (70 min.) : Shohei 
• Break (10 min.) 
• Part II. Recent advances (40 min): Yoshi 
– Time series 
– Latent confounders
4 
Motivation (1/2) 
• Suppose that data X was randomly generated 
from either of the following two data generating 
processes: 
Model 1: Model 2: 
x 1 = 
e 
1 x1 e1 
or 
x 1 = b 12 x 2 + 
e 
1 
x = b x + 
e 
x2 
e2 
x 2 = 
e 
2 
x1 
x2 
e1 
e2 
2 21 1 2 1 e 2 
e , ) 
where and are latent variables ( disturbances, errors). 
• We want to estimate or identify which model 
t generated d th the d t data X Xb d based on th the d t data X l 
only.
5 
Motivation (2/2) 
• We want to identify which model generated the 
data X based on the data X only. 
• If x1 and x2 are Gaussian, it is well known that 
we cannot identify the data generating process. 
e 1 e 
2 – M d l Models 1 d and 2 ll equally fit fitd t 
data. 
• If e 1 x1 and e 
2 x2 are non-Gaussian, an interesting 
result is obtained: We can identify which of 
Models 1 and 2 generated the data 
data. 
• This tutorial reviews how such non-Gaussian 
methods work.
Problem formulation
7 
Basic problem setup (1/3) 
• Assume that the data generating process of 
continuous observed variables x 
i is graphically 
represented by a directed acyclic graph ( DAG). 
p y y g p ) 
– Acyclicity means that there are no directed cycles. 
Example of a directed E l f di t d 
acyclic graph (DAG): 
Example of a directed 
cyclic graph: 
x3 e3 x3 e3 
x1 e1 
x2 e2 
x1 e1 
x2 e2 
is a parent of 1 etc. 3 x x
8 
Basic problem setup (2/3) 
• Further assume linear relations of variables i . x 
• Then we obtain a linear acyclic SEM (Wright, 1921; Bollen, 
1989): 
x = Bx + e i ij j i 
x = Σ 
b x + e or 
where 
j i 
: parents of 
– The are continuous latent variables that are not 
model external 
i e 
determined inside the model, which we call influences (disturbances, errors). 
– The e 
i are of non-zero variance and are independent. 
– The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG. 
ij b
9 
p Example of linear y 
acyclic SEMs 
• A three-variable linear acyclic SEM: 
⎥ 
⎤ 
⎢ ⎡ 
⎥ 
⎤ 
⎢ ⎡ 
⎥ 
⎤ 
⎢ ⎡ 
⎥ ⎤ 
⎢ ⎡ 
1 1 1 x 0 0 1.5 x e 
1 3 1 x =1.5x + e 
⎢ 
e 
⎣ 3 
⎦ ⎢ ⎢ 
⎥ ⎥ ⎢ 
⎣ 
+ 
x 
⎦ ⎢ ⎢ 
⎥ ⎥ ⎢ 
⎣ 
⎢ 
⎣ ⎥ ⎥ ⎦ ⎢ ⎢ 
= − 
⎥ 
⎦ ⎢ ⎢ 
⎥ ⎥ 
2 
2 
3 
2 
3 
1.3 0 0 
0 0 0 
e 
x 
x 
x 
= − + or 
x x e 
= 
2 1 2 1.3 
x e 
1442443 
B 
3 3 • B corresponds to the data-generating DAG: 
x3 e3 
1.5 ij j i b = 0⇔ No directed edge from x to x 
x1 
e1 
-1.3 b ij ≠ 0 0⇔A di directed d d edge f 
from x j to x 
i x2 e2
10 Assumption of acyclicity 
• Acyclicity ensures existence of an ordering of 
variables x 
i that makes B lower-triangular with 
zeros on the diagonal (Bollen, 1989). 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
+ ⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ 
⎥ ⎥ ⎤ 
x 0 
x 0 0 
⎢ ⎢ ⎡ 
= ⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ 
e 
3 
⎢ 
1 
⎣ x 2 0 1.3 x 2 e 
2 x 
3 
1 
⎤ 
3 
⎥ ⎥ 
1 
0 
0 0 
0 1.5 0 e 
x 
x 
0 
⎢ ⎢ ⎡ 
+ ⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ ⎥ ⎥ ⎤ 
− = ⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ 
⎢ ⎢ ⎡ 
e 
1 
⎢ 
2 
⎣ x 3 0 0 0 x 3 e 
3 x 
1 
2 
1 
2 
0 0 1.5 
1.3 0 0 
e 
x 
x 
⎢ ⎦ 
⎥ ⎢ 
⎥ 
⎢ ⎦ 
⎥ ⎣ 
⎥ 
⎢ 
⎥⎣ 
⎢ 
⎥ ⎣ − 
⎢ ⎦ 
⎥ 
⎢ ⎦ 
14424430 
B 
⎥ ⎢ 
⎥ ⎣ 
⎢ ⎦ 
⎥ 
⎢ 
⎥⎣ 
⎢ ⎦ 
⎥ 
⎢ 
⎥ ⎣ 
⎢ ⎦ 
⎥ 
⎢ ⎦ 
1442443 
B 
x3 e3 
perm The ordering is : 
x1 
e1 
1.5 
g 
. 
x < x < x 
3 1 2 
x x x 
3 may be an ancestor of 
1 , 2 , x2 
-1.3 
e2 but not vice versa.
11 Assumption of independence 
between external influences 
• It implies that there are no latent confounders 
(Spirtes et al. 2000) 
– A latent confounder f 
is a latent variable that is a parent of more 
than or equal to two observed variables: 
x1 
x2 
f 
e1’ 
e2 e2’ 
confounder f 
makes influences 
• Such a latent external dependent (Part II): 
x1 e1 
x2 e2
12 Basic problem setup (3/3): 
Learning linear acyclic SEMs 
• Assume that data X is randomly sampled 
from a linear acyclic SEM (with no latent 
confounders): 
x1 e1 
x = Bx+e b 
21 x2 e2 
• Goal: Estimate the path-coefficient matrix B by 
observing data X only! 
– B corresponds to the data-generating DAG.
Problems: 
Identifiability problems of 
conventional methods
14 Under what conditions B is 
identifiable? 
≡ 
• `B is identifiable’ `B is uniquely determined or 
estimated from p(x) x)’ 
. 
• Linear acylic SEM: 
B x1 e1 
x = Bx+e b 
x2 e2 
21 – B and p(e) induce p(x). 
– If p(x) are different for different B, then B is uniquely 
determined.
15 
Conventional estimation principle: 
Causal Markov condition 
• If the data-generating model is a (linear) 
acyclic SEM, causal Markov condition 
holds : 
– Each observed variable xxi is independent of its 
non-descendants in the DAG conditional on its 
t 
i x 
parents (Pearl & Verma, 1991) : 
p 
( x ) =Π ( | parentsof 
) 
i i p p x x 
i 
i=1
16 Conventional methods based on 
causal Markov condition 
• Methods based on conditional independencies 
(Spirtes & Glymour, 1991) 
– Many linear acyclic SEMs give a same set of 
conditional independences and equally fit data. 
• Scoring methods based on Gaussianity 
(Chickering, 2002) 
– Many linear acyclic SEMs give a same Gaussian 
distribution and equally fit data. 
•• In many cases, the path-coefficient matrix B is 
not uniquely determined.
17 Example 
• Two models with Gaussian e11 and : e 2 e 
Model 1: Model 2: 
1 1 x = e 1 2 1 x = 0.8x + e 
x1 e1 x1 e1 
2 1 2 x = 0.8x + e 2 2 x2 e2 x = e x2 e2 
var( ) var( ) 1 1 2 ( ) ( ) 0, x = x = 1 2 E e = E e = 
• Both introduce no conditional independence: 
cov( ( x 1 , x 2 
) = 0.8 ≠ 
0 • Both induce the same Gaussian distribution: 
⎡ 
x 
⎤ 
⎛ 
⎡ 
0⎤ ⎡ 
1 0 8⎤ ⎞ 
⎟ ⎟⎠ 
⎜ ⎜⎝ 
⎥⎦ ⎤ 
⎢⎣ 
⎤ 
⎥⎦ 
⎢⎣ 
1 N 
x 
⎥⎦ 
⎢⎣ 
0.8 
0.8 1 
0 
0 
~ 
2
A solution: 
Non-Gaussian approach
19 A new direction: 
Non-Gaussian approach 
• Non-Gaussian data in many applications: 
– Neuroinformatics (Hyvarinen et al., 2001); Bioinformatics 
(Sogawa et al., ICANN2010); Social sciences (Micceri, 1989); 
Economics (Moneta, Entner, et al., 2010) 
• Utilize non-Gaussianity for model identification: 
– Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001) 
• The path-coefficient matrix B is uniquely 
estimated if ei are non-Gaussian. 
i e 
– Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)
Illustrative example: 20 
Gaussian vs non-Gaussian 
Gaussian Non-Gaussian 
(uniform) 
Model x2 x2 
1: 
x e 
1 1 x1 e1 x1 x1 
x 2 = 0 8 
0.8x 1 + 
e 
2 x2 e2 Model 2: 
= 
M d l 2 
x1 
x2 
x1 e1 
x2 
1 2 1 x = 0.8x + e 
x2 e2 
x1 
2 2 x = e 
( ) ( ) 0, 1 2 E e = E e = 
var( ) var( ) 1 1 2 x = x =
21 Linear Non-Gaussian 
Acyclic Model: LiNGAM 
(Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006) 
• Non-Gaussian version of linear acyclic SEM: 
x i = Σ b ij x or 
j + e 
i 
x == Bx ++ e j : parents of 
i 
where 
– The external influence variables e 
(disturbances, 
errors) are 
• of non-variance 
i zero variance. 
• non-Gaussian and mutually independent.
22 
Identifiability of LiNGAM model 
• LiNGAM model can be shown to be 
identifiable. 
– B is uniquely estimated. 
• To see the identifiability, helpful to review 
independent component analysis (ICA) 
(Hyvarinen et al., 2001).
23 Independent Component Analysis 
(ICA) (Jutten & Herault, 1991; Comon, 1994) 
• Observed random vector x is modeled by 
p 
=Σ x = As 
i ij j 
x a s or 
where 
Σ= 
j 
1 
ij a 
– The mixing matrix A = [ ] is square and is of full 
column rank. 
The latent ariables s 
(– variables i independent components) 
are of non-zero variance, non-Gaussian and mutually 
independent. 
• Then, A is identifiable up to permutation P and 
scaling D of the columns: 
A = APD ica
24 Estimation of ICA 
• Most of estimation methods estimate 
( Hyvarinen et al., 2001) 
W = A−1 : 
y , ) 
x = As = W−1s 
• Most of the methods minimize mutual information (or its 
approximation) of estimated independent components: 
s W x ica ˆ = 
• W is estimated up to permutation P and scaling D of the rows: 
W = PDW ( (= PDA PDA−1 i 
ica 
) • Consistent and computationally efficient algorithms: 
– Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98) 
– Semiparametric: no specific distributional assumption
Back to LiNGAM model
26 Identifiability of LiNGAM (1/3): 
ICA achieves half of identification 
•• LiNGAM model is ICA. 
i x 
– Observed variables are linear combinations of 
non Gaussian independent external non-influences e 
: 
x = Bx + e⇔ x = (I −B)−1e 
i = Ae =W−1e (W= I −B) 
W = PDW = PD(I −B) ica 
• ICA gives . 
– P: unknown permutation matrix 
– D: unknown scaling (diagonal) matrix 
• Need to determine P and D to identify B.
27 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (1/6) 
• ICA gives . 
W = PDW = PD(I −B) ica 
– P : permutation matrix; D: scaling (diagonal) matrix 
• We want to find such a permutation matrix P 
that cancels the 
permutation i e PP = I : 
P, i.e., P P PW = PPDW = DW ica 
I 
P 
• We can show (Shimizu et al., UAI05) (illustrated in the next slides): 
– If PP = I 
i e no = , i.e., permutation is made on the rows of 
D W , 
has no zero in the diagonal (obvious by definition). 
ica PW 
DW 
PP ≠ I 
– If , i i.e., any id ti l nonidentical t ti permutation i is d made on th 
the rows 
of , has a zero in the diagonal. 
ica DW PW
28 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (2/6) 
• By definition has all unities in the diagonal. 
W= I −B 
– The diagonal elements of B are all zeros. 
• Acyclicity ensures existence of an ordering of variables 
that makes B lower-triangular, and thenW W= I −B 
is 
also lower-triangular. 
• So, WLG, Wcan be assumed to be lower-triangular: 
⎥ 
⎥ ⎤ 
1 0 
⎢ ⎢ ⎡ 
= * 1 0 
W 
0 No zeros 
the diagonal! 
⎢ ⎦ 
⎥ ⎢ 
⎣* * 1 
in
29 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (3/6) 
• Premultiplying W 
by a scaling (diagonal) matrix D does 
W 
NOT change the zero/non-zero pattern of : 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
= 
0 0 
d 
d 
11 
* 0 
DW 0 
⎥ 
⎥ ⎤ 
1 0 
⎢ ⎢ ⎡ 
= 
* 1 0 
W 
0 ⎢ ⎦ 
⎥ ⎢ 
⎣ 
33 
22 
* * 
d 
00 
⎢ ⎦ 
⎥ ⎢ 
⎣ 
* * 1 
No zeros in the diagonal!
30 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (4/6) 
• Any other permutation of the rows of 
DW 
g p DW 
changes the zero/non-zero pattern of and 
brings zero in the diagonal: 
⎥ 
⎥ ⎤ 
0 0 0 
⎢ ⎢ ⎡ 
= 11 
22 
12 d 
* 0 
d 
P DW ⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ 
= 
0 
11 
* d 
0 
d 
DW 0 
0 0 
⎢ ⎦ 
⎥ ⎢ 
⎣ 33 
⎥ 
⎥ * * d 
⎢ 
⎣ 
⎢ 33 
⎦ 
22 
* * 
d 
Exchanging 1st 
and 2nd rows Zero in the diagonal!
31 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (5/6) 
• Any other permutation of the rows of 
DW 
changes the zero/non-zero pattern of and 
g p DW 
brings zero in the diagonal: 
* * 33 
DW ⎥ 
⎥ ⎤ 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
0 0 
0 
11 
* d 
0 
d 
⎢ ⎢ ⎡ 
13 = 
* d 
d 
0 P DW 
0 
⎢ ⎦ 
⎥ ⎢ 
⎣ 
= 
33 
22 
* * 
d 
⎢ ⎦ 
⎥ ⎢ 
⎣ 
0 0 
11 
22 
d 
00 
Exchanging 1st 
0 
and 3rd rows Zero in the diagonal!
32 Identifiability of LiNGAM (2/3): 
No permutation indeterminacy (6/6) 
P P 
• We can find correct by finding that gives no 
th di l f PW 
zero on the diagonal of ica (Shimizu et al., UAI05). • Thus, we can solve the permutation 
indeterminacy and obtain: 
PW == PPDW == DW== D((I −−B)) ica 
= I
33 Identifiability of LiNGAM (3/3): 
No scaling indeterminacy 
• Now we have: PW =D(I −B) ica 
• Then, D = diag ( PW 
ica ) • Divide each row of PW i c a ica 
y by its p g 
corresponding diagonal element to get I −B , i.e., B 
: 
(PW )−1PW = D−1 D(I −B) = I −B diag ica ica
34 
Estimation of LiNGAM model 
1. ICA-LiNGAM algorithm 
2. DirectLiNGAM algorithm
35 
Two estimation algorithms 
• ICA-LiNGAM algorithm 
(Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006) 
• DirectLiNGAM algorithm 
(Shimizu, Hyvarinen, Kawahara & Washio, UAI09) 
• Both estimate an ordering of variables that makes 
the path-coefficient matrix B to be lower-triangular. 
– Acyclicity ensures existence of such an ordering. 
⎡ 
O⎤ 
A full DAG 
⎤ 
perm perm perm e x x + ⎥⎦ 
= 
1 3 
⎢⎣ 
O 
x1 x3 
123 
perm B 
x2 
redundant edges
36 Once such an ordering is 
found… 
• Many existing (covariance-based) methods can do: 
– Pruning the redundant path-coefficients 
• Sparse methods like weighted lasso (Zou, 2006) 
– Finding significant path-coefficients 
• Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010) 
x3 x1 x3 x1 e x x + ⎥ ⎤ 
⎢ ⎡ 
= 
O 
0 * 
A full DAG 
x2 x2 
perm perm perm ⎥⎦ 
⎢⎣ 
0 
0 * 
* * 
123 
perm B
37 
1. Outline of ICA-LiNGAM algorithm 
(Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006) 
1. Estimate B by ICA 2. Pruning 
g 
+ permutation 
x1 x2 x1 x2 
x3 23 x3 b 13 b 
Redundant edges
38 ICA-LiNGAM algorithm (1/2): 
Step 1. Estimation of B 
1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to 
obtain an estimate of 
W = PDW = PD(I −B) ica 
P 
2. Find a permutation that makes the diagonal 
ica PWˆ 
elements of as large (non-zero) as possible in 
b l t l 
absolute value: 
ˆ = min 1 Hungarian alg. 
( PW ) 
P 
P ˆ 
( , Kuhn, 1955) 
) 
ica ii 3 3. N li Normalize h each row f of ˆ P ˆ W , th then we t 
get an 
estimate of I-B and . 
ica PW 
Bˆ
39 ICA-LiNGAM algorithm (2/2): 
Step 2. Pruning 
•• Find such an ordering of variables that makes 
estimated B be as close to be lower-triangular as 
possible. 
– Find a permutation matrix Q that minimizes the sum of the 
elements in the upper triangular part of permuted ˆ 
B: 
( ) Σ≤ 
ˆ 2 
Q= 
min ˆQBQT i j 
ij 
Q 
B 
i≤ – Approximate algorithm for large variables (Hoyer et al., ICA06) 
0.1 
5 5 
x1 x2 
x1 x2 
0 1 3 
x3 x3 
0.1 
0.1 0.1 3 
-0.01
40 Basic properties 
of ICA-LiNGAM algorithm 
• ICA-LiNGAM algorithm = ICA + permutations 
– Computationally efficient with the help of well-developed 
ICA techniques. 
• Potential problems 
– ICA is an iterative search method: 
• May get stuck in a local optimum if the initial guess or step 
size is badly chosen. 
– The permutation algorithms are not scale-invariant: 
• May provide different estimates for different scales of 
variables.
41 
Estimation of LiNGAM model 
1. ICA-LiNGAM algorithm 
2. DirectLiNGAM algorithm
42 
2. DirectLiNGAM algorithm 
(Shimizu, Hyvarinen, Kawahara & Washio, UAI2009) 
• Alternative estimation method without ICA 
– Estimates an ordering of variables that makes path-coefficient 
matrix B to be lower-triangular. 
⎡ 
⎤ 
A full = O 
perm perm perm x ⎥x + e 
⎦ 
⎢⎣ 
DAG 
x1 x3 
⎣123⎦ 
perm B x2 
Redundant edges 
• Many existing (covariance-based) methods can 
do further pruning or finding significant path 
coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)
43 Basic idea (1/2) : 
An exogenous variable can be at the 
top of a right ordering 
• An g exogenous variable x 
j is a variable with no parents (Bollen, 1989), here x 
3 . 
– The p g 
corresponding row of B has all zeros. 
• So, an exogenous variable can be at the top of 
such an ordering that makes B lower-triangular 
with zeros on the diagonal. 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
+ 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
⎥ ⎥ ⎤ 
⎢ ⎢ ⎡ 
= 
⎥ 
⎥ ⎤ 
⎢ ⎢ ⎡ 
0 0 0 
e 
x 
x 
3 3 3 
1 5 0 e 
x 
x 
0 
0 
0 
0 0 
x3 x1 x2 
1.5 ⎢ 
⎣ 2 
⎢ ⎦ 
⎥ ⎢ 
⎣ 
1 
⎢ ⎦ 
⎥ ⎥ 
⎢ 
⎥⎣ 
⎢ 
⎣ − 
⎢ ⎦ 
1 
⎢ ⎦ 
⎥ 1 
e 
x 
x 
2 
2 
0 1.3 0 
0
44 Basic idea (2/2): 
Regress exogenous out 
• Compute residuals 3 x 
the r( 3) (i =1 2) regressing the other 
) 1,2) i 
3 x (i =1,2) x i 
variables on exogenous : 
3) 3) – The residuals r 1 r(and r 
2 r(form a LiNGAM model 
model. 
– The ordering of the residuals is equivalent to that of 
p g corresponding g 
original variables. 
⎤ 
⎡ ⎤ 
⎡ 
⎤ 
⎡ 
⎤ 
⎡ 
3 3 3 x 000 000 00 x e 
e 
r 0 0 
⎡ 
⎤ 
⎡ 
⎤⎡ ⎤ 
⎡ 
⎤ 
⎦ ⎢ ⎢ ⎢ 
⎣ 2 
⎦ ⎢ ⎢ ⎢ 
⎣ 
⎥ ⎥ ⎥ 
+ 
⎦ ⎢ ⎢ ⎢ 
⎣ ⎥ ⎥ ⎥ 
⎦ ⎢ ⎢ ⎢ 
⎣ − 
⎥ ⎥ ⎥ 
x = 
0 0 
⎥ ⎥ ⎥ 
e 
1 
x 
1 
⎦⎣ 2 
⎣ 0 
⎣ 1 
2 
1.5 0 0 1.3 e 
x 
x 
0 
⎥⎦ 
⎢⎣ 
⎥ + 
⎦ 
⎡ 
⎢⎣ 
⎤ 
⎥⎦ 
⎢⎣ 
− 
⎥ = 
⎦ 
⎢⎣ 
1 
2 
(3) 
1 
r 
(3) 
2 
(3) 
1 
(3) 
2 
1.3 e 
r 
r 
(3) 
2 (3) r 
1 x3 x1 x2 r 
1 r 1 x 
• Exogenous (3 ) implies ` can be at the second top’.
45 Outline of DirectLiNGAM 
• Iteratively find exogenous variables until all the 
variables are ordered: 
3 x 
1. Find an exogenous variable . 
– P t Put x 
3 at t th the t top f of th the d i 
ordering. 
– Regress out. 
3 x 
2 Fi d id l h r(3) 
2. Find an exogenous residual, here . 
– Put at the second top of the ordering. 
R t 
( ) 
1 r 
1 x 
– Regress ( 3 ) out. 
3) 
1 r 
2 x 
3. Put at the third top of the ordering and terminate. 
Th The ti t estimated d d i ordering i 
is x 3 < x 1 < x 
2 . Step. 1 Step. 2 Step. 3 
1 x3 x1 x2 r (3,1) 
2 r 
(3) 
2 (3) r
46-1 Identification of an exogenous 
variable (two variable cases) 
i) ( ) is exogenous ii) exogenous 1 1 x = e 1 exogenous. x 
is NOT exogenous. 
( 0) 
x e 
1 1 = 
b b 
( ) 1 12 2 12 x = b x + 1 b ≠ 0 e 
2 21 1 2 21 x = x + e ≠ 2 2 x = e 
x x 
Regressing on , 
r x x x 
cov( , ) 
var( ) 
1 
2 1 
2 
(1) 
2 
2 1 
x 
x 
= − 
x x 
Regressing on , 
r x 
x x x 
1 
2 1 
2 
(1) 
2 
2 1 
cov( , ) 
= − 
( ) 
x b var 
x 
var( ) 
1 
b x x 
1 cov( , ) 
var( ) 
12 2 
2 
12 2 1 
x 
x 
− 
⎭ ⎬ ⎫ 
⎩ ⎨ ⎧ 
x 
var( ) 
1 
x b x 
= − 
2 21 1 = − 
1 e 
1 1 ⎩ ⎭ 2 = e 
and (1) are NOT independent. 
1 2 and (1) are independent. x r 
1 2 x r
46-2 Need to use Darmois-Skitovitch’ 
theorem (Darmois, 1953; Skitovitch, 1953) 
ii) 1 exogenous x 
( ) 1 12 2 1 12 x = b x + ⋅e b ≠ 0 
Darmois-Skitovitch’ theorem: 
Define is NOT exogenous. 1 
2 x 
1 x 
two variables and as 
x 2 = e 
2 p 
p 
=Σ =Σ 
j j 
j j 
x a e x a e 1 1 2 2 , 
1 
x x 
Regressing on , 
r x x x 
cov( , ) 
2 1 
2 
(1) 
2 
1 2 
var( ) 
x 
x 
= − 
= j= 
j 
1 1 
j e 
where are independent 
d random ibl 
variables. 
b 
12 If there exists a non-Gaussian e i ( ) 
1 
2 
x x 
2 
1 
b x x 
1 12 cov( 2 , 1 
) 
var 
var( ) 
var( ) 
e 
x 
x 
− 
⎭ ⎬ ⎫ 
⎩ ⎨ ⎧ 
= − 
1 1for which , ⎩ ⎭ 0 1 2 ≠ i i a a 
1 x 2 x 
and are dependent. 
and (1) are NOT independent. 
1 2 x r
47 Identification of an exogenous 
variable (p variable cases) 
cov( ) = − , j 
1: x its residual 
• Lemma and ( ) 
j 
i j 
j 
i i 
x 
x x 
r x 
all is x 
var( ) j 
i ≠ j j are independent for ⇔ x 
exogenous 
I ti id tif i bl 
• In practice, we can identify an exogenous variable 
by finding a variable that is most independent of 
its residuals.
48 
p 
Independence measures 
• Evaluate independence between a variable and a 
residual by a nonlinear correlation: 
corr{x g(r( j) )} (g = tanh) 
, r j i 
• Taking the sum residuals over all the residuals, we get: 
T = Σcorr{x g(r( j) )}+ corr{g(x ) r( j)} 
Σ≠ 
corr { j , g ( r i ) } + 
corr { g ( j , r 
i } i j 
• C Can use more hi ti t d sophisticated measures as ll 
well 
(Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004). 
– Kernel-based independence measure (Bach & Jordan, 2002) 
often gives more accurate estimates (Sogawa et al., IJCNN10).
49 Important properties of 
DirectLiNGAM 
• DirectLiNGAM repeats: 
– Least squares simple linear regression 
– Evaluation of pairwise independence between each 
variable and its residuals. 
• No algorithmic parameters like stepsize, initial 
guesses, convergence criteria. 
• Guaranteed convergence to the right solution in 
a fixed number of steps (the number of 
variables) if the data strictly follows the model.
50 Estimation of LiNGAM model: 
Summary (1/2) 
• Two estimation algorithms: 
– ICA-LiNGAM: Estimation using ICA 
• Pros. Fast 
•• Cons. Possible local optimum; Not scale-invariant 
– DirectLiNGAM: Alternative estimation without ICA 
• Pros. Guaranteed convergence; Scale-invariant 
• Cons. Less fast 
– Cf. Neither needs faithfulness (Shimizu et al., JMLR, 
2006; Hoyer, personal comm., July, 2010).
51 Estimation of LiNGAM model: 
Summary (2/2) 
• Experimental comparison of the two algorithms: 
(Sogawa et al., IJCNN2010) 
• Sample size: Both need at least 1000 sample size for 
more than 10 variables. 
• Scalability: Both can analyze 100 variables. The 
performances depend on the sample size etc., of course! 
• Scale invariance: ICA-LiNGAM is less robust for changing 
scales of variables. 
• Local optima?: 
– For less than 10 variables, ICA-LiNGAM often a bit 
better. 
– For more than 10 variables, DirectLiNGAM often better 
perhaps because the problem of local optima might 
become more serious?
52 
Testing and 
Reliability evaluation
53 
Testing testable assumptions 
• Non-Gaussianity of external influences i 
: y e 
– Gaussianity tests 
• Could detect violations of some assumptions: 
– Local test 
i e 
• Independence of external influences 
• Conditional independencies between observed 
variables x 
(causal Markov condition) 
• Linearity 
– i Overall fit of the model assumptions 
• Chi-square test using 3rd and/or 4th-order moments 
(Shimizu & Kano, 2008) 
• Still under development
54 
Reliability evaluation 
• Need to evaluate statistical reliability of LiNGAM 
results: 
– Sample fluctuations 
– Smaller non-Gaussianity makes the model closer to 
be NOT identifiable. 
• Reliability evaluation by bootstrapping: 
(Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010) 
– If either the sample size is small or the magnitude of 
non-Gaussianity is small, LiNGAM would give very 
different results for bootstrap samples.
55 
Extensions
56 
Extensions (a partial list) 
• Relaxing the assumptions of LiNGAM model: 
– Acyclic Æ Cyclic (Lacerda et al., UAI2008) 
– Single homogenous population 
Æ heterogeneous population (Shimizu et al., ICONIP2007) 
– i.i.d. sampling Æ time structures (Part II.) (Hyvarinen et al, 
JMLR, 2010; Kawahara, S & Washio, 2010) 
– No latent confounders Æ Allow latents (Part II.) (Hoyer et al., 
IJAR, 08; Kawahara, Bollen et al., 2010) 
– Linear Æ Non-linear (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09; 
Tilmann, Gretton & Spirtes, NIPS09)
57 
Application areas so far
58 Non-Gaussian SEMs have 
been applied to… 
• Neuroinformatics 
– Brain connectivity analysis (Hyvarinen et al., JMLR, 2010) 
• Bioinformatics 
– Gene network estimation (Sogawa et al., ICANN2010) 
• Economics (Wan & Tan, 2009; Moneta, Entner, Hoyer & Coad, 2010) 
• Genetics (Ozaki & Ando, 2009) 
• Environmental sciences (Niyogi et al., 2010) 
• Physics (Kawahara, Shimizu & Washio, 2010) 
• Sociology (Kawahara, Bollen, Shimizu & Washio, 2010)
59 
Final summary of Part I 
• Use of non-Gaussianity in linear SEMs is 
useful for model identification. 
• Non-Gaussian data is encountered in many 
applications. 
• The non-Gaussian approach can be a good 
option. 
• Links to codes and papers: 
http://homepage.mac.com/shoheishimizu/lingampapers.html
60 
FAQs
61 
Q. My data is Gaussian. 
LiNGAM will not be useful. 
• A. You’re right. Try Gaussian methods. 
• Comment: Hoyer et al. (UAI2008) showed: 
`To what extent one can identify the model for a 
mixture of Gaussian and non-Gaussian external 
influence variables’’.
62 
Q. I applied LiNGAM, but the result is not 
reasonable to background knowledge. 
• A. You might first want to check: 
– Some model assumptions might be violated. 
Æ Try other extensions of LiNGAM 
or non-parametric methods PC or FCI etc. 
(Spirtes et al., 2000). 
– Small sample size or small non-Gaussianity 
ÆÆ Try bootstrap to see if the result is reliable. 
– Background knowledge might be wrong.
63 Q. Relation to causal Markov 
condition? 
• A. The following 3 estimation principles are 
equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010): 
1. Maximize independence between external influences e 
i . 
2 2. Minimize the sum of entropies of external influences 
e 
. 
3. Causal Markov condition (Each variable is independent of 
non-i its descendants in the DAG conditional on its 
parents) AND maximization of independence between 
the parents of each variable and its corresponding 
external influences i . e
64 
Q. I am a psychometrician and am 
more interested in latent factors. 
• A. Shimizu, Hoyer, and Hyvarinen. (2009) 
proposes LiNGAM for latent factors: 
f = Bf + d -- LiNGAM for latent factors 
x = Gf + e -- Measurement model
65 Others 
• Q. Prior knowledge? 
– It is possible to incorporate prior knowledge. The accuracy of 
DirectLiNGAM is often greatly improved even if the amount of prior 
knowledge is not so large (Inazumi et al., LVA/ICA2010). 
• Q. Sparse LiNGAM? 
– Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010). 
– ICA + adaptive Lasso (Zou, 2006). 
• Q. Bayesian approach? 
– Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09). 
• Q. The idea can be applied to discrete variables? 
– One proposal by Peters et al. (AISTATS2010). 
– Comment: if your discrete variables are close to be continuous, e.g., 
ordinal scales with many points, LiNGAM might work.
66 
Q Q. Nonlinear extensions? 
• A SEMs have been A. Several nonlinear proposed: 
– DAG; No latent confounders. 
( ) i 
i ij i 1. x =Σ f parent j of x + e -- Imoto et al. (2002) 
j 
( ) i i i i 
2. x = f parents of x + e -- Hoyer et al. (NIPS08) 
x = f − 1 
i i ( f i , 
1 ( parents of x i ) + e i ) 3. ,2 -- Zhang et al. (UAI09) 
• For two variable cases, unique identification possible 
except several combinations of nonlinearities and 
distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).
67 Nonlinear extensions 
(continued) 
• Proposals to aim at computational efficiency (Mooij et al., 
ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09). 
• Pros: 
– Nonlinear models are more general than linear models. 
• Cons: 
– Computationally demanding. 
• Current: at most 7 or 8 variables. 
•• Perhaps, assumption of Gaussian external influences might 
help: Imoto et al. (2002) analyzes 100 variables. 
– More difficult to allow other possible violations of LiNGAM 
assumptions, latent confounders etc.
68 
Q. My data follows neither such linear 
SEMs nor such nonlinear SEMs as you 
have talked. 
• A non-parametric methods e g 
A. Try non methods, e.g., 
– DAG: PC (Spirtes & Glymour, 1991) 
– DAG with latent confounders: FCI (Spirtes et al., 1995). 
x = f ( parents of x e 
) 
i i i i ,• Probably you get an ( probably large) ) class f 
of 
equivalence models rather than a single model, 
but that would be the best you currently can.
69 
Q. Deterministic relations? 
• A. LiNGAM is not applicable. 
• See Daniusis et al. (UAI2010) for a method to 
analyze deterministic relations.
70 
Extra slides
Very brief review of 
Linear SEMs and causality 
See Pearl (2000) for details.
72 
We can give a causal interpretation 
to each path-coefficient biijj (Pearl, 2000). b 
• Example: 
x e 
= x1 e1 
1 1 
x b x + 
e 
e2 
2 
21 b 
= 2 21 1 2 
x = b 
x + 
e 
3 32 2 3 
x2 
x3 
e3 
32 b 
32 • Causa Causal e p e a o interpretation o of b 
3 32 
2 : If you change x 2 from a constant c 0 to a constant c 
1 , then ( ) will change by ( ). 3 32 1 0 E x b c − c
73 
`g Change x 2 from c 
0 to c 1 
’ • It means `Fix 2 at regardless of the variables x 0 c 
that previously determines x 
2 and change the 
constant from c 0 to c 1 ’ (Pearl, 2000). • `Fix x 2 at c 
0 regardless of the variables that 
determines x 2 ’ is denoted by do ( x 2 = c 
0 ) . 0 c 2 x 
– In SEM, `Replace the function determining with a 
constant ’: 
1 1 
x = e 
Mutilated model 
with do((x2=c0):) 2 0 do x = c 
1 1 
x = e 
x = b x + 
e 
= + 
2 21 1 2 
x b x e 
2 0 
b 
x c 
+ 
= 
3 32 2 3 3 32 2 3 x = x e 
(Assumption of invariance: the other functions don’t change)
74 Average causal effect: 
Definition (Rubin, 1974; Pearl, 2000) 
• Average causal effect of x2 on x3 when 
changing x22 c1: x 2 x 3 x 
2 x2 from c 0 c0 to c 
1 cc1: 
E ( x 3 | do ( x 2 =c 
1 cc1 c01 )) − E ( x 3 | do ( x 2 = c 0 cc0 
)) ( ) 32 1 0 = b32 c − c b 
– Computed based on the mutilated models 
with do do((x 2 x2== c c1) 1 ) or do do(( x 2 x2== c 
0 c0) ) . • Average causal effects can be computed based 
on estimated path-coefficient matrix B.
75 Cf. Conducting experiments with 
random assignment is a method to 
estimate average causal effects 
• Example: a drug trial 
• Random assignment of subjects into two groups 
fixes x 2 to c 1 or c 
0 regardless of the variables that 
determines x 
2 : 
– 2 1 Experimental group with x = c 
(take the drug) 
– Control group with 2 0 (placebo) x = c 
( ( )) ( ( )) 3 2 1 3 2 0 E x | do x = c − E x | do x = c 
( ) for the exp. group - for the control group 3 E x ( ) 3 E x

Mais conteúdo relacionado

Mais procurados

PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10matsuolab
 
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展Shiga University, RIKEN
 
PRML上巻勉強会 at 東京大学 資料 第1章後半
PRML上巻勉強会 at 東京大学 資料 第1章後半PRML上巻勉強会 at 東京大学 資料 第1章後半
PRML上巻勉強会 at 東京大学 資料 第1章後半Ohsawa Goodfellow
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰sleipnir002
 
自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介Taku Yoshioka
 
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料at grandpa
 
ベイズ統計入門
ベイズ統計入門ベイズ統計入門
ベイズ統計入門Miyoshi Yuya
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3matsuolab
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知Chika Inoshita
 
ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介Naoki Hayashi
 
PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半Ohsawa Goodfellow
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明Haruka Ozaki
 
統計的学習の基礎_3章
統計的学習の基礎_3章統計的学習の基礎_3章
統計的学習の基礎_3章Shoichi Taguchi
 
統計的因果推論 勉強用 isseing333
統計的因果推論 勉強用 isseing333統計的因果推論 勉強用 isseing333
統計的因果推論 勉強用 isseing333Issei Kurahashi
 
統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-Shiga University, RIKEN
 
統計的学習の基礎 5章前半(~5.6)
統計的学習の基礎 5章前半(~5.6)統計的学習の基礎 5章前半(~5.6)
統計的学習の基礎 5章前半(~5.6)Kota Mori
 

Mais procurados (20)

PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
 
PRML上巻勉強会 at 東京大学 資料 第1章後半
PRML上巻勉強会 at 東京大学 資料 第1章後半PRML上巻勉強会 at 東京大学 資料 第1章後半
PRML上巻勉強会 at 東京大学 資料 第1章後半
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰
 
自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介
 
Prml 1.3~1.6 ver3
Prml 1.3~1.6 ver3Prml 1.3~1.6 ver3
Prml 1.3~1.6 ver3
 
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料
機械学習プロフェッショナルシリーズ輪読会 #5 異常検知と変化検知 Chapter 1 & 2 資料
 
PRML chapter7
PRML chapter7PRML chapter7
PRML chapter7
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
EMアルゴリズム
 
ベイズ統計入門
ベイズ統計入門ベイズ統計入門
ベイズ統計入門
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知
 
Prml 3 3.3
Prml 3 3.3Prml 3 3.3
Prml 3 3.3
 
ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介
 
PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明
 
統計的学習の基礎_3章
統計的学習の基礎_3章統計的学習の基礎_3章
統計的学習の基礎_3章
 
統計的因果推論 勉強用 isseing333
統計的因果推論 勉強用 isseing333統計的因果推論 勉強用 isseing333
統計的因果推論 勉強用 isseing333
 
統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-
 
統計的学習の基礎 5章前半(~5.6)
統計的学習の基礎 5章前半(~5.6)統計的学習の基礎 5章前半(~5.6)
統計的学習の基礎 5章前半(~5.6)
 

Destaque

Non-Gaussian structural equation models for causal discovery
Non-Gaussian structural equation models for causal discoveryNon-Gaussian structural equation models for causal discovery
Non-Gaussian structural equation models for causal discoveryShiga University, RIKEN
 
A non-Gaussian model for causal discovery in the presence of hidden common ca...
A non-Gaussian model for causal discovery in the presence of hidden common ca...A non-Gaussian model for causal discovery in the presence of hidden common ca...
A non-Gaussian model for causal discovery in the presence of hidden common ca...Shiga University, RIKEN
 
因果探索: 基本から最近の発展までを概説
因果探索: 基本から最近の発展までを概説因果探索: 基本から最近の発展までを概説
因果探索: 基本から最近の発展までを概説Shiga University, RIKEN
 
構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性Shiga University, RIKEN
 
Linear Non-Gaussian Structural Equation Models
Linear Non-Gaussian Structural Equation ModelsLinear Non-Gaussian Structural Equation Models
Linear Non-Gaussian Structural Equation ModelsShiga University, RIKEN
 
非ガウス性を利用した 因果構造探索
非ガウス性を利用した因果構造探索非ガウス性を利用した因果構造探索
非ガウス性を利用した 因果構造探索Shiga University, RIKEN
 
因果探索: 観察データから 因果仮説を探索する
因果探索: 観察データから因果仮説を探索する因果探索: 観察データから因果仮説を探索する
因果探索: 観察データから 因果仮説を探索するShiga University, RIKEN
 
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...PDF uncertainties the LHC made easy: a compression algorithm for the combinat...
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...juanrojochacon
 

Destaque (8)

Non-Gaussian structural equation models for causal discovery
Non-Gaussian structural equation models for causal discoveryNon-Gaussian structural equation models for causal discovery
Non-Gaussian structural equation models for causal discovery
 
A non-Gaussian model for causal discovery in the presence of hidden common ca...
A non-Gaussian model for causal discovery in the presence of hidden common ca...A non-Gaussian model for causal discovery in the presence of hidden common ca...
A non-Gaussian model for causal discovery in the presence of hidden common ca...
 
因果探索: 基本から最近の発展までを概説
因果探索: 基本から最近の発展までを概説因果探索: 基本から最近の発展までを概説
因果探索: 基本から最近の発展までを概説
 
構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性
 
Linear Non-Gaussian Structural Equation Models
Linear Non-Gaussian Structural Equation ModelsLinear Non-Gaussian Structural Equation Models
Linear Non-Gaussian Structural Equation Models
 
非ガウス性を利用した 因果構造探索
非ガウス性を利用した因果構造探索非ガウス性を利用した因果構造探索
非ガウス性を利用した 因果構造探索
 
因果探索: 観察データから 因果仮説を探索する
因果探索: 観察データから因果仮説を探索する因果探索: 観察データから因果仮説を探索する
因果探索: 観察データから 因果仮説を探索する
 
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...PDF uncertainties the LHC made easy: a compression algorithm for the combinat...
PDF uncertainties the LHC made easy: a compression algorithm for the combinat...
 

Semelhante a Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regressionAlexander Decker
 
Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Japheth Muthama
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JMJapheth Muthama
 
Regression
Regression Regression
Regression Ali Raza
 
chapter1_part2.pdf
chapter1_part2.pdfchapter1_part2.pdf
chapter1_part2.pdfAliEb2
 
Linear equations in two variables
Linear equations in two variablesLinear equations in two variables
Linear equations in two variablesVivekNaithani3
 
Ch9-Gauss_Elimination4.pdf
Ch9-Gauss_Elimination4.pdfCh9-Gauss_Elimination4.pdf
Ch9-Gauss_Elimination4.pdfRahulUkhande
 
Linear Regression - Least Square.pptx
Linear Regression - Least Square.pptxLinear Regression - Least Square.pptx
Linear Regression - Least Square.pptxAyushSingh254558
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)마이캠퍼스
 
Regression analysis presentation
Regression analysis presentationRegression analysis presentation
Regression analysis presentationMuhammadFaisal733
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptAdnanAli861711
 

Semelhante a Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I (20)

Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regression
 
8.5 GaussianBNs.pdf
8.5 GaussianBNs.pdf8.5 GaussianBNs.pdf
8.5 GaussianBNs.pdf
 
Chapter13
Chapter13Chapter13
Chapter13
 
Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JM
 
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
 
Linear regression
Linear regressionLinear regression
Linear regression
 
MLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxMLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptx
 
Regression Analysis.pdf
Regression Analysis.pdfRegression Analysis.pdf
Regression Analysis.pdf
 
Regression
Regression Regression
Regression
 
Chapter 06
Chapter 06Chapter 06
Chapter 06
 
Input analysis
Input analysisInput analysis
Input analysis
 
chapter1_part2.pdf
chapter1_part2.pdfchapter1_part2.pdf
chapter1_part2.pdf
 
Linear equations in two variables
Linear equations in two variablesLinear equations in two variables
Linear equations in two variables
 
Ch9-Gauss_Elimination4.pdf
Ch9-Gauss_Elimination4.pdfCh9-Gauss_Elimination4.pdf
Ch9-Gauss_Elimination4.pdf
 
Linear Regression - Least Square.pptx
Linear Regression - Least Square.pptxLinear Regression - Least Square.pptx
Linear Regression - Least Square.pptx
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Regression analysis presentation
Regression analysis presentationRegression analysis presentation
Regression analysis presentation
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.ppt
 

Último

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 

Último (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 

Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I

  • 1. UAI2010 Tutorial, July 8, 1 Catalina Island. Non-Gaussian Methods for Learning Linear Structural Equation Models Shohei Shimizu and Yoshinobu Kawahara Osaka University 1 Special thanks to Aapo Hyvärinen, Patrik O. Hoyer and Takashi Washio.
  • 2. 2 Abstract • Linear structural equation models (linear SEMs) can be used to model data generating processes of variables. • We review a new approach to learn or estimate linear SEMs. • The new estimation approach utilizes non-Gaussianity of data for model identification and uniquely estimates much wider variety of models.
  • 3. 3 Outline • Part I. Overview (70 min.) : Shohei • Break (10 min.) • Part II. Recent advances (40 min): Yoshi – Time series – Latent confounders
  • 4. 4 Motivation (1/2) • Suppose that data X was randomly generated from either of the following two data generating processes: Model 1: Model 2: x 1 = e 1 x1 e1 or x 1 = b 12 x 2 + e 1 x = b x + e x2 e2 x 2 = e 2 x1 x2 e1 e2 2 21 1 2 1 e 2 e , ) where and are latent variables ( disturbances, errors). • We want to estimate or identify which model t generated d th the d t data X Xb d based on th the d t data X l only.
  • 5. 5 Motivation (2/2) • We want to identify which model generated the data X based on the data X only. • If x1 and x2 are Gaussian, it is well known that we cannot identify the data generating process. e 1 e 2 – M d l Models 1 d and 2 ll equally fit fitd t data. • If e 1 x1 and e 2 x2 are non-Gaussian, an interesting result is obtained: We can identify which of Models 1 and 2 generated the data data. • This tutorial reviews how such non-Gaussian methods work.
  • 7. 7 Basic problem setup (1/3) • Assume that the data generating process of continuous observed variables x i is graphically represented by a directed acyclic graph ( DAG). p y y g p ) – Acyclicity means that there are no directed cycles. Example of a directed E l f di t d acyclic graph (DAG): Example of a directed cyclic graph: x3 e3 x3 e3 x1 e1 x2 e2 x1 e1 x2 e2 is a parent of 1 etc. 3 x x
  • 8. 8 Basic problem setup (2/3) • Further assume linear relations of variables i . x • Then we obtain a linear acyclic SEM (Wright, 1921; Bollen, 1989): x = Bx + e i ij j i x = Σ b x + e or where j i : parents of – The are continuous latent variables that are not model external i e determined inside the model, which we call influences (disturbances, errors). – The e i are of non-zero variance and are independent. – The ‘path-coefficient’ matrix B = [ ] corresponds to a DAG. ij b
  • 9. 9 p Example of linear y acyclic SEMs • A three-variable linear acyclic SEM: ⎥ ⎤ ⎢ ⎡ ⎥ ⎤ ⎢ ⎡ ⎥ ⎤ ⎢ ⎡ ⎥ ⎤ ⎢ ⎡ 1 1 1 x 0 0 1.5 x e 1 3 1 x =1.5x + e ⎢ e ⎣ 3 ⎦ ⎢ ⎢ ⎥ ⎥ ⎢ ⎣ + x ⎦ ⎢ ⎢ ⎥ ⎥ ⎢ ⎣ ⎢ ⎣ ⎥ ⎥ ⎦ ⎢ ⎢ = − ⎥ ⎦ ⎢ ⎢ ⎥ ⎥ 2 2 3 2 3 1.3 0 0 0 0 0 e x x x = − + or x x e = 2 1 2 1.3 x e 1442443 B 3 3 • B corresponds to the data-generating DAG: x3 e3 1.5 ij j i b = 0⇔ No directed edge from x to x x1 e1 -1.3 b ij ≠ 0 0⇔A di directed d d edge f from x j to x i x2 e2
  • 10. 10 Assumption of acyclicity • Acyclicity ensures existence of an ordering of variables x i that makes B lower-triangular with zeros on the diagonal (Bollen, 1989). ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ + ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ ⎥ ⎥ ⎤ x 0 x 0 0 ⎢ ⎢ ⎡ = ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ e 3 ⎢ 1 ⎣ x 2 0 1.3 x 2 e 2 x 3 1 ⎤ 3 ⎥ ⎥ 1 0 0 0 0 1.5 0 e x x 0 ⎢ ⎢ ⎡ + ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ ⎥ ⎥ ⎤ − = ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ ⎢ ⎢ ⎡ e 1 ⎢ 2 ⎣ x 3 0 0 0 x 3 e 3 x 1 2 1 2 0 0 1.5 1.3 0 0 e x x ⎢ ⎦ ⎥ ⎢ ⎥ ⎢ ⎦ ⎥ ⎣ ⎥ ⎢ ⎥⎣ ⎢ ⎥ ⎣ − ⎢ ⎦ ⎥ ⎢ ⎦ 14424430 B ⎥ ⎢ ⎥ ⎣ ⎢ ⎦ ⎥ ⎢ ⎥⎣ ⎢ ⎦ ⎥ ⎢ ⎥ ⎣ ⎢ ⎦ ⎥ ⎢ ⎦ 1442443 B x3 e3 perm The ordering is : x1 e1 1.5 g . x < x < x 3 1 2 x x x 3 may be an ancestor of 1 , 2 , x2 -1.3 e2 but not vice versa.
  • 11. 11 Assumption of independence between external influences • It implies that there are no latent confounders (Spirtes et al. 2000) – A latent confounder f is a latent variable that is a parent of more than or equal to two observed variables: x1 x2 f e1’ e2 e2’ confounder f makes influences • Such a latent external dependent (Part II): x1 e1 x2 e2
  • 12. 12 Basic problem setup (3/3): Learning linear acyclic SEMs • Assume that data X is randomly sampled from a linear acyclic SEM (with no latent confounders): x1 e1 x = Bx+e b 21 x2 e2 • Goal: Estimate the path-coefficient matrix B by observing data X only! – B corresponds to the data-generating DAG.
  • 13. Problems: Identifiability problems of conventional methods
  • 14. 14 Under what conditions B is identifiable? ≡ • `B is identifiable’ `B is uniquely determined or estimated from p(x) x)’ . • Linear acylic SEM: B x1 e1 x = Bx+e b x2 e2 21 – B and p(e) induce p(x). – If p(x) are different for different B, then B is uniquely determined.
  • 15. 15 Conventional estimation principle: Causal Markov condition • If the data-generating model is a (linear) acyclic SEM, causal Markov condition holds : – Each observed variable xxi is independent of its non-descendants in the DAG conditional on its t i x parents (Pearl & Verma, 1991) : p ( x ) =Π ( | parentsof ) i i p p x x i i=1
  • 16. 16 Conventional methods based on causal Markov condition • Methods based on conditional independencies (Spirtes & Glymour, 1991) – Many linear acyclic SEMs give a same set of conditional independences and equally fit data. • Scoring methods based on Gaussianity (Chickering, 2002) – Many linear acyclic SEMs give a same Gaussian distribution and equally fit data. •• In many cases, the path-coefficient matrix B is not uniquely determined.
  • 17. 17 Example • Two models with Gaussian e11 and : e 2 e Model 1: Model 2: 1 1 x = e 1 2 1 x = 0.8x + e x1 e1 x1 e1 2 1 2 x = 0.8x + e 2 2 x2 e2 x = e x2 e2 var( ) var( ) 1 1 2 ( ) ( ) 0, x = x = 1 2 E e = E e = • Both introduce no conditional independence: cov( ( x 1 , x 2 ) = 0.8 ≠ 0 • Both induce the same Gaussian distribution: ⎡ x ⎤ ⎛ ⎡ 0⎤ ⎡ 1 0 8⎤ ⎞ ⎟ ⎟⎠ ⎜ ⎜⎝ ⎥⎦ ⎤ ⎢⎣ ⎤ ⎥⎦ ⎢⎣ 1 N x ⎥⎦ ⎢⎣ 0.8 0.8 1 0 0 ~ 2
  • 19. 19 A new direction: Non-Gaussian approach • Non-Gaussian data in many applications: – Neuroinformatics (Hyvarinen et al., 2001); Bioinformatics (Sogawa et al., ICANN2010); Social sciences (Micceri, 1989); Economics (Moneta, Entner, et al., 2010) • Utilize non-Gaussianity for model identification: – Bentler (1983); Mooijaart (1985); Dodge and Rousson (2001) • The path-coefficient matrix B is uniquely estimated if ei are non-Gaussian. i e – Shimizu, Hoyer, Hyvarinen and Kerminen (JMLR, 2006)
  • 20. Illustrative example: 20 Gaussian vs non-Gaussian Gaussian Non-Gaussian (uniform) Model x2 x2 1: x e 1 1 x1 e1 x1 x1 x 2 = 0 8 0.8x 1 + e 2 x2 e2 Model 2: = M d l 2 x1 x2 x1 e1 x2 1 2 1 x = 0.8x + e x2 e2 x1 2 2 x = e ( ) ( ) 0, 1 2 E e = E e = var( ) var( ) 1 1 2 x = x =
  • 21. 21 Linear Non-Gaussian Acyclic Model: LiNGAM (Shimizu, Hyvarinen, Hoyer & Kerminen, JMLR, 2006) • Non-Gaussian version of linear acyclic SEM: x i = Σ b ij x or j + e i x == Bx ++ e j : parents of i where – The external influence variables e (disturbances, errors) are • of non-variance i zero variance. • non-Gaussian and mutually independent.
  • 22. 22 Identifiability of LiNGAM model • LiNGAM model can be shown to be identifiable. – B is uniquely estimated. • To see the identifiability, helpful to review independent component analysis (ICA) (Hyvarinen et al., 2001).
  • 23. 23 Independent Component Analysis (ICA) (Jutten & Herault, 1991; Comon, 1994) • Observed random vector x is modeled by p =Σ x = As i ij j x a s or where Σ= j 1 ij a – The mixing matrix A = [ ] is square and is of full column rank. The latent ariables s (– variables i independent components) are of non-zero variance, non-Gaussian and mutually independent. • Then, A is identifiable up to permutation P and scaling D of the columns: A = APD ica
  • 24. 24 Estimation of ICA • Most of estimation methods estimate ( Hyvarinen et al., 2001) W = A−1 : y , ) x = As = W−1s • Most of the methods minimize mutual information (or its approximation) of estimated independent components: s W x ica ˆ = • W is estimated up to permutation P and scaling D of the rows: W = PDW ( (= PDA PDA−1 i ica ) • Consistent and computationally efficient algorithms: – Fixed point (FastICA) (Hyvarinen,99); Gradient-based (Amari, 98) – Semiparametric: no specific distributional assumption
  • 25. Back to LiNGAM model
  • 26. 26 Identifiability of LiNGAM (1/3): ICA achieves half of identification •• LiNGAM model is ICA. i x – Observed variables are linear combinations of non Gaussian independent external non-influences e : x = Bx + e⇔ x = (I −B)−1e i = Ae =W−1e (W= I −B) W = PDW = PD(I −B) ica • ICA gives . – P: unknown permutation matrix – D: unknown scaling (diagonal) matrix • Need to determine P and D to identify B.
  • 27. 27 Identifiability of LiNGAM (2/3): No permutation indeterminacy (1/6) • ICA gives . W = PDW = PD(I −B) ica – P : permutation matrix; D: scaling (diagonal) matrix • We want to find such a permutation matrix P that cancels the permutation i e PP = I : P, i.e., P P PW = PPDW = DW ica I P • We can show (Shimizu et al., UAI05) (illustrated in the next slides): – If PP = I i e no = , i.e., permutation is made on the rows of D W , has no zero in the diagonal (obvious by definition). ica PW DW PP ≠ I – If , i i.e., any id ti l nonidentical t ti permutation i is d made on th the rows of , has a zero in the diagonal. ica DW PW
  • 28. 28 Identifiability of LiNGAM (2/3): No permutation indeterminacy (2/6) • By definition has all unities in the diagonal. W= I −B – The diagonal elements of B are all zeros. • Acyclicity ensures existence of an ordering of variables that makes B lower-triangular, and thenW W= I −B is also lower-triangular. • So, WLG, Wcan be assumed to be lower-triangular: ⎥ ⎥ ⎤ 1 0 ⎢ ⎢ ⎡ = * 1 0 W 0 No zeros the diagonal! ⎢ ⎦ ⎥ ⎢ ⎣* * 1 in
  • 29. 29 Identifiability of LiNGAM (2/3): No permutation indeterminacy (3/6) • Premultiplying W by a scaling (diagonal) matrix D does W NOT change the zero/non-zero pattern of : ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ = 0 0 d d 11 * 0 DW 0 ⎥ ⎥ ⎤ 1 0 ⎢ ⎢ ⎡ = * 1 0 W 0 ⎢ ⎦ ⎥ ⎢ ⎣ 33 22 * * d 00 ⎢ ⎦ ⎥ ⎢ ⎣ * * 1 No zeros in the diagonal!
  • 30. 30 Identifiability of LiNGAM (2/3): No permutation indeterminacy (4/6) • Any other permutation of the rows of DW g p DW changes the zero/non-zero pattern of and brings zero in the diagonal: ⎥ ⎥ ⎤ 0 0 0 ⎢ ⎢ ⎡ = 11 22 12 d * 0 d P DW ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ = 0 11 * d 0 d DW 0 0 0 ⎢ ⎦ ⎥ ⎢ ⎣ 33 ⎥ ⎥ * * d ⎢ ⎣ ⎢ 33 ⎦ 22 * * d Exchanging 1st and 2nd rows Zero in the diagonal!
  • 31. 31 Identifiability of LiNGAM (2/3): No permutation indeterminacy (5/6) • Any other permutation of the rows of DW changes the zero/non-zero pattern of and g p DW brings zero in the diagonal: * * 33 DW ⎥ ⎥ ⎤ ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ 0 0 0 11 * d 0 d ⎢ ⎢ ⎡ 13 = * d d 0 P DW 0 ⎢ ⎦ ⎥ ⎢ ⎣ = 33 22 * * d ⎢ ⎦ ⎥ ⎢ ⎣ 0 0 11 22 d 00 Exchanging 1st 0 and 3rd rows Zero in the diagonal!
  • 32. 32 Identifiability of LiNGAM (2/3): No permutation indeterminacy (6/6) P P • We can find correct by finding that gives no th di l f PW zero on the diagonal of ica (Shimizu et al., UAI05). • Thus, we can solve the permutation indeterminacy and obtain: PW == PPDW == DW== D((I −−B)) ica = I
  • 33. 33 Identifiability of LiNGAM (3/3): No scaling indeterminacy • Now we have: PW =D(I −B) ica • Then, D = diag ( PW ica ) • Divide each row of PW i c a ica y by its p g corresponding diagonal element to get I −B , i.e., B : (PW )−1PW = D−1 D(I −B) = I −B diag ica ica
  • 34. 34 Estimation of LiNGAM model 1. ICA-LiNGAM algorithm 2. DirectLiNGAM algorithm
  • 35. 35 Two estimation algorithms • ICA-LiNGAM algorithm (Shimizu, Hoyer, Hyvarinen & Kerminen, JMLR, 2006) • DirectLiNGAM algorithm (Shimizu, Hyvarinen, Kawahara & Washio, UAI09) • Both estimate an ordering of variables that makes the path-coefficient matrix B to be lower-triangular. – Acyclicity ensures existence of such an ordering. ⎡ O⎤ A full DAG ⎤ perm perm perm e x x + ⎥⎦ = 1 3 ⎢⎣ O x1 x3 123 perm B x2 redundant edges
  • 36. 36 Once such an ordering is found… • Many existing (covariance-based) methods can do: – Pruning the redundant path-coefficients • Sparse methods like weighted lasso (Zou, 2006) – Finding significant path-coefficients • Testing, bootstrapping (Shimizu et al., 2006; Hyvarinen et al. 2010) x3 x1 x3 x1 e x x + ⎥ ⎤ ⎢ ⎡ = O 0 * A full DAG x2 x2 perm perm perm ⎥⎦ ⎢⎣ 0 0 * * * 123 perm B
  • 37. 37 1. Outline of ICA-LiNGAM algorithm (Shimizu, Hoyer, Hyvarinen, & Kerminen, JMLR, 2006) 1. Estimate B by ICA 2. Pruning g + permutation x1 x2 x1 x2 x3 23 x3 b 13 b Redundant edges
  • 38. 38 ICA-LiNGAM algorithm (1/2): Step 1. Estimation of B 1. Perform ICA (here, FastICA (Hyvarinen, 1999)) to obtain an estimate of W = PDW = PD(I −B) ica P 2. Find a permutation that makes the diagonal ica PWˆ elements of as large (non-zero) as possible in b l t l absolute value: ˆ = min 1 Hungarian alg. ( PW ) P P ˆ ( , Kuhn, 1955) ) ica ii 3 3. N li Normalize h each row f of ˆ P ˆ W , th then we t get an estimate of I-B and . ica PW Bˆ
  • 39. 39 ICA-LiNGAM algorithm (2/2): Step 2. Pruning •• Find such an ordering of variables that makes estimated B be as close to be lower-triangular as possible. – Find a permutation matrix Q that minimizes the sum of the elements in the upper triangular part of permuted ˆ B: ( ) Σ≤ ˆ 2 Q= min ˆQBQT i j ij Q B i≤ – Approximate algorithm for large variables (Hoyer et al., ICA06) 0.1 5 5 x1 x2 x1 x2 0 1 3 x3 x3 0.1 0.1 0.1 3 -0.01
  • 40. 40 Basic properties of ICA-LiNGAM algorithm • ICA-LiNGAM algorithm = ICA + permutations – Computationally efficient with the help of well-developed ICA techniques. • Potential problems – ICA is an iterative search method: • May get stuck in a local optimum if the initial guess or step size is badly chosen. – The permutation algorithms are not scale-invariant: • May provide different estimates for different scales of variables.
  • 41. 41 Estimation of LiNGAM model 1. ICA-LiNGAM algorithm 2. DirectLiNGAM algorithm
  • 42. 42 2. DirectLiNGAM algorithm (Shimizu, Hyvarinen, Kawahara & Washio, UAI2009) • Alternative estimation method without ICA – Estimates an ordering of variables that makes path-coefficient matrix B to be lower-triangular. ⎡ ⎤ A full = O perm perm perm x ⎥x + e ⎦ ⎢⎣ DAG x1 x3 ⎣123⎦ perm B x2 Redundant edges • Many existing (covariance-based) methods can do further pruning or finding significant path coefficients (Zou, 2006; Shimizu et al., 2006; Hyvarinen et al. 2010)
  • 43. 43 Basic idea (1/2) : An exogenous variable can be at the top of a right ordering • An g exogenous variable x j is a variable with no parents (Bollen, 1989), here x 3 . – The p g corresponding row of B has all zeros. • So, an exogenous variable can be at the top of such an ordering that makes B lower-triangular with zeros on the diagonal. ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ + ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ = ⎥ ⎥ ⎤ ⎢ ⎢ ⎡ 0 0 0 e x x 3 3 3 1 5 0 e x x 0 0 0 0 0 x3 x1 x2 1.5 ⎢ ⎣ 2 ⎢ ⎦ ⎥ ⎢ ⎣ 1 ⎢ ⎦ ⎥ ⎥ ⎢ ⎥⎣ ⎢ ⎣ − ⎢ ⎦ 1 ⎢ ⎦ ⎥ 1 e x x 2 2 0 1.3 0 0
  • 44. 44 Basic idea (2/2): Regress exogenous out • Compute residuals 3 x the r( 3) (i =1 2) regressing the other ) 1,2) i 3 x (i =1,2) x i variables on exogenous : 3) 3) – The residuals r 1 r(and r 2 r(form a LiNGAM model model. – The ordering of the residuals is equivalent to that of p g corresponding g original variables. ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ 3 3 3 x 000 000 00 x e e r 0 0 ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎦ ⎢ ⎢ ⎢ ⎣ 2 ⎦ ⎢ ⎢ ⎢ ⎣ ⎥ ⎥ ⎥ + ⎦ ⎢ ⎢ ⎢ ⎣ ⎥ ⎥ ⎥ ⎦ ⎢ ⎢ ⎢ ⎣ − ⎥ ⎥ ⎥ x = 0 0 ⎥ ⎥ ⎥ e 1 x 1 ⎦⎣ 2 ⎣ 0 ⎣ 1 2 1.5 0 0 1.3 e x x 0 ⎥⎦ ⎢⎣ ⎥ + ⎦ ⎡ ⎢⎣ ⎤ ⎥⎦ ⎢⎣ − ⎥ = ⎦ ⎢⎣ 1 2 (3) 1 r (3) 2 (3) 1 (3) 2 1.3 e r r (3) 2 (3) r 1 x3 x1 x2 r 1 r 1 x • Exogenous (3 ) implies ` can be at the second top’.
  • 45. 45 Outline of DirectLiNGAM • Iteratively find exogenous variables until all the variables are ordered: 3 x 1. Find an exogenous variable . – P t Put x 3 at t th the t top f of th the d i ordering. – Regress out. 3 x 2 Fi d id l h r(3) 2. Find an exogenous residual, here . – Put at the second top of the ordering. R t ( ) 1 r 1 x – Regress ( 3 ) out. 3) 1 r 2 x 3. Put at the third top of the ordering and terminate. Th The ti t estimated d d i ordering i is x 3 < x 1 < x 2 . Step. 1 Step. 2 Step. 3 1 x3 x1 x2 r (3,1) 2 r (3) 2 (3) r
  • 46. 46-1 Identification of an exogenous variable (two variable cases) i) ( ) is exogenous ii) exogenous 1 1 x = e 1 exogenous. x is NOT exogenous. ( 0) x e 1 1 = b b ( ) 1 12 2 12 x = b x + 1 b ≠ 0 e 2 21 1 2 21 x = x + e ≠ 2 2 x = e x x Regressing on , r x x x cov( , ) var( ) 1 2 1 2 (1) 2 2 1 x x = − x x Regressing on , r x x x x 1 2 1 2 (1) 2 2 1 cov( , ) = − ( ) x b var x var( ) 1 b x x 1 cov( , ) var( ) 12 2 2 12 2 1 x x − ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ x var( ) 1 x b x = − 2 21 1 = − 1 e 1 1 ⎩ ⎭ 2 = e and (1) are NOT independent. 1 2 and (1) are independent. x r 1 2 x r
  • 47. 46-2 Need to use Darmois-Skitovitch’ theorem (Darmois, 1953; Skitovitch, 1953) ii) 1 exogenous x ( ) 1 12 2 1 12 x = b x + ⋅e b ≠ 0 Darmois-Skitovitch’ theorem: Define is NOT exogenous. 1 2 x 1 x two variables and as x 2 = e 2 p p =Σ =Σ j j j j x a e x a e 1 1 2 2 , 1 x x Regressing on , r x x x cov( , ) 2 1 2 (1) 2 1 2 var( ) x x = − = j= j 1 1 j e where are independent d random ibl variables. b 12 If there exists a non-Gaussian e i ( ) 1 2 x x 2 1 b x x 1 12 cov( 2 , 1 ) var var( ) var( ) e x x − ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = − 1 1for which , ⎩ ⎭ 0 1 2 ≠ i i a a 1 x 2 x and are dependent. and (1) are NOT independent. 1 2 x r
  • 48. 47 Identification of an exogenous variable (p variable cases) cov( ) = − , j 1: x its residual • Lemma and ( ) j i j j i i x x x r x all is x var( ) j i ≠ j j are independent for ⇔ x exogenous I ti id tif i bl • In practice, we can identify an exogenous variable by finding a variable that is most independent of its residuals.
  • 49. 48 p Independence measures • Evaluate independence between a variable and a residual by a nonlinear correlation: corr{x g(r( j) )} (g = tanh) , r j i • Taking the sum residuals over all the residuals, we get: T = Σcorr{x g(r( j) )}+ corr{g(x ) r( j)} Σ≠ corr { j , g ( r i ) } + corr { g ( j , r i } i j • C Can use more hi ti t d sophisticated measures as ll well (Bach & Jordan, 2002; Gretton et al., 2005; Kraskov et al., 2004). – Kernel-based independence measure (Bach & Jordan, 2002) often gives more accurate estimates (Sogawa et al., IJCNN10).
  • 50. 49 Important properties of DirectLiNGAM • DirectLiNGAM repeats: – Least squares simple linear regression – Evaluation of pairwise independence between each variable and its residuals. • No algorithmic parameters like stepsize, initial guesses, convergence criteria. • Guaranteed convergence to the right solution in a fixed number of steps (the number of variables) if the data strictly follows the model.
  • 51. 50 Estimation of LiNGAM model: Summary (1/2) • Two estimation algorithms: – ICA-LiNGAM: Estimation using ICA • Pros. Fast •• Cons. Possible local optimum; Not scale-invariant – DirectLiNGAM: Alternative estimation without ICA • Pros. Guaranteed convergence; Scale-invariant • Cons. Less fast – Cf. Neither needs faithfulness (Shimizu et al., JMLR, 2006; Hoyer, personal comm., July, 2010).
  • 52. 51 Estimation of LiNGAM model: Summary (2/2) • Experimental comparison of the two algorithms: (Sogawa et al., IJCNN2010) • Sample size: Both need at least 1000 sample size for more than 10 variables. • Scalability: Both can analyze 100 variables. The performances depend on the sample size etc., of course! • Scale invariance: ICA-LiNGAM is less robust for changing scales of variables. • Local optima?: – For less than 10 variables, ICA-LiNGAM often a bit better. – For more than 10 variables, DirectLiNGAM often better perhaps because the problem of local optima might become more serious?
  • 53. 52 Testing and Reliability evaluation
  • 54. 53 Testing testable assumptions • Non-Gaussianity of external influences i : y e – Gaussianity tests • Could detect violations of some assumptions: – Local test i e • Independence of external influences • Conditional independencies between observed variables x (causal Markov condition) • Linearity – i Overall fit of the model assumptions • Chi-square test using 3rd and/or 4th-order moments (Shimizu & Kano, 2008) • Still under development
  • 55. 54 Reliability evaluation • Need to evaluate statistical reliability of LiNGAM results: – Sample fluctuations – Smaller non-Gaussianity makes the model closer to be NOT identifiable. • Reliability evaluation by bootstrapping: (Komatsu, Shimizu & Shimodaira, ICANN2010; Hyvarinen et al., JMLR, 2010) – If either the sample size is small or the magnitude of non-Gaussianity is small, LiNGAM would give very different results for bootstrap samples.
  • 57. 56 Extensions (a partial list) • Relaxing the assumptions of LiNGAM model: – Acyclic Æ Cyclic (Lacerda et al., UAI2008) – Single homogenous population Æ heterogeneous population (Shimizu et al., ICONIP2007) – i.i.d. sampling Æ time structures (Part II.) (Hyvarinen et al, JMLR, 2010; Kawahara, S & Washio, 2010) – No latent confounders Æ Allow latents (Part II.) (Hoyer et al., IJAR, 08; Kawahara, Bollen et al., 2010) – Linear Æ Non-linear (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09; Tilmann, Gretton & Spirtes, NIPS09)
  • 59. 58 Non-Gaussian SEMs have been applied to… • Neuroinformatics – Brain connectivity analysis (Hyvarinen et al., JMLR, 2010) • Bioinformatics – Gene network estimation (Sogawa et al., ICANN2010) • Economics (Wan & Tan, 2009; Moneta, Entner, Hoyer & Coad, 2010) • Genetics (Ozaki & Ando, 2009) • Environmental sciences (Niyogi et al., 2010) • Physics (Kawahara, Shimizu & Washio, 2010) • Sociology (Kawahara, Bollen, Shimizu & Washio, 2010)
  • 60. 59 Final summary of Part I • Use of non-Gaussianity in linear SEMs is useful for model identification. • Non-Gaussian data is encountered in many applications. • The non-Gaussian approach can be a good option. • Links to codes and papers: http://homepage.mac.com/shoheishimizu/lingampapers.html
  • 62. 61 Q. My data is Gaussian. LiNGAM will not be useful. • A. You’re right. Try Gaussian methods. • Comment: Hoyer et al. (UAI2008) showed: `To what extent one can identify the model for a mixture of Gaussian and non-Gaussian external influence variables’’.
  • 63. 62 Q. I applied LiNGAM, but the result is not reasonable to background knowledge. • A. You might first want to check: – Some model assumptions might be violated. Æ Try other extensions of LiNGAM or non-parametric methods PC or FCI etc. (Spirtes et al., 2000). – Small sample size or small non-Gaussianity ÆÆ Try bootstrap to see if the result is reliable. – Background knowledge might be wrong.
  • 64. 63 Q. Relation to causal Markov condition? • A. The following 3 estimation principles are equivalent (Zhang & Hyvarinen, ECML09; Hyvarinen et al., JMLR, 2010): 1. Maximize independence between external influences e i . 2 2. Minimize the sum of entropies of external influences e . 3. Causal Markov condition (Each variable is independent of non-i its descendants in the DAG conditional on its parents) AND maximization of independence between the parents of each variable and its corresponding external influences i . e
  • 65. 64 Q. I am a psychometrician and am more interested in latent factors. • A. Shimizu, Hoyer, and Hyvarinen. (2009) proposes LiNGAM for latent factors: f = Bf + d -- LiNGAM for latent factors x = Gf + e -- Measurement model
  • 66. 65 Others • Q. Prior knowledge? – It is possible to incorporate prior knowledge. The accuracy of DirectLiNGAM is often greatly improved even if the amount of prior knowledge is not so large (Inazumi et al., LVA/ICA2010). • Q. Sparse LiNGAM? – Zhang et al. (ICA09) and Hyvarinen et al. (JMLR, 2010). – ICA + adaptive Lasso (Zou, 2006). • Q. Bayesian approach? – Hoyer and Hyttinen (UAI09); Henao and Winther. (NIPS09). • Q. The idea can be applied to discrete variables? – One proposal by Peters et al. (AISTATS2010). – Comment: if your discrete variables are close to be continuous, e.g., ordinal scales with many points, LiNGAM might work.
  • 67. 66 Q Q. Nonlinear extensions? • A SEMs have been A. Several nonlinear proposed: – DAG; No latent confounders. ( ) i i ij i 1. x =Σ f parent j of x + e -- Imoto et al. (2002) j ( ) i i i i 2. x = f parents of x + e -- Hoyer et al. (NIPS08) x = f − 1 i i ( f i , 1 ( parents of x i ) + e i ) 3. ,2 -- Zhang et al. (UAI09) • For two variable cases, unique identification possible except several combinations of nonlinearities and distributions (Hoyer et al., NIPS08; Zhang & Hyvarinen, UAI09).
  • 68. 67 Nonlinear extensions (continued) • Proposals to aim at computational efficiency (Mooij et al., ICML09; Tilmann et al., NIPS09; Zhang & Hyvarinen, ECML09; UAI09). • Pros: – Nonlinear models are more general than linear models. • Cons: – Computationally demanding. • Current: at most 7 or 8 variables. •• Perhaps, assumption of Gaussian external influences might help: Imoto et al. (2002) analyzes 100 variables. – More difficult to allow other possible violations of LiNGAM assumptions, latent confounders etc.
  • 69. 68 Q. My data follows neither such linear SEMs nor such nonlinear SEMs as you have talked. • A non-parametric methods e g A. Try non methods, e.g., – DAG: PC (Spirtes & Glymour, 1991) – DAG with latent confounders: FCI (Spirtes et al., 1995). x = f ( parents of x e ) i i i i ,• Probably you get an ( probably large) ) class f of equivalence models rather than a single model, but that would be the best you currently can.
  • 70. 69 Q. Deterministic relations? • A. LiNGAM is not applicable. • See Daniusis et al. (UAI2010) for a method to analyze deterministic relations.
  • 72. Very brief review of Linear SEMs and causality See Pearl (2000) for details.
  • 73. 72 We can give a causal interpretation to each path-coefficient biijj (Pearl, 2000). b • Example: x e = x1 e1 1 1 x b x + e e2 2 21 b = 2 21 1 2 x = b x + e 3 32 2 3 x2 x3 e3 32 b 32 • Causa Causal e p e a o interpretation o of b 3 32 2 : If you change x 2 from a constant c 0 to a constant c 1 , then ( ) will change by ( ). 3 32 1 0 E x b c − c
  • 74. 73 `g Change x 2 from c 0 to c 1 ’ • It means `Fix 2 at regardless of the variables x 0 c that previously determines x 2 and change the constant from c 0 to c 1 ’ (Pearl, 2000). • `Fix x 2 at c 0 regardless of the variables that determines x 2 ’ is denoted by do ( x 2 = c 0 ) . 0 c 2 x – In SEM, `Replace the function determining with a constant ’: 1 1 x = e Mutilated model with do((x2=c0):) 2 0 do x = c 1 1 x = e x = b x + e = + 2 21 1 2 x b x e 2 0 b x c + = 3 32 2 3 3 32 2 3 x = x e (Assumption of invariance: the other functions don’t change)
  • 75. 74 Average causal effect: Definition (Rubin, 1974; Pearl, 2000) • Average causal effect of x2 on x3 when changing x22 c1: x 2 x 3 x 2 x2 from c 0 c0 to c 1 cc1: E ( x 3 | do ( x 2 =c 1 cc1 c01 )) − E ( x 3 | do ( x 2 = c 0 cc0 )) ( ) 32 1 0 = b32 c − c b – Computed based on the mutilated models with do do((x 2 x2== c c1) 1 ) or do do(( x 2 x2== c 0 c0) ) . • Average causal effects can be computed based on estimated path-coefficient matrix B.
  • 76. 75 Cf. Conducting experiments with random assignment is a method to estimate average causal effects • Example: a drug trial • Random assignment of subjects into two groups fixes x 2 to c 1 or c 0 regardless of the variables that determines x 2 : – 2 1 Experimental group with x = c (take the drug) – Control group with 2 0 (placebo) x = c ( ( )) ( ( )) 3 2 1 3 2 0 E x | do x = c − E x | do x = c ( ) for the exp. group - for the control group 3 E x ( ) 3 E x