A phylogenetic model of language diversification

A Phylogenetic Model of Language Diversiﬁcation

Robin J. Ryder1 et Geoff K. Nicholls2

1 CEREMADE, Université Paris-Dauphine
2 Department of Statistics, University of Oxford

UCLA, March 2013
www.slideshare.net/robinryder

Gray and Atkinson’s tree(s)

R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 2 / 81

Caveats

I am not a linguist
Statistics: additional insight alongside the comparative method
I use the word "evolution" in a broad sense
"All models all false, but some are useful"


Advantages of statistical methods

Analyse (very) large datasets
Test multiple hypotheses
Cross-validation
Estimate uncertainty


Questions to answer

Topology of the tree
Age of ancestor nodes
Age of root: 6000-6500 BP or 8000-9500 BP (Before Present) ?
6000 BP: Kurgan horsemen ; 8000 BP: Anatolian farmers


Statistical method in a nutshell

1 Collect data
2 Design model
3 Perform inference (MCMC, ...)
4 Check convergence
5 In-model validation (is our inference method able to answer
questions from our model?)
6 Model mis-speciﬁcation analysis (do we need a more complex
model?)
7 Conclude


Outline

1 Data

2 Model

3 Inference

4 In-model validation

5 Model mis-speciﬁcation

6 Results

7 Semitic lexical data

8 Bergsland and Vogt

9 Punctuational bursts


Morris Swadesh and glottochronology

200/100 word list
Compares 2 languages (c=fraction of shared cognates)
Assumes r =fraction of shared cognates after 1000 years constant
for all languages (86%)
Infers age t of Most Recent Common Ancestor

ˆ = ln c
t
2 ln r


all dog grass long river split walk
and drink green louse road
warm
animal dry man root squeeze
guts
ashes rope stab wash
dull many
hair
at dust rotten stand water
hand meat
back ear round star
he moon
stick we
bad earth rub
head
bark eat mother salt stone wet
hear
egg heart sand what
because mountain straight
eye heavy say
belly mouth suck when
fall here name sun
big scratch where
far hit narrow swell
bird fat sea
hold near swim white
bite father see
horn neck tail
black fear seed who
how new ten
blood sew
hunt night that wide
blow feather sharp
nose there wife
bone few husband short
not they
breast fight I sing wind
old thick
fire ice
one sit thin wing
breathe fish if
other skin think
burn five in
sky wipe
child this
float kill
person sleep thou
claw flow knee with
play small three
cloud flower know
pull smell throw
cold fly lake woman
tie
come
R. Ryder & G. Nicholls (Dauphine & Oxford)
fog push
laugh Language phylogenies UCLA 2013 9 / 81
woods

Bergsland and Vogt (1962)

Found different rates for different pairs of languages: Old Norse
and Icelandic, Georgian and Mingrelian, Armenian and Old
Armenian
Discredited Glottochronology
Sankoff (1973): sample selection bias, no estimation of
uncertainty
Fair criticism
Bad observation protocol from Swadesh
Does not apply (so much) to modern methods


Core vocabulary

100 or 200 words, present in almost all languages: bird, hand, to
eat, red...
Borrowing can occur (evolution not along a tree), but:


Core vocabulary

100 or 200 words, present in almost all languages: bird, hand, to
eat, red...
Borrowing can occur (evolution not along a tree), but:
“Easy” to detect
Rare
Does not bias the results


Binary data: he dies, three, all
il meurt trois tout
Old English stierfþ þr¯eı ealle
Old High German stirbit, touwit dr¯ ı alle
Avestan miriiete ¯ ¯
þraiio vispe
Old Church Slavonic ı ˘
um˘retu tr˘je
ı v˘si
ı
Latin moritur ¯
tres omnes ¯
Oscan ? trís súllus

Cognacy classes (traits) for the
meaning he dies:


il meurt trois tout
þraiio vispe
um˘retu tr˘je
ı v˘si
ı
Latin moritur ¯
tres omnes ¯

Cognacy classes (traits) for the
meaning he dies:
1 {stierfþ, stirbit}
2 {touwit}
3 ı ˘
{miriiete, um˘retu, moritur}


il meurt trois tout
þraiio vispe
um˘retu tr˘je
ı v˘si
ı
Latin moritur ¯
tres omnes ¯

O. English 1 0 0 Cognacy classes (traits) for the
OH German 1 1 0 meaning he dies:
Avestan 0 0 1 1 {stierfþ, stirbit}
OC Slavonic 0 0 1 2 {touwit}
Latin 0 0 1 3 ı ˘
{miriiete, um˘retu, moritur}
Oscan ? ? ?


il meurt trois tout
þraiio vispe
um˘retu tr˘je
ı v˘si
ı
Latin moritur ¯
tres omnes ¯

O. English 1 0 0 1 Cognacy classes for
OH German 1 1 0 1 the meaning three:
Avestan 0 0 1 1 1 ¯ ¯ ı ¯
{þr¯e, dr¯, þraiio, tr˘je, tres, trís}
ı ı
V.-slave 0 0 1 1
Latin 0 0 1 1
Osque ? ? ? 1


il meurt trois tout
þraiio vispe
um˘retu tr˘je
ı v˘si
ı
Latin moritur ¯
tres omnes ¯

O. English 1 0 0 1 1 0 0 0 Cognacy classes
OH German 1 1 0 1 1 0 0 0 for all:
Avestan 0 0 1 1 0 1 0 0 1 {ealle, alle}
OC Slavonic 0 0 1 1 0 1 0 0 2 {vispe, v˘si}
ı
Latin 0 0 1 1 0 0 1 0 3 ¯
{omnes}
Oscan ? ? ? 1 0 0 0 1 4 {súllus}


Observation process

Old English 1 0 0 1 1 0 0 0
Old High German 1 1 0 1 1 0 0 0
Avestan 0 0 1 1 0 1 0 0
Old Church Slavonic 0 0 1 1 0 1 0 0
Latin 0 0 1 1 0 0 1 0
Oscan ? ? ? 1 0 0 0 1


Observation process

Old English 1 0 1 1 0
Old High German 1 0 1 1 0
Avestan 0 1 1 0 1
Old Church Slavonic 0 1 1 0 1
Latin 0 1 1 0 0
Oscan ? ? 1 0 0


Constraints

Constraints on the tree topology
30 constraints on the age of some nodes or ancient languages
These constraits are used to estimate the evolution rates and the
age.


Constraints


Outline

1 Data

2 Model

3 Inference



6 Results





Model (1): birth-death process

Traits are born at rate
λ
Traits die at rate µ
λ and µ are constant
1 1 0 0 0 0 0 0 0
2 1 0 1 0 0 0 0 0
3 1 0 0 0 0 0 0 1
4 0 0 0 0 1 0 0 0
5 0 0 0 0 1 0 0 0
6 1 1 0 0 0 1 1 0
7 1 1 0 0 0 1 0 0
8 1 0 0 0 0 0 0 0


Model (2): catastrophic rate heterogeneity

Catastrophes occur at rate ρ
At a catastrophe, each trait dies
with probability κ and Poiss(ν)
traits are born.
λ/µ = ν/κ : the number of traits
is constant on average.
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 1 0 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 1 1 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 1 1 0 0 0 0 0 1 0
7 1 0 0 0 0 1 0 0 0 0 0 0 1 0
8 1 0 0 0 0 0 0 0 0 0 0 0 1 0


Model (3): missing data

Observation process: each
point goes missing with
probability ξi
Some traits are not observed
and are thinned out of the data
1 1000?00000?000
2 ?01000?000000?
3 0?00?000011000
4 0000?0?0000?00
5 00?01?00000000
6 10000??0?000?0
7 ?0000?0?000010
8 10000000000010


Observation process

0 1 0 0 1 0 1 1 0
0 0 0 1 1 0 0 1 1
1 1 0 1 1 1 1 1 1
1 0 0 1 0 1 1 1 0
0 0 1 1 1 1 0 0 1


Observation process

? 1 0 0 ? 0 1 1 0
0 0 ? ? 1 0 0 1 1
? 1 ? ? ? 1 ? 1 1
1 0 0 1 0 1 1 1 0
0 ? ? 1 1 1 0 0 1


Observation process

1 0 ? 0 1 1 0
0 ? 1 0 0 1 1
1 ? ? 1 ? 1 1
0 1 0 1 1 1 0
? 1 1 1 0 0 1


Outline

1 Data

2 Model

3 Inference



6 Results





TraitLab software
Bayesian inference
Markov Chain Monte Carlo
(Almost) uniform prior over the age of the root


Why be Bayesian?

In the settings described in this talk, it usually makes sense to use
Bayesian inference, because:
The models are complex
Estimating uncertainty is paramount
The output of one model is used as the input of another
We are interested in complex functions of our parameters


Frequentist statistics

Statistical inference deals with estimating an unknown parameter
θ given some data D.
In the frequentist view of statistics, θ has a true ﬁxed
(deterministic) value.
Uncertainty is measured by conﬁdence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].


Frequentist statistics

Statistical inference deals with estimating an unknown parameter
θ given some data D.
In the frequentist view of statistics, θ has a true ﬁxed
(deterministic) value.
Uncertainty is measured by conﬁdence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].
Frequentist statistics often use the maximum likelihood estimator:
for which value of θ would the data be most likely (under our
model)?
L(θ|D) = P[D|θ]
ˆ
θ = arg max L(θ|D)
θ


Bayesian statistics

In the Bayesian framework, the parameter θ is seen as inherently
random: it has a distribution.
Before I see any data, I have a prior distribution on π(θ), usually
uninformative.
Once I take the data into account, I get a posterior distribution,
which is hopefully more informative.

π(θ|D) ∝ π(θ)L(θ|D)

Different people have different priors, hence different posteriors.
But with enough data, the choice of prior matters little.
We are now allowed to make probability statements about θ, such
as "there is a 95% probability that θ belongs to the interval
[78 ; 119]" (credible interval)


Advantages and drawbacks of Bayesian statistics

More intuitive interpretation of the results
Easier to think about uncertainty
In a hierarchical setting, it becomes easier to take into account all
the sources of variability
Prior speciﬁcation: need to check that changing your prior does
not change your result
Computationally intensive


Prior and inference
Parameter Prior Note on prior Method
Tree g fG marginally uniform on MCMC
root age, uniform on
topologies
Death rate µ 1/µ improper; invariant by MCMC
scale change
Birth rate λ 1/λ improper; invariant by integration
scale change
Birth time Z PPP Poisson process+ ob- integration
servatoin process (pruning)
Catastrophe time k PPP Total per edge MCMC
Catastrophe rate ρ fR , Γ IC 95%: 1/tree – MCMC
1/edge
Catastrophe death U(0, 1) MCMC
rate κ
Missing data rate ξ U(0, 1)L MCMC


Likelihood calculation

P[M = ω|Z = (ti , c), g, µ] =
(c)
ω∈Ωa
(c)

 δi,c ×
 P[M = ω|Z = (tc , c), g, µ] if Y (Ωa ) ≥ 1

(c)

ω∈Ωa



 (c) (
(1−δi,c )+δi,c × P[M=ω|Z=(tc , c), g, µ] if Y (Ωa ) = 0 and Q(Ωa

(c)


 ω∈Ωa
 (1 − δ ) + δ v
 (0) (c) (c)

 i,c i,c c if Y (Ωa ) + Q(Ωa ) =
(c)

(i.e. Ωa = {∅})


(c)

1
 if Ωa = {{c}, ∅} or {{c}}
P[M = ω|Z = (tc , c), g, µ] = (i.e. Dc,a ∈ {?, 1})
(c)
(c)

0 if Ωa = {∅} (i.e. Dc,a = 0)

ω∈Ωa


MCMC

Fit the model to the data
Trees that make the data likely
Obtain a sample of trees and dates
Samples weighted by quality of ﬁt to data


Outline

1 Data

2 Model

3 Inference



6 Results





Tests on synthetic data

Figure: True tree, 40
words/language Figure: Consensus tree


Tests on synthetic data (2)

Figure: Death rate (µ)


Outline

1 Data

2 Model

3 Inference



6 Results





Initial model: no catastrophes

Traits are born at rate
λ
Traits die at rate µ
λ and µ are constant
1 1 0 0 0 0 0 0 0
2 1 0 1 0 0 0 0 0
3 1 0 0 0 0 0 0 1
4 0 0 0 0 1 0 0 0
5 0 0 0 0 1 0 0 0
6 1 1 0 0 0 1 1 0
7 1 1 0 0 0 1 0 0
8 1 0 0 0 0 0 0 0


Mis-speciﬁcation: catastrophic heterogeneity

(a) (b)

(c) (d)


Inﬂuence of borrowing (1)

words/language, 10% Figure: Consensus tree
d’emprunts



words/language, 50% Figure: Consensus tree
d’emprunts



The topology is reconstructed well
Dates are under-estimated

Figure: Root age Figure: Death rate (µ)


Presence of borrowing?

1

0.9

0.8

Ringe 100
b=0
b=0.1
0.7
b=0.5
b=1

0.6

0.5

0.4
2 4 6 8 10 12 14 16 18 20 22 24


Mis-speciﬁcations

Heterogeneity between traits Analyse subset of data+ sim-
ulated data
Heterogeneity in time/space Simulated data analysis with
(non catastrophic) edge rate from a Γ distribution
Borrowing Simulated data analysis +
check level of borrowing
Data missing in blocks Simulated data analysis
Non-empty meaning cate- Simulated data analysis
gories


Outline

1 Data

2 Model

3 Inference



6 Results





Data

Indo-European languages
Core vocabulary (Swadesh 100 ou 207)
Two (almost) independent data sets
Dyen et al. (1997) : 87 languages, mostly modern
Ringe et al. (2002) : 24 languages, mostly ancient


Cross-validation

Predict age of nodes for which we have a constraint: would we
reject the truth?
Γ space of trees which respect all constraints
Γ−c : remove constraint c = 1 . . . 30
M0 : g ∈ Γ, M1 ; g ∈ Γ−c . Bayes factor:

P[g ∈ Γ|D, g ∈ Γ−c ]
B (c) =
P[g ∈ Γ|Γ−c ]

Constraint c conﬂicts with the model if 2 log B (c) < −5.


Cross validation

100

10

5

2

0

−2

−5

−10

−100

HI TA TB LU LY OI UM OS LA GK AR GO ON OE OG OS PR AV PE VE CE IT GE WG NW BS BA IR II TG
0

2000

4000

6000

8000


Consensus tree: modern languages (Dyen data)


Consensus tree; ancient languages (Ringe data)
oldhighgerman

oldenglish

oldnorse

gothic

oscan

umbrian
66

latin

welsh
oldirish
85 oldpersian
avestan

vedic
58
lithuanian
latvian

oldprussian

oldcslavonic
greek
78
armenian
lycian
luvian

hittite
62
tocharian_b
tocharian_a
albanian

8000 7000 6000 5000 4000 3000 2000 1000 0


Root age


Conclusions

Strong support for Anatolian farming hypothesis: root around 8000
BP
Statistics reconstruct known linguistic facts and answer
unresolved questions
TraitLab: it’s free! (Though Matlab is not...)


Outline

1 Data

2 Model

3 Inference



6 Results





Semitic lexical data

Data: Kitchen et al. (2009)
25 languages, 96 meanings, 674 cognacy classes
Questions of interest: root age (constraint known), topology,
outgroup


Model validation

Thin bar: constraint. Thick bar: 95% posterior HPD. (Red bar: 95%
prior HPD)

Model validation


Conclusions

Root age 95% HPD: 4400 – 5100 BP
Akkadian outgroup: 67% (Syrian homeland?)
Zero catastrophes: 33%

Outline

1 Data

2 Model

3 Inference



6 Results





Back to Bergsland and Vogt

Norse family, 8 languages.
Selection bias
Claim that the rate of change is signiﬁcantly different for these
data.
B&V included words used only in literary Icelandic, which we
exclude
We can handle polymorphism
Do not include catastrophes


Known history

Gjestal

Sandnes

Riksmal

X XI XII XIII

Icelandic


Tests

Two possible ways to test whether the same model parameters apply
to this example and to Indo-European:
1 Assume parameters are the same as for the general
Indo-European tree, and estimate ancestral ages.
2 Use Norse constraints to estimate parameters, and compare to
parameter estimates from general Indo-European tree


Results

If we use parameter values from another analysis, we can try to
estimate the age of 13th century Norse.
True constraint: 660–760 BP. Our HPD: 615 – 872 BP.
If we analyse the Norse data on its own, we estimate parameters.
Value of µ for Norse: 2.47 ± 0.4 · 10−4
Value of µ for IE: 1.86 ± 0.39 · 10−4 (Dyen), 2.37 ± 0.21 · 10−4
(Ringe)


But...

We can also try to estimate the age of Icelandic (which is 0 BP)
Find 439–560 BP, far from the true value
B&V were right: there was signiﬁcantly less change on the branch
leading to Icelandic than average
However, we are still able to estimate internal node ages.


Georgian

Second data set: Georgian and Mingrelian
Age of ancestor: last millenium BC
Code data given by B&V, discarding borrowed items
Use rate estimate from Ringe et al. analysis


Georgian

Second data set: Georgian and Mingrelian
Age of ancestor: last millenium BC
Code data given by B&V, discarding borrowed items
Use rate estimate from Ringe et al. analysis
95% HPD: 2065 – 3170 BP


B&V: conclusions

Third data set (Armenian) not clear enough to be recoded.
There is variation in the number of changes on an edge
Nonetheless, we are still able to estimate ancestral language age
Variation in borrowing rates
B& V: "we cannot estimate dates, and it follows that we cannot
estimate the topology either".
We can estimate dates, and even if we couldn’t, we might still be
able to estimate the topology


Outline

1 Data

2 Model

3 Inference



6 Results





Atkinson et al. (2008)

Hypothesis: when a language is founded by a migration, the
founder effect leads to fast change over a short period of time.
There is a catastrophe at each branching event.
Indirect estimation: correlation between number of changes
between root and leaf, and number of branching events along the
same path
Atkinson: 21% of changes in the history of IE are due to
punctuational bursts


Atkinson et al. (2008)


Direct analysis

We force a catastrophe on each edge.
Infer size of catastrophes.
Find κ very close to 0.
Less than 1% of change can be attributed to punctuational bursts.
Reason for discrepancy unclear.


Conclusions

Strong support for age of PIE around 8000 BP
Statistical methods can help answer questions which traditional
methods cannot
Many more questions and models to come
TraitLab: it’s free! (although Matlab is not...)


Questions

otázky kesses
spørgsmåler cwestiwnau
pytania preguntes
preguntas vrae
kláusimai Fragen
voprosy quaestiones
˘
întrebari questions
vragen ρωτ η σ ις
´
zapitanni spurningar
domande spørsmåler
questões frågor
vprašanja


References

R. J. Ryder & G. K. Nicholls, Missing data in a stochastic Dollo
model for cognate data, and its application to the dating of
Proto-Indo-European (2011), JRSS C
G. K. Nicholls, Horses or farmers? The tower of Babel and
confidence in trees (2008), Significance (popular science)
G. K. Nicholls & R. J. Ryder, Phylogenetic models for Semitic
vocabulary (2011), IWSM
R. J. Ryder, Phylogenetic Models of Language Diversification
(2010), DPhil. thesis, University of Oxford


A phylogenetic model of language diversification

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

A phylogenetic model of language diversification