SlideShare uma empresa Scribd logo
1 de 92
Baixar para ler offline
A Phylogenetic Model of Language Diversification

          Robin J. Ryder1 et Geoff K. Nicholls2

             1 CEREMADE,      Université Paris-Dauphine
           2 Department   of Statistics, University of Oxford


                UCLA, March 2013
         www.slideshare.net/robinryder
Gray and Atkinson’s tree(s)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   2 / 81
Caveats




           I am not a linguist
           Statistics: additional insight alongside the comparative method
           I use the word "evolution" in a broad sense
           "All models all false, but some are useful"




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   3 / 81
Advantages of statistical methods




           Analyse (very) large datasets
           Test multiple hypotheses
           Cross-validation
           Estimate uncertainty




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   4 / 81
Questions to answer




           Topology of the tree
           Age of ancestor nodes
           Age of root: 6000-6500 BP or 8000-9500 BP (Before Present) ?
           6000 BP: Kurgan horsemen ; 8000 BP: Anatolian farmers




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   5 / 81
Statistical method in a nutshell



      1    Collect data
      2    Design model
      3    Perform inference (MCMC, ...)
      4    Check convergence
      5    In-model validation (is our inference method able to answer
           questions from our model?)
      6    Model mis-specification analysis (do we need a more complex
           model?)
      7    Conclude




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   6 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   7 / 81
Morris Swadesh and glottochronology



           200/100 word list
           Compares 2 languages (c=fraction of shared cognates)
           Assumes r =fraction of shared cognates after 1000 years constant
           for all languages (86%)
           Infers age t of Most Recent Common Ancestor

                                                   ˆ = ln c
                                                   t
                                                       2 ln r




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   8 / 81
all              dog              grass           long         river     split               walk
           and              drink            green           louse        road
                                                                                                        warm
           animal           dry                              man          root      squeeze
                                             guts
           ashes                                                          rope      stab                wash
                            dull                             many
                                             hair
           at               dust                                          rotten    stand               water
                                             hand            meat
           back             ear                                           round     star
                                             he              moon
                                                                                    stick               we
           bad              earth                                         rub
                                             head
           bark             eat                              mother       salt      stone               wet
                                             hear
                            egg              heart                        sand                          what
           because                                           mountain               straight
                            eye              heavy                        say
           belly                                             mouth                  suck                when
                            fall             here            name                   sun
           big                                                            scratch                       where
                            far              hit             narrow                 swell
           bird             fat                                           sea
                                             hold            near                   swim                white
           bite             father                                        see
                                             horn            neck                   tail
           black            fear                                          seed                          who
                                             how             new                    ten
           blood                                                          sew
                                             hunt            night                  that                wide
           blow             feather                                       sharp
                                                             nose                   there               wife
           bone             few              husband                      short
                                                             not                    they
           breast           fight             I                            sing                          wind
                                                             old                    thick
                            fire              ice
                                                             one          sit       thin                wing
           breathe          fish              if
                                                             other        skin      think
           burn             five              in
                                                                          sky                           wipe
           child                                                                    this
                            float             kill
                                                             person       sleep     thou
           claw             flow              knee                                                       with
                                                             play         small     three
           cloud            flower            know
                                                             pull         smell     throw
           cold             fly               lake                                                       woman
                                                                                    tie
           come
R. Ryder & G. Nicholls (Dauphine & Oxford)
                             fog                             push
                                             laugh Language phylogenies                     UCLA 2013     9 / 81
                                                                                                        woods
Bergsland and Vogt (1962)



           Found different rates for different pairs of languages: Old Norse
           and Icelandic, Georgian and Mingrelian, Armenian and Old
           Armenian
           Discredited Glottochronology
           Sankoff (1973): sample selection bias, no estimation of
           uncertainty
           Fair criticism
           Bad observation protocol from Swadesh
           Does not apply (so much) to modern methods




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   10 / 81
Core vocabulary




           100 or 200 words, present in almost all languages: bird, hand, to
           eat, red...
           Borrowing can occur (evolution not along a tree), but:




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   11 / 81
Core vocabulary




           100 or 200 words, present in almost all languages: bird, hand, to
           eat, red...
           Borrowing can occur (evolution not along a tree), but:
           “Easy” to detect
           Rare
           Does not bias the results




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   11 / 81
Binary data: he dies, three, all
                                                     il meurt       trois        tout
                     Old English                      stierfþ        þr¯eı      ealle
                  Old High German                 stirbit, touwit     dr¯  ı     alle
                      Avestan                        miriiete           ¯ ¯
                                                                    þraiio      vispe
                 Old Church Slavonic                     ı ˘
                                                     um˘retu         tr˘je
                                                                        ı        v˘si
                                                                                   ı
                        Latin                         moritur           ¯
                                                                     tres      omnes ¯
                       Oscan                             ?            trís     súllus

                                                            Cognacy classes (traits) for the
                                                            meaning he dies:




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                UCLA 2013   12 / 81
Binary data: he dies, three, all
                                                     il meurt            trois        tout
                     Old English                      stierfþ             þr¯eı      ealle
                  Old High German                 stirbit, touwit          dr¯  ı     alle
                      Avestan                        miriiete                ¯ ¯
                                                                         þraiio      vispe
                 Old Church Slavonic                     ı ˘
                                                     um˘retu              tr˘je
                                                                             ı        v˘si
                                                                                        ı
                        Latin                         moritur                ¯
                                                                          tres      omnes ¯
                       Oscan                             ?                 trís     súllus

                                                            Cognacy classes (traits) for the
                                                            meaning he dies:
                                                               1    {stierfþ, stirbit}
                                                               2    {touwit}
                                                               3                 ı ˘
                                                                    {miriiete, um˘retu, moritur}



R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                        UCLA 2013   12 / 81
Binary data: he dies, three, all
                                                         il meurt            trois        tout
                     Old English                          stierfþ             þr¯eı      ealle
                  Old High German                     stirbit, touwit          dr¯  ı     alle
                      Avestan                            miriiete                ¯ ¯
                                                                             þraiio      vispe
                 Old Church Slavonic                         ı ˘
                                                         um˘retu              tr˘je
                                                                                 ı        v˘si
                                                                                            ı
                        Latin                             moritur                ¯
                                                                              tres      omnes ¯
                       Oscan                                 ?                 trís     súllus

        O. English                1      0   0                  Cognacy classes (traits) for the
       OH German                  1      1   0                  meaning he dies:
         Avestan                  0      0   1                     1    {stierfþ, stirbit}
       OC Slavonic                0      0   1                     2    {touwit}
           Latin                  0      0   1                     3                 ı ˘
                                                                        {miriiete, um˘retu, moritur}
          Oscan                   ?      ?   ?


R. Ryder & G. Nicholls (Dauphine & Oxford)       Language phylogenies                        UCLA 2013   12 / 81
Binary data: he dies, three, all
                                                         il meurt            trois        tout
                     Old English                          stierfþ             þr¯eı      ealle
                  Old High German                     stirbit, touwit          dr¯  ı     alle
                      Avestan                            miriiete                ¯ ¯
                                                                             þraiio      vispe
                 Old Church Slavonic                         ı ˘
                                                         um˘retu              tr˘je
                                                                                 ı        v˘si
                                                                                            ı
                        Latin                             moritur                ¯
                                                                              tres      omnes ¯
                       Oscan                                 ?                 trís     súllus

       O. English                1      0    0   1                  Cognacy classes for
       OH German                 1      1    0   1                 the meaning three:
        Avestan                  0      0    1   1             1                  ¯ ¯ ı          ¯
                                                                    {þr¯e, dr¯, þraiio, tr˘je, tres, trís}
                                                                       ı     ı
        V.-slave                 0      0    1   1
          Latin                  0      0    1   1
         Osque                   ?      ?    ?   1


R. Ryder & G. Nicholls (Dauphine & Oxford)       Language phylogenies                     UCLA 2013   12 / 81
Binary data: he dies, three, all
                                                         il meurt               trois         tout
                     Old English                          stierfþ                þr¯eı       ealle
                  Old High German                     stirbit, touwit             dr¯  ı      alle
                      Avestan                            miriiete                   ¯ ¯
                                                                                þraiio       vispe
                 Old Church Slavonic                         ı ˘
                                                         um˘retu                 tr˘je
                                                                                    ı         v˘si
                                                                                                ı
                        Latin                             moritur                   ¯
                                                                                 tres       omnes ¯
                       Oscan                                 ?                    trís      súllus

        O. English                1      0   0    1      1     0        0   0              Cognacy classes
       OH German                  1      1   0    1      1     0        0   0              for all:
         Avestan                  0      0   1    1      0     1        0   0               1   {ealle, alle}
       OC Slavonic                0      0   1    1      0     1        0   0               2   {vispe, v˘si}
                                                                                                         ı
           Latin                  0      0   1    1      0     0        1   0               3       ¯
                                                                                                {omnes}
          Oscan                   ?      ?   ?    1      0     0        0   1               4   {súllus}


R. Ryder & G. Nicholls (Dauphine & Oxford)       Language phylogenies                           UCLA 2013   12 / 81
Observation process




                       Old English                  1     0     0   1   1   0   0     0
                    Old High German                 1     1     0   1   1   0   0     0
                        Avestan                     0     0     1   1   0   1   0     0
                   Old Church Slavonic              0     0     1   1   0   1   0     0
                          Latin                     0     0     1   1   0   0   1     0
                         Oscan                      ?     ?     ?   1   0   0   0     1




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                   UCLA 2013   13 / 81
Observation process




                       Old English                  1     0     0   1   1   0   0     0
                    Old High German                 1     1     0   1   1   0   0     0
                        Avestan                     0     0     1   1   0   1   0     0
                   Old Church Slavonic              0     0     1   1   0   1   0     0
                          Latin                     0     0     1   1   0   0   1     0
                         Oscan                      ?     ?     ?   1   0   0   0     1




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                   UCLA 2013   13 / 81
Observation process




                       Old English                  1           0   1   1   0
                    Old High German                 1           0   1   1   0
                        Avestan                     0           1   1   0   1
                   Old Church Slavonic              0           1   1   0   1
                          Latin                     0           1   1   0   0
                         Oscan                      ?           ?   1   0   0




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies               UCLA 2013   13 / 81
Constraints




           Constraints on the tree topology
           30 constraints on the age of some nodes or ancient languages
           These constraits are used to estimate the evolution rates and the
           age.




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   14 / 81
Constraints




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   15 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   16 / 81
Model (1): birth-death process


                                                                Traits are born at rate
                                                                λ
                                                                Traits die at rate µ
                                                                λ and µ are constant
                                                            1       1   0   0   0   0   0   0   0
                                                            2       1   0   1   0   0   0   0   0
                                                            3       1   0   0   0   0   0   0   1
                                                            4       0   0   0   0   1   0   0   0
                                                            5       0   0   0   0   1   0   0   0
                                                            6       1   1   0   0   0   1   1   0
                                                            7       1   1   0   0   0   1   0   0
                                                            8       1   0   0   0   0   0   0   0



R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                           UCLA 2013   17 / 81
Model (2): catastrophic rate heterogeneity

                                                      Catastrophes occur at rate ρ
                                                      At a catastrophe, each trait dies
                                                      with probability κ and Poiss(ν)
                                                      traits are born.
                                                      λ/µ = ν/κ : the number of traits
                                                      is constant on average.
                                                       1     1      0   0   0   0   0   0   0   0   0   0   0   0    0
                                                       2     1      0   1   0   0   0   0   0   0   0   0   0   0    1
                                                       3     0      0   0   0   0   0   0   0   0   1   1   0   0    0
                                                       4     0      0   0   0   1   0   0   0   0   0   0   0   0    0
                                                       5     0      0   0   0   1   0   0   0   0   0   0   0   0    0
                                                       6     1      0   0   0   0   1   1   0   0   0   0   0   1    0
                                                       7     1      0   0   0   0   1   0   0   0   0   0   0   1    0
                                                       8     1      0   0   0   0   0   0   0   0   0   0   0   1    0

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                               UCLA 2013           18 / 81
Model (3): missing data

                                                      Observation process: each
                                                      point goes missing with
                                                      probability ξi
                                                      Some traits are not observed
                                                      and are thinned out of the data
                                                       1     1000?00000?000
                                                       2     ?01000?000000?
                                                       3     0?00?000011000
                                                       4     0000?0?0000?00
                                                       5     00?01?00000000
                                                       6     10000??0?000?0
                                                       7     ?0000?0?000010
                                                       8     10000000000010


R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies            UCLA 2013   19 / 81
Observation process




     0      1     0     0      1     0       1   1      0
     0      0     0     1      1     0       0   1      1
     1      1     0     1      1     1       1   1      1
     1      0     0     1      0     1       1   1      0
     0      0     1     1      1     1       0   0      1




R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies   UCLA 2013   20 / 81
Observation process




     0      1     0     0      1     0       1   1      0
     0      0     0     1      1     0       0   1      1
     1      1     0     1      1     1       1   1      1
     1      0     0     1      0     1       1   1      0
     0      0     1     1      1     1       0   0      1




R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies   UCLA 2013   20 / 81
Observation process




     ?      1     0     0      ?     0       1   1      0
     0      0     ?     ?      1     0       0   1      1
     ?      1     ?     ?      ?     1       ?   1      1
     1      0     0     1      0     1       1   1      0
     0      ?     ?     1      1     1       0   0      1




R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies   UCLA 2013   21 / 81
Observation process




     ?      1     0     0      ?     0       1   1      0
     0      0     ?     ?      1     0       0   1      1
     ?      1     ?     ?      ?     1       ?   1      1
     1      0     0     1      0     1       1   1      0
     0      ?     ?     1      1     1       0   0      1




R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies   UCLA 2013   21 / 81
Observation process




            1           0      ?     0       1   1      0
            0           ?      1     0       0   1      1
            1           ?      ?     1       ?   1      1
            0           1      0     1       1   1      0
            ?           1      1     1       0   0      1




R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies   UCLA 2013   22 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   23 / 81
TraitLab software
           Bayesian inference
           Markov Chain Monte Carlo
           (Almost) uniform prior over the age of the root




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   24 / 81
Why be Bayesian?




   In the settings described in this talk, it usually makes sense to use
   Bayesian inference, because:
           The models are complex
           Estimating uncertainty is paramount
           The output of one model is used as the input of another
           We are interested in complex functions of our parameters




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   25 / 81
Frequentist statistics

           Statistical inference deals with estimating an unknown parameter
           θ given some data D.
           In the frequentist view of statistics, θ has a true fixed
           (deterministic) value.
           Uncertainty is measured by confidence intervals, which are not
           intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
           for θ, I cannot say that there is a 95% probability that θ belongs to
           the interval [80 ; 120].




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies     UCLA 2013   26 / 81
Frequentist statistics

           Statistical inference deals with estimating an unknown parameter
           θ given some data D.
           In the frequentist view of statistics, θ has a true fixed
           (deterministic) value.
           Uncertainty is measured by confidence intervals, which are not
           intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
           for θ, I cannot say that there is a 95% probability that θ belongs to
           the interval [80 ; 120].
           Frequentist statistics often use the maximum likelihood estimator:
           for which value of θ would the data be most likely (under our
           model)?
                                      L(θ|D) = P[D|θ]
                                             ˆ
                                             θ = arg max L(θ|D)
                                                          θ


R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies     UCLA 2013   26 / 81
Bayesian statistics

           In the Bayesian framework, the parameter θ is seen as inherently
           random: it has a distribution.
           Before I see any data, I have a prior distribution on π(θ), usually
           uninformative.
           Once I take the data into account, I get a posterior distribution,
           which is hopefully more informative.

                                             π(θ|D) ∝ π(θ)L(θ|D)

           Different people have different priors, hence different posteriors.
           But with enough data, the choice of prior matters little.
           We are now allowed to make probability statements about θ, such
           as "there is a 95% probability that θ belongs to the interval
           [78 ; 119]" (credible interval)


R. Ryder & G. Nicholls (Dauphine & Oxford)    Language phylogenies   UCLA 2013   27 / 81
Advantages and drawbacks of Bayesian statistics




           More intuitive interpretation of the results
           Easier to think about uncertainty
           In a hierarchical setting, it becomes easier to take into account all
           the sources of variability
           Prior specification: need to check that changing your prior does
           not change your result
           Computationally intensive




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   28 / 81
Prior and inference
     Parameter                                Prior         Note on prior             Method
     Tree g                                    fG           marginally uniform on     MCMC
                                                            root age, uniform on
                                                            topologies
     Death rate µ                              1/µ          improper; invariant by    MCMC
                                                            scale change
     Birth rate λ                              1/λ          improper; invariant by    integration
                                                            scale change
     Birth time Z                             PPP           Poisson process+ ob-      integration
                                                            servatoin process         (pruning)
     Catastrophe time k                       PPP           Total per edge            MCMC
     Catastrophe rate ρ                       fR , Γ        IC 95%:       1/tree –    MCMC
                                                            1/edge
     Catastrophe death                       U(0, 1)                                  MCMC
     rate κ
     Missing data rate ξ                     U(0, 1)L                                 MCMC

R. Ryder & G. Nicholls (Dauphine & Oxford)           Language phylogenies            UCLA 2013   29 / 81
Posterior distribution



   p(g, µ, λ, κ, ρ, ξ|D = D)
                                                                                                                 
                  N
        1 λ               λ
     =             exp −                             P[EZ |Z = (ti , i), g, µ, κ, ξ](1 − e−µ(tj −ti +ki TC ) )
       N! µ               µ
                                             i,j ∈E
                                                                                                       
               N
          ×                                 P[M = ω|Z = (ti , i), g, µ](1 − e−µ(tj −ti +ki TC ) )
              a=1       i,j ∈Ea ω∈Ωa
                                                                L
            1             e−ρ|g| (ρ|g|)kT
          × p(ρ)fG (g|T )                                            (1 − ξi )Qi ξiN−Qi
           µλ                  kT !
                                                               i=1




R. Ryder & G. Nicholls (Dauphine & Oxford)            Language phylogenies                  UCLA 2013   30 / 81
Likelihood calculation


             P[M = ω|Z = (ti , c), g, µ] =
      (c)
   ω∈Ωa
                                                                     (c)
          
           δi,c ×
                          P[M = ω|Z = (tc , c), g, µ]      if Y (Ωa ) ≥ 1
          
                       (c)
          
                   ω∈Ωa
          
          
          
                                                                     (c)          (
            (1−δi,c )+δi,c ×        P[M=ω|Z=(tc , c), g, µ] if Y (Ωa ) = 0 and Q(Ωa
          
                                  (c)
          
          
                             ω∈Ωa
           (1 − δ ) + δ v
                               (0)                                   (c)     (c)
          
                  i,c      i,c c                            if Y (Ωa ) + Q(Ωa ) =
                                                                         (c)
          
                                                                (i.e. Ωa = {∅})
          


                                                                       (c)
                                                        
                                                        1
                                                                   if Ωa = {{c}, ∅} or {{c}}
                P[M = ω|Z = (tc , c), g, µ] =                          (i.e. Dc,a ∈ {?, 1})
                                                                       (c)
         (c)
                                                        
                                                         0          if Ωa = {∅} (i.e. Dc,a = 0)
                                                        
      ω∈Ωa


R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                   UCLA 2013   31 / 81
MCMC




           Fit the model to the data
           Trees that make the data likely
           Obtain a sample of trees and dates
           Samples weighted by quality of fit to data




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   32 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   33 / 81
Tests on synthetic data




     Figure: True tree, 40
     words/language                                      Figure: Consensus tree




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                 UCLA 2013   34 / 81
Tests on synthetic data (2)




                                             Figure: Death rate (µ)




R. Ryder & G. Nicholls (Dauphine & Oxford)       Language phylogenies   UCLA 2013   35 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   36 / 81
Initial model: no catastrophes


                                                                Traits are born at rate
                                                                λ
                                                                Traits die at rate µ
                                                                λ and µ are constant
                                                            1       1   0   0   0   0   0   0   0
                                                            2       1   0   1   0   0   0   0   0
                                                            3       1   0   0   0   0   0   0   1
                                                            4       0   0   0   0   1   0   0   0
                                                            5       0   0   0   0   1   0   0   0
                                                            6       1   1   0   0   0   1   1   0
                                                            7       1   1   0   0   0   1   0   0
                                                            8       1   0   0   0   0   0   0   0



R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                           UCLA 2013   37 / 81
Mis-specification: catastrophic heterogeneity




                                             (a)                    (b)




                                             (c)                    (d)

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies         UCLA 2013   38 / 81
Influence of borrowing (1)




     Figure: True tree, 40
     words/language, 10%                                 Figure: Consensus tree
     d’emprunts



R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                 UCLA 2013   39 / 81
Influence of borrowing (2)




     Figure: True tree, 40
     words/language, 50%                                 Figure: Consensus tree
     d’emprunts




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                 UCLA 2013   40 / 81
Influence of borrowing (3)

           The topology is reconstructed well
           Dates are under-estimated




                 Figure: Root age                         Figure: Death rate (µ)



R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies                  UCLA 2013   41 / 81
Presence of borrowing?

                      1




                     0.9




                     0.8




                                                                                  Ringe 100
                                                                                  b=0
                                                                                  b=0.1
                     0.7
                                                                                  b=0.5
                                                                                  b=1




                     0.6




                     0.5




                     0.4
                           2     4       6   8     10     12     14     16   18   20          22      24




R. Ryder & G. Nicholls (Dauphine & Oxford)       Language phylogenies                              UCLA 2013   42 / 81
Mis-specifications



     Heterogeneity between traits                  Analyse subset of data+ sim-
                                                   ulated data
     Heterogeneity in time/space                   Simulated data analysis with
     (non catastrophic)                            edge rate from a Γ distribution
     Borrowing                                     Simulated data analysis +
                                                   check level of borrowing
     Data missing in blocks                        Simulated data analysis
     Non-empty meaning cate-                       Simulated data analysis
     gories




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies            UCLA 2013   43 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   44 / 81
Data




           Indo-European languages
           Core vocabulary (Swadesh 100 ou 207)
           Two (almost) independent data sets
           Dyen et al. (1997) : 87 languages, mostly modern
           Ringe et al. (2002) : 24 languages, mostly ancient




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   45 / 81
Cross-validation



           Predict age of nodes for which we have a constraint: would we
           reject the truth?
           Γ space of trees which respect all constraints
           Γ−c : remove constraint c = 1 . . . 30
           M0 : g ∈ Γ, M1 ; g ∈ Γ−c . Bayes factor:

                                                       P[g ∈ Γ|D, g ∈ Γ−c ]
                                             B (c) =
                                                          P[g ∈ Γ|Γ−c ]

           Constraint c conflicts with the model if 2 log B (c) < −5.




R. Ryder & G. Nicholls (Dauphine & Oxford)        Language phylogenies        UCLA 2013   46 / 81
Cross validation

        100




         10

          5

          2

          0

         −2

         −5

        −10




       −100




              HI   TA   TB   LU   LY   OI   UM OS   LA   GK   AR   GO ON OE   OG OS   PR   AV   PE   VE   CE   IT   GE   WG NW BS   BA   IR   II   TG
          0



       2000



       4000



       6000



       8000




R. Ryder & G. Nicholls (Dauphine & Oxford)                             Language phylogenies                                                        UCLA 2013   47 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   48 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   49 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   50 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   51 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   52 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   53 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   54 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   55 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   56 / 81
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   57 / 81
Consensus tree: modern languages (Dyen data)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   58 / 81
Consensus tree; ancient languages (Ringe data)
                                                                                                       oldhighgerman

                                                                                                         oldenglish


                                                                                                             oldnorse



                                                                                             gothic

                                                                                    oscan

                                                                                   umbrian
                                      66

                                                                                     latin


                                                                                                                                  welsh
                                                                                                           oldirish
                          85                                                    oldpersian
                                                                               avestan

                                                                vedic
                    58
                                                                                                                        lithuanian
                                                                                                                        latvian

                                                                                                            oldprussian

                                                                                                      oldcslavonic
                                                                                    greek
                                 78
                                                                                                           armenian
                                                                                             lycian
                                                                    luvian

                                                                    hittite
                     62
                                                                                                           tocharian_b
                                                                                                           tocharian_a
                                                                                                                      albanian

             8000         7000             6000   5000   4000           3000                 2000               1000                 0




R. Ryder & G. Nicholls (Dauphine & Oxford)                Language phylogenies                                                            UCLA 2013   59 / 81
Root age




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   60 / 81
Conclusions




           Strong support for Anatolian farming hypothesis: root around 8000
           BP
           Statistics reconstruct known linguistic facts and answer
           unresolved questions
           TraitLab: it’s free! (Though Matlab is not...)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   61 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   62 / 81
Semitic lexical data




           Data: Kitchen et al. (2009)
           25 languages, 96 meanings, 674 cognacy classes
           Questions of interest: root age (constraint known), topology,
           outgroup




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   63 / 81
Model validation




   Thin bar: constraint. Thick bar: 95% posterior HPD. (Red bar: 95%
   prior HPD)
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   64 / 81
Model validation




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   65 / 81
Conclusions




           Root age 95% HPD: 4400 – 5100 BP
           Akkadian outgroup: 67% (Syrian homeland?)
           Zero catastrophes: 33%
R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   66 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   67 / 81
Back to Bergsland and Vogt



           Norse family, 8 languages.
           Selection bias
           Claim that the rate of change is significantly different for these
           data.
           B&V included words used only in literary Icelandic, which we
           exclude
           We can handle polymorphism
           Do not include catastrophes




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   68 / 81
Known history



                                                                    Gjestal




                                                                    Sandnes




                                                                    Riksmal

                X     XI    XII   XIII



                                                                    Icelandic




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies               UCLA 2013   69 / 81
Tests




   Two possible ways to test whether the same model parameters apply
   to this example and to Indo-European:
      1    Assume parameters are the same as for the general
           Indo-European tree, and estimate ancestral ages.
      2    Use Norse constraints to estimate parameters, and compare to
           parameter estimates from general Indo-European tree




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   70 / 81
Results




           If we use parameter values from another analysis, we can try to
           estimate the age of 13th century Norse.
           True constraint: 660–760 BP. Our HPD: 615 – 872 BP.
           If we analyse the Norse data on its own, we estimate parameters.
           Value of µ for Norse: 2.47 ± 0.4 · 10−4
           Value of µ for IE: 1.86 ± 0.39 · 10−4 (Dyen), 2.37 ± 0.21 · 10−4
           (Ringe)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   71 / 81
But...




           We can also try to estimate the age of Icelandic (which is 0 BP)
           Find 439–560 BP, far from the true value
           B&V were right: there was significantly less change on the branch
           leading to Icelandic than average
           However, we are still able to estimate internal node ages.




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   72 / 81
Georgian




           Second data set: Georgian and Mingrelian
           Age of ancestor: last millenium BC
           Code data given by B&V, discarding borrowed items
           Use rate estimate from Ringe et al. analysis




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   73 / 81
Georgian




           Second data set: Georgian and Mingrelian
           Age of ancestor: last millenium BC
           Code data given by B&V, discarding borrowed items
           Use rate estimate from Ringe et al. analysis
           95% HPD: 2065 – 3170 BP




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   73 / 81
B&V: conclusions



           Third data set (Armenian) not clear enough to be recoded.
           There is variation in the number of changes on an edge
           Nonetheless, we are still able to estimate ancestral language age
           Variation in borrowing rates
           B& V: "we cannot estimate dates, and it follows that we cannot
           estimate the topology either".
           We can estimate dates, and even if we couldn’t, we might still be
           able to estimate the topology




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   74 / 81
Outline

  1    Data

  2    Model

  3    Inference

  4    In-model validation

  5    Model mis-specification

  6    Results

  7    Semitic lexical data

  8    Bergsland and Vogt

  9    Punctuational bursts

R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   75 / 81
Atkinson et al. (2008)



           Hypothesis: when a language is founded by a migration, the
           founder effect leads to fast change over a short period of time.
           There is a catastrophe at each branching event.
           Indirect estimation: correlation between number of changes
           between root and leaf, and number of branching events along the
           same path
           Atkinson: 21% of changes in the history of IE are due to
           punctuational bursts




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   76 / 81
Atkinson et al. (2008)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   77 / 81
Direct analysis




           We force a catastrophe on each edge.
           Infer size of catastrophes.
           Find κ very close to 0.
           Less than 1% of change can be attributed to punctuational bursts.
           Reason for discrepancy unclear.




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   78 / 81
Conclusions




           Strong support for age of PIE around 8000 BP
           Statistical methods can help answer questions which traditional
           methods cannot
           Many more questions and models to come
           TraitLab: it’s free! (although Matlab is not...)




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   79 / 81
Questions


                            otázky                      kesses
                         spørgsmåler                 cwestiwnau
                           pytania                    preguntes
                          preguntas                       vrae
                          kláusimai                     Fragen
                             voprosy                 quaestiones
                                   ˘
                             întrebari                questions
                              vragen                   ρωτ η σ ις
                                                            ´
                           zapitanni                  spurningar
                           domande                   spørsmåler
                           questões                      frågor
                           vprašanja




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   80 / 81
References



           R. J. Ryder & G. K. Nicholls, Missing data in a stochastic Dollo
           model for cognate data, and its application to the dating of
           Proto-Indo-European (2011), JRSS C
           G. K. Nicholls, Horses or farmers? The tower of Babel and
           confidence in trees (2008), Significance (popular science)
           G. K. Nicholls & R. J. Ryder, Phylogenetic models for Semitic
           vocabulary (2011), IWSM
           R. J. Ryder, Phylogenetic Models of Language Diversification
           (2010), DPhil. thesis, University of Oxford




R. Ryder & G. Nicholls (Dauphine & Oxford)   Language phylogenies   UCLA 2013   81 / 81

Mais conteúdo relacionado

Último

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

Último (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

A phylogenetic model of language diversification

  • 1. A Phylogenetic Model of Language Diversification Robin J. Ryder1 et Geoff K. Nicholls2 1 CEREMADE, Université Paris-Dauphine 2 Department of Statistics, University of Oxford UCLA, March 2013 www.slideshare.net/robinryder
  • 2. Gray and Atkinson’s tree(s) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 2 / 81
  • 3. Caveats I am not a linguist Statistics: additional insight alongside the comparative method I use the word "evolution" in a broad sense "All models all false, but some are useful" R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 3 / 81
  • 4. Advantages of statistical methods Analyse (very) large datasets Test multiple hypotheses Cross-validation Estimate uncertainty R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 4 / 81
  • 5. Questions to answer Topology of the tree Age of ancestor nodes Age of root: 6000-6500 BP or 8000-9500 BP (Before Present) ? 6000 BP: Kurgan horsemen ; 8000 BP: Anatolian farmers R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 5 / 81
  • 6. Statistical method in a nutshell 1 Collect data 2 Design model 3 Perform inference (MCMC, ...) 4 Check convergence 5 In-model validation (is our inference method able to answer questions from our model?) 6 Model mis-specification analysis (do we need a more complex model?) 7 Conclude R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 6 / 81
  • 7. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 7 / 81
  • 8. Morris Swadesh and glottochronology 200/100 word list Compares 2 languages (c=fraction of shared cognates) Assumes r =fraction of shared cognates after 1000 years constant for all languages (86%) Infers age t of Most Recent Common Ancestor ˆ = ln c t 2 ln r R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 8 / 81
  • 9. all dog grass long river split walk and drink green louse road warm animal dry man root squeeze guts ashes rope stab wash dull many hair at dust rotten stand water hand meat back ear round star he moon stick we bad earth rub head bark eat mother salt stone wet hear egg heart sand what because mountain straight eye heavy say belly mouth suck when fall here name sun big scratch where far hit narrow swell bird fat sea hold near swim white bite father see horn neck tail black fear seed who how new ten blood sew hunt night that wide blow feather sharp nose there wife bone few husband short not they breast fight I sing wind old thick fire ice one sit thin wing breathe fish if other skin think burn five in sky wipe child this float kill person sleep thou claw flow knee with play small three cloud flower know pull smell throw cold fly lake woman tie come R. Ryder & G. Nicholls (Dauphine & Oxford) fog push laugh Language phylogenies UCLA 2013 9 / 81 woods
  • 10. Bergsland and Vogt (1962) Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian Discredited Glottochronology Sankoff (1973): sample selection bias, no estimation of uncertainty Fair criticism Bad observation protocol from Swadesh Does not apply (so much) to modern methods R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 10 / 81
  • 11. Core vocabulary 100 or 200 words, present in almost all languages: bird, hand, to eat, red... Borrowing can occur (evolution not along a tree), but: R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 11 / 81
  • 12. Core vocabulary 100 or 200 words, present in almost all languages: bird, hand, to eat, red... Borrowing can occur (evolution not along a tree), but: “Easy” to detect Rare Does not bias the results R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 11 / 81
  • 13. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus Cognacy classes (traits) for the meaning he dies: R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  • 14. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus Cognacy classes (traits) for the meaning he dies: 1 {stierfþ, stirbit} 2 {touwit} 3 ı ˘ {miriiete, um˘retu, moritur} R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  • 15. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 Cognacy classes (traits) for the OH German 1 1 0 meaning he dies: Avestan 0 0 1 1 {stierfþ, stirbit} OC Slavonic 0 0 1 2 {touwit} Latin 0 0 1 3 ı ˘ {miriiete, um˘retu, moritur} Oscan ? ? ? R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  • 16. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 1 Cognacy classes for OH German 1 1 0 1 the meaning three: Avestan 0 0 1 1 1 ¯ ¯ ı ¯ {þr¯e, dr¯, þraiio, tr˘je, tres, trís} ı ı V.-slave 0 0 1 1 Latin 0 0 1 1 Osque ? ? ? 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  • 17. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 1 1 0 0 0 Cognacy classes OH German 1 1 0 1 1 0 0 0 for all: Avestan 0 0 1 1 0 1 0 0 1 {ealle, alle} OC Slavonic 0 0 1 1 0 1 0 0 2 {vispe, v˘si} ı Latin 0 0 1 1 0 0 1 0 3 ¯ {omnes} Oscan ? ? ? 1 0 0 0 1 4 {súllus} R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  • 18. Observation process Old English 1 0 0 1 1 0 0 0 Old High German 1 1 0 1 1 0 0 0 Avestan 0 0 1 1 0 1 0 0 Old Church Slavonic 0 0 1 1 0 1 0 0 Latin 0 0 1 1 0 0 1 0 Oscan ? ? ? 1 0 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  • 19. Observation process Old English 1 0 0 1 1 0 0 0 Old High German 1 1 0 1 1 0 0 0 Avestan 0 0 1 1 0 1 0 0 Old Church Slavonic 0 0 1 1 0 1 0 0 Latin 0 0 1 1 0 0 1 0 Oscan ? ? ? 1 0 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  • 20. Observation process Old English 1 0 1 1 0 Old High German 1 0 1 1 0 Avestan 0 1 1 0 1 Old Church Slavonic 0 1 1 0 1 Latin 0 1 1 0 0 Oscan ? ? 1 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  • 21. Constraints Constraints on the tree topology 30 constraints on the age of some nodes or ancient languages These constraits are used to estimate the evolution rates and the age. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 14 / 81
  • 22. Constraints R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 15 / 81
  • 23. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 16 / 81
  • 24. Model (1): birth-death process Traits are born at rate λ Traits die at rate µ λ and µ are constant 1 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 3 1 0 0 0 0 0 0 1 4 0 0 0 0 1 0 0 0 5 0 0 0 0 1 0 0 0 6 1 1 0 0 0 1 1 0 7 1 1 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 17 / 81
  • 25. Model (2): catastrophic rate heterogeneity Catastrophes occur at rate ρ At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born. λ/µ = ν/κ : the number of traits is constant on average. 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 1 1 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 1 1 0 0 0 0 0 1 0 7 1 0 0 0 0 1 0 0 0 0 0 0 1 0 8 1 0 0 0 0 0 0 0 0 0 0 0 1 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 18 / 81
  • 26. Model (3): missing data Observation process: each point goes missing with probability ξi Some traits are not observed and are thinned out of the data 1 1000?00000?000 2 ?01000?000000? 3 0?00?000011000 4 0000?0?0000?00 5 00?01?00000000 6 10000??0?000?0 7 ?0000?0?000010 8 10000000000010 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 19 / 81
  • 27. Observation process 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 20 / 81
  • 28. Observation process 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 20 / 81
  • 29. Observation process ? 1 0 0 ? 0 1 1 0 0 0 ? ? 1 0 0 1 1 ? 1 ? ? ? 1 ? 1 1 1 0 0 1 0 1 1 1 0 0 ? ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 21 / 81
  • 30. Observation process ? 1 0 0 ? 0 1 1 0 0 0 ? ? 1 0 0 1 1 ? 1 ? ? ? 1 ? 1 1 1 0 0 1 0 1 1 1 0 0 ? ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 21 / 81
  • 31. Observation process 1 0 ? 0 1 1 0 0 ? 1 0 0 1 1 1 ? ? 1 ? 1 1 0 1 0 1 1 1 0 ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 22 / 81
  • 32. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 23 / 81
  • 33. TraitLab software Bayesian inference Markov Chain Monte Carlo (Almost) uniform prior over the age of the root R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 24 / 81
  • 34. Why be Bayesian? In the settings described in this talk, it usually makes sense to use Bayesian inference, because: The models are complex Estimating uncertainty is paramount The output of one model is used as the input of another We are interested in complex functions of our parameters R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 25 / 81
  • 35. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 26 / 81
  • 36. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Frequentist statistics often use the maximum likelihood estimator: for which value of θ would the data be most likely (under our model)? L(θ|D) = P[D|θ] ˆ θ = arg max L(θ|D) θ R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 26 / 81
  • 37. Bayesian statistics In the Bayesian framework, the parameter θ is seen as inherently random: it has a distribution. Before I see any data, I have a prior distribution on π(θ), usually uninformative. Once I take the data into account, I get a posterior distribution, which is hopefully more informative. π(θ|D) ∝ π(θ)L(θ|D) Different people have different priors, hence different posteriors. But with enough data, the choice of prior matters little. We are now allowed to make probability statements about θ, such as "there is a 95% probability that θ belongs to the interval [78 ; 119]" (credible interval) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 27 / 81
  • 38. Advantages and drawbacks of Bayesian statistics More intuitive interpretation of the results Easier to think about uncertainty In a hierarchical setting, it becomes easier to take into account all the sources of variability Prior specification: need to check that changing your prior does not change your result Computationally intensive R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 28 / 81
  • 39. Prior and inference Parameter Prior Note on prior Method Tree g fG marginally uniform on MCMC root age, uniform on topologies Death rate µ 1/µ improper; invariant by MCMC scale change Birth rate λ 1/λ improper; invariant by integration scale change Birth time Z PPP Poisson process+ ob- integration servatoin process (pruning) Catastrophe time k PPP Total per edge MCMC Catastrophe rate ρ fR , Γ IC 95%: 1/tree – MCMC 1/edge Catastrophe death U(0, 1) MCMC rate κ Missing data rate ξ U(0, 1)L MCMC R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 29 / 81
  • 40. Posterior distribution p(g, µ, λ, κ, ρ, ξ|D = D)   N 1 λ λ = exp − P[EZ |Z = (ti , i), g, µ, κ, ξ](1 − e−µ(tj −ti +ki TC ) ) N! µ µ i,j ∈E   N ×  P[M = ω|Z = (ti , i), g, µ](1 − e−µ(tj −ti +ki TC ) ) a=1 i,j ∈Ea ω∈Ωa L 1 e−ρ|g| (ρ|g|)kT × p(ρ)fG (g|T ) (1 − ξi )Qi ξiN−Qi µλ kT ! i=1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 30 / 81
  • 41. Likelihood calculation P[M = ω|Z = (ti , c), g, µ] = (c) ω∈Ωa (c)   δi,c ×  P[M = ω|Z = (tc , c), g, µ] if Y (Ωa ) ≥ 1  (c)  ω∈Ωa     (c) ( (1−δi,c )+δi,c × P[M=ω|Z=(tc , c), g, µ] if Y (Ωa ) = 0 and Q(Ωa  (c)    ω∈Ωa  (1 − δ ) + δ v  (0) (c) (c)   i,c i,c c if Y (Ωa ) + Q(Ωa ) = (c)  (i.e. Ωa = {∅})  (c)  1  if Ωa = {{c}, ∅} or {{c}} P[M = ω|Z = (tc , c), g, µ] = (i.e. Dc,a ∈ {?, 1}) (c) (c)  0 if Ωa = {∅} (i.e. Dc,a = 0)  ω∈Ωa R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 31 / 81
  • 42. MCMC Fit the model to the data Trees that make the data likely Obtain a sample of trees and dates Samples weighted by quality of fit to data R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 32 / 81
  • 43. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 33 / 81
  • 44. Tests on synthetic data Figure: True tree, 40 words/language Figure: Consensus tree R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 34 / 81
  • 45. Tests on synthetic data (2) Figure: Death rate (µ) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 35 / 81
  • 46. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 36 / 81
  • 47. Initial model: no catastrophes Traits are born at rate λ Traits die at rate µ λ and µ are constant 1 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 3 1 0 0 0 0 0 0 1 4 0 0 0 0 1 0 0 0 5 0 0 0 0 1 0 0 0 6 1 1 0 0 0 1 1 0 7 1 1 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 37 / 81
  • 48. Mis-specification: catastrophic heterogeneity (a) (b) (c) (d) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 38 / 81
  • 49. Influence of borrowing (1) Figure: True tree, 40 words/language, 10% Figure: Consensus tree d’emprunts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 39 / 81
  • 50. Influence of borrowing (2) Figure: True tree, 40 words/language, 50% Figure: Consensus tree d’emprunts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 40 / 81
  • 51. Influence of borrowing (3) The topology is reconstructed well Dates are under-estimated Figure: Root age Figure: Death rate (µ) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 41 / 81
  • 52. Presence of borrowing? 1 0.9 0.8 Ringe 100 b=0 b=0.1 0.7 b=0.5 b=1 0.6 0.5 0.4 2 4 6 8 10 12 14 16 18 20 22 24 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 42 / 81
  • 53. Mis-specifications Heterogeneity between traits Analyse subset of data+ sim- ulated data Heterogeneity in time/space Simulated data analysis with (non catastrophic) edge rate from a Γ distribution Borrowing Simulated data analysis + check level of borrowing Data missing in blocks Simulated data analysis Non-empty meaning cate- Simulated data analysis gories R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 43 / 81
  • 54. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 44 / 81
  • 55. Data Indo-European languages Core vocabulary (Swadesh 100 ou 207) Two (almost) independent data sets Dyen et al. (1997) : 87 languages, mostly modern Ringe et al. (2002) : 24 languages, mostly ancient R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 45 / 81
  • 56. Cross-validation Predict age of nodes for which we have a constraint: would we reject the truth? Γ space of trees which respect all constraints Γ−c : remove constraint c = 1 . . . 30 M0 : g ∈ Γ, M1 ; g ∈ Γ−c . Bayes factor: P[g ∈ Γ|D, g ∈ Γ−c ] B (c) = P[g ∈ Γ|Γ−c ] Constraint c conflicts with the model if 2 log B (c) < −5. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 46 / 81
  • 57. Cross validation 100 10 5 2 0 −2 −5 −10 −100 HI TA TB LU LY OI UM OS LA GK AR GO ON OE OG OS PR AV PE VE CE IT GE WG NW BS BA IR II TG 0 2000 4000 6000 8000 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 47 / 81
  • 58. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 48 / 81
  • 59. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 49 / 81
  • 60. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 50 / 81
  • 61. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 51 / 81
  • 62. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 52 / 81
  • 63. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 53 / 81
  • 64. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 54 / 81
  • 65. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 55 / 81
  • 66. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 56 / 81
  • 67. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 57 / 81
  • 68. Consensus tree: modern languages (Dyen data) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 58 / 81
  • 69. Consensus tree; ancient languages (Ringe data) oldhighgerman oldenglish oldnorse gothic oscan umbrian 66 latin welsh oldirish 85 oldpersian avestan vedic 58 lithuanian latvian oldprussian oldcslavonic greek 78 armenian lycian luvian hittite 62 tocharian_b tocharian_a albanian 8000 7000 6000 5000 4000 3000 2000 1000 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 59 / 81
  • 70. Root age R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 60 / 81
  • 71. Conclusions Strong support for Anatolian farming hypothesis: root around 8000 BP Statistics reconstruct known linguistic facts and answer unresolved questions TraitLab: it’s free! (Though Matlab is not...) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 61 / 81
  • 72. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 62 / 81
  • 73. Semitic lexical data Data: Kitchen et al. (2009) 25 languages, 96 meanings, 674 cognacy classes Questions of interest: root age (constraint known), topology, outgroup R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 63 / 81
  • 74. Model validation Thin bar: constraint. Thick bar: 95% posterior HPD. (Red bar: 95% prior HPD) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 64 / 81
  • 75. Model validation R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 65 / 81
  • 76. Conclusions Root age 95% HPD: 4400 – 5100 BP Akkadian outgroup: 67% (Syrian homeland?) Zero catastrophes: 33% R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 66 / 81
  • 77. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 67 / 81
  • 78. Back to Bergsland and Vogt Norse family, 8 languages. Selection bias Claim that the rate of change is significantly different for these data. B&V included words used only in literary Icelandic, which we exclude We can handle polymorphism Do not include catastrophes R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 68 / 81
  • 79. Known history Gjestal Sandnes Riksmal X XI XII XIII Icelandic R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 69 / 81
  • 80. Tests Two possible ways to test whether the same model parameters apply to this example and to Indo-European: 1 Assume parameters are the same as for the general Indo-European tree, and estimate ancestral ages. 2 Use Norse constraints to estimate parameters, and compare to parameter estimates from general Indo-European tree R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 70 / 81
  • 81. Results If we use parameter values from another analysis, we can try to estimate the age of 13th century Norse. True constraint: 660–760 BP. Our HPD: 615 – 872 BP. If we analyse the Norse data on its own, we estimate parameters. Value of µ for Norse: 2.47 ± 0.4 · 10−4 Value of µ for IE: 1.86 ± 0.39 · 10−4 (Dyen), 2.37 ± 0.21 · 10−4 (Ringe) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 71 / 81
  • 82. But... We can also try to estimate the age of Icelandic (which is 0 BP) Find 439–560 BP, far from the true value B&V were right: there was significantly less change on the branch leading to Icelandic than average However, we are still able to estimate internal node ages. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 72 / 81
  • 83. Georgian Second data set: Georgian and Mingrelian Age of ancestor: last millenium BC Code data given by B&V, discarding borrowed items Use rate estimate from Ringe et al. analysis R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 73 / 81
  • 84. Georgian Second data set: Georgian and Mingrelian Age of ancestor: last millenium BC Code data given by B&V, discarding borrowed items Use rate estimate from Ringe et al. analysis 95% HPD: 2065 – 3170 BP R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 73 / 81
  • 85. B&V: conclusions Third data set (Armenian) not clear enough to be recoded. There is variation in the number of changes on an edge Nonetheless, we are still able to estimate ancestral language age Variation in borrowing rates B& V: "we cannot estimate dates, and it follows that we cannot estimate the topology either". We can estimate dates, and even if we couldn’t, we might still be able to estimate the topology R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 74 / 81
  • 86. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 75 / 81
  • 87. Atkinson et al. (2008) Hypothesis: when a language is founded by a migration, the founder effect leads to fast change over a short period of time. There is a catastrophe at each branching event. Indirect estimation: correlation between number of changes between root and leaf, and number of branching events along the same path Atkinson: 21% of changes in the history of IE are due to punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 76 / 81
  • 88. Atkinson et al. (2008) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 77 / 81
  • 89. Direct analysis We force a catastrophe on each edge. Infer size of catastrophes. Find κ very close to 0. Less than 1% of change can be attributed to punctuational bursts. Reason for discrepancy unclear. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 78 / 81
  • 90. Conclusions Strong support for age of PIE around 8000 BP Statistical methods can help answer questions which traditional methods cannot Many more questions and models to come TraitLab: it’s free! (although Matlab is not...) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 79 / 81
  • 91. Questions otázky kesses spørgsmåler cwestiwnau pytania preguntes preguntas vrae kláusimai Fragen voprosy quaestiones ˘ întrebari questions vragen ρωτ η σ ις ´ zapitanni spurningar domande spørsmåler questões frågor vprašanja R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 80 / 81
  • 92. References R. J. Ryder & G. K. Nicholls, Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European (2011), JRSS C G. K. Nicholls, Horses or farmers? The tower of Babel and confidence in trees (2008), Significance (popular science) G. K. Nicholls & R. J. Ryder, Phylogenetic models for Semitic vocabulary (2011), IWSM R. J. Ryder, Phylogenetic Models of Language Diversification (2010), DPhil. thesis, University of Oxford R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 81 / 81