SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Diachronic Analysis of Language
exploiting Google Ngram
Dr Annalina Caputo
ADAPT Centre
Diachronic Linguistics
The scientific study of language change over time
also called Historical Linguistics
Synchronic
It describes the language rules
at a specific point in time
without taking its history into
account.
Synchronic vs.
Diachronic
Diachronic
It considers the evolution of a
language over time.
Diachronic Linguistics
Why?
▪ Observe changes in particular languages
▪ Reconstruct the pre-history of languages
▪ Develop general theories about how and why language
changes
▪ Describe the history of speech communities
▪ Etymology
https://en.wikipedia.org/wiki/Historical_linguistics
Google
Book
Ngram
5,195,769 books
4%all published books
500billion words
1500-2012 time span
CULTUROMICSA form of computational lexicology that studies human behavior and
cultural trends through the quantitative analysis of digitized texts.
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Culturomics
Grammar Evolution
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Culturomics
Popularity
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Culturomics
feminism (Italian)
«sufraggette»
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Culturomics
Censorship
Marc Chagall (German)
Nazi censorship
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Culturomics
Events
Russian Flu
Spanish Flu
Asian Flu
J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
Limit
call (chiamare) vs. phone (telefonare)
Distributional
semantic models
You shall know a word by
the company it keeps!
Meaning of a word is
determined by its usage.
Distributional structure
Mathematical structures of
language
John Rupert Firth Ludwig Wittgenstein Zellig Harris
https://goo.gl/nY4els https://goo.gl/mD1oKn https://goo.gl/b3sMtC
Distributional
Semantic Models
● Analysis of word-usage
statistics over huge
corpora
● Geometric space of
concepts
● Similar words are
represented close in the
space
“ A WordSpace is a snapshot of a
specific corpus it does not take
into account temporal information
Random Indexing
Building the WordSpace
▪ Assign a random vector to
each term in the corpus
vocabulary
▪ Semantic vector for a term
is the sum of the context
vectors co-occurring with
the term
Random Vector
…-1 0 1 0 0 0 0 0 0 0 0 0 -1 …
▪ Sparse
▪ high dimensional
▪ ternary {-1, 0, +1}
▪ small number of randomly
distributed non-zero
elements
https://github.com/semanticvectors/semanticvectors
Temporal Random Indexing
TRI
▪ Corpus with temporal information: split the corpus in
several time periods
▪ Build a WordSpace for each time period
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1
https://github.com/pippokill/tri
Temporal Random Indexing
TRI
RI
Space1
RI
Space2
RI
Space3
RI
Space4
Corpus1900 Corpus1920 Corpus1930 Corpus1940
Similarity between words can
change over time
WordSpace 1910 WordSpace 1920 WordSpace 1930
chiamare
(call)
chiamare
(call)
telefonare
(phone)
chiamare
(call)
telefonare
(phone)
Google
Ngram
TRI
Methodology
TRI Time
Series
Change Point
Detection
Run TRI on Google
Ngram: a WordSpace
for each time period
is built (10 years)
Provide a time series
for each word
Detect significant
changes in the time
series
Several time series Γ at the time interval k
log frequency
point-wise
cumulative
Word frequency in each time
period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods
Time Series
Change point detection
Mean shift model
▪Mean shift of Γ pivoted at time period j
▪Search statistical significant mean shift
▪Bootstrapping approach under the null hypothesis that
there is no change in the meaning
V. Kulkarni, et al. Statistically significant detection of linguistic change. WWW 2015.
Evaluation
▪Build TRI by relying on the Italian Google Ngram corpus
▪Build a standard benchmarking for meaning shift detection for the Italian
language
▫ “Dizionario Sabatino Coletti”
▫ “Dizionario Etimologico Zanichelli”
▪Evaluate the performance of TRI
▫ compare the system output with manual annotations provided by
experts
P. Basile, A. Caputo, G. Semeraro. Diachronic Analysis of the Italian Language exploiting Google Ngram. CLIC-it 2016
Build a gold standard for the evaluation
change point
http://dizionari.corriere.it/dizionario_italiano/
Evaluation
Results
Method Accuracy
TRIpoint 0.3086
TRIcum 0.2963
TRR1point 0.2716
log freq 0.2346
TRR2point 0.1728
TRR1cum 0.1605
TRR2cum 0.1235
Accuracy: the year predicted by the system must be equal
or greater than one of the years reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing
On going work…
English Google Ngram
▪ Build a gold standard for the English language
http://www.etymonline.com/
On going work…
social media
▪ Build TRI on Twitter (TWITA collection)
▪ About 500M tweets (feb. 2012 – sep. 2015)
▪ Time interval = 1 month
http://valeriobasile.github.io/twita/about.html
On going work…
social media
On going work…
social media Local Election
Roma Marino (Roma
Mayor) crisis
Workshop on
Temporal Dynamics in Digital Libraries @ TPDL2017
https://tddl2017.github.io/
Submission deadline: June 2, 2017
Thanks!
You can find me at @headlighty &
annalina.caputo@adaptcentre.ie &
annalina.github.io
Credits
▪ Thanks to Pierpaolo Basile for the material of this
presentation
▪ The Google Ngram graphs are taken from J.-B. Michel
et al., Quantitative Analysis of Culture Using Millions of
Digitized Books. Science, 2011
▪ Presentation template by SlidesCarnival
▪ Photographs by Unsplash
▪ The source for every picture has been indicated below
each of them. All copyrights belong to their respective
owners.

Mais conteúdo relacionado

Semelhante a Diachronic Analysis of Language exploiting Google Ngram

Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Janifer Gatenby
 
Experiences in the Development of Geographical Ontologies and Linked Data
Experiences in the Development of Geographical Ontologies and Linked DataExperiences in the Development of Geographical Ontologies and Linked Data
Experiences in the Development of Geographical Ontologies and Linked DataOscar Corcho
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Ali Arsalan Kazmi
 
Terminology: tips and tricks to boost your terminology work
Terminology: tips and tricks to boost your terminology workTerminology: tips and tricks to boost your terminology work
Terminology: tips and tricks to boost your terminology workLaura Ramirez Polo
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Ron Martinez
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31Dag Endresen
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...Digital Classicist Seminar Berlin
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
ArCo: the Knowledge Graph of Italian Cultural Heritage
ArCo: the Knowledge Graph of Italian Cultural HeritageArCo: the Knowledge Graph of Italian Cultural Heritage
ArCo: the Knowledge Graph of Italian Cultural HeritageValentina Presutti
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for PortugueseValeria de Paiva
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methodsAkanshShandilya
 

Semelhante a Diachronic Analysis of Language exploiting Google Ngram (20)

Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08
 
Experiences in the Development of Geographical Ontologies and Linked Data
Experiences in the Development of Geographical Ontologies and Linked DataExperiences in the Development of Geographical Ontologies and Linked Data
Experiences in the Development of Geographical Ontologies and Linked Data
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
 
Terminology: tips and tricks to boost your terminology work
Terminology: tips and tricks to boost your terminology workTerminology: tips and tricks to boost your terminology work
Terminology: tips and tricks to boost your terminology work
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
ArCo: the Knowledge Graph of Italian Cultural Heritage
ArCo: the Knowledge Graph of Italian Cultural HeritageArCo: the Knowledge Graph of Italian Cultural Heritage
ArCo: the Knowledge Graph of Italian Cultural Heritage
 
OWN-PT: Taking Stock
OWN-PT: Taking Stock OWN-PT: Taking Stock
OWN-PT: Taking Stock
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for Portuguese
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methods
 

Último

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 

Último (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Diachronic Analysis of Language exploiting Google Ngram

  • 1. Diachronic Analysis of Language exploiting Google Ngram Dr Annalina Caputo ADAPT Centre
  • 2. Diachronic Linguistics The scientific study of language change over time also called Historical Linguistics
  • 3. Synchronic It describes the language rules at a specific point in time without taking its history into account. Synchronic vs. Diachronic Diachronic It considers the evolution of a language over time.
  • 4. Diachronic Linguistics Why? ▪ Observe changes in particular languages ▪ Reconstruct the pre-history of languages ▪ Develop general theories about how and why language changes ▪ Describe the history of speech communities ▪ Etymology https://en.wikipedia.org/wiki/Historical_linguistics
  • 5. Google Book Ngram 5,195,769 books 4%all published books 500billion words 1500-2012 time span
  • 6. CULTUROMICSA form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 7. Culturomics Grammar Evolution J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 8. Culturomics Popularity J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 9. Culturomics feminism (Italian) «sufraggette» J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 10. Culturomics Censorship Marc Chagall (German) Nazi censorship J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 11. Culturomics Events Russian Flu Spanish Flu Asian Flu J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 2011
  • 12. Limit call (chiamare) vs. phone (telefonare)
  • 13. Distributional semantic models You shall know a word by the company it keeps! Meaning of a word is determined by its usage. Distributional structure Mathematical structures of language John Rupert Firth Ludwig Wittgenstein Zellig Harris https://goo.gl/nY4els https://goo.gl/mD1oKn https://goo.gl/b3sMtC
  • 14. Distributional Semantic Models ● Analysis of word-usage statistics over huge corpora ● Geometric space of concepts ● Similar words are represented close in the space
  • 15. “ A WordSpace is a snapshot of a specific corpus it does not take into account temporal information
  • 16. Random Indexing Building the WordSpace ▪ Assign a random vector to each term in the corpus vocabulary ▪ Semantic vector for a term is the sum of the context vectors co-occurring with the term Random Vector …-1 0 1 0 0 0 0 0 0 0 0 0 -1 … ▪ Sparse ▪ high dimensional ▪ ternary {-1, 0, +1} ▪ small number of randomly distributed non-zero elements https://github.com/semanticvectors/semanticvectors
  • 17. Temporal Random Indexing TRI ▪ Corpus with temporal information: split the corpus in several time periods ▪ Build a WordSpace for each time period ▪ Words in different WordSpaces are comparable! P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1 https://github.com/pippokill/tri
  • 19. Similarity between words can change over time WordSpace 1910 WordSpace 1920 WordSpace 1930 chiamare (call) chiamare (call) telefonare (phone) chiamare (call) telefonare (phone)
  • 21. Methodology TRI Time Series Change Point Detection Run TRI on Google Ngram: a WordSpace for each time period is built (10 years) Provide a time series for each word Detect significant changes in the time series
  • 22. Several time series Γ at the time interval k log frequency point-wise cumulative Word frequency in each time period k Cosine similarity between word vectors across two time periods Considers a cumulative vector of the previous k-1 time periods Time Series
  • 23. Change point detection Mean shift model ▪Mean shift of Γ pivoted at time period j ▪Search statistical significant mean shift ▪Bootstrapping approach under the null hypothesis that there is no change in the meaning V. Kulkarni, et al. Statistically significant detection of linguistic change. WWW 2015.
  • 24. Evaluation ▪Build TRI by relying on the Italian Google Ngram corpus ▪Build a standard benchmarking for meaning shift detection for the Italian language ▫ “Dizionario Sabatino Coletti” ▫ “Dizionario Etimologico Zanichelli” ▪Evaluate the performance of TRI ▫ compare the system output with manual annotations provided by experts P. Basile, A. Caputo, G. Semeraro. Diachronic Analysis of the Italian Language exploiting Google Ngram. CLIC-it 2016
  • 25. Build a gold standard for the evaluation change point http://dizionari.corriere.it/dizionario_italiano/
  • 26. Evaluation Results Method Accuracy TRIpoint 0.3086 TRIcum 0.2963 TRR1point 0.2716 log freq 0.2346 TRR2point 0.1728 TRR1cum 0.1605 TRR2cum 0.1235 Accuracy: the year predicted by the system must be equal or greater than one of the years reported in the gold standard TRR1 and TRR2 are variants of TRI based on Reflective Random Indexing
  • 27. On going work… English Google Ngram ▪ Build a gold standard for the English language http://www.etymonline.com/
  • 28. On going work… social media ▪ Build TRI on Twitter (TWITA collection) ▪ About 500M tweets (feb. 2012 – sep. 2015) ▪ Time interval = 1 month http://valeriobasile.github.io/twita/about.html
  • 30. On going work… social media Local Election Roma Marino (Roma Mayor) crisis
  • 31. Workshop on Temporal Dynamics in Digital Libraries @ TPDL2017 https://tddl2017.github.io/ Submission deadline: June 2, 2017 Thanks! You can find me at @headlighty & annalina.caputo@adaptcentre.ie & annalina.github.io
  • 32. Credits ▪ Thanks to Pierpaolo Basile for the material of this presentation ▪ The Google Ngram graphs are taken from J.-B. Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 2011 ▪ Presentation template by SlidesCarnival ▪ Photographs by Unsplash ▪ The source for every picture has been indicated below each of them. All copyrights belong to their respective owners.