Concepts Through Time: Tracing Concepts in Dutch Newspaper Discourse using Sequential Word Vector Spaces
1. CONCEPTS
THROUGH TIME
Tracing Concepts in Dutch Newspaper Discourse
using Sequential Word Vector Spaces
Translantis Project
Digital Humanities Approaches to Reference
Cultures: The Emergence of the United States
in Dutch Public Discourse 1890-1990
Melvin Wevers, Tom Kenter & Pim Huijnen
Utrecht University & University of Amsterdam, the Netherlands
2. PROBLEM =
CHALLENGE
• Conceptual history / intellectual history studies the emergence and
transformation of concepts, ideas, and thoughts.
• Problems with existing methods
• Use of predefined list of words (N-gram viewers / Full-text search)
• Top-down approaches (NER, word classification lists) make use pre-
established models that are often a-historic
• Topic modeling is useful but quite static
• How to to trace the genealogy of a concept?
3. CONCEPTS THROUGH
TIME
• We would like to study changes in
the meaning (constitution) of
concepts over time
• Question: What words were used in
the past to talk about particular
concepts?
4. OUR APPROACH
• Multi-dimensional word-vector
space using Google’s
word2vec (neural language
model)
• Data: 500.000 digitized
newspaper issues from the
Dutch National Library
• Semantic and syntactic
information representation by
geometry (Baroni &
Kruszweksi, 2014; Wijaya &
Yeniterzi, 2011)
1950 1960 1970
1 model = 10 years
40 models for period
between 1950-1990
5. TRACING CONCEPTS
• One or more words as entry-
points into concept
• Concepts defined by in and out
links > inspired by Deleuze’s
notion of the rhizome
• Model ambiguity see which
words remain and disappear
from network
• Fast and relatively light
• Forwards and backwards
9. CONCLUSIONS
• Trace concepts over large periods of
time
• Greater sensitivity to semantic
changes based on corpus
• Greater heuristic interactivity with the
researcher
10. FUTURE WORK
• Optimize algorithm based on
different types of conceptual
changes
• Query expansion. Use this
technique to find relevant related
words within specific periods
12. (2009): 71.
Deleuze, Gilles. A Thousand Plateaus: Capitalism and Schizophrenia. University of Minnesota Press, 1987.
Huijnen, Pim, Fons Laan, Maarten de Rijke, and Toine Pieters. “A Digital Humanities Approach to the History of
Science.” In Social Informatics, edited by Akiyo Nadamoto, Adam Jatowt, Adam Wierzbicki, and Jochen L. Leidner, 71–
85. Lecture Notes in Computer Science 8359. Springer Berlin Heidelberg, 2014.
Kenter, Tom, Melvin Wevers, and Pim Huijnen “Ad Hoc Monitoring of Vocabulary Shifts over Time.” To be published
Kim, Yoon, Yi-I. Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. “Temporal Analysis of Language through
Neural Language Models.” arXiv:1405.3515 [cs], May 14, 2014. http://arxiv.org/abs/1405.3515.
Klingenstein, S., T. Hitchcock, and S. DeDeo. “The Civilizing Process in London’s Old Bailey.” Proceedings of the
National Academy of Sciences 111, no. 26 (July 1, 2014): 9419–24.
Kruszewski, Marco Baroni Georgiana Dinu Germán. “Don’t Count, Predict! A Systematic Comparison of Context-
Counting vs. Context-Predicting Semantic Vectors.” Accessed September 11, 2014.
http://anthology.aclweb.org/P/P14/P14-1023.xhtml.
Wang, Xuerui, and Andrew McCallum. “Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends.”
In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 424–33.
ACM, 2006.
Wiedemann, Gregor, Andreas Niekler, and others. “Document Retrieval for Large Scale Content Analysis Using
Contextualized Dictionaries.” In Terminology and Knowledge Engineering 2014, 2014. http://hal.archives-ouvertes.fr/hal-
01005879/.
Wijaya, Derry Tanti, and Reyyan Yeniterzi. “Understanding Semantic Change of Words over Centuries.” In Proceedings
of the 2011 International Workshop on DETecting and Exploiting Cultural diversiTy on the Social Web, 35–40. ACM,
Notas do Editor
Today, I will be highlighted some of the points made in our paper Concepts Through Time: Tracing Concepts in Dutch Newspaper Discourse using Sequential Word Vector Spaces
I am part of the Translantis project with the insanely long subtitle: Digital humanities approaches to Reference Cultures: The Emergence of the United States in Dutch Public Discourse 1890-1990. My project looks at the ways in which the United States has appeared as a model, or reference culture, in debates concerning consumerism and modernization.
The central theme of today’s talk is the study of the emergence and transformation of concepts, ideas, and thoughts. SLIDE
I am a cultural historian that tries to see how computational tools can aid my work. Historians have increasingly used digital tools for the purposes of conceptual history.
SLIDE
However, in these studies very often researchers employ pre-defined and ahistorical definitions of concepts.
SLIDE
Full-text search and n-gram viewers, for example, require workable definition of the concept or range of words that cover the subject to, subsequently, analyze them within certain contexts and periods. The necessity of pre-defining terms is a serious drawback of working with these tracking tools. The research done in this way runs the risk of ahistoricity.
SLIDE
The same goes for top-down approaches, in which a specific model of language allows for the recognition of certain semantic information, such as specific entities via Named Entity Recognition or via word classification lists. We would like the corpus to generate the list of words that form a concept
SLIDE
Topic modeling approaches partly circumvent this limitation, although it is rather static. It can infer latent topics from corpus. It does not allow for ad hoc settings. You give an amount of texts as input, you set the parameters, and you are presented with your output.
SLIDE
We would like to keep our hands on the wheel, to steer the research process, to follow the genealogy of a concept. And what we would really really like is to trace concepts before their key term was even introduced. For instance, the notion of efficiency was thought up in the interwar years, however, even in the years before people talked about similar notions without using the words efficiency. Well what words did they use?!
SLIDE
So basically, what we would like is a method to study changes in the meaning / constitution of concepts over time.
SLIDE
Our main research question then is to see what words were used in the past to talk about specific concepts.
This would enable us to show the continuities and discontinuities in discourse, but also to remain sensitive to ambiguities within the words that make up their concepts.
In order to this, we have turned to multi-dimensional word vector spaces. SLIDE These are created using word2vec, a neural language model that does not depart from top-down model of language; rather a semantic space is inferred from the input data. SLIDE As our dataset we have used the digitized newspaper collection from the Dutch National Library, which contains over 500.000 newspaper issues between 1890 and 1990.
SLIDE a multi-dimensional word-vector space contains semantic and linguistic regularities that can be used for the analysis of discourse. A positional shift within the vector space has been established as an indicator for changess on a semantic and syntactic level
SLIDE
Our method introduces a sequential modeling of these vector spaces, for which we make multiple models over time. We create a model for decade, so 1950-1960. Then we move this one year, and create another one. For the period between 1950 and 1990 we have thus created 40 models.
So then we have all these models, how do we trace concepts within these models
We trace groups of terms, rather than individual words, by keeping track of semantic relations between terms per period. SLIDE
We will use a single seed set of terms, merely as an entry-point into the cluster and then find semantically related words, word that are close to the seed term, within the vector space. This present us with the first layer of words.
Then we looked for the related words for these words. This give a semantic graph, that we have pruned by weighing the model using in and out-links. SLIDE
This pruned model is then located within the subsequent model in time, and the same process is executed. A key aspect of this procedure is that the original seed words might disappear from the cluster of words over time. Remember, this relates to the example of efficiency I just gave.
SLIDE through this approach we try to model ambiguity through time by monitoring the network of words that change in position or leave the network altogether.
SLIDE this technique is fast and relatively light. You can query the models using a number of different operators. SLIDE
You can peruse the models forwards and backwards.
This is the output for now. I have two examples
Propaganda. Before WW2, propaganda and advertising referred to the same thing, this shifted after WW2.
After WW2, we have traced the concept and we have seen that quite suddenly its meaning seems to shift into the realm of advertising.
In addition to this tracing, we have looked up the related words through time. This shows that propaganda received a new meaning namely that of political propaganda, more specifically within the cold war context.
Another example is that of the word Aliens. This show how the debate on foreigners has changed over the years. The connotation with tourist and europeans changes into that of illegals, guest worker, immigrants, and people from surinam and the Dutch Antilles.
Concepts Through Time enables historians to trace concepts over large periods without having to manually select appropriate terms for the entire time span and without being dependent on a fixed set of topics. This allows for a greater sensitivity to semantic changes and an increased interactive heuristic approach to concepts within their discursive context.
Different Conceptualization of Types of Conceptual Changes
Create an user-interface to visualize concepts in vector-space, but also allow researchers to play with settings when moving through time.
Implement this as a function of query expansion. This technique can find relevant related words within historical periods. So if you would look for Efficiency in 1910, it would give you to words used to talk about this concept.