4. Why Detecting Linguistic Change Matters ?
Focus less on keywords
Intended meaning of words
Semantic Search
Google’s Humming Bird Algorithm
Powerset (bought by Microsoft)
Tracking and detecting linguistic change, key to
Semantic Web and Search Applications 4
2011
2012
5. Effectively capture word semantics over time using
Word Embeddings
5
Talk in a nut shell
Detect when and whether a change is significant.
Results on Twitter, Google Book Ngrams etc.
Project Code available from: http://www.vivekkulkarni.net
6. Results – A quick preview
6
SOURCE WORD ESTIMATED
CHANGE POINT
PAST USAGE CURRENT USAGE
GOOGLE
BOOK
NGRAMS
tape 1970 red-tape , tape from her mouth A copy of the tape
gay 1985 Happy and gay Gay and lesbians
sex 1965 Of the fair sex To have sex with
plastic 1950 Of plastic possibilities Put in a plastic
TWITTER
Candy April 2013 Candy sweets Candy crush (the game)
snap Dec 2012 Snap a picture Snap chat
mystery Dec 2012 Mystery books Mystery Manor (a game)
7. Detecting Linguistic Change – How we did it
Tracking and detecting linguistic change in a word’s usage is really
the problem of
Constructing a time series capturing word’s usage
Analyzing the time series for statistically significant changes
(Change point Detection)
7
8. Talk outline
Different methods to model word evolution as a time series
Frequency
Syntactic
Distributional
Method to establish statistical significance of changes.
Results on several datasets of online content like Twitter
8
9. Outline
Different methods to model word evolution as a time series
Frequency
Syntactic
Distributional
Method to establish statistical significance of changes.
Results on several datasets of online content like Twitter
9
10. Using Frequency
Frequency based approaches to capture word usage widely used
Google Trends
Google NGrams [Jean-Baptiste Michel et.al, 2011]
Given a word w, we construct the time series as
where ∁ 𝑡 is the corpus at time t.
𝑇𝑡 𝑤 =
#(𝑤 ∈ ∁ 𝑡)
𝐶𝑡
10
Time series
for gay
12. A Second Approach – Syntactic Method
Part of Speech changes indicative of linguistic shift
Google Syntactic Ngram Viewer [Jason Mann et.al 2014, Goldberg et.al 2013]
Each word is tagged with its Part of Speech (POS)
happy and sad
ADJ CC ADJ
Construct a time series by tracking changes in POS Distribution
12
𝑄𝑡 = 𝑃𝑟𝑋~𝑃𝑂𝑆𝑇𝑎𝑔𝑠 𝑋 𝑤, 𝐶𝑡
𝑇𝑡 𝑤 = 𝐽𝑆𝐷𝑖𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 𝑄0, 𝑄𝑡
17. Learning word representations (embeddings)
17
flower rose daisy bird canary Robin
flower 0 10 20 1 1 1
rose 10 0 15 0 0 0
daisy 20 15 0 1 2 3
bird 1 0 1 0 20 40
canary 1 0 2 20 0 10
robin 1 0 3 40 10 0
[Rumelhart+, 2003]
• Learning a representation is learning a mapping -- φ: 𝒱 → ℛ 𝒹
• Capture syntactic and semantic aspects of word usage
• Advantages: Very effective on NLP Tasks, scalable and online
methods.
18. Skipgram Model – Learning Word Embeddings
18
Can learn word representations by back-propagating errors
Predict surrounding words of every word
Context word
Current word
Vector for wI
Objective
19. Using Word Embeddings To Detect Linguistic
Change – Key Idea
Train word embeddings for each time point
We use Skipgram model [Mikolov 2013] to train word embeddings.
1900 19801920 1950 1990 2000
• Track displacement of a word over time in this latent space
19
19801920 1950 1990 20001900
Distance
20. But … a road block
Need to align word embeddings from different vector spaces !
20
Cannot compare word embeddings from different time points
because they lie in different vector spaces.
22. Aligning Word Embeddings – Assumptions
Local structure preserved as most words did not change over time
22
Local structure between vector spaces equivalent under a linear
transformation.
23. Aligning Word Embeddings – Main Idea
Learn a linear transformation W that attempts to preserve local
structure : Use piece-wise linear regression using only k-Nearest
Neighbors
23
25. Distributional Method- Constructing Time series
Align embeddings to joint space using piece-wise linear regression
model
Can now induce a distance measure over the embeddings
Construct the time series for w as
Recall: Cosine Distance between 2 vectors is 0 if they are equal. Higher
values indicate greater distance.
𝑇𝑡(𝑤) = 𝐶𝑜𝑠𝑖𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑣0, 𝑣 𝑡)
25
27. Outline
Different methods to model word evolution as a time series
Frequency
Syntactic
Distributional
Method to establish statistical significance of changes.
Results on several datasets of online content like Twitter
27
28. Track changes in Mean, Variance or perhaps both
Track a test statistic at each time point
Eg. Difference in mean between left and right end of time series
Cumulative Sum (CUSUM)
Use some notion of significance to establish whether test statistic at
time point t indicates a significant shift and hence a change point.
Label the most significant shift as the change point.
Change Point Detection in Time Series
28
30. Outline
Different methods to model word evolution as a time series
Frequency
Syntactic
Distributional
Method to establish statistical significance of changes.
Results on several datasets of online content like Twitter
30
31. Popular words detected by word embeddings(Google
Book NGrams)
WORD PVALUE ESTIMATED CHANGE
POINT
PAST USAGE CURRENT USAGE
tape < 0.0001 1970 red-tape , tape
from her mouth
A copy of the tape
gay 0.0001 1985 Happy and gay Gay and lesbians
sex 0.0002 1965 Of the fair sex To have sex with
checking 0.0002 1970 Then checking
himself
Checking him out
peck 0.0004 1935 Brewed a peck A peck on the
cheek
plastic 0.0005 1950 Of plastic
possibilities
Put in a plastic
diet 0.0104 1970 Diet of bread and
butter
To go on a diet
honey 0.02 1930 Land of milk and
honey
Oh honey !
31
32. Popular words detected by POS (Google Book
NGrams)
WORD ESTIMATED CHANGE POINT REASON
apple 1984 NOUN TO PROPER NOUN
windows 1992 NOUN TO PROPER NOUN
bush 1989 NOUN TO PROPER NOUN
click 1952 NOUN TO VERB
handle 1951 NOUN TO VERB
sink 1972 VERB TO NOUN
32
33. Popular words detected by word embeddings - Twitter
WORD ESTIMATED CHANGE
POINT
PAST USAGE CURRENT USAGE
candy April 2013 Candy sweets Candy Crush (the
game)
rally March 2013 Political rally Rally of soldiers
(The Immortalis
Game)
snap December 2012 Snap a picture Snap chat
mystery December 2012 Mystery books Mystery Manor (the
game)
shades June 2012 Color shades,
shaded glasses
50 shades of grey
33
34. Summary
• Looked at the problem of detecting and tracking linguistic
change
• A meta approach to detect and track such linguistic changes
by constructing a time series
• Demonstrated how to use word embeddings to detect linguistic
change.
• Change point detection and estimation
• Results on Google Ngrams, Twitter, Amazon Movie Reviews.
• Project Code available from: http://www.vivekkulkarni.net
34