SlideShare uma empresa Scribd logo
1 de 89
Baixar para ler offline
August 25-27, 2015 Crazy Futures III 1
Ted Pedersen
Department of Computer Science
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse
The horizon isn't found in a
dictionary : Identifying
emerging word senses and
identities in raw text
August 25-27, 2015 Crazy Futures III 2
A winding road
● Dictionaries
● A powerful lens to look back, but not to the future
● Lexicographers
● While making dictionaries, engage in a kind of horizon
scanning
– What new words or senses are emerging?
● Natural Language Processing
● Can we automate the task of the lexicographer?
● Can identify emerging words, senses, and identities?
August 25-27, 2015 Crazy Futures III 3
Dictionaries
● Wonderful for looking back!
● Is that really a word?
● How do you spell it?
● What does it mean?
● When was a word first used?
● When did that sense of a word emerge?
August 25-27, 2015 Crazy Futures III 4
Dictionaries
● Not particularly predictive
● But, the people who create dictionaries
are horizon scanners, always looking
for new words and senses
● Lexicographers
● Or … computer programs? (NLP)
August 25-27, 2015 Crazy Futures III 5
Dictionaries
● Go back to at least 2300 BCE
● Early on were bilingual word lists
● Useful for trade, warfare
● Idea of monolingual dictionary
developed later
● In English, 1604
August 25-27, 2015 Crazy Futures III 6
Descriptive or Prescriptive
● Descriptive
● Document how the language is used
● Use determines meaning
● English – OED
● Prescriptive
● Define how the language should be used
● Experts decide
● English – early Webster
● French Academy – create words to replace Anglicisms
August 25-27, 2015 Crazy Futures III 7
English Lexicography
● 1604 - A Table Alphabeticall, by Robert Cawdrey, approx
2,500 entries
● 1755 - The Dictionary of the English Language, by Samuel
Johnson, approx 42,000 entries.
● 1828 – American Dictionary of the English Language, by
Noah Webster, approx 70,000 entires
● 1928 - Oxford English Dictionary, 4 volumes, approx
400,000 entries
● 1989 – Oxford English Dictionary (2nd ed), 10 volumes,
600,000 entries
August 25-27, 2015 Crazy Futures III 8
August 25-27, 2015 Crazy Futures III 9
Table Alphabeticall (1604)
A Table Alphabeticall, conteyning and teaching the true writing, and
vnderstanding of hard vsuall English wordes, borrowed from the
Hebrew, Greeke, Latine, or French. & c.
With the interpretation thereof by plaine English words, gathered for
the benefit & helpe of Ladies, Gentlewomen, or any other vnskilfull
persons.
Whereby they may the more easilie and better vnderstand many hard
English wordes, which they shall heare or read in Scriptures, Sermons,
or elswhere, and also be made able to vse the same aptly themselues.
Legere, et non intelligere, neglegere est.
As good not read, as not to vnderstand.
August 25-27, 2015 Crazy Futures III 10
Table Alphabeticall (1604)
● A Table Alphabeticall of Hard Usual
English Words
● Developed by Robert Cawdrey
● 120 pages, 2,543 entries
● Short definitions, synonyms
● Doesn't include multiple senses for a word
● http://www.library.utoronto.ca/utel/ret/cawdre
y/cawdrey0.html
August 25-27, 2015 Crazy Futures III 11
August 25-27, 2015 Crazy Futures III 12
combustible, easily burnt
combustion, burning or consuming with fire.
comedie, (k) stage play,
comicall, handled merily like a comedie
commemoration, rehearsing or remebring
[fr] commencement, a beginning or entrance
comet, (g) a blasing starre
comentarie, exposition of any thing
commerce, fellowship, entercourse of merchandise.
commination, threatning, or menacing,
commiseration, pittie
commodious, profitable, pleasant, fit,
commotion, rebellion, trouble, or disquietnesse.
communicate, make partaker, or giue part vnto
[fr] communaltie, common people, or comon-wealth
communion, (* synonyms *) fellow-
communitie, ship. (* synonyms end *)
compact, ioyned together, or an agreement.
compassion, pitty, fellow-feeling
compell, to force, or constraine
compendious, short, profitable
August 25-27, 2015 Crazy Futures III 13
Table Alphabeticall (1604)
● The First English Dictionary
● Not clear why words included or not
● Hard?
● Introspection
● Quickly superseded
August 25-27, 2015 Crazy Futures III 14
August 25-27, 2015 Crazy Futures III 15
A Dictionary of the English
Language (1755)
● Written by Samuel Johnson (Dr. Johnson)
● Worked alone (with six copyists)
● Nearly 43,000 entries
● 2,300 pages
● 100,000 illustrative quotes from literature
● http://johnsonsdictionaryonline.com/
● Sometimes biased, long-winded, inconsistent
● A delight really...
August 25-27, 2015 Crazy Futures III 16
Method
● Decided not to build upon previous works
● Carried out a perusal of English literature
● Studied 2,000 books from 500 authors
going back 200 years
● Entries based on the past
● Selected quotations to show language in
action
August 25-27, 2015 Crazy Futures III 17
The Inimitable Dr. Johnson
● Lexicographer: A writer of dictionaries; a harmless
drudge that busies himself in tracing the original,
and detailing the signification of words.
● Oats: A grain, which in England is generally given to
horses, but in Scotland appears to support the
people.
● To worm: To deprive a dog of something, nobody
knows what, under his tongue, which is said to
prevent him, nobody knows why, from running mad.
August 25-27, 2015 Crazy Futures III 18
oats
● Oats. n.s. [aten, Saxon.] A grain, which in England is generally
given to horses, but in Scotland supports the people.
● It is of the grass leaved tribe; the flowers have no petals, and are
disposed in a loose panicle: the grain is eatable. The meal makes
tolerable good bread. Miller.
● The oats have eaten the horses. Shakespeare.
● It is bare mechanism, no otherwise produced than the turning of a wild
oatbeard, by the insinuation of the particles of moisture. Locke.
● For your lean cattle, fodder them with barley straw first, and the oat
straw last. Mortimer's Husbandry.
● His horse's allowance of oats and beans, was greater than the journey
required. Swift.
August 25-27, 2015 Crazy Futures III 19
August 25-27, 2015 Crazy Futures III 20
August 25-27, 2015 Crazy Futures III 21
A Dictionary of the English
Language (1755)
● A monumental work
● Set precedents for dictionaries that live on
today
● Systematic study of published literature for
words and senses
● Illustrate senses with quotations
● 1700 of Dr. Johnson's definitions remain in OED
today
August 25-27, 2015 Crazy Futures III 22
Noah Webster
● A tireless advocate for American English
● “Blue Backed Speller” (1783, 1804, 1806)
● Proposed Americanized spellings
● Widely used in schools in 1800s
● Dissertations on the English Language
(1789)
● An American standard needed to be developed
August 25-27, 2015 Crazy Futures III 23
August 25-27, 2015 Crazy Futures III 24
Noah Webster
● A Compendius Dictionary of the
English Language (1806)
● 28,000 entries
● Intended to improve, Americanize
Dr. Johnson's dictionary
August 25-27, 2015 Crazy Futures III 25
Noah Webster
● An American Dictionary of the
English Language (1828)
● 70,000 entries
● 1864 Unabridged edition had
114,000 entries
August 25-27, 2015 Crazy Futures III 26
August 25-27, 2015 Crazy Futures III 27
Improving on Dr. Johnson?
OAT, n.
A plant of the genus Avena, and more usually, the
seed of the plant. The word is commonly used in
the plural, oats. This plant flourishes best in cold
latitudes, and degenerates in the warm. The meal
of this grain, oatmeal, forms a considerable and
very valuable article of food for man in Scotland,
and every where oats are excellent food for
horses and cattle.
August 25-27, 2015 Crazy Futures III 28
An American Dictionary
It is not only important, but, in a degree necessary, that the people of
this country, should have an American Dictionary of the English
Language; for, although the body of the language is the same as in
England, and it is desirable to perpetuate that sameness, yet some
differences must exist. Language is the expression of ideas; and if the
people of one country cannot preserve an identity of ideas, they
cannot retain an identity of language. Now an identity of ideas
depends materially upon a sameness of things or objects with which
the people of the two countries are conversant. But in no two portions
of the earth, remote from each other, can such identity be found. Even
physical objects must be different. But the principal differences
between the people of this country and of all others, arise from
different forms of government, different laws, institutions and customs.
August 25-27, 2015 Crazy Futures III 29
Noah Webster
● An American Dictionary of the
English Language (1828)
● 70,000 words
● Not a great success at the time
August 25-27, 2015 Crazy Futures III 30
Oxford English Dictionary
● OED began in 1857 as a revision of Dr.
Johnson's dictionary
● Improve coverage, quality of entries,
consistency, remove biases
● Envisioned as a 10 year project
● Was also a response to perception that other
European languages were more advanced
with their dictionaries
August 25-27, 2015 Crazy Futures III 31
Oxford English Dictionary
● Work began in 1857, first
publication in 1884, first edition
in 1928 (71 years later)
● James Murray, Chief Editor of OED,
1879 – 1915
August 25-27, 2015 Crazy Futures III 32
August 25-27, 2015 Crazy Futures III 33
Crowd-sourced!
● Invite English readers to contribute
words
● Read, and whenever they see a word
of interest used in an illustrative
context, write it on a slip of paper and
send it to OUP
● Word, quotation, citation, reference
August 25-27, 2015 Crazy Futures III 34
August 25-27, 2015 Crazy Futures III 35
First edition 1928
● 10 volumes, 15,490 pages
● 414,800 entries
● 2,000 contributors
● 5 million submitted quotations
● 1.86 million used
August 25-27, 2015 Crazy Futures III 36
Second Edition 1989
● 20 volumes, 21,730 pages
● Weighs 137 pounds
● 658,000 words
● 2.43 million quotations
August 25-27, 2015 Crazy Futures III 37
August 25-27, 2015 Crazy Futures III 38
August 25-27, 2015 Crazy Futures III 39
August 25-27, 2015 Crazy Futures III 40
August 25-27, 2015 Crazy Futures III 41
August 25-27, 2015 Crazy Futures III 42
August 25-27, 2015 Crazy Futures III 43
But...good news
● Duck face is entering dictionaries
● Oxford Dictionaries online
● Urban dictionary
● OED sets high bar for inclusion
● What words are being used today
that will find their way into OED?
August 25-27, 2015 Crazy Futures III 44
And now...NLP?
● OED tells us when a word or sense was
first used
● What if we could automatically recognize
new words or senses going forward?
● What if we could recognize people or
organizations (identities) that were to be
significant?
August 25-27, 2015 Crazy Futures III 45
New words, emerging
senses, new identities
● Scan sources of interest and look for
words or terms that have not occurred
previously, and that reach some level
of regularity and frequency
● Once you have a few candidates, you
can start to investigate further
August 25-27, 2015 Crazy Futures III 46
NLP
● Identify interesting or significant
words, phrases, or names
● Group the occurrences of this
“interesting thing” into senses
● Differentiate among the senses
August 25-27, 2015 Crazy Futures III 47
NLP
● Concordances
● Measures of Association
● Clustering
● First order co-occurrences
● Second order co-occurrences
August 25-27, 2015 Crazy Futures III 48
Concordances
● KWIC – Key Word in Context
● A basic tool for lexicographers, and
many other language users
● Long history with religious scholars
● Shows a target word surrounded by
some amount of context on either side
August 25-27, 2015 Crazy Futures III 49
August 25-27, 2015 Crazy Futures III 50
Concordance
● Can ponder different usages of a word in
context, sort and rearrange them, compare and
contrast, come to understand distinctions in
meaning
● The goal may be to group the contexts in the
concordance into groups or clusters, where each
cluster uses the target word in the same sense
● ...Much like a lexicographer
August 25-27, 2015 Crazy Futures III 51
Collocations
● How to recognize similar entries in a
concordance?
● Collocations with the target word
– All entries using “burnt offering” likely to be using
same sense (of offering)
● Same or similar words co-occur in context
– All entries that also include “priest” may be
similar
August 25-27, 2015 Crazy Futures III 52
Collocations
● Can be recognized via frequency
● May be identified in a large corpus
via measures of association
● Do these two words occur together
significantly more often than expected
by chance?
August 25-27, 2015 Crazy Futures III 53
Frequency
August 25-27, 2015 Crazy Futures III 54
Measures of Association
● Compare the frequency of a pair of words
with the value that would be expected if they
were independent
● p(w1,w2) = p(w1)*p(w2) ??
● If the frequency of the pair is not what would
be expected, then this pair is not considered
interesting (but is instead just a chance
occurrence)
August 25-27, 2015 Crazy Futures III 55
Measures of Association
http://ngram.sourceforge.net
● Log-likelihood ratio (ll)
● Mutual Information
(tmi)
● Pearson's chi-
squared test (x2)
● Pointwise Mutual
Information (pmi)
● Poisson-Stiring (ps)
● Fisher's Exact Test
(leftFisher)
● Jaccard Coefficient
(jaccard)
● Odds Ratio (odds)
● Dice Coefficient (dice)
● T-score (tscore)
August 25-27, 2015 Crazy Futures III 56
Log likelihood ratio
August 25-27, 2015 Crazy Futures III 57
Observed versus Expected
● p(w_1,w_2) = n_11 / n_++
● p(w_1) = n_1+ / n_++, p(w2) = n_+1 / n_++
● m_11 = (n_1+ * n_+1) / n_++
● Generalizes to m_ij
W2 NOT W2
W1 n_11 n_12 n_1+
NOT W1 n_21 n_22 n_2+
n_+1 n_+2 n_++
August 25-27, 2015 Crazy Futures III 58
Example
offering NOT
offering
burnt n_11 = 184
m_11 = 2.47
n_12 = 125
m_12 = 306.53
309
NOT burnt n_21 = 364
m_21 = 505.60
n_22 = 67,944
m_22 = 62,802.40
68,30868,308
548 68,069 68,617
● Do n_ij and m_ij diverge enough to reject the
model of independence?
● According to log-likelihood they do …
August 25-27, 2015 Crazy Futures III 59
Features
● Collocations – words that occur together
more often than expected by chance
● Can indicate sense reliably when target word
involved
● Co-occurrences – words that occur near the
target word (but not adjacent)
● Useful for differentiating among senses,
especially when several are involved
August 25-27, 2015 Crazy Futures III 60
Word Sense Discrimination
● Feed a cold, starve a fever.
● It is always cold in Minnesota.
● The soup was cold and watery.
● Cold and flu season is upon us.
August 25-27, 2015 Crazy Futures III 61
Word Sense Discrimination
● Feed a cold, starve a fever.
● Cold and flu season is upon us.
● It is always cold in Minnesota.
● The soup was cold and watery.
August 25-27, 2015 Crazy Futures III 62
First Order Representations
● CTX1 : Feed a cold, starve a fever.
cold feed fever starve
CTX1 1 1 1 1
August 25-27, 2015 Crazy Futures III 63
First order methods
● Following bag-of-words, text classification
● Represent each target word context with a
binary vector that shows which features occur
within
● Collocations, co-occurrences
● Results in a context by word matrix (where
each row is an instance to be clustered)
● Cluster
August 25-27, 2015 Crazy Futures III 64
First Order Representations
● CTX1 : Feed a cold, starve a fever.
●
CTX4 : Cold and flu season is upon us.
cold feed fever flu season starve upon
CTX1
1 1 1 0 0 1 0
CTX4 1 0 0 1 1 0 1
August 25-27, 2015 Crazy Futures III 65
First order representations
● Works well enough if you have moderate to
large numbers of larger contexts
● and a relatively consistent vocabulary...
– and a bit of luck...
● Success in supervised text classification
problems doesn't always transfer over to
unsupervised arena
August 25-27, 2015 Crazy Futures III 66
What drives us crazy...
● fever and flu have much in
common ...
● But, just can't see it here..
cold feed fever flu season starve upon
CTX1
1 1 1 0 0 1 0
CTX4 1 0 0 1 1 0 1
CTX1 : Feed a cold, starve a fever.
CTX4 : Cold and flu season is upon us.
August 25-27, 2015 Crazy Futures III 67
Look to the second order...
● You shall know a word by the company it keeps (JR
Firth, 1957)
● Words have friends
– Cold is a friend of fever and flu
● Friends share friends and hang outs
– Fever and flu share some friends that aren't
friends with cold
● 2nd order co-occurrences with cold (f of f)
– Fever and flu hang out in places without cold
● 2nd order “locations” of cold
August 25-27, 2015 Crazy Futures III 68
Look to the second order...
● Fever and flu have some of the same friends...
● His fever caused his temperature to spike.
● The flu brings on a rise in body temperature.
● Fever and flu hang out together...
● Although influenza (the flu) is not considered
serious by many parents, the very high fever that
it can cause is a cause of blindness and even
death in children.
● Second order features can be derived from the target
word contexts, or from other (unannotated) data
August 25-27, 2015 Crazy Futures III 69
LSI, LSA, and Schütze
● Unsupervised methods
● Input Contexts, Output Clusters of Contexts
● Influential
● Context representation a key distinction
● Alternatives to first order features
● They look to the second order...
– LSI/LSA – where do you find your word friends?
– Schütze - who do your word friends hang out with?
August 25-27, 2015 Crazy Futures III 70
Second order
representations
● CTX1 : Feed a cold, starve a fever...
● Create co-occurrence vectors for all non-
stop words : feed, starve, fever
● Replace words in CTX1 with those vectors
● Average together and replace CTX1 with
that new averaged vector
● Do the same with all other target word
contexts, then cluster
August 25-27, 2015 Crazy Futures III 71
Second order
representations
● CTX1 : Feed a cold, starve a fever.
●
CTX4 : Cold and flu season is upon us.
● Nothing matches in first order representation,
but in second order if fever and flu ...
● both occur with temperature, then there is
some similarity between CTX1 and CTX4
● both occur in document 12432, then there is
some similarity between CTX1 and CTX4
August 25-27, 2015 Crazy Futures III 72
Method
● Collect contexts with a given target word
● Identify lexical features within the contexts
● Use these to represent contexts using first or second
order features
● Perform SVD or other dimensionality reduction
● Cluster
● Number of clusters automatically discovered
● Generate a label for each cluster
August 25-27, 2015 Crazy Futures III 73
First order features
● Represent contexts with binary vectors that
show which features occur in the context
● Results in a context by word matrix (where
each row is an instance to be clustered)
● Cluster
August 25-27, 2015 Crazy Futures III 74
Second order
co-occurrences
● Use bigram features to create a word by word
co-occurrence matrix
● SVD or dimensionality reduction
● Replace each word in a target word context
with the corresponding co-occurrence vector
● Average all of the word vectors together to
represent the context
● Do this for each target word context, cluster
August 25-27, 2015 Crazy Futures III 75
A note on word embeddings
● Word embeddings are a recently popular
idea where a vector is created for a word
based on co-occurrence or other kinds of
language information
● 2nd order features as shown here can be
seen as a fairly direct sort of word
embedding
● word2vec is a widely used tool
August 25-27, 2015 Crazy Futures III 76
second order locations
(LSI/LSA)
● Transpose first order representation so that it
becomes word by context
● Perform SVD (LSA recommendation)
● Represent contexts to be clustered by
replacing each word in a target word context
with the corresponding word vector
● Average all of the word vectors together to
represent the context
August 25-27, 2015 Crazy Futures III 77
Clustering
● Repeated Bisections
● Starts by clustering all contexts in one
cluster, then repeatedly partitioning (in two)
to optimize the criterion function
● Partitioning done via k-means with k=2
● I2 criterion function
● Finds average pairwise similarity between
each context in the cluster and the centroid,
sums across all clusters to find value
August 25-27, 2015 Crazy Futures III 78
Cluster stopping
● Find k where criterion function stops improving
● PK2 (Hartigan, 1975) takes ratio of criterion function
of successive pairs of k
● PK3 takes ratio of twice the criterion function at k
divided by product of (k-1) and (k+1)
● PK2 and PK3 stop when these ratios are within 1
std of 1
● Gap Statistic (Tibshirani, 2001) compares observed
data with reference sample of noise, find k with
greatest divergence from noise
August 25-27, 2015 Crazy Futures III 79
Cluster labeling
● Clusters made up of contexts that use the target
word in a particular sense
● Find top N most associated bigrams that are
unique to that cluster (discriminating features) and
top N that are most associated without regard to
which cluster they are in (descriptive features)
● Use standard measures of association like log-
likelihood, etc.
● Definition via a few well chosen bigrams
August 25-27, 2015 Crazy Futures III 80
The result?
● Contexts that contain a particular
target word
● Organized by sense, where each
cluster contains contexts used in
approximately the same sense
August 25-27, 2015 Crazy Futures III 81
Identities?
● Much like word senses, except
they apply to names
● Many distinct individuals have the
same name
● How do we differentiate among them?
Same techniques can be used.
August 25-27, 2015 Crazy Futures III 82
Synonyms
● Might also be interested in new
words for old ideas
● How similar are the contexts in
which these new words are being
used (with old contexts)
August 25-27, 2015 Crazy Futures III 83
Synonyms
● Might also be interested in new words
for old ideas
● How similar are the contexts in which
these new words are being used (with old
contexts)
● Or different words for the same idea
● Can use same technqiues to recognize
August 25-27, 2015 Crazy Futures III 84
The Future of
Word Sense Discrimination
● Automatically identifying senses by clustering
contexts continues to improve
● Automatically creating definitions remains
challenging, but fascinating problem in its own
right
● Given a cluster of contexts, create a definition that
captures why these contexts are in the same cluster
● Related task at Semeval-2015
http://alt.qcri.org/semeval2015/task15/
August 25-27, 2015 Crazy Futures III 85
The Future of
Word Sense Discrimination
● Once a definition has been
created, use that to position the
new sense in a WordNet or
ontology
● Related task at Semeval-2016
http://alt.qcri.org/semeval2016/task
14/
August 25-27, 2015 Crazy Futures III 86
Conclusion
● Dictionaries look backwards, and only
include words once they have a good
chance of long-term acceptance
● The process by which dictionaries are
created can be seen as a kind of horizon
scanning
● New words, new senses
● Standards for inclusion in OED very high
August 25-27, 2015 Crazy Futures III 87
Conclusion
● These techniques can be used
to spot emerging words, senses
and identities in raw text
● These can be harbingers of
future trends
August 25-27, 2015 Crazy Futures III 88
Thank you!
● Measures of Association
● http://ngram.sourceforge.net
● Word Sense Discrimination
● http://senseclusters.sourceforge.net
August 25-27, 2015 Crazy Futures III 89
LSI, LSA, and Schütze
● LSI : Deerwester, S., et al. (1988) Improving Information
Retrieval with Latent Semantic Indexing, Proceedings of the
51st Annual Meeting of the American Society for Information
Science 25, pp. 36–40.
● LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to
Plato's problem: The Latent Semantic Analysis theory of the
acquisition, induction, and representation of knowledge.
Psychological Review, 104, 211-240.
●
Schütze : Schütze, H. (1998) Automatic word sense
discrimination. Computational Linguistics, 24(1), pp. 97-123.
● SenseClusters : http://senseclusters.sourceforge.net

Mais conteúdo relacionado

Semelhante a The horizon isn't found in a dictionary : Identifying emerging word senses and identities in raw text

Great britain
Great britainGreat britain
Great britainborzna
 
Dictionaries 2003 version
Dictionaries 2003 versionDictionaries 2003 version
Dictionaries 2003 versionJohan Koren
 
English literature
English literatureEnglish literature
English literatureBusines
 
History of English Language
History of English LanguageHistory of English Language
History of English Languagesuasenglish
 
A Guide To British and American English ( PDFDrive ).pdf
A Guide To British and American English ( PDFDrive ).pdfA Guide To British and American English ( PDFDrive ).pdf
A Guide To British and American English ( PDFDrive ).pdfraykhona_r
 
History of English Language
History of English LanguageHistory of English Language
History of English LanguageTOHIDURRAHMAN5
 
English as a global language grace
English as a global language graceEnglish as a global language grace
English as a global language gracePao Plastina
 
Nineteenth century and after
Nineteenth century and afterNineteenth century and after
Nineteenth century and afteriqbal hussain
 
Lots Of Free Printables For Kids Animal Writing Writi
Lots Of Free Printables For Kids Animal Writing WritiLots Of Free Printables For Kids Animal Writing Writi
Lots Of Free Printables For Kids Animal Writing WritiMonroe Anderton
 
History of english
History of englishHistory of english
History of englishKarlaAnampa
 
All in one general quiz
All in one general quizAll in one general quiz
All in one general quizShivam Agarwal
 
1.1 Introduction to the Industrial Revolution.pptx
1.1 Introduction to the Industrial Revolution.pptx1.1 Introduction to the Industrial Revolution.pptx
1.1 Introduction to the Industrial Revolution.pptxMartensJ
 
presentation of language final.pptx
presentation of language final.pptxpresentation of language final.pptx
presentation of language final.pptxsharjeelmushtaq47
 
America is ruining the english
America is ruining the englishAmerica is ruining the english
America is ruining the englishJessica Soto
 
Anglo-Saxon Glosses And Glossaries An Introduction
Anglo-Saxon Glosses And Glossaries  An IntroductionAnglo-Saxon Glosses And Glossaries  An Introduction
Anglo-Saxon Glosses And Glossaries An IntroductionTye Rausch
 

Semelhante a The horizon isn't found in a dictionary : Identifying emerging word senses and identities in raw text (20)

Dictionaries
DictionariesDictionaries
Dictionaries
 
Great britain
Great britainGreat britain
Great britain
 
Dictionaries
DictionariesDictionaries
Dictionaries
 
Oed
OedOed
Oed
 
Dictionaries 2003 version
Dictionaries 2003 versionDictionaries 2003 version
Dictionaries 2003 version
 
English literature
English literatureEnglish literature
English literature
 
History of English Language
History of English LanguageHistory of English Language
History of English Language
 
A Guide To British and American English ( PDFDrive ).pdf
A Guide To British and American English ( PDFDrive ).pdfA Guide To British and American English ( PDFDrive ).pdf
A Guide To British and American English ( PDFDrive ).pdf
 
History of English Language
History of English LanguageHistory of English Language
History of English Language
 
English as a global language grace
English as a global language graceEnglish as a global language grace
English as a global language grace
 
Nineteenth century and after
Nineteenth century and afterNineteenth century and after
Nineteenth century and after
 
Lots Of Free Printables For Kids Animal Writing Writi
Lots Of Free Printables For Kids Animal Writing WritiLots Of Free Printables For Kids Animal Writing Writi
Lots Of Free Printables For Kids Animal Writing Writi
 
History of english
History of englishHistory of english
History of english
 
All in one general quiz
All in one general quizAll in one general quiz
All in one general quiz
 
1.1 Introduction to the Industrial Revolution.pptx
1.1 Introduction to the Industrial Revolution.pptx1.1 Introduction to the Industrial Revolution.pptx
1.1 Introduction to the Industrial Revolution.pptx
 
presentation of language final.pptx
presentation of language final.pptxpresentation of language final.pptx
presentation of language final.pptx
 
Dictionaries
DictionariesDictionaries
Dictionaries
 
America is ruining the english
America is ruining the englishAmerica is ruining the english
America is ruining the english
 
V1.6 dump qotm
V1.6 dump qotmV1.6 dump qotm
V1.6 dump qotm
 
Anglo-Saxon Glosses And Glossaries An Introduction
Anglo-Saxon Glosses And Glossaries  An IntroductionAnglo-Saxon Glosses And Glossaries  An Introduction
Anglo-Saxon Glosses And Glossaries An Introduction
 

Mais de University of Minnesota, Duluth

Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...University of Minnesota, Duluth
 
Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? University of Minnesota, Duluth
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?University of Minnesota, Duluth
 
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection University of Minnesota, Duluth
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...University of Minnesota, Duluth
 
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...University of Minnesota, Duluth
 
Puns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyPuns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyUniversity of Minnesota, Duluth
 
Duluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyDuluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyUniversity of Minnesota, Duluth
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)University of Minnesota, Duluth
 

Mais de University of Minnesota, Duluth (20)

Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
 
Automatically Identifying Islamophobia in Social Media
Automatically Identifying Islamophobia in Social MediaAutomatically Identifying Islamophobia in Social Media
Automatically Identifying Islamophobia in Social Media
 
What Makes Hate Speech : an interactive workshop
What Makes Hate Speech : an interactive workshopWhat Makes Hate Speech : an interactive workshop
What Makes Hate Speech : an interactive workshop
 
Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it?
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...
 
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
 
Puns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyPuns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and weary
 
Screening Twitter Users for Depression and PTSD
Screening Twitter Users for Depression and PTSDScreening Twitter Users for Depression and PTSD
Screening Twitter Users for Depression and PTSD
 
Duluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyDuluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of Lexicography
 
Pedersen masters-thesis-oct-10-2014
Pedersen masters-thesis-oct-10-2014Pedersen masters-thesis-oct-10-2014
Pedersen masters-thesis-oct-10-2014
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)
 
Pedersen naacl-2013-demo-poster-may25
Pedersen naacl-2013-demo-poster-may25Pedersen naacl-2013-demo-poster-may25
Pedersen naacl-2013-demo-poster-may25
 
Pedersen semeval-2013-poster-may24
Pedersen semeval-2013-poster-may24Pedersen semeval-2013-poster-may24
Pedersen semeval-2013-poster-may24
 
Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013
 
Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012
 
Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1
 
Pedersen ACL Disco-2011 workshop
Pedersen ACL Disco-2011 workshopPedersen ACL Disco-2011 workshop
Pedersen ACL Disco-2011 workshop
 

Último

Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 

Último (20)

Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 

The horizon isn't found in a dictionary : Identifying emerging word senses and identities in raw text

  • 1. August 25-27, 2015 Crazy Futures III 1 Ted Pedersen Department of Computer Science University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse The horizon isn't found in a dictionary : Identifying emerging word senses and identities in raw text
  • 2. August 25-27, 2015 Crazy Futures III 2 A winding road ● Dictionaries ● A powerful lens to look back, but not to the future ● Lexicographers ● While making dictionaries, engage in a kind of horizon scanning – What new words or senses are emerging? ● Natural Language Processing ● Can we automate the task of the lexicographer? ● Can identify emerging words, senses, and identities?
  • 3. August 25-27, 2015 Crazy Futures III 3 Dictionaries ● Wonderful for looking back! ● Is that really a word? ● How do you spell it? ● What does it mean? ● When was a word first used? ● When did that sense of a word emerge?
  • 4. August 25-27, 2015 Crazy Futures III 4 Dictionaries ● Not particularly predictive ● But, the people who create dictionaries are horizon scanners, always looking for new words and senses ● Lexicographers ● Or … computer programs? (NLP)
  • 5. August 25-27, 2015 Crazy Futures III 5 Dictionaries ● Go back to at least 2300 BCE ● Early on were bilingual word lists ● Useful for trade, warfare ● Idea of monolingual dictionary developed later ● In English, 1604
  • 6. August 25-27, 2015 Crazy Futures III 6 Descriptive or Prescriptive ● Descriptive ● Document how the language is used ● Use determines meaning ● English – OED ● Prescriptive ● Define how the language should be used ● Experts decide ● English – early Webster ● French Academy – create words to replace Anglicisms
  • 7. August 25-27, 2015 Crazy Futures III 7 English Lexicography ● 1604 - A Table Alphabeticall, by Robert Cawdrey, approx 2,500 entries ● 1755 - The Dictionary of the English Language, by Samuel Johnson, approx 42,000 entries. ● 1828 – American Dictionary of the English Language, by Noah Webster, approx 70,000 entires ● 1928 - Oxford English Dictionary, 4 volumes, approx 400,000 entries ● 1989 – Oxford English Dictionary (2nd ed), 10 volumes, 600,000 entries
  • 8. August 25-27, 2015 Crazy Futures III 8
  • 9. August 25-27, 2015 Crazy Futures III 9 Table Alphabeticall (1604) A Table Alphabeticall, conteyning and teaching the true writing, and vnderstanding of hard vsuall English wordes, borrowed from the Hebrew, Greeke, Latine, or French. & c. With the interpretation thereof by plaine English words, gathered for the benefit & helpe of Ladies, Gentlewomen, or any other vnskilfull persons. Whereby they may the more easilie and better vnderstand many hard English wordes, which they shall heare or read in Scriptures, Sermons, or elswhere, and also be made able to vse the same aptly themselues. Legere, et non intelligere, neglegere est. As good not read, as not to vnderstand.
  • 10. August 25-27, 2015 Crazy Futures III 10 Table Alphabeticall (1604) ● A Table Alphabeticall of Hard Usual English Words ● Developed by Robert Cawdrey ● 120 pages, 2,543 entries ● Short definitions, synonyms ● Doesn't include multiple senses for a word ● http://www.library.utoronto.ca/utel/ret/cawdre y/cawdrey0.html
  • 11. August 25-27, 2015 Crazy Futures III 11
  • 12. August 25-27, 2015 Crazy Futures III 12 combustible, easily burnt combustion, burning or consuming with fire. comedie, (k) stage play, comicall, handled merily like a comedie commemoration, rehearsing or remebring [fr] commencement, a beginning or entrance comet, (g) a blasing starre comentarie, exposition of any thing commerce, fellowship, entercourse of merchandise. commination, threatning, or menacing, commiseration, pittie commodious, profitable, pleasant, fit, commotion, rebellion, trouble, or disquietnesse. communicate, make partaker, or giue part vnto [fr] communaltie, common people, or comon-wealth communion, (* synonyms *) fellow- communitie, ship. (* synonyms end *) compact, ioyned together, or an agreement. compassion, pitty, fellow-feeling compell, to force, or constraine compendious, short, profitable
  • 13. August 25-27, 2015 Crazy Futures III 13 Table Alphabeticall (1604) ● The First English Dictionary ● Not clear why words included or not ● Hard? ● Introspection ● Quickly superseded
  • 14. August 25-27, 2015 Crazy Futures III 14
  • 15. August 25-27, 2015 Crazy Futures III 15 A Dictionary of the English Language (1755) ● Written by Samuel Johnson (Dr. Johnson) ● Worked alone (with six copyists) ● Nearly 43,000 entries ● 2,300 pages ● 100,000 illustrative quotes from literature ● http://johnsonsdictionaryonline.com/ ● Sometimes biased, long-winded, inconsistent ● A delight really...
  • 16. August 25-27, 2015 Crazy Futures III 16 Method ● Decided not to build upon previous works ● Carried out a perusal of English literature ● Studied 2,000 books from 500 authors going back 200 years ● Entries based on the past ● Selected quotations to show language in action
  • 17. August 25-27, 2015 Crazy Futures III 17 The Inimitable Dr. Johnson ● Lexicographer: A writer of dictionaries; a harmless drudge that busies himself in tracing the original, and detailing the signification of words. ● Oats: A grain, which in England is generally given to horses, but in Scotland appears to support the people. ● To worm: To deprive a dog of something, nobody knows what, under his tongue, which is said to prevent him, nobody knows why, from running mad.
  • 18. August 25-27, 2015 Crazy Futures III 18 oats ● Oats. n.s. [aten, Saxon.] A grain, which in England is generally given to horses, but in Scotland supports the people. ● It is of the grass leaved tribe; the flowers have no petals, and are disposed in a loose panicle: the grain is eatable. The meal makes tolerable good bread. Miller. ● The oats have eaten the horses. Shakespeare. ● It is bare mechanism, no otherwise produced than the turning of a wild oatbeard, by the insinuation of the particles of moisture. Locke. ● For your lean cattle, fodder them with barley straw first, and the oat straw last. Mortimer's Husbandry. ● His horse's allowance of oats and beans, was greater than the journey required. Swift.
  • 19. August 25-27, 2015 Crazy Futures III 19
  • 20. August 25-27, 2015 Crazy Futures III 20
  • 21. August 25-27, 2015 Crazy Futures III 21 A Dictionary of the English Language (1755) ● A monumental work ● Set precedents for dictionaries that live on today ● Systematic study of published literature for words and senses ● Illustrate senses with quotations ● 1700 of Dr. Johnson's definitions remain in OED today
  • 22. August 25-27, 2015 Crazy Futures III 22 Noah Webster ● A tireless advocate for American English ● “Blue Backed Speller” (1783, 1804, 1806) ● Proposed Americanized spellings ● Widely used in schools in 1800s ● Dissertations on the English Language (1789) ● An American standard needed to be developed
  • 23. August 25-27, 2015 Crazy Futures III 23
  • 24. August 25-27, 2015 Crazy Futures III 24 Noah Webster ● A Compendius Dictionary of the English Language (1806) ● 28,000 entries ● Intended to improve, Americanize Dr. Johnson's dictionary
  • 25. August 25-27, 2015 Crazy Futures III 25 Noah Webster ● An American Dictionary of the English Language (1828) ● 70,000 entries ● 1864 Unabridged edition had 114,000 entries
  • 26. August 25-27, 2015 Crazy Futures III 26
  • 27. August 25-27, 2015 Crazy Futures III 27 Improving on Dr. Johnson? OAT, n. A plant of the genus Avena, and more usually, the seed of the plant. The word is commonly used in the plural, oats. This plant flourishes best in cold latitudes, and degenerates in the warm. The meal of this grain, oatmeal, forms a considerable and very valuable article of food for man in Scotland, and every where oats are excellent food for horses and cattle.
  • 28. August 25-27, 2015 Crazy Futures III 28 An American Dictionary It is not only important, but, in a degree necessary, that the people of this country, should have an American Dictionary of the English Language; for, although the body of the language is the same as in England, and it is desirable to perpetuate that sameness, yet some differences must exist. Language is the expression of ideas; and if the people of one country cannot preserve an identity of ideas, they cannot retain an identity of language. Now an identity of ideas depends materially upon a sameness of things or objects with which the people of the two countries are conversant. But in no two portions of the earth, remote from each other, can such identity be found. Even physical objects must be different. But the principal differences between the people of this country and of all others, arise from different forms of government, different laws, institutions and customs.
  • 29. August 25-27, 2015 Crazy Futures III 29 Noah Webster ● An American Dictionary of the English Language (1828) ● 70,000 words ● Not a great success at the time
  • 30. August 25-27, 2015 Crazy Futures III 30 Oxford English Dictionary ● OED began in 1857 as a revision of Dr. Johnson's dictionary ● Improve coverage, quality of entries, consistency, remove biases ● Envisioned as a 10 year project ● Was also a response to perception that other European languages were more advanced with their dictionaries
  • 31. August 25-27, 2015 Crazy Futures III 31 Oxford English Dictionary ● Work began in 1857, first publication in 1884, first edition in 1928 (71 years later) ● James Murray, Chief Editor of OED, 1879 – 1915
  • 32. August 25-27, 2015 Crazy Futures III 32
  • 33. August 25-27, 2015 Crazy Futures III 33 Crowd-sourced! ● Invite English readers to contribute words ● Read, and whenever they see a word of interest used in an illustrative context, write it on a slip of paper and send it to OUP ● Word, quotation, citation, reference
  • 34. August 25-27, 2015 Crazy Futures III 34
  • 35. August 25-27, 2015 Crazy Futures III 35 First edition 1928 ● 10 volumes, 15,490 pages ● 414,800 entries ● 2,000 contributors ● 5 million submitted quotations ● 1.86 million used
  • 36. August 25-27, 2015 Crazy Futures III 36 Second Edition 1989 ● 20 volumes, 21,730 pages ● Weighs 137 pounds ● 658,000 words ● 2.43 million quotations
  • 37. August 25-27, 2015 Crazy Futures III 37
  • 38. August 25-27, 2015 Crazy Futures III 38
  • 39. August 25-27, 2015 Crazy Futures III 39
  • 40. August 25-27, 2015 Crazy Futures III 40
  • 41. August 25-27, 2015 Crazy Futures III 41
  • 42. August 25-27, 2015 Crazy Futures III 42
  • 43. August 25-27, 2015 Crazy Futures III 43 But...good news ● Duck face is entering dictionaries ● Oxford Dictionaries online ● Urban dictionary ● OED sets high bar for inclusion ● What words are being used today that will find their way into OED?
  • 44. August 25-27, 2015 Crazy Futures III 44 And now...NLP? ● OED tells us when a word or sense was first used ● What if we could automatically recognize new words or senses going forward? ● What if we could recognize people or organizations (identities) that were to be significant?
  • 45. August 25-27, 2015 Crazy Futures III 45 New words, emerging senses, new identities ● Scan sources of interest and look for words or terms that have not occurred previously, and that reach some level of regularity and frequency ● Once you have a few candidates, you can start to investigate further
  • 46. August 25-27, 2015 Crazy Futures III 46 NLP ● Identify interesting or significant words, phrases, or names ● Group the occurrences of this “interesting thing” into senses ● Differentiate among the senses
  • 47. August 25-27, 2015 Crazy Futures III 47 NLP ● Concordances ● Measures of Association ● Clustering ● First order co-occurrences ● Second order co-occurrences
  • 48. August 25-27, 2015 Crazy Futures III 48 Concordances ● KWIC – Key Word in Context ● A basic tool for lexicographers, and many other language users ● Long history with religious scholars ● Shows a target word surrounded by some amount of context on either side
  • 49. August 25-27, 2015 Crazy Futures III 49
  • 50. August 25-27, 2015 Crazy Futures III 50 Concordance ● Can ponder different usages of a word in context, sort and rearrange them, compare and contrast, come to understand distinctions in meaning ● The goal may be to group the contexts in the concordance into groups or clusters, where each cluster uses the target word in the same sense ● ...Much like a lexicographer
  • 51. August 25-27, 2015 Crazy Futures III 51 Collocations ● How to recognize similar entries in a concordance? ● Collocations with the target word – All entries using “burnt offering” likely to be using same sense (of offering) ● Same or similar words co-occur in context – All entries that also include “priest” may be similar
  • 52. August 25-27, 2015 Crazy Futures III 52 Collocations ● Can be recognized via frequency ● May be identified in a large corpus via measures of association ● Do these two words occur together significantly more often than expected by chance?
  • 53. August 25-27, 2015 Crazy Futures III 53 Frequency
  • 54. August 25-27, 2015 Crazy Futures III 54 Measures of Association ● Compare the frequency of a pair of words with the value that would be expected if they were independent ● p(w1,w2) = p(w1)*p(w2) ?? ● If the frequency of the pair is not what would be expected, then this pair is not considered interesting (but is instead just a chance occurrence)
  • 55. August 25-27, 2015 Crazy Futures III 55 Measures of Association http://ngram.sourceforge.net ● Log-likelihood ratio (ll) ● Mutual Information (tmi) ● Pearson's chi- squared test (x2) ● Pointwise Mutual Information (pmi) ● Poisson-Stiring (ps) ● Fisher's Exact Test (leftFisher) ● Jaccard Coefficient (jaccard) ● Odds Ratio (odds) ● Dice Coefficient (dice) ● T-score (tscore)
  • 56. August 25-27, 2015 Crazy Futures III 56 Log likelihood ratio
  • 57. August 25-27, 2015 Crazy Futures III 57 Observed versus Expected ● p(w_1,w_2) = n_11 / n_++ ● p(w_1) = n_1+ / n_++, p(w2) = n_+1 / n_++ ● m_11 = (n_1+ * n_+1) / n_++ ● Generalizes to m_ij W2 NOT W2 W1 n_11 n_12 n_1+ NOT W1 n_21 n_22 n_2+ n_+1 n_+2 n_++
  • 58. August 25-27, 2015 Crazy Futures III 58 Example offering NOT offering burnt n_11 = 184 m_11 = 2.47 n_12 = 125 m_12 = 306.53 309 NOT burnt n_21 = 364 m_21 = 505.60 n_22 = 67,944 m_22 = 62,802.40 68,30868,308 548 68,069 68,617 ● Do n_ij and m_ij diverge enough to reject the model of independence? ● According to log-likelihood they do …
  • 59. August 25-27, 2015 Crazy Futures III 59 Features ● Collocations – words that occur together more often than expected by chance ● Can indicate sense reliably when target word involved ● Co-occurrences – words that occur near the target word (but not adjacent) ● Useful for differentiating among senses, especially when several are involved
  • 60. August 25-27, 2015 Crazy Futures III 60 Word Sense Discrimination ● Feed a cold, starve a fever. ● It is always cold in Minnesota. ● The soup was cold and watery. ● Cold and flu season is upon us.
  • 61. August 25-27, 2015 Crazy Futures III 61 Word Sense Discrimination ● Feed a cold, starve a fever. ● Cold and flu season is upon us. ● It is always cold in Minnesota. ● The soup was cold and watery.
  • 62. August 25-27, 2015 Crazy Futures III 62 First Order Representations ● CTX1 : Feed a cold, starve a fever. cold feed fever starve CTX1 1 1 1 1
  • 63. August 25-27, 2015 Crazy Futures III 63 First order methods ● Following bag-of-words, text classification ● Represent each target word context with a binary vector that shows which features occur within ● Collocations, co-occurrences ● Results in a context by word matrix (where each row is an instance to be clustered) ● Cluster
  • 64. August 25-27, 2015 Crazy Futures III 64 First Order Representations ● CTX1 : Feed a cold, starve a fever. ● CTX4 : Cold and flu season is upon us. cold feed fever flu season starve upon CTX1 1 1 1 0 0 1 0 CTX4 1 0 0 1 1 0 1
  • 65. August 25-27, 2015 Crazy Futures III 65 First order representations ● Works well enough if you have moderate to large numbers of larger contexts ● and a relatively consistent vocabulary... – and a bit of luck... ● Success in supervised text classification problems doesn't always transfer over to unsupervised arena
  • 66. August 25-27, 2015 Crazy Futures III 66 What drives us crazy... ● fever and flu have much in common ... ● But, just can't see it here.. cold feed fever flu season starve upon CTX1 1 1 1 0 0 1 0 CTX4 1 0 0 1 1 0 1 CTX1 : Feed a cold, starve a fever. CTX4 : Cold and flu season is upon us.
  • 67. August 25-27, 2015 Crazy Futures III 67 Look to the second order... ● You shall know a word by the company it keeps (JR Firth, 1957) ● Words have friends – Cold is a friend of fever and flu ● Friends share friends and hang outs – Fever and flu share some friends that aren't friends with cold ● 2nd order co-occurrences with cold (f of f) – Fever and flu hang out in places without cold ● 2nd order “locations” of cold
  • 68. August 25-27, 2015 Crazy Futures III 68 Look to the second order... ● Fever and flu have some of the same friends... ● His fever caused his temperature to spike. ● The flu brings on a rise in body temperature. ● Fever and flu hang out together... ● Although influenza (the flu) is not considered serious by many parents, the very high fever that it can cause is a cause of blindness and even death in children. ● Second order features can be derived from the target word contexts, or from other (unannotated) data
  • 69. August 25-27, 2015 Crazy Futures III 69 LSI, LSA, and Schütze ● Unsupervised methods ● Input Contexts, Output Clusters of Contexts ● Influential ● Context representation a key distinction ● Alternatives to first order features ● They look to the second order... – LSI/LSA – where do you find your word friends? – Schütze - who do your word friends hang out with?
  • 70. August 25-27, 2015 Crazy Futures III 70 Second order representations ● CTX1 : Feed a cold, starve a fever... ● Create co-occurrence vectors for all non- stop words : feed, starve, fever ● Replace words in CTX1 with those vectors ● Average together and replace CTX1 with that new averaged vector ● Do the same with all other target word contexts, then cluster
  • 71. August 25-27, 2015 Crazy Futures III 71 Second order representations ● CTX1 : Feed a cold, starve a fever. ● CTX4 : Cold and flu season is upon us. ● Nothing matches in first order representation, but in second order if fever and flu ... ● both occur with temperature, then there is some similarity between CTX1 and CTX4 ● both occur in document 12432, then there is some similarity between CTX1 and CTX4
  • 72. August 25-27, 2015 Crazy Futures III 72 Method ● Collect contexts with a given target word ● Identify lexical features within the contexts ● Use these to represent contexts using first or second order features ● Perform SVD or other dimensionality reduction ● Cluster ● Number of clusters automatically discovered ● Generate a label for each cluster
  • 73. August 25-27, 2015 Crazy Futures III 73 First order features ● Represent contexts with binary vectors that show which features occur in the context ● Results in a context by word matrix (where each row is an instance to be clustered) ● Cluster
  • 74. August 25-27, 2015 Crazy Futures III 74 Second order co-occurrences ● Use bigram features to create a word by word co-occurrence matrix ● SVD or dimensionality reduction ● Replace each word in a target word context with the corresponding co-occurrence vector ● Average all of the word vectors together to represent the context ● Do this for each target word context, cluster
  • 75. August 25-27, 2015 Crazy Futures III 75 A note on word embeddings ● Word embeddings are a recently popular idea where a vector is created for a word based on co-occurrence or other kinds of language information ● 2nd order features as shown here can be seen as a fairly direct sort of word embedding ● word2vec is a widely used tool
  • 76. August 25-27, 2015 Crazy Futures III 76 second order locations (LSI/LSA) ● Transpose first order representation so that it becomes word by context ● Perform SVD (LSA recommendation) ● Represent contexts to be clustered by replacing each word in a target word context with the corresponding word vector ● Average all of the word vectors together to represent the context
  • 77. August 25-27, 2015 Crazy Futures III 77 Clustering ● Repeated Bisections ● Starts by clustering all contexts in one cluster, then repeatedly partitioning (in two) to optimize the criterion function ● Partitioning done via k-means with k=2 ● I2 criterion function ● Finds average pairwise similarity between each context in the cluster and the centroid, sums across all clusters to find value
  • 78. August 25-27, 2015 Crazy Futures III 78 Cluster stopping ● Find k where criterion function stops improving ● PK2 (Hartigan, 1975) takes ratio of criterion function of successive pairs of k ● PK3 takes ratio of twice the criterion function at k divided by product of (k-1) and (k+1) ● PK2 and PK3 stop when these ratios are within 1 std of 1 ● Gap Statistic (Tibshirani, 2001) compares observed data with reference sample of noise, find k with greatest divergence from noise
  • 79. August 25-27, 2015 Crazy Futures III 79 Cluster labeling ● Clusters made up of contexts that use the target word in a particular sense ● Find top N most associated bigrams that are unique to that cluster (discriminating features) and top N that are most associated without regard to which cluster they are in (descriptive features) ● Use standard measures of association like log- likelihood, etc. ● Definition via a few well chosen bigrams
  • 80. August 25-27, 2015 Crazy Futures III 80 The result? ● Contexts that contain a particular target word ● Organized by sense, where each cluster contains contexts used in approximately the same sense
  • 81. August 25-27, 2015 Crazy Futures III 81 Identities? ● Much like word senses, except they apply to names ● Many distinct individuals have the same name ● How do we differentiate among them? Same techniques can be used.
  • 82. August 25-27, 2015 Crazy Futures III 82 Synonyms ● Might also be interested in new words for old ideas ● How similar are the contexts in which these new words are being used (with old contexts)
  • 83. August 25-27, 2015 Crazy Futures III 83 Synonyms ● Might also be interested in new words for old ideas ● How similar are the contexts in which these new words are being used (with old contexts) ● Or different words for the same idea ● Can use same technqiues to recognize
  • 84. August 25-27, 2015 Crazy Futures III 84 The Future of Word Sense Discrimination ● Automatically identifying senses by clustering contexts continues to improve ● Automatically creating definitions remains challenging, but fascinating problem in its own right ● Given a cluster of contexts, create a definition that captures why these contexts are in the same cluster ● Related task at Semeval-2015 http://alt.qcri.org/semeval2015/task15/
  • 85. August 25-27, 2015 Crazy Futures III 85 The Future of Word Sense Discrimination ● Once a definition has been created, use that to position the new sense in a WordNet or ontology ● Related task at Semeval-2016 http://alt.qcri.org/semeval2016/task 14/
  • 86. August 25-27, 2015 Crazy Futures III 86 Conclusion ● Dictionaries look backwards, and only include words once they have a good chance of long-term acceptance ● The process by which dictionaries are created can be seen as a kind of horizon scanning ● New words, new senses ● Standards for inclusion in OED very high
  • 87. August 25-27, 2015 Crazy Futures III 87 Conclusion ● These techniques can be used to spot emerging words, senses and identities in raw text ● These can be harbingers of future trends
  • 88. August 25-27, 2015 Crazy Futures III 88 Thank you! ● Measures of Association ● http://ngram.sourceforge.net ● Word Sense Discrimination ● http://senseclusters.sourceforge.net
  • 89. August 25-27, 2015 Crazy Futures III 89 LSI, LSA, and Schütze ● LSI : Deerwester, S., et al. (1988) Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp. 36–40. ● LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240. ● Schütze : Schütze, H. (1998) Automatic word sense discrimination. Computational Linguistics, 24(1), pp. 97-123. ● SenseClusters : http://senseclusters.sourceforge.net