Timo Honkela: Spaces of Knowledge

Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Timo Honkela
Modeling Meaning and Knowledge
1 Feb 2016
timo.honkela@helsinki.fi
Spaces of Knowledge

http://www.cs.cornell.edu/Info/
Department/Annual95/Faculty/Salton.html
Advent of vector-based
information retrieval
●
Gerarg Salton: Documents and
queries represented as vectors of
term counts
● Similarity between a document
and a query is given by the cosine
between the term vector and the
document vector
● TF-IDF (term-frequency-inverse-
document frequency) for weighting
of a term in a document
● Inverse document frequency had
been introduced by Karen
Spärck-Jones in 1972
https://en.wikipedia.org/wiki/Gerard_Salton
https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones

University
Society
D
D
D
Q Q
Q
1
1
2
2
3
3
Document 1: The word “university”
appears three times and “society” once, etc.
Query 1: “university”
https://en.wikipedia.org/wiki/Cosine_similarity
https://en.wikipedia.org/wiki/Sine

Contexts tell about meaning
● John Rupert Firth: “You shall know a word
by the company it keeps”
● Ludwig Wittgenstein: “For a large class of
cases of the employment of the word
‘meaning’—though not for all—this way can be
explained in this way: the meaning of a word is
its use in the language” (PI 43)
https://en.wikipedia.org/wiki/John_Rupert_Firth
http://plato.stanford.edu/entries/wittgenstein/#Mea
https://en.wikipedia.org/wiki/Ludwig_Wittgenstein

Analysis of term-document matrices
● The same idea as in information retrieval can
also be applied in studying words and
expressions
● Statistical analysis of document-term matrices
gives rise to models of relationship between
words or documents
● Classical examples include
– Latent Semantic Analysis (Deerwester, Dumais et al. 1988)
– Self-Organizing Semantic Maps (Ritter & Kohonen 1989)

Word spaces, clusters, clouds, ...
● The analysis of the statistical information
related to word contexts can be turned into
visualizations of the word relations

Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Automated learning of word relations
using self-organizing map on text context data

Chemistry
Natural sciences
and engineering
Bio- and
environmental
sciences
Health
Culture and
society
Map of Finnish Science
(T. Honkela & M. Klami 2007)

From term weighting
to term selection
● TF-IDF is a widely used method for term
weighting
● Likey (Language Independent Keyphrase
Extraction) was developed to select terms
automally by camparing the corpus at hand
with another corpus, called a reference corpus
(Paukkeri et al. 2008, Paukkeri & Honkela 2010)

1. the 1276847
2. of 1067918
3. and   817852
4. in   625330
5. to   357453
6. for   225307
7. is   205723
8. on   162509
9. research 157251
10. be   151475
11. with   136854
12. will   135992
13. as      122707
14. are   116508
15. by   113878
16. university 98003
...
1. the 2023617
2. of   945622
3. to   883206
4. and   717718
5. in   611421
6. that   473739
7. a   445775
8. is   445119
9. we   305590
10. for   296092
11. i     290412
12. this   286924
13. on   274614
14. it   251343
15. be   246917
16. are   197082
...
Most frequent word forms (types) in
two corpora
Academy
corpus
Europarl
corpus

Documents
Terms
SOM
Document map
Likey
Reference
corpus
(EU partiament)
Academy
corpus
Term list

Extralinguistic contexts
● Human beings learn language in real world
contexts that include visual, tactile, etc.
perceptions
● In order to model meaning in a human-like
manner, these other modalities have to be taken
into account
● In a project called “Multimodally Grounded
Language Technology” we associated visual
patterns of human movements with expressions
that had been used to describe these
movements

RUNNING
WALKING
LIMPING
JOGGING

Modeling subjectivity
of meaning
● In our method Grounded Intersubjective
Concept Analysis (GICA), we added a new
“dimension” to the term-document matrices
● We did not assume that each person
understands and uses every word in a similar
manner but wanted to model the personal
variation
● This was achieved by using Subject-Object-
Context tensors (Honkela et al. 2012)

GICA: Grounded Intersubjective
Concept Analysis
Honkela,
Raitio,
Lagus &
Nieminen
2012

Analysis of “health” in the
State of the Union addresses
Subjects on objects in contexts:
Using GICA method to quantify
epistemological subjectivity.
Timo Honkela, Juha Raitio, Krista Lagus,
Ilari T. Nieminen, Nina Honkela, and Mika Pantzar.
Proc. of IJCNN 2012.

Thank you for
you attention!

Timo Honkela: Spaces of Knowledge

Recommended

Recommended

More Related Content

Similar to Timo Honkela: Spaces of Knowledge

Similar to Timo Honkela: Spaces of Knowledge (20)

More from Timo Honkela

More from Timo Honkela (20)

Recently uploaded

Recently uploaded (20)

Timo Honkela: Spaces of Knowledge