Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Semantic Relations
1. Similarity of Semantic Relations
Peter D. Turney
National Research Council Canada
Presented by: Jennifer Lee
November 14, 2008
CSI 5386
2. Attributional Similarity
Two words, A and B with high degree of
attributional similarity are called synonyms.
An example of a typical synonym question that
appears in TOEFL exam:
Stem: Levied
Choices: (a) imposed
(b) believed
(c) requested
(d) correlated
Solution: (a) imposed
3. Attributional Similarity
A measure of similarity: sima(A, B) ϵ R.
Semantic relatedness x semantic distance.
A more general concept than similarity.
Semantic relatedness is the same as attributional
similarity.
4. Attributional Similarity
Example of semantic relatedness:
Similar entities: (banktrust company)
Dissimilar entities:
Meronymy: (carwheel)
Antonymy: (hotcold)
Any functional relationship/frequent
association: (pencilpaper), (penguin
Antartica).
5. Attributional Similarity
Types of attributional similarity:
Semantically associated: (beehoney).
Semantically similar: (deerpony).
Both: (doctornurse).
The term semantic similarity is misleading as it
refers to a type of attributional similarity, yet
relational similarity is not any less semantic
than attributional similarity.
Hence, we use the term taxonomical
similarity.
6. Relational Similarity
Relational similarity:
When two pair of words have a high degree of
relational similarity, we say they are analogous.
Measured by: simr(A:B,C:D) ϵ R
A:B::C:D
A is to B as C is to D
7. Verbal Analogy
Examples:
traffic:street::water:riverbed
mason:stone::carpenter:wood
It seems like in the second example, the
relational similarity can be reduced to
attributional similarity.
8. Verbal Analogy
A typical analogy question from SAT:
Stem: mason: stone
Choices: a) teacher: chalk
b) carpenter:wood
c) soldier:gun
d) photograph:camera
e) book:word
Solution: carpenter:wood
9. Near Analogy
Near Analogy
When there's a high degree of relational similarity
between two words, A:B and C:D, there's also a
high degree of attributional similarity between A and
C, and between B and D.
Otherwise, it is a far analogy.
Which is one of these pairs is a near analogy?
(mason:stone::carpenter:wood)
(traffic:street::water:riverbed)
10. Measures of Attributional
Similarity
Many algorithms have been proposed.
Measures of attributional similarity have been
studied extensively.
Applications:
Problems such as recognizing synonyms,
information retrieval, determining semantic
orientation, grading student essays, measuring
textual cohesion, and word sense disambiguation.
11. Measuring Attributional Similarity
Algorithms:
Lexiconbased, corpusbased.
Hybrid of the two.
We expect that lexiconbased algorithms would
be better at capturing synonymy than corpus
based algorithm. But, this is not the case.
13. Measures of Relational Similarity
Not well developed
Potential applications are not so wellknown.
Many problems that involve semantic relations
would benefit from an algorithm for measuring
relational similarity:
NLP, information retrieval and information
extraction.
14. Using Attributional Similarity to
Solve Analogies
We could score each candidate analogy by the
average of the attributional similarity, sima,
between A and C and between B and D:
1
score(A:B::C:D) = (sima (A,C) + sim (B,D))
2
Performance of algorithms was measured by
precision, recall, and F
15. Using Attributional Similarity to
Solve Analogies
number of correct guesses
precision=
total number of guesses made
number of correct guesses
recall=
maximum possible number correct
2 x precision x recall
F=
precision recall
16. Using Attributional Similarity to
Solve Analogies
For example, using the algorithm of Hirst and
StOnge (1998), out of 374 SAT analogy
questions, 120 questions were answered
correctly, 224 incorrectly, and 30 questions
were skipped.
Precision was 120/(120 + 224)
Recall was 120/(120 + 224 + 30)
17. Using Attributional Similarity to
Solve Analogies
Performance of attributional similarity measures
on the 374 SAT questions. The bottom two
rows are included for comparison.
Algorithm Type Precision Recall F
Hirst and StOnge (1998) Lexiconbased 34.9 32.1 33.4
Jian and Conrath (1997) Hybrid 29.8 27.3 28.5
Leacock and Chodorow (1998) Lexiconbased 32.8 31.3 32
Lin (1998b) Hybrid 31.2 27.3 29.1
Resnik (1995) Hybrid 35.7 33.2 34.4
Turney (2001) Corpusbased 35 35 35
Turney and Littman (2005) Relational (VSM) 47.7 47.1 47.4
Random Random 20 20 20
18. Using Attributional Similarity to
Solve Analogies
We conclude that there are enough near
analogies in the 374 SAT questions for
attributional similarity to perform better than
random guessing.
But not enough near analogies for attributional
similarity to perform as well as relational
similarity.
19. Recognizing Word Analogies
First attempted by a system called Argus using
a small handbuilt semantic network.
Argus was based on a spreading activation
model and did not explicitly attempt to measure
relational similarity. Therefore, it could only
solve a limited set of analogy questions.
20. Recognizing Word Analogies
Turney at al. (2003) combined 13 independent
modules to answer SAT questions. VSM is the
best out of 13, achieving a score of 47%.
Veale (2004) applied a lexiconbased approach
to the same 374 SAT questions, attaining a
score of 43%
WordNet was used to get the quality measure,
based on similarity between A:B paths and the C:D
paths.
21. Latent Relational Analysis
Turney (2005) introduced Latent Relational
Analysis (LRA), an enhanced version of the
VSM approach to measure relational similarity.
LRA has potential in many areas, including
information extraction, word sense
disambiguation, and information retrieval.
LRA relies on three resources: a search engine
with a large corpus of text, thesaurus of
synonyms and an efficient implementation of
SVD.
22. Structure Mapping Theory
Most influential on modeling of analogy making,
implemented in Structure Mapping Engine
(SME).
Produces an analogical mapping between the
source and target domain. Uses predicate logic.
Example analogy:
Source domain: solar system (basic objects are sun
and planet)
Target domain: Rutherford's model of the atom
(basic objects are nucleus and electrons)
23. Structure Mapping Theory
Each individual connection in an analogy
mapping implies that the connection relations
are similar.
Later versions of SME allowed similar, non
identical relations to match.
Although SME focuses on the mapping process
as a whole rather than measuring similarity
between any two particular relations, LRA can
enhance the performance of SME and likewise.
24. Metaphor
Novel metaphors can be understood through
analogy, but conventional metaphors are simply
recalled from memory.
It may be fruitful to combine an algorithm
(Dolan's 1995) for handling conventional
metaphor with LRA and SME for handling novel
metaphors.
25. Metaphor
Lakoff and Johnson (1980):
Metaphorical sentence SATstyle verbal
analogy
He shot down all of my arguments. aircraft:shoot down::argument:refute
I demolished his argument. building:demolish::argument:refute
You need to budget your time. money:budget::time:schedule
I’ve invested a lot of time in her. money:invest::time:allocate
My mind just isn’t operating today. machine:operate::mind:think
Life has cheated me. charlatan:cheat::life:disappoint
Inflation is eating up our profits. animal:eat::inflation:reduce
26. Classifying Semantic Relations
The problem is to classify a nounmodifier pair
according to the semantic relation between the
head noun and the modifier.
Example: laser printer
Rosario and Hearst (2001) trained a neural
network to distinguish 13 classes of semantic
relations in the medical domain.
Lexical resources used: MeSH and UMLS
Each nounmodifier pair is represented with a
feature vector.
27. Classifying Semantic Relations
Nastase and Szpakowicz (2003) classified 600
general nounmodifier pairs using WordNet and
Roget's Thesaurus as lexical resources.
Vanderwende (2004) used handbuilt rules,
together with a lexical knowledge base.
Any classification of semantic relations employs
some implicit notion of relational similarity.
28. Classifying Semantic Relations
Barker and Szpakowicz (1998) tried a corpus
based approach
Explicitly use measure of relational similarity
Moldovan et al. (2004) also used a measure of
relational similarity to map each noun and
modifier into semantic classes in WordNet.
Taken from corpus
Surrounding context in the corpus is used in a word
sense disambiguation algorithm to improve the
mapping.
29. Classifying Semantic Relations
Turney and Littman (2005) used the VSM (as
the component in a single nearest neighbor
learning algorithm) to measure relational
similarity. This paper focuses on LRA.
Lauer (1995) used a corpusbased approach to
paraphrase nounmodifier pairs by inserting
propositions.
Example: reptile haven → haven for reptiles.
Lapata and Keller (2004) improved the result by
using the database of Alta Vista as a corpus.
30. Word Sense Disambiguation,
Information Extraction
If we can identify the relations between a given
word and its context, then we can disambiguate
the given word.
For example, consider the word plant. Suppose
plant appears in some text near food.
Information Extraction:
Given an input document and a specific relation R,
extract all pairs of entities (if any) that have the
relation R in the document.
Example: John Smith and Hardcom Corporation.
31. Information Extraction
With the VSM approach, there were a training
set of labeled examples of the relation.
Each example would be represented by a vector of
pattern frequencies.
Given two entities, we could construct a vector
representing their relation
Then measure the relational similarity between the
unlabeled vector and each of the labeled training
vectors.
32. Information Extraction and
Question Answering
Looks like a problem:
Training vectors would be relatively dense
The new unlabled vector for the two entities would
be sparse.
Moldovan et al. (2004) propose to map a given
question to semantic relation, and then search
for that relation in a corpus of semantically
tagged text.
33. Automatic Thesaurus Generation
Hearst (1992) presents an algorithm that can
automatically generate a thesaurus or
dictionary:
Learning hyponym, meronym relations and more.
Hearst and Berland and Charniak (1999) use
manually generated rules to mine text for
semantic relations.
Turney and Littman (2005) also use a manually
generated set of 64 patterns.
34. Automatic Thesaurus Generation
Instead of manually generating new rules or
patterns for each semantic relation, LRA can
automatically learn patterns from a large
corpus.
Girju, Badulescu, and Moldovan (2003) present
an algorithm for learning meronym from a
corpus.
They supplied manual rules wtih automatically
learned constraints.
35. Information Retrieval
Veale (2003) proposes to use algorithm for
solving word analogies, based on WordNet for
information retrieval.
Example: Hindu bible → the Vedas.
Focus on the analogy form:
Adjective:noun::adjective:noun
Example: Muslim:mosque::Christian: church
An unsupervised algorithm for discovering
analogies for clustering words from two different
corpora had been developed (Marx et al, 2002).
36. Identifying Semantic Roles
Semantic roles are merely a special case of
semantic relations (Moldovan et al).
Example:
Semantic frame: statement
Semantic roles: speaker, address and adressee
It is helpful to view semantic frames and their
semantic roles as sets of semantic relations.
37. Measuring Attributional Similarity
with the Vector Space Model
In the VSM approach to information retrieval,
queries and documents are represented by
vectors.
Elements in these vectors are the frequencies
of words in the corresponding queries and
documents.
The attributional similarity between a query and
a document is measured by the cosine of the
angle between their corresponding vectors.
38. Singular Value Decomposition
LRA enhances the VSM by using SVD to
smooth vectors.
SVD improves both documentquery
attributional similarity measures
39. Measuring Relational Similarity
with VSM
Given two unknown relations, R1 (between a
pair of words A and B) and R2 ( between C and
D), we wish to measure the relational similarity
between R1 and R2.
First, we need to create vectors:
R1 = < r1,1, ...., r1,n >
R2 = < r2,1, ...., r2,n >
40. Measuring Relational Similarity
with VSM
The measure of similrity of R1 and R2 is given
by the cosine of the angle Ɵ between r1 and r2:
cosine =
∑ r1 , i . r2 , i = r1.r2
=
r1.r2
∑ r1 , i 2 .r2 , i 2 r1.r2 . r1.r2 ∣r1∣.∣r2∣
Vector r indicates the relationship between two
words X and Y.
Created by counting the frequencies of short
phrases containing X and Y
41. Measuring Relational Similarity
with the VSM
If the number of hits for a query is x, then the
corresponding element in the vector r is:
log(x + 1).
To answer multiplechoice analogy questions,
vectors are created for the stem pair and each
choice pair. Then cosines are calculated for the
angles between stem pair and each choice pair.
42. Sample Multiple Choice
This SAT question:
Stem: quart:volume
Choices:
(a) day:night
(b) mile:distance
(c) decade: century
(d) friction:heat
(e) part:whole
Solution: (b) mile:distance
43. Measuring Relational Similarity
with VSM
Turney and Litman (2005) used the Alta Vista
search engine to obtain frequency information
needed to build vectors for VSM. But, Altavista
later changed their policy toward automated
searching.
They use the hit count, but LRA uses the
number of passages (strings) matching the
query.
44. Measuring Relational Similarity
with VSM
For experiment:
Waterloo MultiText System (WMTS) is used. It has
5 x 1010 English words.
Lin's (1998a) automatically generated
thesaurus online is used to query and
fetching the resulting list of synonyms.
Lin's thesaurus:
Generated by parsing a corpus of 5x107
words
45. Measuring Relational Similarity
with VSM
Lin's Thesaurus provides and sorts a list of
words in order of decreasing order
Convenient for LRA.
WordNet, in contrast, provides a list of words
grouped by possible senses, with groups
sorted by frequency of senses.
46. Steps of LRA
Let's suppose we want to calculate the
relational similarity between the pair
quart:volume and the pair mile:distance.
The LRA consists of 12 steps:
Step 1: Find alternates:
For each word pair A:B in the input set, look in Lin's
thesaurus for the top num_sim words that are most
similar to A. Do for A':B and B':A.
47. Alternate Forms of the original
pair quart:volume
Word pair Similarity Frequency Filtering
step
quart:volume NA 632 Accept
(original pair)
pint:volume 0.210 372
gallon:volume 0.159 1500 Accept
(top alternate)
liter:volume 0.122 3323 Accept (top
alternate)
50. Steps of LRA
Step 2: Filter alternates:
For each alternate pair, send a query to the WMTS
to find the frequency of phrases (that begin with one
member of the pair and end with another). The
phrases cannot have more than max_phrases (in
this case, 5). Select the top num_filter most
frequent alternates and discard the remainder.
51. Steps of LRA
Step 3: Find phrases
For each pair, make a list of phrases in the corpus
that contain the pair. Query the WMTS for all
phrases that begin with one member of the pair and
end with the other (in either order). We ignore
suffixes.
The phrases cannot have more than max_phase
and there must be at least one word in between.
52. Examples of phrases that contain quart volume:
_____________________________________
quarts liquid volume volume in quarts
quarts of volume volume capacity quarts
quarts in volume volume being about two
quarts
quart total volume volume of milk in quarts
quart of spray volume volume include measures
like quart
53. Steps of LRA
Step 4: Find Patterns:
For each phrase found in step 2, build patterns from
the intervening words. A pattern is constructed by
replacing any/all/none of the intervening words with
wild cards. A phrase with n words generate: 2(n2)
patterns.
For each pattern, count the number of pairs
(original and alternates) with phrases that match the
pattern. Keep the top num_patterns (4000 here)
most frequent patterns and discard the rest.
54. Steps of LRA
Step 5: Map pairs to rows
To build matrix X, create a mapping of word pairs to
row numbers.
For each A:B, create a row for A:B and another row
for B:A.
Step 6: Map patterns to columns
Create a mapping of the top num_patterns to
column numbers
For each pattern P, create a column for word1 P
word2 and another column for word2 P word1
55. Steps of LRA
Step 7: Generate a sparse matrix
Frequencies of various patterns for quart:volume.
P = “in” P = “* of” P = “of *” P = ”* *”
freq(“quart P volume”) 4 1 5 19
freq(“volume P quart”) 10 0 2 16
56. Steps of LRA
Step 8: Calculate entropy
Let m be the number of rows in matrix X and let n
be number of column.
To calculate the entropy of the column, we need to
convert the column into a vector of probabilities
Let pi,j be the probability of xi,j:
pi , j=xi , j / ∑ xk , j
where k = 1 to m.
57. Step 8: cont
The entropy of jth column:
Hj=−∑ pk , j .log pk , j.
Give more weight to columns(patterns) with
frequencies that vary substantially from one row to
the next. Therefore we weight the cell xi,j by
wj = 1 – Hj / log(m) which varies from 0 when pi,j
is uniform to 1 when entropy is minimal
We also apply the log transformation to
frequencies, log(xi,j + 1).
58. Step 8: cont, Step 9
Step 8 (cont): For all i and j, replace the original
value xi,j in X by the new value wj log (xi,j + 1 ).
Step 9: Apply SVD
SVD decomposes a matrix into a product of three
matrices U Ʃ VT, where U and V are orthonomal
and Ʃ is a diagonal matrix of singular values.
If X is of rank k, then the matrix Uk Ʃk VKT is the
matrix of rank k that best approximates the original
matrix X.
59. Step 9 and 10
Step 9 (cont): Since the cosine of two vectors is
their dot product, XXT = U Ʃ VT (U Ʃ VT) = U Ʃ VT V
Ʃ UT =
U Ʃ (U Ʃ )T, which means we can calculate cosines
with the smaller matrix U Ʃ.
Step 10: Projection
Project the row vector for each word pair
from original 8000 dimensional to 300 (k =
300).
Calculate UkƩk
60. Step 11
Step 11: Evaluate alternatives
Let A:B and C:D be any two word pairs in the input
set. From step 2, we have: (num_filter + 1)2 ways
to compare a version of A:B with a version of C:D.
Look for the row vectors in UkƩk that correspond
to each version.
Calculate the (num_filter + 1)2 cosines.
61. The 16 combinations and their
consines
Word pairs Cosine Cosine >=
original pair
quart:volume::mile:distance 0.525 Yes (original
pairs)
quart:volume::feet:distance 0.464
quart:volume::mile:length 0.634 Yes
quart:volume::length:distance 0.499
liter:volume::mile:distance 0.736 Yes
liter:volume::feet:distance 0.687 Yes
liter:volume::mile:length 0.745 Yes
63. Step 12
Step 12: Calculate relational similarity
Find cosines from step 11 that are greater than or
equal to the original cosines
This is a way to filter out poor analogies, which may
have slipped through the filtering in step 2.
Averaging the cosines, as opposed to taking the
maximum is intended to provide some resistence to
noise.
65. Performance of LRA on the 374
SAT
Algorithm Precision Recall F
LRA 56.8 56.1 56.5
Veale (2004) 42.8 42.8 42.8
Best attributional similarity 35.0 35.0 35.0
Random guessing 20.0 20.0 20.0
Lowest cooccurrence frequency 16.8 16.8 16.8
Highest cooccurrence frequency 11.8 11.8 11.8
66. Baseline LRA System
Performance of the baseline LRA system on the
374 SAT questions:
210 questions were correctly answered correctly,
160 incorrectly and 4 questions were skipped
because its stem pair and its alternates were
represented by zero vectors.
Performance of LRA is slightly better than the
lexiconapproach of Veale (2004) and the best
performance using attributional similarity, with 95%
confidence.
67. LRA versus VSM
LRA performs better than VSM AV.
Algorithm Correct Incorrect Skipped Precision Recall F
VSM – AV 176 193 5 47.7 47.1 47.4
VSM – WMTS 144 196 34 42.4 38.5 40.3
LRA 210 160 4 56.8 56.1 56.5
With smaller corpus, many more of the input
word pairs simply do not appear together in
short phrases in the corpus.
68. LRA versus VSM
LRA is able to answer as many questions as
VSMAV although it uses the same corpus as
VSMWMTS.
Human performance on 78 verbal SAT1
questions: 57% recall.
The experiment did not attempt to tune the
parameter values (k, num_sim, ...) to maximize
the precision and recall on the 374 SAT
questions.
70. Ablation Experiments
Without VSD, performance dropped. But the
drop is not statistically significant with 95%
confidence.
More words pairs would likely show SVD is
making siginificant contribution; it would also
give SVD more leverage.
Dropping synonyms rises the skipped
questions. Recall drops significantly, but the
drop in precision is not significant.
71. Ablation Experiments
When both SVD and synonyms are dropped,
decrease in recall is significant, but larger
decrease in precision is not significant.
The difference betwen LRA and VSMWMTS is
the patterns.
Contribution of SVD has not been proven.
72. Matrix Symmetry and Vector
Interpretations
A good measure of relational similarity, simr:
Simr (A:B,C:D) = simr (B:A, C:D)
This helps prevent drops in recall and precision.
Choose better alternates than all alternates.
The semantic content of a vector is ditributed
over the whole vector.
73. Manual Patterns versus
Automatic Patterns
LRA uses 4000 automatically generated
patterns, whereas Turney and Litmann (2005)
used 64 manually generated patterns.
The improvement in performance with
automated patterns is due to the the increased
quantity of patterns.
The manually generated patterns are not used
to mine text for instances of word pairs that fit
patterns.
74. Classes of Relations
Experiment was performed using the 600
labeled nounmodifiers pairs of Nastase and
Szpakowicz (2003).
Use single nearest neighbour classification with
leaveoneout crossvalidation
The data set is split 600 times
There were originally six groups of semantic
relations.
75. Classes of semantic relations
from Nastase and Szpakowicz
Relation Abbr. Example phrase Description
CAUSALITY
cause cs flu virus (*) H makes M occur or exist, H is
necessary and sufficient.
effect eff exam anxiety M makes H occur or exist, M is
necessary and sufficient.
76. Classes of Relations
Answering 374 SAT questions require
calculating: 374 x 5 x 16 = 29,920 cosines.
With leaveoneout crossvalidation, each test
pair has 599 choices. So, it requires calculation
600 x 599 x 16 cosines.
To reduce amount of computation, we first
ignore alternate pairs:
(600x599 = 359,400 cosines), then apply full LRA to
just those 30 neighbours (600 x 30 x 16 = 288.000
cosines) → Total = 647,400 cosines.
77. Limitations of LRA
Although LRA performs significantly better than
VSM, it is also clear that the accuracy might not
be adequate for practical applications.
It is possible to adjust the tradeoff between
precision and recall.
Speed: took 9 days to answer 374 analogy
questions.
78. Conclusions
The LRA extends the VSM approach of Turney
and Litman (2005) by:
Exploring variations on the analogies by replacing
words with synonyms (step 1).
Automatically generating connecting patterns (step
4).
Smoothing the data with SVD (step 9).
The accuracy of LRA is significantly higher than
accuracies of VSMAV and VSMWMTS
79. Conclusions
The difference betwen VSMAV and VSM
WMTS shows that VSM is sensitive to the size
of corpus.
LRA may perform better with larger corpus.
A hybrid approach will surpass any purebred
approach.
Pattern selection algorithms has little impact on
performance.