Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010).
Bangkok, Thaıland.
1. th
4 intensive summer school on
Natural Language Processing
Bilingual Terminology Mining
Estelle Delpech
30th November, 2010
1
2. About me
●
●
●
●
●
●
●
●
Estelle Delpech
Research engineer at Lingua et Machina,
France
CAT tools provider
ed(at)lingua-et-machina(dot)com
www.lingua-et-machina.com
Ph. Candidate at LINA, France
taln team : specialises in NLP
estelle.delpech(at)univ-nantes(dot)fr
2
5. What is a term ?
●
●
●
Classical definition :
●
“unequivocal expression of a concept
within a technical domain“
Traces back to 1930 Eugene Wüster
« General Theory of Terminology »
Specialized language is / should be
unambiguous
concept
term
referent
Ogden semiotic triangle
5
6. What is a term ?
“Classical terminology challenged in the 1990's
by :
● sociolinguistics
● corpus-based linguistics
● computational terminology
●
Observe terms in texts :
● there is variation, polysemy
● concepts evolve overtime
● no clear-cut border between
specialized and general languages
6
7. What is a term ?
●
●
●
●
Definition of « term » depends on the
application / audience of the terminology
Domain expert :
●
Unit of knowledge
Information retrieval :
●
Descriptors for indexation
Translation
●
word or phrase that :
● is not part of general language
● Translates differently in a particular
domain
●
can be :
● Noun, adjective, verb
● Noun phrase, verb phrase, etc.
7
8. What is a terminology ?
●
●
●
●
Set of terms + terminological records
Terminological record :
●
Part-speech
●
Frequency
●
Variants
●
contexts
Relations between terms / concepts
●
Hypernoymy : cat is a sort of animal
●
Meronymy : head is part of body
Bilingual terminology :
●
Translation relations
8
10. Were do you find terms ?
●
●
In specialized texts :
●
Research papers on breast cancer
●
Planes crashes reports
Corpora building :
● important to gather texts following
a well-defined domain / thematic
10
11. Bilingual terminology mining (1)
Specialized texts
term extraction
data mining
terms
terms
term alignment
bilingual
terminology
terminology
management
software
11
12. Bilingual terminology mining (2)
Specialized texts
synchronized
term extraction
and alignment
terms
terms
bilingual
terminology
terminology
management
software
12
14. Term extraction : semi-supervised
process
●
●
●
L'Homme, 2004
The notion of term is « slippery »
The same lexical unit may or may not be
considered as a term depending on :
●
Audience
●
Domain
●
Application
Term extractors extract candidate terms
● Frequent in texts of a given domain
● HER2 gene
● Look like terms : well-formed phrase
● human cell lines
● Group of words that frequently occur
together
● to compile a program
15. Term extraction : semi-supervised,
lexico-semantic process
texts
specialized texts
term extractor
candidate terms
automatic
indexing
candidate terms
manual selection
terms
terms
terminology
concepts
16. Termhood clues (1) : Frequency
●
●
●
L'Homme, 2004
Term occurs frequently in specialized texts
● the higher, the better ?
Comparison with general language :
● Does the term occur more frequently
than expected in general language ?
Compute significance tests :
● ex : ² chi-square
17. Termhood clues (2) : form
●
●
●
A term is a well-formed phrase
●
...HER2/neu oncogenes are members of...
Match morpho-syntactic patterns
●
Ex: NOUN + NOUN
Many :
●
NOUN PREP DET NOUN
●
alternation of the gene
●
●
●
●
NOUN PREP NOUN COORD ADJ NOUN
susceptibility to breast and ovarian cancer
NOUN NOUN NOUN NOUN NOUN
human breast cancer cell lines
17
18. Termhood clues (2) : form
●
Preprocessing :
● Tokenization
● Lemmatisation
● POS Tagging
… HER-2/neu oncogenes are members of ....
HER-2/neu
oncogenes
are
members
of
NOUN
NOUN
VERB
NOUN
PREP
HER-2/neu
oncogene
be
member
of
19. Identification of Syntactic Patterns
●
Patterns expressed as regular expression /
Finite state automata
PREP
START
NOUN
NOUN
NOUN (PREP? NOUN) ?
●
●
●
NOUN : gene
NOUN NOUN : HER2 gene
NOUN PREP NOUN : member of family
20. Term hood clue (3) : words association
●
●
Significant coocurrences are good clues for
term hood :
● … breast cancer …
● ...breast remains...
● .. alternative cancer...
Must take into account :
● number of times the two word cooccur
● number of times word A occurs
● number of times word B occurs
21. Measure for cooccurrence significance
●
Mutual Information
MI a , b=log2
P a , b
P a⋅P b
P a , b=nbocc a , b / N
P a=nbocc a/ N
N =total nb of words in corpus
invasive carcinoma
20
cancer means
50
invasive
30
cancer
800
carcinoma
20
means
800
MI
9,7
MI
1,69
●
Church and Hanks, 1990
L'Homme, 2004
remarkable attraction between invasive
and carcinoma despite relatively low
number of cooccurrences
24. Parallel and comparable corpora
●
●
Parallel corpora
●
Source text and target texts are translations
●
Reduce search space little by little
● First sentences
● Then terms
Comparable corpora
● Not translation but very similar in topic
● Good proportion of terms translations
● Search space :
● All terms of target corpus
24
25. Sentence alignement (1)
●
Gale and Church, 1993
Gale and Church (1993) 's hypothesis :
● Translated sentences have roughly the
same length
● Probability P(S,T) that sentence S
translates into T is based on the length
difference
●
Improvements : use seed-lexicon
● Probability P(S,T) is based on the
number of words in common
25
26. Sentence alignement (2)
●
●
Compute probabilites for all pairs of (S,T)
Build matrix where M(i,j) contains probability
that sentence i translates to sentence j
0
2
...
n
0
0,89
0,56
0,2
...
...
1
0,45
0,9
0,1
...
...
2
...
0,23
0,9
0,3
...
...
...
...
0,44
0,76
...
m
Gale and Church, 1993
1
...
...
...
...
0,88
26
27. Sentence alignement (2)
●
Use dynamic programming to find the best
“path” i.e. the best alignments
0
2
...
n
0
0,89
0,56
0,2
...
...
1
0,45
0,9
0,1
...
...
2
...
0,23
0,9
0,3
...
...
...
...
0,44
0,76
...
m
Gale and Church, 1993
1
...
...
...
...
0,88
27
28. Sub sentence alignment : AnyMalign
(Lardilleux, 2010)
●
Lardilleux et al., 2010
AnyMalign is a sub-sentencial aligner
●
Aligns words, groups of words for MT
translation tables
●
Aligned group of words :
● more or less like statistical collocations
● possible to find term patterns in these
groups of words
28
29. AnyMalign (Lardilleux, 2010)
●
Algorithm is based on « perfect alignments » :
● words or groups of words that occur
exactly in the same aligned sentences
ad ↔ AD
b↔B
b↔C
a e ↔ A DD
a ↔ A is a perfect alignment
Lardilleux et al., 2010
29
30. AnyMalign (Lardilleux, 2010)
●
●
How to get more « perfect alignments » ?
● with smaller corpora
How to get smaller corpora ?
● randomly select sub corpora from your
corpora
Subcorpora 1
Subcorpora 2
Lardilleux et al., 2010
Sub corpora 1 :
b↔B
Sub corpora 2 :
a↔A
ad ↔ AD
b↔B
b↔C
a e ↔ A DD
30
31. AnyMalign (Lardilleux, 2010)
Complementaires of perfect alignments are
likely to be good alignments too :
●
ad ↔ AD
b↔B
b↔C
a e ↔ A DD
Perfect alignment
a↔A
●
Complementaries
d↔D
e ↔ DD
●
Lardilleux et al., 2010
31
32. AnyMalign (Lardilleux, 2010)
●
●
●
●
Lardilleux et al., 2010
Process : Iteratively extract random
samples of of random size from your
corpora
Extract « perfect alignements » and their
complementary
The same alignment can occur several
times
Count, for each alignement the number of
times it occurs
32
33. AnyMalign (Lardilleux, 2010)
●
●
●
Output :
alignments sorted by descending number of
occurrences
Alignement probability :
CS ,T
P S∣T =
C T
Lardilleux et al., 2010
S = source group of words
T = target group of words
C (S,T) = number of times S was aligned with
T
C (T) = number of times T appears in an
alignment
33
34. AnyMalign (Lardilleux, 2010)
Advantages :
●
can perform alignment with more than 2
languages at the same time
● 1 language → statistical collocations
●
Extracts and aligns non contiguous
sequences of words
to give something up
to let someone down
●
No a priori expectations on terms
● Sometimes a term in source
language is not translated by a term
●
Terms = what you can align
Lardilleux et al., 2010
34
35. AnyMalign (Lardilleux, 2010)
●
●
Words groups are not grammatical phrases :
that sample sentences and
exchange format fitted for the
but not
Solutions :
● find term patterns
● use heuristics
● trim stop words
sample sentences
exchange format
Lardilleux et al., 2010
35
37. Advantages of comparable corpora
●
●
●
More available
● new languages
● new language pairs
● new topics / domains
Less expensive to build
More natural
● data was produced
spontaneously
● no influence from source text
37
38. Contextual approach
●
●
Based on distributional linguistics (Z.
Harris)
●
Words with similar meaning appear
in similar contexts
If source and target words have similar
contexts, they might be translations
●
Compute contexts for each source
and target word
●
Compare contexts
●
Find the most similar contexts
38
40. Building context vector for « drink »
●
Collocates : word occuring at a distance of n
words from head
is variety of reasons to drink plenty of water each day
simple as a glass of drinking water be the key to the
popular in Japan today to drink water from glass after waking
●
●
●
●
●
(drink,water) = 3
(drink, glass) = 2
(drink, Japan) = 1
(drink, reason) = 1
(drink, plenty) = 1
40
41. Normalized cooccurrences frequency
●
●
●
●
●
●
Normalization : use measure like IM, log
likehood ratio to counteract the influence
of high frequency words
Ex : log likelihood ratio
1000 cooc. in corpus
(drink,x) = 75 cooc.
(water,y) = 75 cooc.
(drink, water) = 25 cooc.
water
drink
50
25
75
¬ drink
25
900
925
75
Dunning, 1993
¬ water
925
1000
41
42. Log likelyhood ratio
Contingency table :
●
water
¬ water
drink
a
b
e
¬ drink
c
d
h
f
g
N
log likelihood ratio water , drink =
log a b log bc log c d log d N log N
−e loge − f log f − g log g −h log h
●
Dunning, 1993
loglikelihoo ratio (drink,water) = 45,05
42
43. Context vector comparison
m
ou
th
be
er
●
...
●
น
ดม
●
Rapp 1995 ; Fung 1997
●
●
●
ป
ก
●
ยร
เบ
●
ว
แก
drink
gl
as
s
Compute context vectors for words in
source and target corpus
wa
te
r
●
...
●
How to compare words contexts in
different languages ?
43
45. Context vector comparison
●
●
Measuring context similarity of words a
and b
= measuring cosinus angle between
vector of a and vector of b
cosinus angle a , b=
∑ b w c , a⋅w c ,b
c∈a∪
∑ w 2 , a⋅∑ w 2 ,b
c
c
c∈a
c∈ b
c ∈ x=collocate in vector of x
w c , x =weight of association of collocate c withhead x
●
Rapp 1995 ; Fung 1997
Select the top 1, 10 or 20 most closest
words as candidate translations
45
46. Contextual approach : improvements
●
●
●
●
●
Using syntactic collocates
Improving dictionary with cognates,
transliterations, other dictionaries
Give more weight to « anchor words »
● cognates, transliterations
● frequent, monosemous
Filter with part-of-speech
Favor reciprocal translations
SOURCE
TARGET
a
Chiao et Zweignebaum, 2002
Sadat et al., 2003
Gamallo and Campos, 2005
Kohen and Knight, 2002
Prochasson, 2010
a'
b
b'
c
c'
d
d'
46
47. Variant to direct translation of vector
●
●
●
« Interlingual » translation
Translate the n-closest words instead of
context vector
Seed lexicon : some mappings between
source and target words
SOURCE
TARGET
seed
lexicon
Déjean and Gaussier, 2002
47
48. Variant to direct translation of vector
●
●
●
To translate term T :
Find n-closest words
these closest words are in the lexicon
SOURCE
TARGET
seed
lexicon
Déjean and Gaussier, 2002
48
49. Variant to direct translation of vector
●
Find the target term which is the closest to
the n closest words
SOURCE
TARGET
seed
lexicon
Déjean and Gaussier, 2002
49
50. Variant to direct translation of vector
●
●
« Interlingual » approach
Translate closest words instead of direct
context
SOURCE
Déjean and Gaussier, 2002
TARGET
50
51. Adaptation to multi-word terms
energy
drink
Morin et al., 2004
Morin and Daille, 2009
●
...
be
er
●
●
●
...
m
ou
th
...
be
er
gl
as
s
●
●
...
●
m
ou
th
...
st
ro
ng
drink
●
be
er
●
...
energy
gl
as
s
st
ro
ng
●
Context vector :
Union of vector of each word of the terms
gl
as
s
●
...
●
51
52. Evaluation
Precison on TopN candidates
50% on Top20
Correct translation is in the Top 20 best
candidates for 50% of source terms
●
●
●
Single word
units
Multi-word
units
Multi-word
terms
●
Morin and Daille, 2010
●
big, general language
corpus
small, specialized
corpus
small, specialized
corpus
80%
60%
42%
big = hundreds milliions of words
small = one million to 100 thousand
words vector
52
53. Why is it so difficult ?
●
●
●
●
●
●
translation might not be present
target term has not been extracted
polysemous words : undiscriminant,
fuzzy vector
low frequency words : unsignificant
vector
translation has different usage in target
language
big search space : all words of target
corpus
→ can not be fully automatic
→ semi supervised term alignment
53
54. th
4 Franco-Thai Workshop 2010
intensive summer school on
Natural Language Processing
Thank you
ed(a)lingua-et-machina.com
54