Fsmnlp presentation 02

Handling Unknown Words in Arabic FST
Morphology

Khaled Shaalan and Mohammed Attia
Faculty of Engineering and IT,
The British University in Dubai

Presented by
Younes Samih
Heinrich-Heine-Universität, Germany

Bird’s Eye view
Problem
• Out of Vocabulary words (OOV) cause a problem to
morphological analysers, parsers, MT, etc.
• The manual extension of lexical databases is costly an time
consuming.
• With the large amount of data, manual extension of lexicons
becomes practically impossible.
Solution
• Creating an automatic method for updating a lexical database
• Integrating a Machine Learning method with a finite state
guesser to lemmatize unknown words
• Weighting new words by relevance and importance

Outline
• Introduction
• Morphological Guesser
• Methodology
• Testing and Evaluation
• Conclusion

Introduction
• Why deal with unknown words?

• Complexity of lemmatization in Arabic

• Data used

Introduction
Why deal with unknown words?
• Language is always changing
• New words appear
• Old words disappear
• Unknown words make up 29% of the Gigaword
corpus
• Unknown words (OOV) always cause a problem to:
• Morphological analysers
• Parsers
• Machine Translation & other applications

Introduction
Complexity of lemmatization in Arabic
• Lemmatization means reducing words to their base
(canonical) forms
• played -> play studies - study
• went -> go wives -> wife
• New words in English appear in their base form 86% of
the time (Lindén, 2008)
• New words in Arabic appear in their base form 45% of
the time
• Arabic morphology is complex and semi-algorithmic:
root, patterns, inflections, clitics, etc.

Introduction
Proclitics Prefix Lemma Suffix Enclitic

Conjunction/ Comp Tense/mood – Verb Tense/mood – Object
question article number/gend number/gend pronoun
Conjunctions ‫ل و‬li ‘to’ Imperfective Imperfective First person
wa ‘and’ or ‫ف‬fa tense (5) tense (10) (2)
‘then’
Question word ‫س أ‬sa ‘will’ Perfective tense lemma Perfective
lemma Second
᾽a ‘is it true that’ (1) tense (12) person (5)
‫ل‬la ‘then’ Imperative (2) Imperative (5) Third person
(5)

Possible Concatenations in Arabic Verbs
‫ شكر‬šakara ‘to thank’, generate
2,552 valid forms

Introduction
Proclitics lemma Suffix Enclitic
Conjunction/ Preposition Definite Noun Gender/Number Genitive
question article article pronoun
Conjunctions ‫ب و‬bi ‘with’, ‫ال‬al ‘the’ Masculine Dual First person
wa ‘and’ or ‫ف‬ ‫ك‬ka ‘as’ (4) (2)
fa ‘then’ or ‫ل‬li ‘to’ Feminine Dual
(4)
Question word ‫أ‬ Stem
lemma Masculine Second person
᾽a ‘is it true regular plural (5)
that’ (4)
Feminine Third person
regular plural (5)
(1)
Feminine Mark
(1)
‫ معلم‬mu῾allim ‘teacher’, generate 519
Possible Concatenations in Arabic Nouns valid forms

Introduction
Data used
• A large-scale corpus of 1,089,111,204
words
• 85% from the Arabic Gigaword Fourth Edition
• 15% from news articles crawled from the Al-Jazeera
web site

Morphological Guesser
We develop a morphological guesser for
Arabic unknown words that handles all
possible
• Clitics
• Prefixes
• Suffixes
• And all relevant alteration operations that include
insertion, assimilation, and deletion

Guesser
LEXC 1 LEXICON Adjectives
====== +adj+fem GuessWords;
+adj+masc GuessWords;
LEXICON Conjunctions ^ss^^‫سعيد‬se^+adj+masc
+‫وـ‬conj:‫وـ‬ Prepositions; FemMascduFemduMascplFempl;
+‫فـ‬conj:‫فـ‬ Prepositions; ....
Prepositions;
LEXICON GuessWords
LEXICON Prepositions ^ss^^GUESSNOUNSTEM^^se^
+‫لـ‬prep:‫لـ‬ Article; FemMascduFemduMascplFempl;
+‫كـ‬prep:‫كـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^
FemMascduFemduFempl;
+‫بـ‬prep:‫بـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^
Article; FemMascduFemdu;
LEXICON Article ….
+‫الـ‬defArt Nouns; ALTERATION RULES 2
+‫الـ‬defArt Adjectives; =================
Nouns; a -> b || L _ R
Adjectives; XFST 3
LEXICON Nouns =====
+noun GuessWords; read regex < arb-Alphabet.txt
define Alphabet
^ss^^‫خادم‬se^ FemMascduMascpl; define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;
.... substitute defined PossNounStem for
"^GUESSNOUNSTEM^“

Methodology
We use a pipelined approach
• First: a machine learning (SVM), context-sensitive tool
(MADA) is used to predict:
• POS
• Morpho-syntactic features of number, gender, person, tense, etc.
• Second: The finite-state morphological guesser is used
to produce all the possible interpretations of words and
suggested lemmas.
• Third: The two output are matched together and the
agreed analysis is selected.

Methodology
Example
‫والمسوِّ قون‬
َ َ ُ
wa-Al-musaw~iquwna “and-the-marketers”

MADA output:
form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na
pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:d

Finite-state guesser output:
‫والمسوقون‬ +adj‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬ +adj‫+والمسوقون‬Guess+sg@
‫والمسوقون‬ +noun‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬ +noun‫+والمسوقون‬Guess+sg@
‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@
‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@
‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis
‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@
‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@
‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@

Methodology
Results
• Corpus size is 1,089,111,204 tokens, 7,348,173
types
• Unknown Types in the corpus: 2,116,180 (29%)
• After spell checking, correctly spelt types are
208,188
• Types with frequency of 10 or more: 40,277
• After lemmatization:18,399 types

Testing and Evaluation
We create a gold standard of 1,310 words
manually-annotated for:
• Gold lemma
• Gold POS
• Lexical relevance (include in a dictionary): yes or
no
Gold POS Type Count Ratio
noun_prop 584 45%
Among unknown words, noun 264 20%
- Proper nouns are the most common adj 255 19%
- Verbs are the least common verb 52 4%

Evaluating POS (accuracy)
• Baseline: The most frequent tag (proper name)
for all unknown words: 45%
• Mada: 60%
• Voted POS Tagging: 69%. When a lemma gets a
different POS tag with a higher frequency we
take the higher Accuracy
POS tagging
1 POS Tagging baseline 45%
2 MADA POS tagging 60%
3 Voted POS Tagging 69%

Evaluating Lemmatization (accuracy)
• Baseline: new words appear in their base form:
45%
• Pipelined strict definite article ‘al’: 54%
• Pipelined ignoring definite article ‘al’: 63%
Lemmatization
1 Lemma first-order baseline 45%
2 Pipelined lemmatization (first- 54%
order decision) with strict
definite article matching
3 Pipelined lemmatization (first- 63%
order decision) ignoring definite
article matching

Evaluating Lemma Weighting
• The weighting criteria aims to push lexicographically
relevant words up the list and less interesting words down.
• We aim to make the number of important words high in the
top 100 and low in the bottom 100
Word Weight = ((number of
sister forms * 800) + Good words In top In bottom
frequencies of sister forms) / 2 + 100 100

POS factor relying on Frequency 63 50
alone (baseline)
relying on number of 87 28
sister forms * 800
relying on POS factor 58 30
using combined criteria 78 15

Conclusion
• We develop a methodology for automatically extracting
and lemmatizing unknown words in Arabic
• We pipeline a finite-state guesser with a machine
learning tool for lemmatization
• We develop a weighting mechanism for predicting the
relevance and importance of lemmas
• Out of 2,116,180 unknown words, we create a lexicon of
18,399 lemmatized, POS-tagged and weighted entries.

Fsmnlp presentation 02

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (15)

Semelhante a Fsmnlp presentation 02

Semelhante a Fsmnlp presentation 02 (12)

Fsmnlp presentation 02