1. Handling Unknown Words in Arabic FST
Morphology
Khaled Shaalan and Mohammed Attia
Faculty of Engineering and IT,
The British University in Dubai
Presented by
Younes Samih
Heinrich-Heine-Universität, Germany
2. Bird’s Eye view
Problem
• Out of Vocabulary words (OOV) cause a problem to
morphological analysers, parsers, MT, etc.
• The manual extension of lexical databases is costly an time
consuming.
• With the large amount of data, manual extension of lexicons
becomes practically impossible.
Solution
• Creating an automatic method for updating a lexical database
• Integrating a Machine Learning method with a finite state
guesser to lemmatize unknown words
• Weighting new words by relevance and importance
4. Introduction
• Why deal with unknown words?
• Complexity of lemmatization in Arabic
• Data used
5. Introduction
Why deal with unknown words?
• Language is always changing
• New words appear
• Old words disappear
• Unknown words make up 29% of the Gigaword
corpus
• Unknown words (OOV) always cause a problem to:
• Morphological analysers
• Parsers
• Machine Translation & other applications
6. Introduction
Complexity of lemmatization in Arabic
• Lemmatization means reducing words to their base
(canonical) forms
• played -> play studies - study
• went -> go wives -> wife
• New words in English appear in their base form 86% of
the time (Lindén, 2008)
• New words in Arabic appear in their base form 45% of
the time
• Arabic morphology is complex and semi-algorithmic:
root, patterns, inflections, clitics, etc.
7. Introduction
Complexity of lemmatization in Arabic
Proclitics Prefix Lemma Suffix Enclitic
Conjunction/ Comp Tense/mood – Verb Tense/mood – Object
question article number/gend number/gend pronoun
Conjunctions ل وli ‘to’ Imperfective Imperfective First person
wa ‘and’ or فfa tense (5) tense (10) (2)
‘then’
Question word س أsa ‘will’ Perfective tense lemma Perfective
lemma Second
᾽a ‘is it true that’ (1) tense (12) person (5)
لla ‘then’ Imperative (2) Imperative (5) Third person
(5)
Possible Concatenations in Arabic Verbs
شكرšakara ‘to thank’, generate
2,552 valid forms
8. Introduction
Complexity of lemmatization in Arabic
Proclitics lemma Suffix Enclitic
Conjunction/ Preposition Definite Noun Gender/Number Genitive
question article article pronoun
Conjunctions ب وbi ‘with’, الal ‘the’ Masculine Dual First person
wa ‘and’ or ف كka ‘as’ (4) (2)
fa ‘then’ or لli ‘to’ Feminine Dual
(4)
Question word أ Stem
lemma Masculine Second person
᾽a ‘is it true regular plural (5)
that’ (4)
Feminine Third person
regular plural (5)
(1)
Feminine Mark
(1)
معلمmu῾allim ‘teacher’, generate 519
Possible Concatenations in Arabic Nouns valid forms
9. Introduction
Data used
• A large-scale corpus of 1,089,111,204
words
• 85% from the Arabic Gigaword Fourth Edition
• 15% from news articles crawled from the Al-Jazeera
web site
10. Morphological Guesser
We develop a morphological guesser for
Arabic unknown words that handles all
possible
• Clitics
• Prefixes
• Suffixes
• And all relevant alteration operations that include
insertion, assimilation, and deletion
12. Methodology
We use a pipelined approach
• First: a machine learning (SVM), context-sensitive tool
(MADA) is used to predict:
• POS
• Morpho-syntactic features of number, gender, person, tense, etc.
• Second: The finite-state morphological guesser is used
to produce all the possible interpretations of words and
suggested lemmas.
• Third: The two output are matched together and the
agreed analysis is selected.
14. Methodology
Results
• Corpus size is 1,089,111,204 tokens, 7,348,173
types
• Unknown Types in the corpus: 2,116,180 (29%)
• After spell checking, correctly spelt types are
208,188
• Types with frequency of 10 or more: 40,277
• After lemmatization:18,399 types
15. Testing and Evaluation
We create a gold standard of 1,310 words
manually-annotated for:
• Gold lemma
• Gold POS
• Lexical relevance (include in a dictionary): yes or
no
Gold POS Type Count Ratio
noun_prop 584 45%
Among unknown words, noun 264 20%
- Proper nouns are the most common adj 255 19%
- Verbs are the least common verb 52 4%
16. Testing and Evaluation
Evaluating POS (accuracy)
• Baseline: The most frequent tag (proper name)
for all unknown words: 45%
• Mada: 60%
• Voted POS Tagging: 69%. When a lemma gets a
different POS tag with a higher frequency we
take the higher Accuracy
POS tagging
1 POS Tagging baseline 45%
2 MADA POS tagging 60%
3 Voted POS Tagging 69%
17. Testing and Evaluation
Evaluating Lemmatization (accuracy)
• Baseline: new words appear in their base form:
45%
• Pipelined strict definite article ‘al’: 54%
• Pipelined ignoring definite article ‘al’: 63%
Lemmatization
1 Lemma first-order baseline 45%
2 Pipelined lemmatization (first- 54%
order decision) with strict
definite article matching
3 Pipelined lemmatization (first- 63%
order decision) ignoring definite
article matching
18. Testing and Evaluation
Evaluating Lemma Weighting
• The weighting criteria aims to push lexicographically
relevant words up the list and less interesting words down.
• We aim to make the number of important words high in the
top 100 and low in the bottom 100
Word Weight = ((number of
sister forms * 800) + Good words In top In bottom
frequencies of sister forms) / 2 + 100 100
POS factor relying on Frequency 63 50
alone (baseline)
relying on number of 87 28
sister forms * 800
relying on POS factor 58 30
using combined criteria 78 15
19. Conclusion
• We develop a methodology for automatically extracting
and lemmatizing unknown words in Arabic
• We pipeline a finite-state guesser with a machine
learning tool for lemmatization
• We develop a weighting mechanism for predicting the
relevance and importance of lemmas
• Out of 2,116,180 unknown words, we create a lexicon of
18,399 lemmatized, POS-tagged and weighted entries.