SlideShare uma empresa Scribd logo
1 de 89
1
Arabic Natural Language
Processing
Nizar Habash
Columbia University
Center for Computational Learning Systems
With modifications
2
Road Map
• Introduction
• Orthography
– Arabic Script
– MSA Phonology and Spelling
– Encoding Issues
• Morphology
• Syntax
3
Introduction
• What is ‘Arabic’?
– A Semitic language
– Arabic Script
• With or without diacritics
• More ambiguity!
– Arabic Language
• Modern Standard
Arabic (MSA)
• Arabic Dialects
4
Road Map
• Introduction
• Orthography
– Arabic Script
– MSA Phonology and Spelling
– Encoding Issues
• Morphology
• Syntax
5
Arabic Script
Arabic script is an alphabet with allographic variants,
optional zero-width diacritics and common ligatures.
Arabic script is used to write many languages: Arabic,
Persian, Kurdish, Urdu, Pashto, etc.
PDF Files problem
 ‫طا‬ُ ‫ا‬ ‫خ‬َ‫ط‬ ‫ال‬
‫بي‬ِ‫ي‬ ‫ر‬َ‫ط‬ ‫ع‬َ‫ط‬ ‫ال‬
6
Arabic Script
Alphabet
• letter forms
• letter marks
• Arabic only
• Other languages
• Persian, Kurdish, Urdu,
Pashto, etc.
• OCR output ambiguity
7
Arabic Script
‫ب‬
/b/
Alphabet (MSA)
• letters (form+mark)
• Distinctive
• Non-distinctive
‫ت‬
/t/
‫ث‬
/θ/
‫س‬
/s/
‫ش‬
/ /ʃ
/ʔ/
glottal stop aka hamza
‫ا‬‫أ‬‫إ‬‫آ‬‫ء‬‫ؤ‬ ‫ئ‬
8
Arabic Script
Letter Shapes
• No distinction between print and handwriting
• No capitalization
• Right-to-left
• Ambiguous
shapes
• Connective
letters
• Disconnective
letters
•OCR problems ‫ﺪ‬
‫د‬
‫ﺎ‬
‫ا‬
‫ﺰ‬
‫ز‬
‫ﻦ‬
‫ﻨ‬
‫ﻧ‬
‫ن‬
final
media
l
initial
Stand ‫ا‬
alone
‫ﻎ‬‫ﺶ‬‫ﻢ‬‫ﻚ‬‫ﺐ‬
‫ﻐ‬‫ﺸ‬‫ﻤ‬‫ﻜ‬‫ﺒ‬
‫ﻏ‬‫ﺷ‬‫ﻣ‬‫ﻛ‬‫ب‬
‫ﻍ‬‫ش‬‫م‬‫ك‬‫ب‬
9
Arabic Script
Letter shaping
/kitāb/
 ‫ا‬ ‫ﻛ‬‫ت‬‫ﺎ‬‫ب‬‫ا =ا ﻛتﺎب‬  ‫ا‬  ‫كا‬ ‫تا‬‫ا‬ ‫ا‬‫ب‬  ‫ا‬
ktāb
book
/katab/
 ‫ا‬ ‫ﻛ‬‫ت‬‫ﺐ‬‫ا =ا ﻛتﺐ‬  ‫ا‬  ‫كا‬ ‫تا‬‫ب‬  ‫ا‬
ktb
to write
10
Arabic Script
Nunation
‫ب‬ً
/ban/
‫ب‬ٌ
/bun/
‫ب‬ٍ
/bin/
Diacritics
• Zero-width characters
• Used for short vowels
 ‫تﺐا ا ا‬َ‫ط‬ ‫ﻛ‬َ‫ط‬  ‫ا‬ ‫/ا‬katab/ ‫ا‬to ‫ا‬write
• Nunation is used for
nominal indefinite
marker in MSA
 ‫با ا‬ٌ  ‫تﺎ‬َ‫ط‬ ‫ﻛ‬ِ‫ي‬  ‫ا ا‬ ‫/ا‬kitābun/ ‫ا‬a ‫ا‬book
Vowel
‫ب‬َ
/ba/
‫ب‬ُ
/bu/
‫ب‬ِ
/bi/
11
Arabic Script
Double
Consonant
‫ب‬ّ
/bb/
‫ب‬ّ ‫ب‬ّ ‫ب‬ّ
/bbu/ /bbin/ /bban/
Diacritics
• No-vowel marker (sukun)
 ‫تﺐا ا‬َ‫ط‬ ‫ﻜ‬ْ‫َت‬ ‫ﻣ‬َ‫ط‬  ‫ا‬ ‫/ا‬maktab/ ‫ا‬office
• Double consonant marker
(shadda)
 ‫تﺐا ا‬َّ‫ب‬ ‫ﻛ‬َ‫ط‬  ‫ا‬ ‫/ا‬kattab/ ‫ا‬to ‫ا‬dictate
• Combinable
No Vowel
‫ب‬ْ
/b/
12
Arabic Script
‫ربا =ا عرب‬َ‫ط‬ ‫ع‬َ‫ط‬  ‫ا‬ ‫ا با‬َ‫ط‬ ‫ا  ر‬َ‫ط‬ ‫ع‬
Putting it together
Simple combination
Ligatures
‫ربا =ا ﻏرب‬ْ‫َت‬ ‫ﻏ‬َ‫ط‬  ‫ا‬ ‫ا با‬ْ‫َت‬ ‫ا  ر‬َ‫ط‬ ‫غ‬West ‫/ا‬ʁarb/
Arab / arab/ʕ
 ‫ا ﺳﻠﺎما‬ ‫سا لا اا ما‬Peace ‫/ا‬salām/ ‫ﺳلم‬
13
Arabic Script
Tatweel ‫ا‬
• ‫‘ا‬elongation’
• ‫ا‬aka ‫ا‬kashida
• ‫ا‬used ‫ا‬for ‫ا‬text ‫ا‬highlight ‫ا‬
and ‫ا‬justification ‫ا‬
‫الﻧسﺎن‬ ‫حقوق‬
‫الﻧسـﺎن‬ ‫حقـوق‬
‫الﻧســـﺎن‬ ‫حقـــوق‬
‫الﻧســـــﺎن‬ ‫حقـــــوق‬
human ‫ا‬rights ‫/ا ا‬ħuqūq ‫ا‬alʔinsān/
14
Western Arabic
Tunisia, Morocco, etc.
0 1 2 3 4 5 6 7 8 9
Indo-Arabic
Middle East
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
Eastern Indo-Arabic
Iran, Pakistan, etc.
٠ ١ ٢ ٣ ۴ ۵ ۶ ٧ ٨ ٩
Arabic Script
“Arabic” ‫ا‬Numerals
• ‫ا‬Decimal ‫ا‬system
• ‫ا‬Numbers ‫ا‬written ‫ا‬left-to-right ‫ا‬in ‫ا‬right-to-left ‫ا‬text ‫ا :ا‬dual ‫ا‬directions
 ‫اﺳتقﻠتا الجﺰائرا فيا ﺳﻨةا‬1962 ‫ا بعﺪا‬132 ‫ا عﺎﻣﺎا ﻣﻦا‬
‫.الحتللا الفرﻧسي‬ ‫ا‬
Algeria ‫ا‬achieved ‫ا‬its ‫ا‬independence ‫ا‬in ‫ا 2691ا‬after ‫ا 231ا‬years ‫ا‬of ‫ا‬French ‫ا‬occupation.
 ‫ا‬
• ‫ا‬Three ‫ا‬systems ‫ا‬of ‫ا‬enumeration ‫ا‬symbols ‫ا‬that ‫ا‬vary ‫ا‬by ‫ا‬region
15
Road Map
• Introduction
• Orthography
– Arabic Script
– MSA Phonology and Spelling
– Encoding Issues
• Morphology
• Syntax
16
MSA Phonology and Spelling
• Phonological profile of Standard Arabic
– 28 Consonants
– 3 short vowels, 3 long vowels, 2 diphthongs
• Arabic spelling is mostly phonemic …
– Letter-sound correspondence
ā ʔt bʤθx ħδ dz rsṣ ʃṭ ḍʕk ʁq flm
‫ت‬‫ث‬ ‫ا‬‫ب‬‫ج‬‫ح‬‫خ‬‫د‬‫ذ‬‫ر‬‫ز‬‫س‬‫ش‬‫ص‬‫ض‬‫ط‬‫ظ‬‫ع‬‫غ‬‫ف‬‫ق‬‫ك‬‫ل‬‫م‬‫ن‬‫ه‬‫و‬‫ي‬ ‫ى‬ ‫ء‬‫أ‬‫آ‬‫إ‬‫ؤ‬‫ئ‬‫ة‬
h nwj ūī δ̣
17
MSA Phonology and Spelling
• Arabic spelling is mostly phonemic …
Except for
• Medial short vowels can only appear as
diacritics
• Diacritics are optional in most written text
– Except in holy scripture
– Occasionally appear in newspapers to mark less
common readings, resolve certain ambiguities
• ‫كتب‬ /katab/ to write ‫كتب‬ُ /kutib/ to be written
• Dual use of ‫ا‬,‫و‬,‫ي‬ as consonant and long vowel
– ‫ا‬ (/‘/,/ā/) ‫و‬ (/w/,/ū/) ‫ي‬ (/j/,/ī/)
18
MSA Phonology and Spelling
• Arabic spelling is mostly phonemic …
Except for (continued)
• Morphophonemic characters
– Feminine marker ‫ة‬ (ta marbuta)
• ‫كبير‬ /kabīr/ (big ♂) ‫كبير‬‫ة‬ /kabīra/ (big ♀)
– Derivation marker
• / aṣa/ (to disobeyʕ ‫عص‬‫ى‬ ) (a stick ‫عص‬‫ا‬ )
• Hamza variants (6 characters for one phoneme!)
– ( ‫أآإؤئ‬ ‫ء‬(‫بها‬‫ء‬‫بها‬ ‫ه‬‫ؤ‬‫بها‬ ‫ه‬‫ئ‬‫ه‬ /baha’/ + 3MascSing (his glory)
19
MSA Phonology and Spelling
• Arabic spelling can be ambiguous
• But how ambiguous? Really?
• Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels
• Not exactly true
– Long vowels are always written
– Initial vowels are represented by an ‫ا‬ ‘alef’
– Some final short vowels are represented
ths is wht an Arbc txt lks lik wth no vwls
Will revisit ambiguity in more detail again under morphology discussion
Humans can read, machines may be troubled!
20
Road Map
• Introduction
• Orthography
– Arabic Script
– MSA Phonology and Spelling
– Encoding Issues
• Morphology
• Syntax
21
Encoding Issues
• Encoding Arabic
– Data entry, storage, and display
– Ease of use for Arabic-
– Multi-script support
– Multilingual support (extended Arabic characters)
• Types of Encoding
– Machine character sets
• Graphemic (shape insensitive, logical order)
• Allographic (shape/direction sensitive) [obsolete]
– Human accessible
• Transliteration
• Phonetic spelling (IPA)
• Romanization
22
Encoding Issues
• Many Conflicting Character Sets for Arabic
23
Encodings
• CP-1256
– Commonly used
– 1-byte characters
– Widely supported
input/display
– Minimal support for
extended Arabic
characters
– bi-script support
(Roman/Arabic)
– Tri-lingual support:
Arabic, French,
English (ala ANSI)
24
Encodings
• Unicode
– Becoming the
standard more and
more
– 2-byte characters
– Widely supported
input/display
– Supports extended
Arabic characters
– Multi-script
representation
25
Encoding Issues
Arabic Display
• Memory (logical order) 
ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004.
‫ﺕﻙﺭﺍﺵ‬‫ﻥﻱﻃﺱﻝﻑ‬ (Palestine) ‫ﻱﻑ‬‫ﺩﺍﻱﺏﻡﻝﻭﺍ‬ (Olympics) 2000 ‫ﻭ‬2004 .
or this way for those with direction-bias

.4002 æ 0002 )scipmylO( ÏÇíÈãáæÇ íÝ )enitselaP( äíØÓáÝ ÊßÑÇÔ
.4002 ‫ﻭ‬0002 )scipmylO( ‫ﺍﻭﻝﻡﺏﻱﺍﺩ‬‫ﻑﻱ‬ )enitselaP( ‫ﻑﻝﺱﻃﻱﻥ‬ ‫ﺵﺍﺭﻙﺕ‬
26
Encoding Issues
Arabic Display
• Memory (logical order)
ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004.
‫ﺕﻙﺭﺍﺵ‬‫ﻥﻱﻃﺱﻝﻑ‬ (Palestine) ‫ﻱﻑ‬‫ﺩﺍﻱﺏﻡﻝﻭﺍ‬ (Olympics) 2000 ‫ﻭ‬2004 .
• Display (visual order)
– Bidirectional (BiDi) support
• Numbers and Roman script
2000 ‫ﻭ‬2004 . (Olympics) ‫ﺍﻭﻝﻡﺏﻱﺍﺩ‬‫ﻑﻱ‬ (Palestine) ‫ﻑﻝﺱﻃﻱﻥ‬‫ﺵﺍﺭﻙﺕ‬
– Letter and ligature shaping
2000 ‫ﻭ‬2004 . (Olympics) ‫ﺍﻭليبياﺩ‬ ‫ف‬ (Palestine) ‫فلسطي‬ ‫شاﺭكت‬
27
Display Problems
تدشين
منطقة ح
رة في
دبي
للتجارة
الالكترÙ
ˆÙ†ÙŠØ©
ÊÏÔêæ åæ×âÉ ÍÑÉ
áê ÏÈê ääÊÌÇÑÉ
ÇäÇäãÊÑèæêÉ
ÊÏÔíä ãäØÞÉ ÍÑÉ
Ýí ÏÈí ááÊÌÇÑÉ
ÇáÇáßÊÑæäíÉ
‫تدشين‬‫منطقة‬‫حرة‬‫في‬
‫دبي‬‫للتجارة‬‫اللكترونية‬
ُ‫ظ‬‫ظظ؛؟‬ ‫ع‬‫ع‬‫ع‬ 
‫ع‬‫ظ‬  ‫ع‬‫ظ‬  ‫ظظ­ظ‬
‫ع‬‫ع‬‫ظ‬ ‫ظ‬ ‫ع‬
‫ع‬‫ع‬‫ظ‬ ‫ظظظ،ظ‬
‫ظ‬ ‫ع‬‫ظ‬  ‫ع‬‫ع‬‫ظ‬ ‫عظ‬
‫ع‬‫ع‬‫ظ‬ 
ï»…‫ظ‬ †‫ظ‬ ‫ط¯ط´ظ‬ ‫؟ط‬‫ٹ‬ ‫ھ‬
©‫ط­ط±ط‬ ©‫ظ†ط·ظ‚ط‬
‫ط¯ط¨ظ‬ ‫ظپظ‬‫ٹ‬ ‫ٹ‬
©‫ط¬ط§ط±ط‬ ‫ظ„ظ„ط‬‫ھ‬
‫ط§ظ„ط§ظ„ظ‬ƒ‫ط±ظˆظ‬ ‫ط‬‫ھ‬
‫ط‬ ‫†ظ‬‫ٹ‬ ©
Ԫʏ 栥既 ɠɠ͑ ԪψԪ
Ԫ ǑɠǤǤ ㊑親 ɠ
‫تدشين‬‫منطقة‬‫حرة‬‫في‬
‫دبي‬‫للتجارة‬‫اللكترونية‬
‫تدش‬ê‫هو‬ ‫×و‬â‫حرة‬ ‫ة‬
‫ل‬ê ‫دب‬ê ‫ننتجارة‬
‫انانمتر‬è‫و‬ê‫ة‬
Ԫʏ ԪԪ ɠɠ͑ Ԫ
ψ�ԪǑɠǡǡߊѦ Ԫ
‫كلظ‬ ‫ش ل‬ٍ‫ل‬ ‫تد‬Ԫ‫حرة‬ ‫ة‬Ԫٍ‫ل‬
‫افاف‬ ‫ففتجارة‬ ‫ب‬ٍ‫ل‬ ‫د‬Ԫ‫لة‬ٍ‫ل‬‫ترن‬
‫تدشين‬‫منطقة‬‫حرة‬‫في‬
‫دبي‬‫للتجارة‬‫اللكترونية‬
WesternUnicodeISO-8859CP-1256
Display Encoding
CP­1256ISO­8859Unicode
ActualEncoding
• Wrong encoding • Partial support problems
28
http://www.cyrillic.com/kbd/btc.html
‫ﺱ‬‫ﻞ‬‫ا‬‫ﻡ‬ ‫ﻡ‬‫سل‬
Encoding Issues
Arabic Input
• Standard graphemic keyboard
• Logical order input
29
Encodings
Buckwalter Encoding
• Romanization
– One-to-one mapping
to Arabic script spelling
– Left-to-right
– Easy to learn/use
– Human  machine compatible
• Penn Arabic Tree Bank
• Some characters can be
modified to allow use with XML
and regular expressions
• Roman input/display
• Monolingual encoding (can’t do
English and Arabic)
• Minimal support for extended
Arabic characters
30
Road Map
• Introduction
• Orthography
• Morphology
– Derivational Morphology
– Inflectional Morphology
– Morphological Ambiguity
– Arabic Computational Morphology
• Syntax
31
Morphology
• Type
– Concatenative: prefix, suffix, infix
– Templatic: root+pattern
• Function
– Derivational
• Creating new words
• Mostly templatic
– Inflectional
• Modifying features of words
– Tense, number, person, mood, aspect
• Mostly concatenative
32
Road Map
• Introduction
• Orthography
• Morphology
– Derivational Morphology
– Inflectional Morphology
– Morphological Ambiguity
– Arabic Computational Morphology
• Syntax
33
Derivational Morphology
• Templatic Morphology
‫م‬‫كت‬‫و‬‫ب‬
b
?‫و‬ ‫م‬َ??
kt
‫ك‬‫ا‬‫تب‬
?‫ا‬ِ??
maktūb
written
kātib
writerLexeme.Meaning =
(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random
‫ب‬ ‫ك‬‫ت‬
maū āi
• Root
• Pattern
• Lexeme
34
Derivational Morphology
Root Meaning
• ‫ك‬‫ت‬‫ب‬ KTB = notion of “writing”
‫كتب‬
/katab/
write
‫كاتب‬
/kātib/
writer
‫مكتوب‬
/maktūb/
letter
‫كتاب‬
/kitāb/
book
‫مكتبة‬
/maktaba/
library
‫مكتب‬
/maktab/
office
‫مكتوب‬
/maktūb/
written
35
Derivational Morphology
Root Meaning
• LHM-2
• Notion of “battle”
– ‫م‬‫لحم‬‫ة‬ /malħama/
• Fierce battle
• Massacre
• Epic
36
Road Map
• Introduction
• Orthography
• Morphology
– Derivational Morphology
– Inflectional Morphology
– Morphological Ambiguity
– Arabic Computational Morphology
• Syntax
37
Inflectional Morphology
• Derivational Morphology
– Lexeme ≈ Root + Pattern
• Inflectional Morphology
– Word = Lexeme + Features
• Features
– Part-of-speech
• Traditional: Noun, Verb, Particle
• Computational: N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det,
Aux, Pun, IJ, and others
– Noun-specific
• Number: singular, dual, plural, collective
• Gender: masculine, feminine, Neutral
• Definiteness: definite, indefinite
• Case: nominative, accusative, genitive
• Possessive clitic
38
Inflectional Morphology
• Features (continued)
– Verb-specific
• Aspect: perfective, imperfective, imperative
• Voice: active, passive
• Tense: past, present, future
• Mood: indicative, subjunctive, jussive
• Subject (Person, Number, Gender)
• Object clitic
– Others
• Single-letter conjunctions
• Single-letter prepositions
39
Inflectional Morphology
Nouns
‫وللمكتبات‬
/walilmaktabāt/
‫و‬+‫ل‬+‫ال‬+‫مكتبة‬‫+ات‬
wa+li+al+maktaba+āt
and+for+the+library+plural
And for the libraries
conjprepnounposs plural article
‫وكبيوتنا‬
/wakabiyūtinā/
‫نا‬+‫بيوت‬+‫ك‬+‫و‬
wa+ka+biyūt+nā
and+like+houses+our
And like our houses
• Morphotactics (e.g. ‫ل+ال‬  ‫)لل‬
• Arabic Broken Plurals (templatic)
40
Inflectional Morphology
Verbs
‫فقلناها‬
/faqulnāhā/
‫ف‬+‫قال‬+‫نا‬+‫ها‬
fa+qul+na+hā
so+said+we+it
So we said it.
conjverbobject subj tense
‫وسن‬‫قول‬‫ها‬
/wasanaqūluhā/
‫و‬+‫س‬ +‫ن‬+‫قول‬+‫ها‬
wa+sa+na+qūl+u+hā
and+will+we+say+it
And we will say it
• Morphotactics
• Subject conjugation (suffix or circumfix)
41
Inflectional Morphology
• Perfect verb subject conjugation (suffixes only)
Singular Dual Plural
1 ‫كتب‬‫ت‬ُ katabtu ‫كتب‬‫نا‬ katabnā
2 ‫كتب‬‫ت‬َ katabta ‫كتب‬‫ت‬‫ما‬ katabtumā ‫كتب‬‫ت‬‫م‬ katabtum
3 ‫كتب‬َ kataba ‫كتب‬‫ا‬ katabā ‫كتب‬‫وا‬ katabtū
• Imperfect verb subject conjugation (prefix+suffix)
Feminine form and other verb moods not shown
Singular Dual Plural
1 ‫ا‬‫كتب‬ ُaktubu ‫ن‬‫كتب‬ ُnaktubu
2 ‫ت‬‫كتب‬ ُ taktubu ‫ت‬‫كتب‬‫ان‬ taktubān ‫ت‬‫كتب‬‫ون‬ taktubūn
3 ‫ي‬‫كتب‬ ُ yaktubu ‫ي‬‫كتب‬‫ان‬ yaktubān ‫ي‬‫تكتب‬‫ون‬ yaktubūn
42
Road Map
• Introduction
• Orthography
• Morphology
– Derivational Morphology
– Inflectional Morphology
– Morphological Ambiguity
– Arabic Computational Morphology
• Syntax
43
Morphological Ambiguity
• Derivational ambiguity
– ‫:قاعدة‬ basis/principle/rule, military base, Qa'ida/Qaeda/Qaida
• Inflectional ambiguity
– ‫:تكتب‬ you write, she writes
– Segmentation ambiguity
• ‫:وجد‬ he found; ‫:و+جد‬ and+grandfather
• ‫للغة‬:‫ل+لغة‬ : for a language; ‫:ل+اللغة‬ for the language
• Spelling ambiguity
– Optional diacritics
• ‫:كاتب‬ /kātib/ writer , /kātab/ to correspond
– Suboptimal spelling
• Hamza dropping: ‫أ‬,‫إ‬  ‫ا‬
• Undotted ta-marbuta: ‫ة‬  ‫ه‬
• Undotted final ya: ‫ي‬  ‫ى‬
44
Morphological Ambiguity
• Multiple sources of ambiguity
‫بين‬
– /bayyana/ Verb he declared/demonstrated
– /bayyanna/ Verb they [feminine] declared/demonstrated
– /bayyin/ Adj clear/evident/explicit
– /bayna/ Prep between/among
– /biyin/ Proper Noun in Yen
– /biyn/ Proper Noun Ben
• Hard to measure specific causes of ambiguity
– Derivational ambiguity* (diacritized tokens)
• 1.09 entries/token
• 1.01 entries/token (within same part-of-speech)
– Spelling ambiguity* (undiacritized tokens)
• 1.28 entries/token
• 1.08 entries/token (within same part-of-speech)
* in Buckwalter’s Lexicon (~40,000 lexemes)
45
Road Map
• Introduction
• Orthography
• Morphology
– Derivational Morphology
– Inflectional Morphology
– Morphological Ambiguity
– Arabic Computational Morphology
• Syntax
• Machine Translation Issues
• Syntax
46
Arabic Computational Morphology
• Representation units
• Natural token ‫ولـلـمـكتـبــــات‬
– White space separated strings (as is)
– Can include extra characters (e.g. tatweel/kashida)
• Word ‫وللمكتبات‬
• Segmented word
– Can include any degree of morphological analysis
– Pure segmentation: ‫لمكتبات‬ ‫ل‬ ‫و‬
– Arabic Treebank tokens (with recovery of some
deleted/modified letters): ‫المكتبات‬ ‫ل‬ ‫و‬
47
Arabic Computational Morphology
• Representation units (continued)
• Prefix + Stem + Suffix
– ‫ولل‬+‫مكتب‬+‫ات‬
– Can create more ambiguity
• Lexeme + Features
– ‫+[مكتبة‬Plural +Def +‫+و‬ ‫]ل‬
• Root + Pattern + Features
– ‫كتب‬+‫ة‬ a3a21a‫م‬ + [+Plural +Def + ‫ل‬+‫و‬ ]
– Very abstract
• Root + Pattern + Vocalism + Features
– ‫كتب‬+‫م‬321‫ة‬ + a.a.a + [+Plural +Def + ‫ل‬+‫و‬ ]
– Very very abstract
48
Arabic Computational Morphology
• Approaches
– Finite state machines (Beesely,2001) (Kiraz,2001) (Habash et al, 2005b)
– Concatenative analysis/generation (Buckwlater,2002) (Cavalli-Sforza et
al, 2000)
– Lexeme+Feature analysis/generation (Habash, 2004)
– Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002)
– Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003)
• Issues
– Appropriateness of system representation for an application
• Machine Translation vs. Information Retrieval
• Arabic spelling vs. phonetic spelling
– System coverage
– System extendibility
– Availability to researchers
– Use for analysis and generation
Road Map
• Introduction
• Orthography
- Arabic Script
- MSA Phonology and Spelling
- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/…
- Encoding Issues
• Morphology
• Syntax
49
Arabic Script
Other languages
Arabic
• No more than 3 dots
• Dots either above or below
• Marks are 1/2/3 dots, hamza (‫)ء‬
or madda (~) only
• Rare borrowing for foreign words
• ‫/پ‬p/, ‫ڤ‬ /v/, ‫چ‬ ‫گ‬ ‫ڤ‬ /g/, ‫چ‬ /t ʃ /
• regionally variable
Not Arabic
• Extra marks: haft (v), ring (o), taa (‫,)ط‬
four dots (::), vertical dots (:)
• Some Numerals (i,h,g)
‫ب‬ … ‫إ‬ ‫ا‬ ‫ؤ‬ ‫أ‬
‫ئةتثجحخ‬
‫ص‬ ‫ش‬ ‫س‬ ‫ز‬ ‫ر‬ ‫ذد‬
‫ضطظعغ‬
‫ف‬
‫قكلمن…و‬‫ء‬ ‫ىي‬
‫ژ‬ ‫ڑ‬ ‫ڒ‬ ‫ړ‬ ‫ٻ‬ ‫ټ‬ ‫ٽ‬ ‫پ‬ ‫ٿ‬ ‫ڀ‬
‫ڏ‬ ‫ڐ‬ ‫ڻ‬ ‫ڼ‬ ‫ڽ‬‫ڹ‬ s r q p ‫ڈ‬ ‫ډ‬ ‫ڊ‬
‫ځ‬ ‫ڂ‬ ‫ڃ‬ ‫ڄ‬ ‫چ‬ ‫څ‬ ‫ۇ‬ ‫ۈ‬ ‫ۅ‬ ‫ۆ‬
‫ڏ‬ ‫ڐ‬ ‫ڑ‬ ‫ڒ‬ s r q p ‫ڈ‬ ‫ډ‬ ‫ڊ‬
‫گ‬ ‫ێ‬ ‫…ڤ‬
Once you learn the alphabet, it is easier ☺
HOW TO DETECT ARABIC?? 50
Morphology and Syntax
• Rich morphology crosses into syntax
- Pro-drop / Subject conjugation
- Verb sub-categorization and object clitics
• Verbtransitive+subject+object
• Verbintransitive+subject but not Verbintransitive+subject+object
• Verbpassive+subject but not Verbpassive+subject+object
• Morphological interactions with syntax
- Agreement
• Full: e.g. Noun-Adjective on number, gender, and definiteness (for
persons)
• Partial: e.g. Verb-Subject on gender (in VSO order)
- Definiteness
• Noun compound formation, copular sentences, etc.
• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc.
51
Morphology and Syntax
• Morphological interactions with syntax (continued)
- Case
• MSA is case marking: nominative, accusative, genitive
• Almost-free word order
• Case is often marked with optionally written short vowels
- This effectively limits the word-order freedom in published text
• Agglutination
- Attached prepositions create words that cross phrase
boundaries
‫ت‬  W‫ا+ل‬ li+Almaktabāt
for the-libraries [PP li [NP Almaktabāt]]
• Some morphological analysis (minimally segmentation)
is necessary even for statistical approaches to parsing
52
Road Map
• Introduction
• Orthography
• Morphology
• Syntax
- Morphology and Syntax
- Sentence Structure
- Phrase Structure
- Computational Resources
53
Sentence Structure
Two types of Arabic Sentences
• Verbal sentences
- [Verb Subject Object] (VSO)
-‫العشعار‬ ‫اللول د‬ ‫كتب‬
Wrote the-boys the-poems
The boys wrote the poems
• Copular sentences (aka nominal sentences)
- [Topic Complement]
‫عشعراء‬ ‫اللول د‬
the-boys poets
The boys are poets
54
Sentence Structure
• Verbal sentences
- Verb agreement with gender only
• Default singular number
• ‫الولد-اللول د‬ ‫كتب‬wrote3MascSing the-boy/the-boys
‫كتب‬‫ت‬‫البنت-البنات‬ wrote3FemSing the-girl/the-girls
- Pronominal subjects are conjugated
• ‫كتب‬‫ت‬ wrote-youMascSing
• ‫كتب‬‫تم‬ wrote-youMascPlur
• ‫ا‬ ‫كتب‬‫وا‬ wrote-theyMascPlur
– Passive verbs
• Same structure: Verbpassive SubjectunderlyingObject
• Agreement with surface subject
55
Road Map
• Introduction
• Orthography
• Morphology
• Syntax
- Morphology and Syntax
- Sentence Structure
- Phrase Structure
- Computational Resources
56
Computational Resources
• Monolingual corpora for building language models
- Arabic Gigaword
• Agence France Presse
• AlHayat News Agency
• AnNahar News Agency
• Xinhua News Agency
- Arabic Newswire
- United Nations Corpus (parallel with other UN languages)
- Ummah Corpus (parallel with English)
• Distributors
- Linguistic Data Consortium (LDC)
- Evaluations and Language resources Distribution Agency
(ELDA)
57
Computational Resources
• Penn Arabic Treebank (PATB)
- Started in 2001
- Goal is 1 Million words
- Currently 650K words
• Agence France Presse , AlHayat newspaper, AnNahar
newspaper
• POS tags
- Buckwalter analyzer
- Arabic-tailored POS list
• PATB constituency
representation
- Some modifications of Penn English Treebank
• (e.g. Verb-phrase internal subjects)
58
Efficiency Enhancement
Tools for Arabic Search
Engines:
A Statistical Approach
by
Adnan Yahya
joint work with
Ali Salhi
Birzeit University, Palestine
Presented at AI Class, November 3, 2011
1.Introduction and Motivation
2.Our Arabic NLP Tools and Methods
3.Utility of employed Structures.
4.Search of Arabic PDF Files.
5.Next Steps
Talk Outline
Introduction
• The World-Wide Web is rapidly changing
and expanding.
• Estimates of the size of the World-Wide Web
vary from 15 to 30 billion pages.
• The share of the Arabic is around 1% (for 5%
of population).
• Usually one is interested in web searches
that return exactly the needed info.
Returning too little : not sufficient
Returning too much : not useful.
Introduction
• Sometimes you want to search for more by
(expanding the query or correcting it and
• Sometimes you want to search for less by
eliminating certain query words
• Issues like document structuring, user
profiling and using NLP tools are part of the
solution
• The problems are present in other
languages but may be more complicated
for Arabic
Arabic NLP Tools and Methods
1. It was decided to look for practical
solutions
2. Developing NLP tools and methods based
on the availability of:
• Arabic Corpus,
• Arabic Word Database,
• Stop Word List
Arabic Corpus Construction
 For developing NLP tools we employed a
statistical/Corpus based approach.
 We started with contemporary data we
obtained initially from Al-Sharq Alawsat
newspaper.
 Then we expended it with available
sources (not without major problems)
Some Initial Statistics
Expansion Directions
• Topical Corpora: Topics sub-topics for
Categorization.
• Multisource: books, journals,..
• Multiregional: dialects
• Translation tools usage
• Structures for efficient storage-
manipulation.
Arabic Word Database (AWD)
• A database with all the Arabic words on the
web with their frequencies
• The starting database was (568,106) words.
• Still expanding as we find more documents
and better ways to deal with old ones
• We retain the substructures of data even
after aggregating it: so we retain the ability
to work with subsets of data as well
• Efforts at cleaning: infrequent words +
Construction of Stop Word List
• We created an Arabic stop words list
consisting of
• the Arabic prepositions,
• pronouns, interrogatives, particles,
• English stop words list (translated).
• The list has (1065) words
• Much larger that English due to morphology
• One or multiple lists?: function dependant
Utility of the Built Structures
• Arabic Language Detection.
• Arabic Automatic Categorizer
• Arabic Query Live Suggestion
• Query modification by Word
Elimination/Expansion
• Arabic Automatic Categorizer
• Person Names unification-translation
Arabic Language Detection
1. Determine whether the language of the
document is Arabic or just a language
using Arabic alphabet, say Persian or Urdu
in order to crawl and index it.
2. Based on spell checking of words in the
text against Arabic words and looking for a
threshold that enables accurate but fast
decisions.
3. A side tool: recover wrong language data
entries.
Arabic Automatic Categorizer
• Web documents are assigned to one of
predefined categories.
• Can be used to refine the search results.
Search for query “‫”السل”م‬ under the category
religions to avoid political results. (or ‫)الزراعة‬
• We have been playing with the Granularity
of categorization
• Have implications to the corpora properties
• Multi-stage categorization, user assisted
categorization
Arabic Automatic Categorizer
Steps in Building the automatic categorizer :
1.Defined a set of topics: Politics, Religion,
Science, Economics and Commerce, Sports,
Entertainment, Arts, Social Sciences,
Engineering/technology.
2. Word-based vector for each topic was
prepared.
3. Calculate the similarities for each topic and
select the most similar: How!
Arabic Query Live Suggestion
• When the user types in the search box, the
system queries the suggestions tables to
retrieve a list of possible suggestions.
• Large suggestion tables.
• Attempting the use of two and three words
from the text corpus to reduce the
suggestion space.
Word Elimination
• Eliminate non-discriminating words (stop
words) in queries
• May speed search process.
Example: when a user search for (‫هي‬ ‫يه امما‬
‫,)التعددية‬ the search engine has to make three
runs to find match, while looking for the last
word ‫التعددية‬ is enough to find relevant pages.
Query Expansion
• Looking for relevant words or derivatives.
• In Arabic many words can be derived from
a single root.
• we ran the root extractor tool on the AWD
words.
• For each root we now have almost all the
Arabic words that can be derived from it.
Arabic Query Expansion
The Query expansion is then carried out as follows:
1. Each nonstop word is returned to its root,
then a list of words that share the same
root are retrieved from the AWD.
2. A weighing system is used to assign weight
to the expanded queries say by
frequencies
The Query Correction and
Suggestion System
• The query correction main function is to
correct user entered queries.
• For each query we examine the query
correctness by comparing its words with
words in the AWD. If there is a match, then
the query is considered correct. Otherwise, it
is misspelled.
• Why do we need such dictionary? Aren’t
the normal dictionaries enough?
The Query Correction and
Suggestion System
Common Arabic Spelling Errors
Search Arabic PDF files
• One of the topics that concerns us through
this work is Arabic PDF files because of
prevalence;
• How we can index them so as to make
them searchable by our search engine
• The concern is with legacy files
Search Arabic PDF files
• cross platform font encoding problem in Arabic PDF.
This problem comes from the fact that
windows and MACs encode characters in
different ways once we leave the ASCII
characters. Thus, when we move Arabic
characters from MAC to PC we can notice
that the PC identifies them as unreadable:
we call this legacy pdf files
Search Arabic PDF files
Text As Created In InDesign Under MAC
The Same Text Imported By Notepad under PC
Search of Arabic PDF files
Crawling the web for PDF files.
• Crawling the web for PDF files.
• Character mapping (partial for legacy files).
• Search using approximate match rather
than exact.
• Needs extra work to increase the hit ratio.
Next Steps
• Continue to expand and clean the word
collection: diverse topics
• Observe the behavior and contrast with
other work
• Fine tune the lists and monitor scalability
• We have reported on work in progress
Thanks
Thanks
?
Incremental Search
• Incremental search enable users to find NEW
data on the web.
• Not an Arabic feature but we thought it may
be interesting
• User search for query A for the first time the
search engine will provide the user with
available results.
• Next time the user searches for A, the
search engine will provide the new results
introduced since his last search for A.
The Query Correction and
Suggestion System
Correction is carried out in four stages:
1-Checking stage.
2-Processing Stage.
3-Filtering stage.
4-Weighting stage.
Stemming and Root Extraction
• Indexing based on the roots which are far more
abstract than stems will improve the retrieval
effectiveness over stems and words.
• Arabic words are formed by adding prefixes (letters
and vowels at the start), infixes (vowels) and suffixes
(letters and vowels at the end) to the root.
• The form of an Arabic word is usually determined by
its gender, number, grammatical case, whether it is
definitive or not, and finally if there is a preposition
attached to it.
Stemming and Root Extraction
Stemming is carried out in the following steps.
• There are seven levels of processing (L3,L4, L5, L6, L7,
L8, Ln) that the query may pass through the process
of stemming.
• At each level of processing all the possible
combinations of (prefix core suffix) is examined.
• The combinations where the core is a correct
Arabic (corpus/dictionary) word, prefix and suffix
are extracted.
• Level four processes infixes.
Stemming and Root Extraction
• Example: Stemming ‫”وسيأخذونهما‬
• LN: ‫وسيأخذونهما‬,‫هما‬+‫سيأخذون‬+‫و‬
• L7: ‫سيأخذون‬,‫ون‬+‫أخذ‬+‫سي‬
• L3: ‫,أخذ‬ root

Mais conteúdo relacionado

Mais procurados

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차Taekyung Han
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향LGCNSairesearch
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMohamed Elagnaf
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 

Mais procurados (20)

Theory of computation Lec3 dfa
Theory of computation Lec3 dfaTheory of computation Lec3 dfa
Theory of computation Lec3 dfa
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차
 
NLP
NLPNLP
NLP
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향딥러닝 기반의 자연어처리 최근 연구 동향
딥러닝 기반의 자연어처리 최근 연구 동향
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Destaque

Animal & human language ppt.
Animal & human language ppt.Animal & human language ppt.
Animal & human language ppt.Aneeka Kazmi
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs Mustafa Jarrar
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process ImplementationMustafa Jarrar
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineeringMustafa Jarrar
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsMustafa Jarrar
 
Jarrar: Description Logic
Jarrar: Description LogicJarrar: Description Logic
Jarrar: Description LogicMustafa Jarrar
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process ManagementMustafa Jarrar
 
Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingMustafa Jarrar
 
Animal communication and human language
Animal communication and human languageAnimal communication and human language
Animal communication and human languageJasmine Wong
 
Jarrar: Web 2.0 Data Mashups
Jarrar: Web 2.0 Data Mashups Jarrar: Web 2.0 Data Mashups
Jarrar: Web 2.0 Data Mashups Mustafa Jarrar
 
Jarrar: WordNet And Global WordNets
Jarrar: WordNet And Global WordNetsJarrar: WordNet And Global WordNets
Jarrar: WordNet And Global WordNetsMustafa Jarrar
 
SÖZCÜK TÜRLERİ
SÖZCÜK TÜRLERİSÖZCÜK TÜRLERİ
SÖZCÜK TÜRLERİSEYHAN62
 
2012 YGS TÜRKÇE SORULARI
2012 YGS TÜRKÇE SORULARI2012 YGS TÜRKÇE SORULARI
2012 YGS TÜRKÇE SORULARISEYHAN62
 
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signed
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signedTürkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signed
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signedOur Sad Loss, 1930-2018
 
CÜMLE ÇEŞİTLERİ (TÜRLERİ) ppt
CÜMLE ÇEŞİTLERİ (TÜRLERİ) pptCÜMLE ÇEŞİTLERİ (TÜRLERİ) ppt
CÜMLE ÇEŞİTLERİ (TÜRLERİ) pptSEYHAN62
 
The transformation activity of the logic signed
The transformation activity of the logic signedThe transformation activity of the logic signed
The transformation activity of the logic signedOur Sad Loss, 1930-2018
 
Addenda to the turkish grammar yuksel goknel signed
Addenda to the turkish grammar yuksel goknel signedAddenda to the turkish grammar yuksel goknel signed
Addenda to the turkish grammar yuksel goknel signedOur Sad Loss, 1930-2018
 
PARAGRAF KONU ANLATIMI
PARAGRAF KONU ANLATIMIPARAGRAF KONU ANLATIMI
PARAGRAF KONU ANLATIMISEYHAN62
 

Destaque (20)

Animal & human language ppt.
Animal & human language ppt.Animal & human language ppt.
Animal & human language ppt.
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process Implementation
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineering
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical Constructs
 
Jarrar: Description Logic
Jarrar: Description LogicJarrar: Description Logic
Jarrar: Description Logic
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process Management
 
Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language Processing
 
Animal communication and human language
Animal communication and human languageAnimal communication and human language
Animal communication and human language
 
Jarrar: Web 2.0 Data Mashups
Jarrar: Web 2.0 Data Mashups Jarrar: Web 2.0 Data Mashups
Jarrar: Web 2.0 Data Mashups
 
Jarrar: WordNet And Global WordNets
Jarrar: WordNet And Global WordNetsJarrar: WordNet And Global WordNets
Jarrar: WordNet And Global WordNets
 
SÖZCÜK TÜRLERİ
SÖZCÜK TÜRLERİSÖZCÜK TÜRLERİ
SÖZCÜK TÜRLERİ
 
EU Languages
EU LanguagesEU Languages
EU Languages
 
2012 YGS TÜRKÇE SORULARI
2012 YGS TÜRKÇE SORULARI2012 YGS TÜRKÇE SORULARI
2012 YGS TÜRKÇE SORULARI
 
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signed
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signedTürkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signed
Türkçede soru sözcüklerinin kullanimi ve fiil catilari Yuksel Goknel signed
 
CÜMLE ÇEŞİTLERİ (TÜRLERİ) ppt
CÜMLE ÇEŞİTLERİ (TÜRLERİ) pptCÜMLE ÇEŞİTLERİ (TÜRLERİ) ppt
CÜMLE ÇEŞİTLERİ (TÜRLERİ) ppt
 
The transformation activity of the logic signed
The transformation activity of the logic signedThe transformation activity of the logic signed
The transformation activity of the logic signed
 
Addenda to the turkish grammar yuksel goknel signed
Addenda to the turkish grammar yuksel goknel signedAddenda to the turkish grammar yuksel goknel signed
Addenda to the turkish grammar yuksel goknel signed
 
PARAGRAF KONU ANLATIMI
PARAGRAF KONU ANLATIMIPARAGRAF KONU ANLATIMI
PARAGRAF KONU ANLATIMI
 
Understand Arabic In 12 Colored Tables
Understand Arabic In 12 Colored TablesUnderstand Arabic In 12 Colored Tables
Understand Arabic In 12 Colored Tables
 

Semelhante a Habash: Arabic Natural Language Processing

Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
 
Arabic Content with Apache Solr
Arabic Content with Apache SolrArabic Content with Apache Solr
Arabic Content with Apache SolrRamzi Alqrainy
 
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooqArabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooqLucidworks
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01Mohammed Attia
 
Arabic alphabet - The Bah Studie.pdf
Arabic alphabet - The Bah    Studie.pdfArabic alphabet - The Bah    Studie.pdf
Arabic alphabet - The Bah Studie.pdfStacy Taylor
 
Introduction to Arabic and Alphabet
Introduction to Arabic and AlphabetIntroduction to Arabic and Alphabet
Introduction to Arabic and AlphabetSalaam2Arabic
 
4 theory
4 theory4 theory
4 theorybiscus
 

Semelhante a Habash: Arabic Natural Language Processing (10)

Arabic alphabets
Arabic alphabetsArabic alphabets
Arabic alphabets
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
Arabic Content with Apache Solr
Arabic Content with Apache SolrArabic Content with Apache Solr
Arabic Content with Apache Solr
 
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooqArabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq
Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
 
Easy Arabic
Easy ArabicEasy Arabic
Easy Arabic
 
Arabic alphabet - The Bah Studie.pdf
Arabic alphabet - The Bah    Studie.pdfArabic alphabet - The Bah    Studie.pdf
Arabic alphabet - The Bah Studie.pdf
 
Introduction to Arabic and Alphabet
Introduction to Arabic and AlphabetIntroduction to Arabic and Alphabet
Introduction to Arabic and Alphabet
 
4 theory
4 theory4 theory
4 theory
 

Mais de Mustafa Jarrar

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisMustafa Jarrar
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal OntologyMustafa Jarrar
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course OutlineMustafa Jarrar
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology Mustafa Jarrar
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesMustafa Jarrar
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORMMustafa Jarrar
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineMustafa Jarrar
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesMustafa Jarrar
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalMustafa Jarrar
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsMustafa Jarrar
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsMustafa Jarrar
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Mustafa Jarrar
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql ProjectMustafa Jarrar
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringMustafa Jarrar
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesMustafa Jarrar
 
Jarrar: Ontology Modeling using OntoClean Methodology
Jarrar: Ontology Modeling using OntoClean MethodologyJarrar: Ontology Modeling using OntoClean Methodology
Jarrar: Ontology Modeling using OntoClean MethodologyMustafa Jarrar
 
Jarrar: Informed Search
Jarrar: Informed Search  Jarrar: Informed Search
Jarrar: Informed Search Mustafa Jarrar
 
Jarrar: Un-informed Search
Jarrar: Un-informed SearchJarrar: Un-informed Search
Jarrar: Un-informed SearchMustafa Jarrar
 
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Mustafa Jarrar
 

Mais de Mustafa Jarrar (20)

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment Analysis
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal Ontology
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course Outline
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion Rules
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORM
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online Courses
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-final
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 Calls
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing Ontologies
 
Jarrar: Ontology Modeling using OntoClean Methodology
Jarrar: Ontology Modeling using OntoClean MethodologyJarrar: Ontology Modeling using OntoClean Methodology
Jarrar: Ontology Modeling using OntoClean Methodology
 
Jarrar: Games
Jarrar: GamesJarrar: Games
Jarrar: Games
 
Jarrar: Informed Search
Jarrar: Informed Search  Jarrar: Informed Search
Jarrar: Informed Search
 
Jarrar: Un-informed Search
Jarrar: Un-informed SearchJarrar: Un-informed Search
Jarrar: Un-informed Search
 
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Habash: Arabic Natural Language Processing

  • 1. 1 Arabic Natural Language Processing Nizar Habash Columbia University Center for Computational Learning Systems With modifications
  • 2. 2 Road Map • Introduction • Orthography – Arabic Script – MSA Phonology and Spelling – Encoding Issues • Morphology • Syntax
  • 3. 3 Introduction • What is ‘Arabic’? – A Semitic language – Arabic Script • With or without diacritics • More ambiguity! – Arabic Language • Modern Standard Arabic (MSA) • Arabic Dialects
  • 4. 4 Road Map • Introduction • Orthography – Arabic Script – MSA Phonology and Spelling – Encoding Issues • Morphology • Syntax
  • 5. 5 Arabic Script Arabic script is an alphabet with allographic variants, optional zero-width diacritics and common ligatures. Arabic script is used to write many languages: Arabic, Persian, Kurdish, Urdu, Pashto, etc. PDF Files problem ‫طا‬ُ ‫ا‬ ‫خ‬َ‫ط‬ ‫ال‬ ‫بي‬ِ‫ي‬ ‫ر‬َ‫ط‬ ‫ع‬َ‫ط‬ ‫ال‬
  • 6. 6 Arabic Script Alphabet • letter forms • letter marks • Arabic only • Other languages • Persian, Kurdish, Urdu, Pashto, etc. • OCR output ambiguity
  • 7. 7 Arabic Script ‫ب‬ /b/ Alphabet (MSA) • letters (form+mark) • Distinctive • Non-distinctive ‫ت‬ /t/ ‫ث‬ /θ/ ‫س‬ /s/ ‫ش‬ / /ʃ /ʔ/ glottal stop aka hamza ‫ا‬‫أ‬‫إ‬‫آ‬‫ء‬‫ؤ‬ ‫ئ‬
  • 8. 8 Arabic Script Letter Shapes • No distinction between print and handwriting • No capitalization • Right-to-left • Ambiguous shapes • Connective letters • Disconnective letters •OCR problems ‫ﺪ‬ ‫د‬ ‫ﺎ‬ ‫ا‬ ‫ﺰ‬ ‫ز‬ ‫ﻦ‬ ‫ﻨ‬ ‫ﻧ‬ ‫ن‬ final media l initial Stand ‫ا‬ alone ‫ﻎ‬‫ﺶ‬‫ﻢ‬‫ﻚ‬‫ﺐ‬ ‫ﻐ‬‫ﺸ‬‫ﻤ‬‫ﻜ‬‫ﺒ‬ ‫ﻏ‬‫ﺷ‬‫ﻣ‬‫ﻛ‬‫ب‬ ‫ﻍ‬‫ش‬‫م‬‫ك‬‫ب‬
  • 9. 9 Arabic Script Letter shaping /kitāb/ ‫ا‬ ‫ﻛ‬‫ت‬‫ﺎ‬‫ب‬‫ا =ا ﻛتﺎب‬ ‫ا‬ ‫كا‬ ‫تا‬‫ا‬ ‫ا‬‫ب‬ ‫ا‬ ktāb book /katab/ ‫ا‬ ‫ﻛ‬‫ت‬‫ﺐ‬‫ا =ا ﻛتﺐ‬ ‫ا‬ ‫كا‬ ‫تا‬‫ب‬ ‫ا‬ ktb to write
  • 10. 10 Arabic Script Nunation ‫ب‬ً /ban/ ‫ب‬ٌ /bun/ ‫ب‬ٍ /bin/ Diacritics • Zero-width characters • Used for short vowels ‫تﺐا ا ا‬َ‫ط‬ ‫ﻛ‬َ‫ط‬ ‫ا‬ ‫/ا‬katab/ ‫ا‬to ‫ا‬write • Nunation is used for nominal indefinite marker in MSA ‫با ا‬ٌ ‫تﺎ‬َ‫ط‬ ‫ﻛ‬ِ‫ي‬ ‫ا ا‬ ‫/ا‬kitābun/ ‫ا‬a ‫ا‬book Vowel ‫ب‬َ /ba/ ‫ب‬ُ /bu/ ‫ب‬ِ /bi/
  • 11. 11 Arabic Script Double Consonant ‫ب‬ّ /bb/ ‫ب‬ّ ‫ب‬ّ ‫ب‬ّ /bbu/ /bbin/ /bban/ Diacritics • No-vowel marker (sukun) ‫تﺐا ا‬َ‫ط‬ ‫ﻜ‬ْ‫َت‬ ‫ﻣ‬َ‫ط‬ ‫ا‬ ‫/ا‬maktab/ ‫ا‬office • Double consonant marker (shadda) ‫تﺐا ا‬َّ‫ب‬ ‫ﻛ‬َ‫ط‬ ‫ا‬ ‫/ا‬kattab/ ‫ا‬to ‫ا‬dictate • Combinable No Vowel ‫ب‬ْ /b/
  • 12. 12 Arabic Script ‫ربا =ا عرب‬َ‫ط‬ ‫ع‬َ‫ط‬ ‫ا‬ ‫ا با‬َ‫ط‬ ‫ا ر‬َ‫ط‬ ‫ع‬ Putting it together Simple combination Ligatures ‫ربا =ا ﻏرب‬ْ‫َت‬ ‫ﻏ‬َ‫ط‬ ‫ا‬ ‫ا با‬ْ‫َت‬ ‫ا ر‬َ‫ط‬ ‫غ‬West ‫/ا‬ʁarb/ Arab / arab/ʕ ‫ا ﺳﻠﺎما‬ ‫سا لا اا ما‬Peace ‫/ا‬salām/ ‫ﺳلم‬
  • 13. 13 Arabic Script Tatweel ‫ا‬ • ‫‘ا‬elongation’ • ‫ا‬aka ‫ا‬kashida • ‫ا‬used ‫ا‬for ‫ا‬text ‫ا‬highlight ‫ا‬ and ‫ا‬justification ‫ا‬ ‫الﻧسﺎن‬ ‫حقوق‬ ‫الﻧسـﺎن‬ ‫حقـوق‬ ‫الﻧســـﺎن‬ ‫حقـــوق‬ ‫الﻧســـــﺎن‬ ‫حقـــــوق‬ human ‫ا‬rights ‫/ا ا‬ħuqūq ‫ا‬alʔinsān/
  • 14. 14 Western Arabic Tunisia, Morocco, etc. 0 1 2 3 4 5 6 7 8 9 Indo-Arabic Middle East ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ Eastern Indo-Arabic Iran, Pakistan, etc. ٠ ١ ٢ ٣ ۴ ۵ ۶ ٧ ٨ ٩ Arabic Script “Arabic” ‫ا‬Numerals • ‫ا‬Decimal ‫ا‬system • ‫ا‬Numbers ‫ا‬written ‫ا‬left-to-right ‫ا‬in ‫ا‬right-to-left ‫ا‬text ‫ا :ا‬dual ‫ا‬directions ‫اﺳتقﻠتا الجﺰائرا فيا ﺳﻨةا‬1962 ‫ا بعﺪا‬132 ‫ا عﺎﻣﺎا ﻣﻦا‬ ‫.الحتللا الفرﻧسي‬ ‫ا‬ Algeria ‫ا‬achieved ‫ا‬its ‫ا‬independence ‫ا‬in ‫ا 2691ا‬after ‫ا 231ا‬years ‫ا‬of ‫ا‬French ‫ا‬occupation. ‫ا‬ • ‫ا‬Three ‫ا‬systems ‫ا‬of ‫ا‬enumeration ‫ا‬symbols ‫ا‬that ‫ا‬vary ‫ا‬by ‫ا‬region
  • 15. 15 Road Map • Introduction • Orthography – Arabic Script – MSA Phonology and Spelling – Encoding Issues • Morphology • Syntax
  • 16. 16 MSA Phonology and Spelling • Phonological profile of Standard Arabic – 28 Consonants – 3 short vowels, 3 long vowels, 2 diphthongs • Arabic spelling is mostly phonemic … – Letter-sound correspondence ā ʔt bʤθx ħδ dz rsṣ ʃṭ ḍʕk ʁq flm ‫ت‬‫ث‬ ‫ا‬‫ب‬‫ج‬‫ح‬‫خ‬‫د‬‫ذ‬‫ر‬‫ز‬‫س‬‫ش‬‫ص‬‫ض‬‫ط‬‫ظ‬‫ع‬‫غ‬‫ف‬‫ق‬‫ك‬‫ل‬‫م‬‫ن‬‫ه‬‫و‬‫ي‬ ‫ى‬ ‫ء‬‫أ‬‫آ‬‫إ‬‫ؤ‬‫ئ‬‫ة‬ h nwj ūī δ̣
  • 17. 17 MSA Phonology and Spelling • Arabic spelling is mostly phonemic … Except for • Medial short vowels can only appear as diacritics • Diacritics are optional in most written text – Except in holy scripture – Occasionally appear in newspapers to mark less common readings, resolve certain ambiguities • ‫كتب‬ /katab/ to write ‫كتب‬ُ /kutib/ to be written • Dual use of ‫ا‬,‫و‬,‫ي‬ as consonant and long vowel – ‫ا‬ (/‘/,/ā/) ‫و‬ (/w/,/ū/) ‫ي‬ (/j/,/ī/)
  • 18. 18 MSA Phonology and Spelling • Arabic spelling is mostly phonemic … Except for (continued) • Morphophonemic characters – Feminine marker ‫ة‬ (ta marbuta) • ‫كبير‬ /kabīr/ (big ♂) ‫كبير‬‫ة‬ /kabīra/ (big ♀) – Derivation marker • / aṣa/ (to disobeyʕ ‫عص‬‫ى‬ ) (a stick ‫عص‬‫ا‬ ) • Hamza variants (6 characters for one phoneme!) – ( ‫أآإؤئ‬ ‫ء‬(‫بها‬‫ء‬‫بها‬ ‫ه‬‫ؤ‬‫بها‬ ‫ه‬‫ئ‬‫ه‬ /baha’/ + 3MascSing (his glory)
  • 19. 19 MSA Phonology and Spelling • Arabic spelling can be ambiguous • But how ambiguous? Really? • Classic example ths s wht n rbc txt lks lk wth n vwls this is what an Arabic text looks like with no vowels • Not exactly true – Long vowels are always written – Initial vowels are represented by an ‫ا‬ ‘alef’ – Some final short vowels are represented ths is wht an Arbc txt lks lik wth no vwls Will revisit ambiguity in more detail again under morphology discussion Humans can read, machines may be troubled!
  • 20. 20 Road Map • Introduction • Orthography – Arabic Script – MSA Phonology and Spelling – Encoding Issues • Morphology • Syntax
  • 21. 21 Encoding Issues • Encoding Arabic – Data entry, storage, and display – Ease of use for Arabic- – Multi-script support – Multilingual support (extended Arabic characters) • Types of Encoding – Machine character sets • Graphemic (shape insensitive, logical order) • Allographic (shape/direction sensitive) [obsolete] – Human accessible • Transliteration • Phonetic spelling (IPA) • Romanization
  • 22. 22 Encoding Issues • Many Conflicting Character Sets for Arabic
  • 23. 23 Encodings • CP-1256 – Commonly used – 1-byte characters – Widely supported input/display – Minimal support for extended Arabic characters – bi-script support (Roman/Arabic) – Tri-lingual support: Arabic, French, English (ala ANSI)
  • 24. 24 Encodings • Unicode – Becoming the standard more and more – 2-byte characters – Widely supported input/display – Supports extended Arabic characters – Multi-script representation
  • 25. 25 Encoding Issues Arabic Display • Memory (logical order)  ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004. ‫ﺕﻙﺭﺍﺵ‬‫ﻥﻱﻃﺱﻝﻑ‬ (Palestine) ‫ﻱﻑ‬‫ﺩﺍﻱﺏﻡﻝﻭﺍ‬ (Olympics) 2000 ‫ﻭ‬2004 . or this way for those with direction-bias  .4002 æ 0002 )scipmylO( ÏÇíÈãáæÇ íÝ )enitselaP( äíØÓáÝ ÊßÑÇÔ .4002 ‫ﻭ‬0002 )scipmylO( ‫ﺍﻭﻝﻡﺏﻱﺍﺩ‬‫ﻑﻱ‬ )enitselaP( ‫ﻑﻝﺱﻃﻱﻥ‬ ‫ﺵﺍﺭﻙﺕ‬
  • 26. 26 Encoding Issues Arabic Display • Memory (logical order) ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004. ‫ﺕﻙﺭﺍﺵ‬‫ﻥﻱﻃﺱﻝﻑ‬ (Palestine) ‫ﻱﻑ‬‫ﺩﺍﻱﺏﻡﻝﻭﺍ‬ (Olympics) 2000 ‫ﻭ‬2004 . • Display (visual order) – Bidirectional (BiDi) support • Numbers and Roman script 2000 ‫ﻭ‬2004 . (Olympics) ‫ﺍﻭﻝﻡﺏﻱﺍﺩ‬‫ﻑﻱ‬ (Palestine) ‫ﻑﻝﺱﻃﻱﻥ‬‫ﺵﺍﺭﻙﺕ‬ – Letter and ligature shaping 2000 ‫ﻭ‬2004 . (Olympics) ‫ﺍﻭليبياﺩ‬ ‫ف‬ (Palestine) ‫فلسطي‬ ‫شاﺭكت‬
  • 27. 27 Display Problems تدشين منطقة Ø­ رة في دبي للتجارة الالكتر٠ˆÙ†ÙŠØ© ÊÏÔêæ åæ×âÉ ÍÑÉ áê ÏÈê ääÊÌÇÑÉ ÇäÇäãÊÑèæêÉ ÊÏÔíä ãäØÞÉ ÍÑÉ Ýí ÏÈí ááÊÌÇÑÉ ÇáÇáßÊÑæäíÉ ‫تدشين‬‫منطقة‬‫حرة‬‫في‬ ‫دبي‬‫للتجارة‬‫اللكترونية‬ ُ‫ظ‬‫ظظ؛؟‬ ‫ع‬‫ع‬‫ع‬  ‫ع‬‫ظ‬  ‫ع‬‫ظ‬  ‫ظظ­ظ‬ ‫ع‬‫ع‬‫ظ‬ ‫ظ‬ ‫ع‬ ‫ع‬‫ع‬‫ظ‬ ‫ظظظ،ظ‬ ‫ظ‬ ‫ع‬‫ظ‬  ‫ع‬‫ع‬‫ظ‬ ‫عظ‬ ‫ع‬‫ع‬‫ظ‬  ï»…‫ظ‬ †‫ظ‬ ‫ط¯ط´ظ‬ ‫؟ط‬‫ٹ‬ ‫ھ‬ ©‫ط­ط±ط‬ ©‫ظ†ط·ظ‚ط‬ ‫ط¯ط¨ظ‬ ‫ظپظ‬‫ٹ‬ ‫ٹ‬ ©‫ط¬ط§ط±ط‬ ‫ظ„ظ„ط‬‫ھ‬ ‫ط§ظ„ط§ظ„ظ‬ƒ‫ط±ظˆظ‬ ‫ط‬‫ھ‬ ‫ط‬ ‫†ظ‬‫ٹ‬ © Ԫʏ 栥既 ɠɠ͑ ԪψԪ Ԫ ǑɠǤǤ ㊑親 ɠ ‫تدشين‬‫منطقة‬‫حرة‬‫في‬ ‫دبي‬‫للتجارة‬‫اللكترونية‬ ‫تدش‬ê‫هو‬ ‫×و‬â‫حرة‬ ‫ة‬ ‫ل‬ê ‫دب‬ê ‫ننتجارة‬ ‫انانمتر‬è‫و‬ê‫ة‬ Ԫʏ ԪԪ ɠɠ͑ Ԫ ψ�ԪǑɠǡǡߊѦ Ԫ ‫كلظ‬ ‫ش ل‬ٍ‫ل‬ ‫تد‬Ԫ‫حرة‬ ‫ة‬Ԫٍ‫ل‬ ‫افاف‬ ‫ففتجارة‬ ‫ب‬ٍ‫ل‬ ‫د‬Ԫ‫لة‬ٍ‫ل‬‫ترن‬ ‫تدشين‬‫منطقة‬‫حرة‬‫في‬ ‫دبي‬‫للتجارة‬‫اللكترونية‬ WesternUnicodeISO-8859CP-1256 Display Encoding CP­1256ISO­8859Unicode ActualEncoding • Wrong encoding • Partial support problems
  • 29. 29 Encodings Buckwalter Encoding • Romanization – One-to-one mapping to Arabic script spelling – Left-to-right – Easy to learn/use – Human machine compatible • Penn Arabic Tree Bank • Some characters can be modified to allow use with XML and regular expressions • Roman input/display • Monolingual encoding (can’t do English and Arabic) • Minimal support for extended Arabic characters
  • 30. 30 Road Map • Introduction • Orthography • Morphology – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology • Syntax
  • 31. 31 Morphology • Type – Concatenative: prefix, suffix, infix – Templatic: root+pattern • Function – Derivational • Creating new words • Mostly templatic – Inflectional • Modifying features of words – Tense, number, person, mood, aspect • Mostly concatenative
  • 32. 32 Road Map • Introduction • Orthography • Morphology – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology • Syntax
  • 33. 33 Derivational Morphology • Templatic Morphology ‫م‬‫كت‬‫و‬‫ب‬ b ?‫و‬ ‫م‬َ?? kt ‫ك‬‫ا‬‫تب‬ ?‫ا‬ِ?? maktūb written kātib writerLexeme.Meaning = (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random ‫ب‬ ‫ك‬‫ت‬ maū āi • Root • Pattern • Lexeme
  • 34. 34 Derivational Morphology Root Meaning • ‫ك‬‫ت‬‫ب‬ KTB = notion of “writing” ‫كتب‬ /katab/ write ‫كاتب‬ /kātib/ writer ‫مكتوب‬ /maktūb/ letter ‫كتاب‬ /kitāb/ book ‫مكتبة‬ /maktaba/ library ‫مكتب‬ /maktab/ office ‫مكتوب‬ /maktūb/ written
  • 35. 35 Derivational Morphology Root Meaning • LHM-2 • Notion of “battle” – ‫م‬‫لحم‬‫ة‬ /malħama/ • Fierce battle • Massacre • Epic
  • 36. 36 Road Map • Introduction • Orthography • Morphology – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology • Syntax
  • 37. 37 Inflectional Morphology • Derivational Morphology – Lexeme ≈ Root + Pattern • Inflectional Morphology – Word = Lexeme + Features • Features – Part-of-speech • Traditional: Noun, Verb, Particle • Computational: N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det, Aux, Pun, IJ, and others – Noun-specific • Number: singular, dual, plural, collective • Gender: masculine, feminine, Neutral • Definiteness: definite, indefinite • Case: nominative, accusative, genitive • Possessive clitic
  • 38. 38 Inflectional Morphology • Features (continued) – Verb-specific • Aspect: perfective, imperfective, imperative • Voice: active, passive • Tense: past, present, future • Mood: indicative, subjunctive, jussive • Subject (Person, Number, Gender) • Object clitic – Others • Single-letter conjunctions • Single-letter prepositions
  • 39. 39 Inflectional Morphology Nouns ‫وللمكتبات‬ /walilmaktabāt/ ‫و‬+‫ل‬+‫ال‬+‫مكتبة‬‫+ات‬ wa+li+al+maktaba+āt and+for+the+library+plural And for the libraries conjprepnounposs plural article ‫وكبيوتنا‬ /wakabiyūtinā/ ‫نا‬+‫بيوت‬+‫ك‬+‫و‬ wa+ka+biyūt+nā and+like+houses+our And like our houses • Morphotactics (e.g. ‫ل+ال‬  ‫)لل‬ • Arabic Broken Plurals (templatic)
  • 40. 40 Inflectional Morphology Verbs ‫فقلناها‬ /faqulnāhā/ ‫ف‬+‫قال‬+‫نا‬+‫ها‬ fa+qul+na+hā so+said+we+it So we said it. conjverbobject subj tense ‫وسن‬‫قول‬‫ها‬ /wasanaqūluhā/ ‫و‬+‫س‬ +‫ن‬+‫قول‬+‫ها‬ wa+sa+na+qūl+u+hā and+will+we+say+it And we will say it • Morphotactics • Subject conjugation (suffix or circumfix)
  • 41. 41 Inflectional Morphology • Perfect verb subject conjugation (suffixes only) Singular Dual Plural 1 ‫كتب‬‫ت‬ُ katabtu ‫كتب‬‫نا‬ katabnā 2 ‫كتب‬‫ت‬َ katabta ‫كتب‬‫ت‬‫ما‬ katabtumā ‫كتب‬‫ت‬‫م‬ katabtum 3 ‫كتب‬َ kataba ‫كتب‬‫ا‬ katabā ‫كتب‬‫وا‬ katabtū • Imperfect verb subject conjugation (prefix+suffix) Feminine form and other verb moods not shown Singular Dual Plural 1 ‫ا‬‫كتب‬ ُaktubu ‫ن‬‫كتب‬ ُnaktubu 2 ‫ت‬‫كتب‬ ُ taktubu ‫ت‬‫كتب‬‫ان‬ taktubān ‫ت‬‫كتب‬‫ون‬ taktubūn 3 ‫ي‬‫كتب‬ ُ yaktubu ‫ي‬‫كتب‬‫ان‬ yaktubān ‫ي‬‫تكتب‬‫ون‬ yaktubūn
  • 42. 42 Road Map • Introduction • Orthography • Morphology – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology • Syntax
  • 43. 43 Morphological Ambiguity • Derivational ambiguity – ‫:قاعدة‬ basis/principle/rule, military base, Qa'ida/Qaeda/Qaida • Inflectional ambiguity – ‫:تكتب‬ you write, she writes – Segmentation ambiguity • ‫:وجد‬ he found; ‫:و+جد‬ and+grandfather • ‫للغة‬:‫ل+لغة‬ : for a language; ‫:ل+اللغة‬ for the language • Spelling ambiguity – Optional diacritics • ‫:كاتب‬ /kātib/ writer , /kātab/ to correspond – Suboptimal spelling • Hamza dropping: ‫أ‬,‫إ‬  ‫ا‬ • Undotted ta-marbuta: ‫ة‬  ‫ه‬ • Undotted final ya: ‫ي‬  ‫ى‬
  • 44. 44 Morphological Ambiguity • Multiple sources of ambiguity ‫بين‬ – /bayyana/ Verb he declared/demonstrated – /bayyanna/ Verb they [feminine] declared/demonstrated – /bayyin/ Adj clear/evident/explicit – /bayna/ Prep between/among – /biyin/ Proper Noun in Yen – /biyn/ Proper Noun Ben • Hard to measure specific causes of ambiguity – Derivational ambiguity* (diacritized tokens) • 1.09 entries/token • 1.01 entries/token (within same part-of-speech) – Spelling ambiguity* (undiacritized tokens) • 1.28 entries/token • 1.08 entries/token (within same part-of-speech) * in Buckwalter’s Lexicon (~40,000 lexemes)
  • 45. 45 Road Map • Introduction • Orthography • Morphology – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology • Syntax • Machine Translation Issues • Syntax
  • 46. 46 Arabic Computational Morphology • Representation units • Natural token ‫ولـلـمـكتـبــــات‬ – White space separated strings (as is) – Can include extra characters (e.g. tatweel/kashida) • Word ‫وللمكتبات‬ • Segmented word – Can include any degree of morphological analysis – Pure segmentation: ‫لمكتبات‬ ‫ل‬ ‫و‬ – Arabic Treebank tokens (with recovery of some deleted/modified letters): ‫المكتبات‬ ‫ل‬ ‫و‬
  • 47. 47 Arabic Computational Morphology • Representation units (continued) • Prefix + Stem + Suffix – ‫ولل‬+‫مكتب‬+‫ات‬ – Can create more ambiguity • Lexeme + Features – ‫+[مكتبة‬Plural +Def +‫+و‬ ‫]ل‬ • Root + Pattern + Features – ‫كتب‬+‫ة‬ a3a21a‫م‬ + [+Plural +Def + ‫ل‬+‫و‬ ] – Very abstract • Root + Pattern + Vocalism + Features – ‫كتب‬+‫م‬321‫ة‬ + a.a.a + [+Plural +Def + ‫ل‬+‫و‬ ] – Very very abstract
  • 48. 48 Arabic Computational Morphology • Approaches – Finite state machines (Beesely,2001) (Kiraz,2001) (Habash et al, 2005b) – Concatenative analysis/generation (Buckwlater,2002) (Cavalli-Sforza et al, 2000) – Lexeme+Feature analysis/generation (Habash, 2004) – Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) – Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003) • Issues – Appropriateness of system representation for an application • Machine Translation vs. Information Retrieval • Arabic spelling vs. phonetic spelling – System coverage – System extendibility – Availability to researchers – Use for analysis and generation
  • 49. Road Map • Introduction • Orthography - Arabic Script - MSA Phonology and Spelling - Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… - Encoding Issues • Morphology • Syntax 49
  • 50. Arabic Script Other languages Arabic • No more than 3 dots • Dots either above or below • Marks are 1/2/3 dots, hamza (‫)ء‬ or madda (~) only • Rare borrowing for foreign words • ‫/پ‬p/, ‫ڤ‬ /v/, ‫چ‬ ‫گ‬ ‫ڤ‬ /g/, ‫چ‬ /t ʃ / • regionally variable Not Arabic • Extra marks: haft (v), ring (o), taa (‫,)ط‬ four dots (::), vertical dots (:) • Some Numerals (i,h,g) ‫ب‬ … ‫إ‬ ‫ا‬ ‫ؤ‬ ‫أ‬ ‫ئةتثجحخ‬ ‫ص‬ ‫ش‬ ‫س‬ ‫ز‬ ‫ر‬ ‫ذد‬ ‫ضطظعغ‬ ‫ف‬ ‫قكلمن…و‬‫ء‬ ‫ىي‬ ‫ژ‬ ‫ڑ‬ ‫ڒ‬ ‫ړ‬ ‫ٻ‬ ‫ټ‬ ‫ٽ‬ ‫پ‬ ‫ٿ‬ ‫ڀ‬ ‫ڏ‬ ‫ڐ‬ ‫ڻ‬ ‫ڼ‬ ‫ڽ‬‫ڹ‬ s r q p ‫ڈ‬ ‫ډ‬ ‫ڊ‬ ‫ځ‬ ‫ڂ‬ ‫ڃ‬ ‫ڄ‬ ‫چ‬ ‫څ‬ ‫ۇ‬ ‫ۈ‬ ‫ۅ‬ ‫ۆ‬ ‫ڏ‬ ‫ڐ‬ ‫ڑ‬ ‫ڒ‬ s r q p ‫ڈ‬ ‫ډ‬ ‫ڊ‬ ‫گ‬ ‫ێ‬ ‫…ڤ‬ Once you learn the alphabet, it is easier ☺ HOW TO DETECT ARABIC?? 50
  • 51. Morphology and Syntax • Rich morphology crosses into syntax - Pro-drop / Subject conjugation - Verb sub-categorization and object clitics • Verbtransitive+subject+object • Verbintransitive+subject but not Verbintransitive+subject+object • Verbpassive+subject but not Verbpassive+subject+object • Morphological interactions with syntax - Agreement • Full: e.g. Noun-Adjective on number, gender, and definiteness (for persons) • Partial: e.g. Verb-Subject on gender (in VSO order) - Definiteness • Noun compound formation, copular sentences, etc. • Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc. 51
  • 52. Morphology and Syntax • Morphological interactions with syntax (continued) - Case • MSA is case marking: nominative, accusative, genitive • Almost-free word order • Case is often marked with optionally written short vowels - This effectively limits the word-order freedom in published text • Agglutination - Attached prepositions create words that cross phrase boundaries ‫ت‬ W‫ا+ل‬ li+Almaktabāt for the-libraries [PP li [NP Almaktabāt]] • Some morphological analysis (minimally segmentation) is necessary even for statistical approaches to parsing 52
  • 53. Road Map • Introduction • Orthography • Morphology • Syntax - Morphology and Syntax - Sentence Structure - Phrase Structure - Computational Resources 53
  • 54. Sentence Structure Two types of Arabic Sentences • Verbal sentences - [Verb Subject Object] (VSO) -‫العشعار‬ ‫اللول د‬ ‫كتب‬ Wrote the-boys the-poems The boys wrote the poems • Copular sentences (aka nominal sentences) - [Topic Complement] ‫عشعراء‬ ‫اللول د‬ the-boys poets The boys are poets 54
  • 55. Sentence Structure • Verbal sentences - Verb agreement with gender only • Default singular number • ‫الولد-اللول د‬ ‫كتب‬wrote3MascSing the-boy/the-boys ‫كتب‬‫ت‬‫البنت-البنات‬ wrote3FemSing the-girl/the-girls - Pronominal subjects are conjugated • ‫كتب‬‫ت‬ wrote-youMascSing • ‫كتب‬‫تم‬ wrote-youMascPlur • ‫ا‬ ‫كتب‬‫وا‬ wrote-theyMascPlur – Passive verbs • Same structure: Verbpassive SubjectunderlyingObject • Agreement with surface subject 55
  • 56. Road Map • Introduction • Orthography • Morphology • Syntax - Morphology and Syntax - Sentence Structure - Phrase Structure - Computational Resources 56
  • 57. Computational Resources • Monolingual corpora for building language models - Arabic Gigaword • Agence France Presse • AlHayat News Agency • AnNahar News Agency • Xinhua News Agency - Arabic Newswire - United Nations Corpus (parallel with other UN languages) - Ummah Corpus (parallel with English) • Distributors - Linguistic Data Consortium (LDC) - Evaluations and Language resources Distribution Agency (ELDA) 57
  • 58. Computational Resources • Penn Arabic Treebank (PATB) - Started in 2001 - Goal is 1 Million words - Currently 650K words • Agence France Presse , AlHayat newspaper, AnNahar newspaper • POS tags - Buckwalter analyzer - Arabic-tailored POS list • PATB constituency representation - Some modifications of Penn English Treebank • (e.g. Verb-phrase internal subjects) 58
  • 59. Efficiency Enhancement Tools for Arabic Search Engines: A Statistical Approach by Adnan Yahya joint work with Ali Salhi Birzeit University, Palestine Presented at AI Class, November 3, 2011
  • 60. 1.Introduction and Motivation 2.Our Arabic NLP Tools and Methods 3.Utility of employed Structures. 4.Search of Arabic PDF Files. 5.Next Steps Talk Outline
  • 61. Introduction • The World-Wide Web is rapidly changing and expanding. • Estimates of the size of the World-Wide Web vary from 15 to 30 billion pages. • The share of the Arabic is around 1% (for 5% of population). • Usually one is interested in web searches that return exactly the needed info. Returning too little : not sufficient Returning too much : not useful.
  • 62. Introduction • Sometimes you want to search for more by (expanding the query or correcting it and • Sometimes you want to search for less by eliminating certain query words • Issues like document structuring, user profiling and using NLP tools are part of the solution • The problems are present in other languages but may be more complicated for Arabic
  • 63. Arabic NLP Tools and Methods 1. It was decided to look for practical solutions 2. Developing NLP tools and methods based on the availability of: • Arabic Corpus, • Arabic Word Database, • Stop Word List
  • 64. Arabic Corpus Construction  For developing NLP tools we employed a statistical/Corpus based approach.  We started with contemporary data we obtained initially from Al-Sharq Alawsat newspaper.  Then we expended it with available sources (not without major problems)
  • 66. Expansion Directions • Topical Corpora: Topics sub-topics for Categorization. • Multisource: books, journals,.. • Multiregional: dialects • Translation tools usage • Structures for efficient storage- manipulation.
  • 67. Arabic Word Database (AWD) • A database with all the Arabic words on the web with their frequencies • The starting database was (568,106) words. • Still expanding as we find more documents and better ways to deal with old ones • We retain the substructures of data even after aggregating it: so we retain the ability to work with subsets of data as well • Efforts at cleaning: infrequent words +
  • 68. Construction of Stop Word List • We created an Arabic stop words list consisting of • the Arabic prepositions, • pronouns, interrogatives, particles, • English stop words list (translated). • The list has (1065) words • Much larger that English due to morphology • One or multiple lists?: function dependant
  • 69. Utility of the Built Structures • Arabic Language Detection. • Arabic Automatic Categorizer • Arabic Query Live Suggestion • Query modification by Word Elimination/Expansion • Arabic Automatic Categorizer • Person Names unification-translation
  • 70. Arabic Language Detection 1. Determine whether the language of the document is Arabic or just a language using Arabic alphabet, say Persian or Urdu in order to crawl and index it. 2. Based on spell checking of words in the text against Arabic words and looking for a threshold that enables accurate but fast decisions. 3. A side tool: recover wrong language data entries.
  • 71. Arabic Automatic Categorizer • Web documents are assigned to one of predefined categories. • Can be used to refine the search results. Search for query “‫”السل”م‬ under the category religions to avoid political results. (or ‫)الزراعة‬ • We have been playing with the Granularity of categorization • Have implications to the corpora properties • Multi-stage categorization, user assisted categorization
  • 72. Arabic Automatic Categorizer Steps in Building the automatic categorizer : 1.Defined a set of topics: Politics, Religion, Science, Economics and Commerce, Sports, Entertainment, Arts, Social Sciences, Engineering/technology. 2. Word-based vector for each topic was prepared. 3. Calculate the similarities for each topic and select the most similar: How!
  • 73. Arabic Query Live Suggestion • When the user types in the search box, the system queries the suggestions tables to retrieve a list of possible suggestions. • Large suggestion tables. • Attempting the use of two and three words from the text corpus to reduce the suggestion space.
  • 74. Word Elimination • Eliminate non-discriminating words (stop words) in queries • May speed search process. Example: when a user search for (‫هي‬ ‫يه امما‬ ‫,)التعددية‬ the search engine has to make three runs to find match, while looking for the last word ‫التعددية‬ is enough to find relevant pages.
  • 75. Query Expansion • Looking for relevant words or derivatives. • In Arabic many words can be derived from a single root. • we ran the root extractor tool on the AWD words. • For each root we now have almost all the Arabic words that can be derived from it.
  • 76. Arabic Query Expansion The Query expansion is then carried out as follows: 1. Each nonstop word is returned to its root, then a list of words that share the same root are retrieved from the AWD. 2. A weighing system is used to assign weight to the expanded queries say by frequencies
  • 77. The Query Correction and Suggestion System • The query correction main function is to correct user entered queries. • For each query we examine the query correctness by comparing its words with words in the AWD. If there is a match, then the query is considered correct. Otherwise, it is misspelled. • Why do we need such dictionary? Aren’t the normal dictionaries enough?
  • 78. The Query Correction and Suggestion System Common Arabic Spelling Errors
  • 79. Search Arabic PDF files • One of the topics that concerns us through this work is Arabic PDF files because of prevalence; • How we can index them so as to make them searchable by our search engine • The concern is with legacy files
  • 80. Search Arabic PDF files • cross platform font encoding problem in Arabic PDF. This problem comes from the fact that windows and MACs encode characters in different ways once we leave the ASCII characters. Thus, when we move Arabic characters from MAC to PC we can notice that the PC identifies them as unreadable: we call this legacy pdf files
  • 81. Search Arabic PDF files Text As Created In InDesign Under MAC The Same Text Imported By Notepad under PC
  • 82. Search of Arabic PDF files Crawling the web for PDF files. • Crawling the web for PDF files. • Character mapping (partial for legacy files). • Search using approximate match rather than exact. • Needs extra work to increase the hit ratio.
  • 83. Next Steps • Continue to expand and clean the word collection: diverse topics • Observe the behavior and contrast with other work • Fine tune the lists and monitor scalability • We have reported on work in progress
  • 85. Incremental Search • Incremental search enable users to find NEW data on the web. • Not an Arabic feature but we thought it may be interesting • User search for query A for the first time the search engine will provide the user with available results. • Next time the user searches for A, the search engine will provide the new results introduced since his last search for A.
  • 86. The Query Correction and Suggestion System Correction is carried out in four stages: 1-Checking stage. 2-Processing Stage. 3-Filtering stage. 4-Weighting stage.
  • 87. Stemming and Root Extraction • Indexing based on the roots which are far more abstract than stems will improve the retrieval effectiveness over stems and words. • Arabic words are formed by adding prefixes (letters and vowels at the start), infixes (vowels) and suffixes (letters and vowels at the end) to the root. • The form of an Arabic word is usually determined by its gender, number, grammatical case, whether it is definitive or not, and finally if there is a preposition attached to it.
  • 88. Stemming and Root Extraction Stemming is carried out in the following steps. • There are seven levels of processing (L3,L4, L5, L6, L7, L8, Ln) that the query may pass through the process of stemming. • At each level of processing all the possible combinations of (prefix core suffix) is examined. • The combinations where the core is a correct Arabic (corpus/dictionary) word, prefix and suffix are extracted. • Level four processes infixes.
  • 89. Stemming and Root Extraction • Example: Stemming ‫”وسيأخذونهما‬ • LN: ‫وسيأخذونهما‬,‫هما‬+‫سيأخذون‬+‫و‬ • L7: ‫سيأخذون‬,‫ون‬+‫أخذ‬+‫سي‬ • L3: ‫,أخذ‬ root

Notas do Editor

  1. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  2. Geared towards non-arab researchers working on Arabic NLP Focus on MT! Orthography discussion is necessary to understand later concepts and phenomena
  3. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  4. Letters Body Shape Connective Disconnective Dots Diacritics oblig/optiona; Vocalic/consonantal font others… Short/long writing Script feature
  5. Problems for OCR … MT from OCR output?
  6. This is exemplifying with a language… but in principle these symbols of a script canbe used whichever way…
  7. Con / dis Similar shapes Very different forms
  8. There are additional ones for koranic text that are not discussed here
  9. There are additional ones for koranic text that are not discussed here
  10. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  11. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  12. Indefiniteness (Nunation تنوين) (tashdid تشديد, dagesh דגש)
  13. Indefiniteness (Nunation تنوين) (tashdid تشديد, dagesh דגש)
  14. Indefiniteness (Nunation تنوين) (tashdid تشديد, dagesh דגש)
  15. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  16. No correspondence … standar (some programs have other additional phonbetiocv )
  17. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  18. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  19. Psycholinguistic reality format  فرمت farmat Dictionary ordered Not all combinations possible
  20. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  21. Article or poss Case Arabic Broken Plurals Intersection of Derivational and Inflectional Morphology
  22. Aspect PA circumfix negation Object, iobj
  23. Stems are different Distribution of features
  24. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  25. Average length of arabic words is 4.89 (5.85 unique) Compared to English 4.36 and 7.20 +نا : we, us, our (subject, object, possessive)
  26. Indefiniteness (Nunation تنوين) (tashdid تشديد, dagesh דגש)
  27. My work on palestinian, arabic mt and arabic hebrew mt Highlight similarities and differences A lot of similarities/differences not included
  28. Lexeme idea (UMD+columbia – me) Representation appropriateness for an application: Information Retrieval vs. Machine Translation appropriateness in terms of orthography too
  29. Lexeme idea (UMD+columbia – me) Representation appropriateness for an application: Information Retrieval vs. Machine Translation appropriateness in terms of orthography too
  30. Lexeme idea (UMD+columbia – me) Representation appropriateness for an application: Information Retrieval vs. Machine Translation appropriateness in terms of orthography too