1. Natural Language Processing for
Amazigh Language:
Challenges and Future Directions
Fadoua Ataa Allah Siham Boulaknadel
CEISIC, IRCAM
{ataaallah, boulaknadel}@ircam.ma
2. Outline
Amazigh Language
Amazigh Complexity in NLP
State of the Technology on Amazigh
Future Directions
LREC-2012: SALTMIL-AfLaT Workshop 2
3. Amazigh language
Sociolinguistic Context
North African autochthonous language
Spoken by millions of people as dialects
LREC-2012: SALTMIL-AfLaT Workshop 3
4. Amazigh language
Sociolinguistic Context
Languages of Morocco
Classical Arabic as an official language.
Amazigh, since 2011 it becomes an official
language.
Moroccan Arabic or Darija is the diglossia of
Classical Arabic.
French as the first foreign language.
Spanish is used in the north of Morocco.
English is becoming the second foreign language.
10/07/2012
LREC-2012: SALTMIL-AfLaT Workshop 4
5. Amazigh language
History
Amazigh abjed
Tifinagh is attested from
25 centuries.
Its writing form has
continued to change
from the traditional
Tuareg writing to the
Tifinaghe-IRCAM .
Tinzouline Inscriptions
(Zagora, Morocco)
10/07/2012
LREC-2012: SALTMIL-AfLaT Workshop 5
6. Amazigh language
History
Direction
Plate 9
Anou Elias, Mammanet
Valley (Niger).
Henri Lhote, Oued
Mammanet gravures.
Les Nouvelles Editions
Africaines. 1979
10/07/2012
LREC-2012: SALTMIL-AfLaT Workshop 6
8. Amazigh Complexity in NLP
Different writing forms
Complex phonology and phonetic
systems
Rich morphology
LREC-2012: SALTMIL-AfLaT Workshop 8
9. Amazigh Complexity in NLP
Amazigh script
Writing prescriptions’ conversion into
‘Tifinaghe – Unicode’ is confronted with:
Spelling variation related to regional
varieties ([tfucht] [tafukt] (sun)),
Spelling variation based on the use or the
elimination of spaces within or between
words ([tadartino] [tadart ino] (my house)).
Arabic or Latin transcription systems.
LREC-2012: SALTMIL-AfLaT Workshop 9
10. Amazigh Complexity in NLP
Phonology & phonetic
The main problem of Amazigh phonology
and phonetic consists on allophones:
/ll/ that is realized as [dj] in the North.
LREC-2012: SALTMIL-AfLaT Workshop 10
11. Amazigh Complexity in NLP
Morphology
High inflected language.
Word structure:
Prefix Stem Suffix
Affixes set: Prefixes, Infixes, and Suffixes.
Base form varies with paradigms:
(qqim svim (make sit)).
LREC-2012: SALTMIL-AfLaT Workshop 11
12. State of the Amazigh technology
Tifinaghe Encoding
Optical character recognition
Fundamental processing tools
Language resources
LREC-2012: SALTMIL-AfLaT Workshop 12
13. State of the Amazigh technology
Tifinaghe Encoding
ANSI Unicode
13
14. State of the Amazigh technology
OCR
Amazigh OCR systems:
System focused on isolated printed characters
based on a syntactic approach using finite
automata.
Global approach based on Hidden Markov
Models for recognizing handwritten characters.
Method using invariant moments for recognizing
printed script.
System based on artificial neural network to
recognize printed characters.
LREC-2012: SALTMIL-AfLaT Workshop 14
15. State of the Amazigh technology
Fundamental processing
Transliterator
Tagging assistance tool
Light stemmer
Search engine
Concordancer
LREC-2012: SALTMIL-AfLaT Workshop 15
16. State of the Amazigh technology
Fundamental processing
Transliterator
Arabic script
Tifinaghe
Latin script Convertisor
Unicode
Tifinaghe Latin Transliterator
LREC-2012: SALTMIL-AfLaT Workshop 16
17. State of the Amazigh technology
Fundamental processing
Tagging assistance tool
Amazigh
raw
corpora
Tokenization
Manual POS Tag
Manual Stemming set
Stem
Tagged
list
corpus
Validation
Standard output
LREC-2012: SALTMIL-AfLaT Workshop 17
18. State of the Amazigh technology
Fundamental processing
Light stemmer Begin
Prefix + Stem + Suffix
Find the largest
prefix
Stem + Suffix Find the largest
suffix
Stem
End
LREC-2012: SALTMIL-AfLaT Workshop 18
19. State of the Amazigh technology
Fundamental processing
Search engine
Query Engine
Natural Language Index
Processing Tools
Data Data Indexing
Searching Indexer
User Interface
Natural Language
Processing Tools
Data Crawling
Repository
Web Crawler
LREC-2012: SALTMIL-AfLaT Workshop 19
20. State of the Amazigh technology
Fundamental processing
Concordancer
input field
.txt,.doc
.pdf, .zip
Tokenization
List of the text words Word / expression
and their frequency Context display
LREC-2012: SALTMIL-AfLaT Workshop 20
21. State of the Amazigh technology
Language resources
Corpora
Dictionary
Terminology database
LREC-2012: SALTMIL-AfLaT Workshop 21
22. State of the Amazigh technology
Language resources
Corpora:
General corpus,
POS corpus.
LREC-2012: SALTMIL-AfLaT Workshop 22
23. State of the Amazigh technology
Language resources
Dictionary
Definition,
Arabic equivalent words,
French equivalent words,
English equivalent words,
Synonyms,
Classification by domains,
Derivational families.
LREC-2012: SALTMIL-AfLaT Workshop 23
24. State of the Amazigh technology
Language resources
Terminology database
Media vocabulary
Grammatical vocabulary
LREC-2012: SALTMIL-AfLaT Workshop 24
25. Future Directions
Building a large and representative
Amazigh corpora.
Developing a machine translation
system.
Creating a pool of competent human
resources.
LREC-2012: SALTMIL-AfLaT Workshop 25
26. Thank you
for
your attention
ⵜⴰⵏⵎⵉⵔⵜ
LREC-2012: SALTMIL-AfLaT Workshop 26