SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Natural Language Processing for
      Amazigh Language:
Challenges and Future Directions
  Fadoua Ataa Allah Siham Boulaknadel
               CEISIC, IRCAM
         {ataaallah, boulaknadel}@ircam.ma
Outline

    Amazigh Language
    Amazigh Complexity in NLP
    State of the Technology on Amazigh
    Future Directions




                LREC-2012: SALTMIL-AfLaT Workshop   2
Amazigh language
                             Sociolinguistic Context
  North African   autochthonous language
     Spoken by millions of people as dialects




                      LREC-2012: SALTMIL-AfLaT Workshop   3
Amazigh language
                              Sociolinguistic Context
    Languages of Morocco
        Classical Arabic as an official language.
        Amazigh, since 2011 it becomes an official
         language.
        Moroccan Arabic or Darija is the diglossia of
         Classical Arabic.
        French as the first foreign language.
        Spanish is used in the north of Morocco.
        English is becoming the second foreign language.

                                                           10/07/2012
                       LREC-2012: SALTMIL-AfLaT Workshop            4
Amazigh language
                                                          History

    Amazigh abjed
        Tifinagh is attested from
         25 centuries.
        Its writing form has
         continued to change
         from    the   traditional
         Tuareg writing to the
         Tifinaghe-IRCAM .
                                                 Tinzouline Inscriptions
                                                    (Zagora, Morocco)


                                                                     10/07/2012
                      LREC-2012: SALTMIL-AfLaT Workshop                       5
Amazigh language
                                                  History
 Direction



                                                         Plate 9

                                                  Anou Elias, Mammanet
                                                      Valley (Niger).
                                                    Henri Lhote, Oued
                                                   Mammanet gravures.
                                                  Les Nouvelles Editions
                                                     Africaines. 1979




                                                              10/07/2012
              LREC-2012: SALTMIL-AfLaT Workshop                        6
Moroccan Amazigh characteristics
   Amazigh writing system
       Direction: horizontal from left to right.
       Alphabet:
            27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ ,
             ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;
             2 semi-consonants: ⵢ and ⵡ;
             4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.
       Punctuation marks: conventional signs including: “ ”
        (space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.
       Numerals: Hindu-Arabic numerals [0-9].



                                                                               10/07/2012
                              LREC-2012: SALTMIL-AfLaT Workshop                           7
Amazigh Complexity in NLP

  Different writing forms
  Complex      phonology                    and   phonetic
   systems
  Rich morphology




               LREC-2012: SALTMIL-AfLaT Workshop              8
Amazigh Complexity in NLP
                  Amazigh script
     Writing prescriptions’ conversion into
      ‘Tifinaghe – Unicode’ is confronted with:
       Spelling variation related to regional
        varieties ([tfucht] [tafukt] (sun)),
       Spelling variation based on the use or the
        elimination of spaces within or between
        words ([tadartino] [tadart ino] (my house)).
       Arabic or Latin transcription systems.




                  LREC-2012: SALTMIL-AfLaT Workshop    9
Amazigh Complexity in NLP
          Phonology & phonetic
     The main problem of Amazigh phonology
      and phonetic consists on allophones:

         /ll/ that is realized as [dj] in the North.




                   LREC-2012: SALTMIL-AfLaT Workshop   10
Amazigh Complexity in NLP
                      Morphology
     High inflected language.
     Word structure:

              Prefix                 Stem                  Suffix


     Affixes set: Prefixes, Infixes, and Suffixes.
     Base form varies with paradigms:
                (qqim  svim (make sit)).



                       LREC-2012: SALTMIL-AfLaT Workshop            11
State of the Amazigh technology
    Tifinaghe Encoding
    Optical character recognition
    Fundamental processing tools
    Language resources




                 LREC-2012: SALTMIL-AfLaT Workshop   12
State of the Amazigh technology
               Tifinaghe Encoding
     ANSI            Unicode




                                    13
State of the Amazigh technology
                            OCR
    Amazigh OCR systems:
        System focused on isolated printed characters
         based on a syntactic approach using finite
         automata.
        Global approach based on Hidden Markov
         Models for recognizing handwritten characters.
        Method using invariant moments for recognizing
         printed script.
        System based on artificial neural network to
         recognize printed characters.

                      LREC-2012: SALTMIL-AfLaT Workshop   14
State of the Amazigh technology
           Fundamental processing
     Transliterator
     Tagging assistance tool
     Light stemmer
     Search engine
     Concordancer


                  LREC-2012: SALTMIL-AfLaT Workshop   15
State of the Amazigh technology
           Fundamental processing
     Transliterator


       Arabic script
                                                            Tifinaghe
       Latin script           Convertisor
                                                            Unicode
      Tifinaghe Latin        Transliterator




                        LREC-2012: SALTMIL-AfLaT Workshop               16
State of the Amazigh technology
           Fundamental processing
    Tagging assistance tool
                                Amazigh
                                  raw
                                corpora




                              Tokenization


                                               Manual POS       Tag
        Manual Stemming                                         set



            Stem
                                                  Tagged
             list
                                                  corpus
                            Validation


                          Standard output


                            LREC-2012: SALTMIL-AfLaT Workshop         17
State of the Amazigh technology
           Fundamental processing
     Light stemmer            Begin


                       Prefix + Stem + Suffix




                          Find the largest
                              prefix




                          Stem + Suffix              Find the largest
                                                          suffix




                                                          Stem



                                                          End


                 LREC-2012: SALTMIL-AfLaT Workshop                      18
State of the Amazigh technology
           Fundamental processing
     Search engine
                                 Query Engine
                                Natural Language                    Index
                                 Processing Tools


                         Data                               Data Indexing
                         Searching                                     Indexer
        User Interface
                                                                    Natural Language
                                                                     Processing Tools




                                                    Data Crawling
                                                                     Repository
             Web                             Crawler




                             LREC-2012: SALTMIL-AfLaT Workshop                          19
State of the Amazigh technology
           Fundamental processing
     Concordancer
                                    input field
                                     .txt,.doc
                                     .pdf, .zip


                                   Tokenization



        List of the text words                       Word / expression
         and their frequency                          Context display




                             LREC-2012: SALTMIL-AfLaT Workshop           20
State of the Amazigh technology
               Language resources
     Corpora
     Dictionary
     Terminology database




                   LREC-2012: SALTMIL-AfLaT Workshop   21
State of the Amazigh technology
               Language resources
     Corpora:
         General corpus,
         POS corpus.




                    LREC-2012: SALTMIL-AfLaT Workshop   22
State of the Amazigh technology
               Language resources
     Dictionary
         Definition,
         Arabic equivalent words,
         French equivalent words,
         English equivalent words,
         Synonyms,
         Classification by domains,
         Derivational families.

                        LREC-2012: SALTMIL-AfLaT Workshop   23
State of the Amazigh technology
               Language resources
     Terminology database
         Media vocabulary
         Grammatical vocabulary




                    LREC-2012: SALTMIL-AfLaT Workshop   24
Future Directions

  Building a large and representative
   Amazigh corpora.
  Developing   a machine translation
   system.
  Creating a pool of competent human
   resources.




            LREC-2012: SALTMIL-AfLaT Workshop   25
Thank you
     for
your attention

       ⵜⴰⵏⵎⵉⵔⵜ



LREC-2012: SALTMIL-AfLaT Workshop   26

Mais conteúdo relacionado

Mais de Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 

Mais de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 

Último

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Último (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Natural Language Processing for Amazigh Language

  • 1. Natural Language Processing for Amazigh Language: Challenges and Future Directions Fadoua Ataa Allah Siham Boulaknadel CEISIC, IRCAM {ataaallah, boulaknadel}@ircam.ma
  • 2. Outline  Amazigh Language  Amazigh Complexity in NLP  State of the Technology on Amazigh  Future Directions LREC-2012: SALTMIL-AfLaT Workshop 2
  • 3. Amazigh language Sociolinguistic Context North African autochthonous language  Spoken by millions of people as dialects LREC-2012: SALTMIL-AfLaT Workshop 3
  • 4. Amazigh language Sociolinguistic Context  Languages of Morocco  Classical Arabic as an official language.  Amazigh, since 2011 it becomes an official language.  Moroccan Arabic or Darija is the diglossia of Classical Arabic.  French as the first foreign language.  Spanish is used in the north of Morocco.  English is becoming the second foreign language. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 4
  • 5. Amazigh language History  Amazigh abjed  Tifinagh is attested from 25 centuries.  Its writing form has continued to change from the traditional Tuareg writing to the Tifinaghe-IRCAM . Tinzouline Inscriptions (Zagora, Morocco) 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 5
  • 6. Amazigh language History Direction Plate 9 Anou Elias, Mammanet Valley (Niger). Henri Lhote, Oued Mammanet gravures. Les Nouvelles Editions Africaines. 1979 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 6
  • 7. Moroccan Amazigh characteristics  Amazigh writing system  Direction: horizontal from left to right.  Alphabet:  27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ , ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;  2 semi-consonants: ⵢ and ⵡ;  4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.  Punctuation marks: conventional signs including: “ ” (space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.  Numerals: Hindu-Arabic numerals [0-9]. 10/07/2012 LREC-2012: SALTMIL-AfLaT Workshop 7
  • 8. Amazigh Complexity in NLP  Different writing forms  Complex phonology and phonetic systems  Rich morphology LREC-2012: SALTMIL-AfLaT Workshop 8
  • 9. Amazigh Complexity in NLP Amazigh script  Writing prescriptions’ conversion into ‘Tifinaghe – Unicode’ is confronted with:  Spelling variation related to regional varieties ([tfucht] [tafukt] (sun)),  Spelling variation based on the use or the elimination of spaces within or between words ([tadartino] [tadart ino] (my house)).  Arabic or Latin transcription systems. LREC-2012: SALTMIL-AfLaT Workshop 9
  • 10. Amazigh Complexity in NLP Phonology & phonetic  The main problem of Amazigh phonology and phonetic consists on allophones: /ll/ that is realized as [dj] in the North. LREC-2012: SALTMIL-AfLaT Workshop 10
  • 11. Amazigh Complexity in NLP Morphology  High inflected language.  Word structure: Prefix Stem Suffix  Affixes set: Prefixes, Infixes, and Suffixes.  Base form varies with paradigms: (qqim  svim (make sit)). LREC-2012: SALTMIL-AfLaT Workshop 11
  • 12. State of the Amazigh technology  Tifinaghe Encoding  Optical character recognition  Fundamental processing tools  Language resources LREC-2012: SALTMIL-AfLaT Workshop 12
  • 13. State of the Amazigh technology Tifinaghe Encoding  ANSI  Unicode 13
  • 14. State of the Amazigh technology OCR  Amazigh OCR systems:  System focused on isolated printed characters based on a syntactic approach using finite automata.  Global approach based on Hidden Markov Models for recognizing handwritten characters.  Method using invariant moments for recognizing printed script.  System based on artificial neural network to recognize printed characters. LREC-2012: SALTMIL-AfLaT Workshop 14
  • 15. State of the Amazigh technology Fundamental processing  Transliterator  Tagging assistance tool  Light stemmer  Search engine  Concordancer LREC-2012: SALTMIL-AfLaT Workshop 15
  • 16. State of the Amazigh technology Fundamental processing  Transliterator Arabic script Tifinaghe Latin script Convertisor Unicode Tifinaghe Latin Transliterator LREC-2012: SALTMIL-AfLaT Workshop 16
  • 17. State of the Amazigh technology Fundamental processing  Tagging assistance tool Amazigh raw corpora Tokenization Manual POS Tag Manual Stemming set Stem Tagged list corpus Validation Standard output LREC-2012: SALTMIL-AfLaT Workshop 17
  • 18. State of the Amazigh technology Fundamental processing  Light stemmer Begin Prefix + Stem + Suffix Find the largest prefix Stem + Suffix Find the largest suffix Stem End LREC-2012: SALTMIL-AfLaT Workshop 18
  • 19. State of the Amazigh technology Fundamental processing  Search engine Query Engine Natural Language Index Processing Tools Data Data Indexing Searching Indexer User Interface Natural Language Processing Tools Data Crawling Repository Web Crawler LREC-2012: SALTMIL-AfLaT Workshop 19
  • 20. State of the Amazigh technology Fundamental processing  Concordancer input field .txt,.doc .pdf, .zip Tokenization List of the text words Word / expression and their frequency Context display LREC-2012: SALTMIL-AfLaT Workshop 20
  • 21. State of the Amazigh technology Language resources  Corpora  Dictionary  Terminology database LREC-2012: SALTMIL-AfLaT Workshop 21
  • 22. State of the Amazigh technology Language resources  Corpora:  General corpus,  POS corpus. LREC-2012: SALTMIL-AfLaT Workshop 22
  • 23. State of the Amazigh technology Language resources  Dictionary  Definition,  Arabic equivalent words,  French equivalent words,  English equivalent words,  Synonyms,  Classification by domains,  Derivational families. LREC-2012: SALTMIL-AfLaT Workshop 23
  • 24. State of the Amazigh technology Language resources  Terminology database  Media vocabulary  Grammatical vocabulary LREC-2012: SALTMIL-AfLaT Workshop 24
  • 25. Future Directions  Building a large and representative Amazigh corpora.  Developing a machine translation system.  Creating a pool of competent human resources. LREC-2012: SALTMIL-AfLaT Workshop 25
  • 26. Thank you for your attention ⵜⴰⵏⵎⵉⵔⵜ LREC-2012: SALTMIL-AfLaT Workshop 26