SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Learning Morphological Rules for
           Amharic Verbs
 Using Inductive Logic Programming
                                  1


 Won dwossen M uluge ta                       M icha el Ga sser
Addis Ab a b a U n iversi ty,              In dia n a U n iversit y,
 Addis Ab a b a , Ethio p ia                Bloomin g t on , U S A
 won disho @ya h o o. c o m              ga sser@cs.in d ia n a . ed u



                           LREC Con feren ce
                    S ALTM IL- Af La T Worksh op
         N orma liza t i on of U n der-Reso u rc es La n gua ges
                            Ista n b ul, Turkey
                              May 22, 2012
Agenda
                     2


 Introduction
 Amharic Verb Morphology
 Machine Learning (ILP) of Morphology
 Experiment Setup and Data
 Experiments and Result
 Challenges
 Future work
Introduction
                                         3

 Amharic is a Semitic language (27+ speakers)
     Written using Amharic Fidel, ፊደል , which is syllabic script
             (be=በ, bu=ቡ, bi=ቢ, ba=ባ, bE=ቤ, b=ብ, bo=ቦ)
 Many Computational Approaches of Morphology:
   Rule based and Machine learning for many languages
         HornMorpho: Finite State Transducer (Amharic, Tigrigna and Oromo)
     Amharic morphology has so far been attempted using only rule-
      based methods.
 We have applied machine learning approach to the task
   This is a work on progress
 The work is a contribution to learning of Morphology in
  general
Amharic Verbs
                                     4

 Amharic Verbs convey:
    lexical information, subject and object person, number, and
     gender; tense, aspect, and mood; various derivational
     categories such as passive, causative, and reciprocal; polarity
     (affirmative/negative); relativization; and a range of
     prepositions and conjunctions.
 Amharic Verb Morphology:
    affixation, reduplication, and compounding (common to most)
    The stems consist of a root + vowels + template merger
        (e.g., sbr + ee + CVCVC, which leads to the stem seber ‘broke’)
 It is a non-concatenative process
Amharic Verbs…contd
                                       5

 Amharic verbs: 4 prefixes and 5 suffixes.
 The affixes have an intricate set of co-occurrence rule
 Grammatical features are shown using:
     Affixes, Vowel sequence, Root template (CV pattern)

        ?-sebr-alehu (እሰብራለሁ) 1s pers. sing. simplex imperfective
        ?-seber-alehu (እሰበራለሁ) 1st pers. sing. passive imperfective

        te-deres-ku (ተደረስኩ)  1st pers. sing. passive perfective
        deres-ku (ደረስኩ) 1st pers. sing. simplex perfective

 The Geez script has been Romanized using the standard SERA for
 this experiment using our own Prolog script (lookup dictionary).
Amharic Verbs…contd
         6
Amharic Verbs…contd
         7
Amharic Verbs…contd
                                    8

 Amharic morphology has alternation rules:
     stem affix intersection points or within the stem itself
               Word             Root                   Feature
      gdel (ግደል)              gdl(ግድል)   2nd person sing. masc. imperative

      gdey (gdel-i) (ግደይ)     gdl(ግድል)   2nd person sing. fem. imperative

      t-gedl-aleh (ትገድላለህ)    gdl(ግድል)   2nd person sing. masc. imperfect

      t-gedy-alex (ትገድያለሽ)    gdl(ግድል)   2nd person sing. fem. imperfect




Another Example                          I won’t be
Rule Based Morphology
                                     9

 Finite State for morphology dominant after
 Koskenniemi’s two level morphology .
    Rule based
    HornMorpho developed to analyze Amharic, Tigrigna and
     Oromiffa words
    All rules need to de enumerated
    knowledge-based: (HornMorpho experience)
      difficult to debug,
      hard to modify (to add new findings),
      Difficult to adapt to other similar languages
ILP Framework
                                  10
 ILP combines Logic and
  Programming
 Hypothesis is drawn                            Representation
                                  Input                &               Output
  from background                                  Processing
  knowledge and
  examples.
  The examples (E), background
                                                   Learning
   knowledge (B) and hypothesis                    Algorithm
   (H) are all logic programs.


                                          Examples             Background
 Prolog Rules:                             (+ and -)
                                                   )           (how much?)
ILP for Morphology
                                  11

 ILP learning systems is Supervised
    Supervised morphology learning systems are usually based on
     two-level morphology
    These approaches differ in the level of supervision they employ
     to capture the rules.
      word pairs
      segmentation of input words
      a stem or root and a set of grammatical features
Machine Learning (ILP) of Morphology
                                        12

 Attempts to apply ILP to morphology
   Kazakov, 2000, Manandhar et al, 1998, Zdravkova et al, 2005,
         English, Macedonian
     dealt with languages with relatively simple morphology
     No root template issue (Vital for Amharic and similar languages)
     Was possible to list all (most) examples?


 We have used CLOG ILP tool for our experiment
     CLOG is a Prolog based ILP system,
     Developed by Manandhar et al (1998),
     Learn first order decision lists (rules)
     Use only positive examples.
       CLOG relies on output completeness
Experiment Setup and Data
                                      13

 Learning Amharic morphological rules with ILP:
    Training data prepared with the help of HornMorpho
    Used only 216 Amharic verbs
    background knowledge and the learning aspect.
        1)   To handle stem extraction by identifying affixes,
        2)   To identify root and vowel sequence
        3)   To handle orthographic alternations
        4)   To associate grammatical feature with word constituents
     Training Data Format:
             stem(Word, Stem, Root, Grammatical Features)
     Training Example:
         Stem([s,e,b,e,r,k,u],[s,e,b,e,r],[s,b,r] [1,1]).
         stem([s,e,b,e,r,k],[s,e,b,e,r],[s,b,r], [1,2]).
         stem([s,e,b,e,r,x],[s,e,b,e,r],[s,b,r], [1,3]).
Experiment Setup and Data
                                         14

 Learning stem extraction:
   set_affix predicate
    Identify the prefix and suffixes of the input word.
    Takes the Word and Stem and learns the affixes

           set_affix(Word, Stem, P1,P2,S1,S2):-
             split(Word, P1, W11),
             split(Stem, P2, W22),
             split(W11, X, S1),
             split(W22, X, S2),
             not( (P1=[],P2=[],S1=[],S2=[])).

      The utility predicate ‘split’ segments any input string into all possible
       pairs of substrings.
      sebr {([]-[sebr]), ([s]-[ebr]), ([se]-[br]), ([seb]-[r]), or ([sebr]-[])}.
Experiment Setup and Data
                                 15


 Affix extraction example:
              teseberku (seber) Vs tegedelku (gedel)
                       {break}                {kill}

[],[teseberku],[]                                    [],[tegedelku],[]
[t],[eseberku],[]          [te],[seber],[ku]         [t],[egedelku],[]
[te],[seberku],[]          [te],[gedel],[ku]         [te],[gedelku],[]
        :                                                    :
[te],[seber],[ku]    Segment: [te], STEM, [ku]       [te],[gedel],[ku]
        :            CV Pattern: CVCVC, ee                   :
[te],[seber],[ku]                                    [te],[gedel],[ku]
[te],[seber],[ku]                                    [te],[gedel],[ku]
[],[teseber],[ku]                                    [],[tegedel],[ku]
[],[teseberk],[k]       More general rule that can   [],[tegedelk],[u]
                         be derived for the two
                                 words
Experiment Setup and Data
                                    16

 Learning Roots:
 root_vocal, Predicate                    root_vocal(Stem,Root,Vowel):-
    Used to extract the root from                 merge(Stem,Root,Vowel).
     examples by taking only the Stem
     and the Root (the second and third    merge([X,Y,Z|T],[X,Y|R],[Z|V]):-
     arguments)                                   merge(T,R,V).
    The performs unconstrained            merge([X,Y|T],R,[X,Y|V]):-
     permutation of the characters in             merge(T,R,V).
     the Stem until the first segment of   merge([X|Y],[X|Z],W) :-
     the permutated string matches the            merge(Y,Z,W).
     Root character pattern provided in    merge([X|Y],Z,[X|W]) :-
     the example.                                  merge(Y,Z,W).
Experiment Setup and Data
                                17

    Root Vocal and Template extraction”
             seber (sbr) Vs gedl (gdl)

seber                                      gdel
sebre                                      gdle
  :        Vowel Sequence: ee                     Vowel Sequence: e
           CV Pattern: CVCVC
                                             :    CV Pattern: CVCC
  :                                          :
sbree                                      gdle
  :                                          :
  :                                          :
seebr                                      gedl
ebres                                      edlg
esber                                      egdl
eesbr                                      egld
Experiment Setup and Data
                                          18

 Learning stem internal alternations:
    set_internal_alter predicate
        This predicate works much like the ‘set_affix’ predicate except that it
         replaces a substring which is found in the middle of Stem by another
         substring from Valid_Stem.
                                                          alter([h,e,d],[h,y,e,d]).
        Required a different set of training data:       alter([m,o,t],[m,e,w,o,t]).
                                                          alter([s,a,m],[s,e,?,a,m]).

     set_internal_alter(Stem,Valid_Stem,St1,St2):-
          split(Stem,P1,X1),
          split(Valid_Stem,P1,X2),
          split(X1,St1,Y1),
          split(X2,St2,Y1).
Experiments and Result
                                 19

 To learn a set of rules, the predicate and arity for the rules
  must be provided for CLOG.
   predicate schemas
     rule(stem(_,_,_,_)) for set_affix and root_vocal, and
     rule(alter(_,_)) for set_internal_alter.

 The training set contains 216 Amharic verbs.
 The example contains all possible combinations of tense
  and subject features.
 Each word is first Romanized, then segmented into the
  stem and grammatical features
Experiments and Result
                       20



 Verb     • Stem-Affix Extraction


          • Stem-Internal-Alternation
 Stem


          • Template Extraction
 Stem


          • Grammatical Feature
Feature     Assignment
Experiments and Result
                                         21

 The training took less than a minute for Affix extraction
 108 rules for affix extraction, one example
        stem(Word, Stem, [2, 7]):-
               set_affix(Word, Stem, [y], [], [u], []),
               feature([2, 7], [imperfective, tppn]),
               template(Stem, [1, 0, 1, 1]).

Input Word: [y,m,e,k,r,u] {advise}
Set_affix results in: Stem=[m,e,k,r] as to the above rule (removing[y] and [u])
Template will generate: 1,0,1,1 from Stem which is the same as in the rule
Thus, Feature will declare the word is Imperfective, Third Person Plural Neuter

Input Word: *[y,m,e,k,e,r,u] {advise} (valid but no such examples in the training)
Set_affix will result in: Stem=[m,e,k,e,r] according to the above rule
Template will generate: 1,0,1,0,1 from Stem which will fail the rule
Experiments and Result
                                         22

 18 rules for root template extraction: one example
        root(Stem, Root):-
                root_vocal(Stem, Root, [e, e]),
                 template(Stem, [1, 0, 1, 0, 1]) .

Input Stem: [g,e,r,e,f] {beat}
root_vocal results in:
         Root=[m,k,r] based on the number and type of vowels in the rule
Template will generate: 1,0,1,0,1 from Stem which is the same as in the rule

Input Stem: *[g,a,r,e,f]
root_vocal results in: Root=[g,r,f] [a,e], which fails to meet the rule
Experiments and Result
                                          23

 3 rules for internal stem alternation
         alter(Stem,Valid_Stem):-
           set_internal_alter(Stem,Valid_Stem, [o], [e, w, o]).


Input Stem: [m,o,k] {hot}
Set_internal_alter results in:
         Valid_Stem=[m,e,w,o,k] which will be further analyzed for validity later

Input Stem: [m,o,k,e,r] {try} (wrong alternation)
Set_internal_alter results in:
         Valid_Stem=[m,e,w,o,k,e,r] but it will fail letter in the template extraction
         as [1,0,1,0,1,0,1] is not among the learned templates
Experiments and Result
                                     24

 Test Date:
    verbs in their third person singular masculine form
    Source: Online, Armbruster (1908)
 The verbs are inflected for the eight subjects and four
 tense-aspect-mood features of Amharic
    Total Test set:
        1,784 distinct verb forms
 Accuracy:
    1,552=86.9%
Experiments and Result
                                       25


 Error Analysis
     absence of similar examples
     inappropriate alternation rule

  Test Word        Stem        Root         Feature

[s,e,m,a,c,h,u] [s,e,m,a,?] [s,m,?] perfective, sppn     Correct Analysis


[l,e,g,u,m,u]   [l,e,g,u,m]   NA      NA                No Similar Example


[s,e,m,a,c,h,u] [s,e,y,e,m]   [s,y,m] gerundive, sppn   Wrong Alternation
Challenge
                                    26

The current learning trend demands examples for all
combination of features for word formation.
     Every subject for alltense and voice
     For various root groups/radicals

     For all object/negative/applicative/conj combinations

     For all forms with alternation

 Can we (do we have to) exhaustively list all the
  combination?
 What correlation do we have between the constituents?
 How can we learn these interactions?
........
Future work
                                        27

 We can not give all possible combinations for Amharic
 Won’t be generic otherwise
More Generic Approach: (Genetic Programming)
Generalizing from partial data....some morphemes can decide some
feature of the word independent of the other constituents!
    Rules like..if it has the prefix 'te' then the word is passive despite all
      the other morphemes and template structure

stem(A, B, C, D):-                             S, B, C, Ten and Sub are
     set_affix(A, B, [t,e], [], S, []),        variables and they can take
     root_temp(B, C, [e, e]),                  any forms but the word (if
     template(B, [1, 0, 1, 0, 1]),             it a valid Amharic word)
                                               will be Passive
     feature(D, [Ten, passive, Sub]).   ]).
Acknowledgement
                                  28

 Special Thanks to:
    IDRC-ICT4D project hosted by Nairobi University, Kenya
      For sponsoring my conference participation
      For partially supporting this project
29



    አመሰግናችኋለሁ
a-mesegn-a-chwa-lehu
    I Thank you all
      wondisho@yahoo.com
     wondgewe@indiana.edu
             and
     gasser@cs.indiana.edu

Mais conteúdo relacionado

Mais procurados

شهادة بكالوريوس الحاسب الالي
شهادة بكالوريوس الحاسب الاليشهادة بكالوريوس الحاسب الالي
شهادة بكالوريوس الحاسب الالي
abdulrahman aljohar
 
Histórico Escolar de Mestrado em Engenharia Elétrica
Histórico Escolar de Mestrado em Engenharia ElétricaHistórico Escolar de Mestrado em Engenharia Elétrica
Histórico Escolar de Mestrado em Engenharia Elétrica
Gladston Bernardi
 
Diploma Original copy
Diploma Original copyDiploma Original copy
Diploma Original copy
Victor Tupag
 
Sheet music the complete piano player easy blues[1]
Sheet music   the complete piano player easy blues[1]Sheet music   the complete piano player easy blues[1]
Sheet music the complete piano player easy blues[1]
Jr Medina
 
Engineering Science N3 Certificate
Engineering Science N3 CertificateEngineering Science N3 Certificate
Engineering Science N3 Certificate
Barbara Lubbe
 
Unisa Diploma - Project Management
Unisa Diploma - Project ManagementUnisa Diploma - Project Management
Unisa Diploma - Project Management
Samantha de Canha
 
Inglês raymond murphy - essential grammar in use
Inglês   raymond murphy - essential grammar in useInglês   raymond murphy - essential grammar in use
Inglês raymond murphy - essential grammar in use
Netto Inhesta
 

Mais procurados (20)

Jahangirnagar University MSC admission Test Question solved- 2014 Spring EMSC
Jahangirnagar University MSC admission Test Question solved- 2014 Spring EMSCJahangirnagar University MSC admission Test Question solved- 2014 Spring EMSC
Jahangirnagar University MSC admission Test Question solved- 2014 Spring EMSC
 
Triste tom jobim
Triste tom jobimTriste tom jobim
Triste tom jobim
 
شهادة بكالوريوس الحاسب الالي
شهادة بكالوريوس الحاسب الاليشهادة بكالوريوس الحاسب الالي
شهادة بكالوريوس الحاسب الالي
 
Histórico Escolar de Mestrado em Engenharia Elétrica
Histórico Escolar de Mestrado em Engenharia ElétricaHistórico Escolar de Mestrado em Engenharia Elétrica
Histórico Escolar de Mestrado em Engenharia Elétrica
 
NESPAK certificate
NESPAK certificateNESPAK certificate
NESPAK certificate
 
Diploma Original copy
Diploma Original copyDiploma Original copy
Diploma Original copy
 
Sheet music the complete piano player easy blues[1]
Sheet music   the complete piano player easy blues[1]Sheet music   the complete piano player easy blues[1]
Sheet music the complete piano player easy blues[1]
 
Probashi kallyan bank question solution
Probashi kallyan bank question solutionProbashi kallyan bank question solution
Probashi kallyan bank question solution
 
Cambridge practice tests for ielts 5
Cambridge practice tests for ielts 5Cambridge practice tests for ielts 5
Cambridge practice tests for ielts 5
 
Curso Portugues para autodidactas.
Curso Portugues para autodidactas.Curso Portugues para autodidactas.
Curso Portugues para autodidactas.
 
mphil transcript
mphil transcriptmphil transcript
mphil transcript
 
Engineering Science N3 Certificate
Engineering Science N3 CertificateEngineering Science N3 Certificate
Engineering Science N3 Certificate
 
Unisa Diploma - Project Management
Unisa Diploma - Project ManagementUnisa Diploma - Project Management
Unisa Diploma - Project Management
 
The outing
The outingThe outing
The outing
 
Joao bosco vol 2
Joao bosco vol 2Joao bosco vol 2
Joao bosco vol 2
 
Manual l200 2000 pt-br
Manual l200 2000 pt-brManual l200 2000 pt-br
Manual l200 2000 pt-br
 
Inglês raymond murphy - essential grammar in use
Inglês   raymond murphy - essential grammar in useInglês   raymond murphy - essential grammar in use
Inglês raymond murphy - essential grammar in use
 
TADijonSens.pdf
TADijonSens.pdfTADijonSens.pdf
TADijonSens.pdf
 
Giving Unchained: could the blockchain transform charity? (Charities Aid Foun...
Giving Unchained: could the blockchain transform charity? (Charities Aid Foun...Giving Unchained: could the blockchain transform charity? (Charities Aid Foun...
Giving Unchained: could the blockchain transform charity? (Charities Aid Foun...
 
H.S.Marksheet
H.S.MarksheetH.S.Marksheet
H.S.Marksheet
 

Semelhante a Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming

Learning phoneme mappings for transliteration without parallel data
Learning phoneme mappings for transliteration without parallel dataLearning phoneme mappings for transliteration without parallel data
Learning phoneme mappings for transliteration without parallel data
Attaporn Ninsuwan
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
Raghu nath
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
butest
 

Semelhante a Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming (20)

MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAMORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Inteligencia artificial
Inteligencia artificialInteligencia artificial
Inteligencia artificial
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming method
 
Learning phoneme mappings for transliteration without parallel data
Learning phoneme mappings for transliteration without parallel dataLearning phoneme mappings for transliteration without parallel data
Learning phoneme mappings for transliteration without parallel data
 
english-grammar
english-grammarenglish-grammar
english-grammar
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for Amharic
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
INFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptINFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.ppt
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMDEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Prolog Programming Language
Prolog Programming  LanguageProlog Programming  Language
Prolog Programming Language
 
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
 
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
 

Mais de Guy De Pauw

The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 

Mais de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming

  • 1. Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming 1 Won dwossen M uluge ta M icha el Ga sser Addis Ab a b a U n iversi ty, In dia n a U n iversit y, Addis Ab a b a , Ethio p ia Bloomin g t on , U S A won disho @ya h o o. c o m ga sser@cs.in d ia n a . ed u LREC Con feren ce S ALTM IL- Af La T Worksh op N orma liza t i on of U n der-Reso u rc es La n gua ges Ista n b ul, Turkey May 22, 2012
  • 2. Agenda 2  Introduction  Amharic Verb Morphology  Machine Learning (ILP) of Morphology  Experiment Setup and Data  Experiments and Result  Challenges  Future work
  • 3. Introduction 3  Amharic is a Semitic language (27+ speakers)  Written using Amharic Fidel, ፊደል , which is syllabic script  (be=በ, bu=ቡ, bi=ቢ, ba=ባ, bE=ቤ, b=ብ, bo=ቦ)  Many Computational Approaches of Morphology:  Rule based and Machine learning for many languages  HornMorpho: Finite State Transducer (Amharic, Tigrigna and Oromo)  Amharic morphology has so far been attempted using only rule- based methods.  We have applied machine learning approach to the task  This is a work on progress  The work is a contribution to learning of Morphology in general
  • 4. Amharic Verbs 4  Amharic Verbs convey:  lexical information, subject and object person, number, and gender; tense, aspect, and mood; various derivational categories such as passive, causative, and reciprocal; polarity (affirmative/negative); relativization; and a range of prepositions and conjunctions.  Amharic Verb Morphology:  affixation, reduplication, and compounding (common to most)  The stems consist of a root + vowels + template merger  (e.g., sbr + ee + CVCVC, which leads to the stem seber ‘broke’)  It is a non-concatenative process
  • 5. Amharic Verbs…contd 5  Amharic verbs: 4 prefixes and 5 suffixes.  The affixes have an intricate set of co-occurrence rule  Grammatical features are shown using:  Affixes, Vowel sequence, Root template (CV pattern) ?-sebr-alehu (እሰብራለሁ) 1s pers. sing. simplex imperfective ?-seber-alehu (እሰበራለሁ) 1st pers. sing. passive imperfective te-deres-ku (ተደረስኩ)  1st pers. sing. passive perfective deres-ku (ደረስኩ) 1st pers. sing. simplex perfective The Geez script has been Romanized using the standard SERA for this experiment using our own Prolog script (lookup dictionary).
  • 8. Amharic Verbs…contd 8  Amharic morphology has alternation rules:  stem affix intersection points or within the stem itself Word Root Feature gdel (ግደል) gdl(ግድል) 2nd person sing. masc. imperative gdey (gdel-i) (ግደይ) gdl(ግድል) 2nd person sing. fem. imperative t-gedl-aleh (ትገድላለህ) gdl(ግድል) 2nd person sing. masc. imperfect t-gedy-alex (ትገድያለሽ) gdl(ግድል) 2nd person sing. fem. imperfect Another Example I won’t be
  • 9. Rule Based Morphology 9  Finite State for morphology dominant after Koskenniemi’s two level morphology .  Rule based  HornMorpho developed to analyze Amharic, Tigrigna and Oromiffa words  All rules need to de enumerated  knowledge-based: (HornMorpho experience)  difficult to debug,  hard to modify (to add new findings),  Difficult to adapt to other similar languages
  • 10. ILP Framework 10  ILP combines Logic and Programming  Hypothesis is drawn Representation Input & Output from background Processing knowledge and examples.  The examples (E), background Learning knowledge (B) and hypothesis Algorithm (H) are all logic programs. Examples Background Prolog Rules: (+ and -) ) (how much?)
  • 11. ILP for Morphology 11  ILP learning systems is Supervised  Supervised morphology learning systems are usually based on two-level morphology  These approaches differ in the level of supervision they employ to capture the rules.  word pairs  segmentation of input words  a stem or root and a set of grammatical features
  • 12. Machine Learning (ILP) of Morphology 12  Attempts to apply ILP to morphology  Kazakov, 2000, Manandhar et al, 1998, Zdravkova et al, 2005,  English, Macedonian  dealt with languages with relatively simple morphology  No root template issue (Vital for Amharic and similar languages)  Was possible to list all (most) examples?  We have used CLOG ILP tool for our experiment  CLOG is a Prolog based ILP system,  Developed by Manandhar et al (1998),  Learn first order decision lists (rules)  Use only positive examples.  CLOG relies on output completeness
  • 13. Experiment Setup and Data 13  Learning Amharic morphological rules with ILP:  Training data prepared with the help of HornMorpho  Used only 216 Amharic verbs  background knowledge and the learning aspect. 1) To handle stem extraction by identifying affixes, 2) To identify root and vowel sequence 3) To handle orthographic alternations 4) To associate grammatical feature with word constituents Training Data Format: stem(Word, Stem, Root, Grammatical Features) Training Example: Stem([s,e,b,e,r,k,u],[s,e,b,e,r],[s,b,r] [1,1]). stem([s,e,b,e,r,k],[s,e,b,e,r],[s,b,r], [1,2]). stem([s,e,b,e,r,x],[s,e,b,e,r],[s,b,r], [1,3]).
  • 14. Experiment Setup and Data 14  Learning stem extraction:  set_affix predicate  Identify the prefix and suffixes of the input word.  Takes the Word and Stem and learns the affixes set_affix(Word, Stem, P1,P2,S1,S2):- split(Word, P1, W11), split(Stem, P2, W22), split(W11, X, S1), split(W22, X, S2), not( (P1=[],P2=[],S1=[],S2=[])).  The utility predicate ‘split’ segments any input string into all possible pairs of substrings.  sebr {([]-[sebr]), ([s]-[ebr]), ([se]-[br]), ([seb]-[r]), or ([sebr]-[])}.
  • 15. Experiment Setup and Data 15  Affix extraction example: teseberku (seber) Vs tegedelku (gedel) {break} {kill} [],[teseberku],[] [],[tegedelku],[] [t],[eseberku],[] [te],[seber],[ku] [t],[egedelku],[] [te],[seberku],[] [te],[gedel],[ku] [te],[gedelku],[] : : [te],[seber],[ku] Segment: [te], STEM, [ku] [te],[gedel],[ku] : CV Pattern: CVCVC, ee : [te],[seber],[ku] [te],[gedel],[ku] [te],[seber],[ku] [te],[gedel],[ku] [],[teseber],[ku] [],[tegedel],[ku] [],[teseberk],[k] More general rule that can [],[tegedelk],[u] be derived for the two words
  • 16. Experiment Setup and Data 16  Learning Roots:  root_vocal, Predicate root_vocal(Stem,Root,Vowel):-  Used to extract the root from merge(Stem,Root,Vowel). examples by taking only the Stem and the Root (the second and third merge([X,Y,Z|T],[X,Y|R],[Z|V]):- arguments) merge(T,R,V).  The performs unconstrained merge([X,Y|T],R,[X,Y|V]):- permutation of the characters in merge(T,R,V). the Stem until the first segment of merge([X|Y],[X|Z],W) :- the permutated string matches the merge(Y,Z,W). Root character pattern provided in merge([X|Y],Z,[X|W]) :- the example. merge(Y,Z,W).
  • 17. Experiment Setup and Data 17  Root Vocal and Template extraction” seber (sbr) Vs gedl (gdl) seber gdel sebre gdle : Vowel Sequence: ee Vowel Sequence: e CV Pattern: CVCVC : CV Pattern: CVCC : : sbree gdle : : : : seebr gedl ebres edlg esber egdl eesbr egld
  • 18. Experiment Setup and Data 18  Learning stem internal alternations:  set_internal_alter predicate  This predicate works much like the ‘set_affix’ predicate except that it replaces a substring which is found in the middle of Stem by another substring from Valid_Stem. alter([h,e,d],[h,y,e,d]).  Required a different set of training data: alter([m,o,t],[m,e,w,o,t]). alter([s,a,m],[s,e,?,a,m]). set_internal_alter(Stem,Valid_Stem,St1,St2):- split(Stem,P1,X1), split(Valid_Stem,P1,X2), split(X1,St1,Y1), split(X2,St2,Y1).
  • 19. Experiments and Result 19  To learn a set of rules, the predicate and arity for the rules must be provided for CLOG.  predicate schemas  rule(stem(_,_,_,_)) for set_affix and root_vocal, and  rule(alter(_,_)) for set_internal_alter.  The training set contains 216 Amharic verbs.  The example contains all possible combinations of tense and subject features.  Each word is first Romanized, then segmented into the stem and grammatical features
  • 20. Experiments and Result 20 Verb • Stem-Affix Extraction • Stem-Internal-Alternation Stem • Template Extraction Stem • Grammatical Feature Feature Assignment
  • 21. Experiments and Result 21  The training took less than a minute for Affix extraction  108 rules for affix extraction, one example stem(Word, Stem, [2, 7]):- set_affix(Word, Stem, [y], [], [u], []), feature([2, 7], [imperfective, tppn]), template(Stem, [1, 0, 1, 1]). Input Word: [y,m,e,k,r,u] {advise} Set_affix results in: Stem=[m,e,k,r] as to the above rule (removing[y] and [u]) Template will generate: 1,0,1,1 from Stem which is the same as in the rule Thus, Feature will declare the word is Imperfective, Third Person Plural Neuter Input Word: *[y,m,e,k,e,r,u] {advise} (valid but no such examples in the training) Set_affix will result in: Stem=[m,e,k,e,r] according to the above rule Template will generate: 1,0,1,0,1 from Stem which will fail the rule
  • 22. Experiments and Result 22  18 rules for root template extraction: one example root(Stem, Root):- root_vocal(Stem, Root, [e, e]), template(Stem, [1, 0, 1, 0, 1]) . Input Stem: [g,e,r,e,f] {beat} root_vocal results in: Root=[m,k,r] based on the number and type of vowels in the rule Template will generate: 1,0,1,0,1 from Stem which is the same as in the rule Input Stem: *[g,a,r,e,f] root_vocal results in: Root=[g,r,f] [a,e], which fails to meet the rule
  • 23. Experiments and Result 23  3 rules for internal stem alternation alter(Stem,Valid_Stem):- set_internal_alter(Stem,Valid_Stem, [o], [e, w, o]). Input Stem: [m,o,k] {hot} Set_internal_alter results in: Valid_Stem=[m,e,w,o,k] which will be further analyzed for validity later Input Stem: [m,o,k,e,r] {try} (wrong alternation) Set_internal_alter results in: Valid_Stem=[m,e,w,o,k,e,r] but it will fail letter in the template extraction as [1,0,1,0,1,0,1] is not among the learned templates
  • 24. Experiments and Result 24  Test Date:  verbs in their third person singular masculine form  Source: Online, Armbruster (1908)  The verbs are inflected for the eight subjects and four tense-aspect-mood features of Amharic  Total Test set:  1,784 distinct verb forms  Accuracy:  1,552=86.9%
  • 25. Experiments and Result 25  Error Analysis  absence of similar examples  inappropriate alternation rule Test Word Stem Root Feature [s,e,m,a,c,h,u] [s,e,m,a,?] [s,m,?] perfective, sppn Correct Analysis [l,e,g,u,m,u] [l,e,g,u,m] NA NA No Similar Example [s,e,m,a,c,h,u] [s,e,y,e,m] [s,y,m] gerundive, sppn Wrong Alternation
  • 26. Challenge 26 The current learning trend demands examples for all combination of features for word formation.  Every subject for alltense and voice  For various root groups/radicals  For all object/negative/applicative/conj combinations  For all forms with alternation  Can we (do we have to) exhaustively list all the combination?  What correlation do we have between the constituents?  How can we learn these interactions? ........
  • 27. Future work 27  We can not give all possible combinations for Amharic  Won’t be generic otherwise More Generic Approach: (Genetic Programming) Generalizing from partial data....some morphemes can decide some feature of the word independent of the other constituents!  Rules like..if it has the prefix 'te' then the word is passive despite all the other morphemes and template structure stem(A, B, C, D):- S, B, C, Ten and Sub are set_affix(A, B, [t,e], [], S, []), variables and they can take root_temp(B, C, [e, e]), any forms but the word (if template(B, [1, 0, 1, 0, 1]), it a valid Amharic word) will be Passive feature(D, [Ten, passive, Sub]). ]).
  • 28. Acknowledgement 28  Special Thanks to:  IDRC-ICT4D project hosted by Nairobi University, Kenya  For sponsoring my conference participation  For partially supporting this project
  • 29. 29 አመሰግናችኋለሁ a-mesegn-a-chwa-lehu I Thank you all wondisho@yahoo.com wondgewe@indiana.edu and gasser@cs.indiana.edu