SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
Setswana Tokenisation and Computational Verb Morphology:
      Facing the Challenge of a Disjunctive Orthography
           Rigardt Pretorius, Ansu Berg, Laurette Pretorius, Biffie Viljoen

Introduction
Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Af-
rica. The language is characterised by a disjunctive orthography, mainly affecting the important word
category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suf-
fixal morphemes follow a conjunctive writing style.

Verb morphology
The most basic form of the verb in Setswana consists of an infinitive prefix + a root + a verb-final suf-
fix, for example, go bona (to see) consists of the infinitive prefix go, the root bon- and the verb-final
suffix -a. While verbs in Setswana may also include various other prefixes and suffixes, the root al-
ways forms the lexical core of a word.

Prefixes of the Setswana verb
    •   The subject agreement morphemes, written disjunctively, include non-consecutive subject
        agreement morphemes and consecutive subject agreement morphemes. For example, the
        non-consecutive subject agreement morpheme for class 5 is le as in lekau le a tshega (the
        young man is laughing), while the consecutive subject agreement morpheme for class 5 is la
        as in lekau la tshega (the young man then laughed).
    •   The object agreement morpheme is written disjunctively in most instances, for example
        ba di bona (they see it).
    •   The reflexive morpheme i- (-self) is always written conjunctively to the root, for example
        o ipona (he sees himself).
    •   The aspectual morphemes are written disjunctively and include the present tense mor-
        pheme a, the progressive morpheme sa (still) and the potential morpheme ka (can). Exam-
        ples                                                                                      are
        o a araba (he answers), ba sa ithuta (they are still learning) and ba ka ithuta (they can
        learn).
    •   The temporal morpheme tla (indicating the future tense) is written disjunctively, for example
        ba tla ithuta (they shall learn).
    •   The negative morphemes ga, sa and se are written disjunctively. Examples are ga ba
        ithute (they do not learn), re sa mo thuse (we do not help him), o se mo rome (do not
        send him).

Suffixes of the Setswana verb
Various morphemes may be suffixed to the verbal root and follow the conjunctive writing style.

Tokenisation
Tokenisation may be defined as the process of breaking up the sequence of characters in a text at the
word boundaries. It may therefore be regarded as a core technology in natural language processing.

Example 1: In the English sentence “I shall buy meat” the four tokens (separated by “/”) are
I / shall / buy / meat. However, in the Setswana sentence Ke tla reka nama (I shall buy meat) the
two tokens are Ke tla reka / nama.

Example 2: Improper tokenisation may distort corpus linguistic conclusions and statistics. Compare
the following:
A/ o itse/ rre/ yo/? (Do you know this gentleman?) Interrogative particle;
Re bone/ makau/ a/ maabane/. (We saw these young men yesterday.) Demonstrative pronoun;
Metsi/ a/ bollo/. (The water is hot.) Descriptive copulative;
Madi/ a/ rona/ a/ mo/ bankeng/. (Our money (the money of us) is in the bank.) Possessive particle
and descriptive copulative;
Mosadi/ a ba bitsa/. (The woman (then) called them.) Subject agreement morpheme;
Dintšwa/ ga di a re bona/. (The dogs did not see us.) Negative morpheme, which is concomitant
with the negative morpheme ga when the negative of the perfect is indicated, thus an example of a
separated dependency.
In the six occurrences of a above only four represent orthographic words that should form part of a
word frequency count for a.

Clearly, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alpha-
betic, segmented languages, including the conjunctively written Nguni group of South African Bantu
languages.

This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) mor-
phological analyser may be combined to effectively solve the Setswana tokenisation problem. The
approach has the important advantage of bringing the processing of Setswana beyond the morpho-
logical analysis level in line with what is appropriate for the Nguni languages. This means that the
challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and
does not in principle propagate to subsequent levels of analysis.

Research question
Can the development and application of a precise tokeniser and morphological analyser for Setswana
resolve the issue of disjunctive orthography? If so, subsequent levels of processing could exploit the
inherent structural similarities between the Bantu languages.

Our approach
Our underlying assumption is that the Bantu languages are structurally very closely related.

Our contention is that precise tokenisation will result in comparable morphological analyses, and that
the similarities and structural agreement between Setswana and languages such as Zulu will prevail
at subsequent levels of syntactic analysis, which could and should then also be computationally ex-
ploited.

Our approach is based on the novel combination of two tokeniser transducers and a morphological
analyser for Setswana.

Morphological analyser
The finite-state morphological analyser prototype for Setswana, developed with the Xerox finite state
toolkit, implements Setswana morpheme sequencing (morphotactics) by means of a lexc script con-
taining cascades of so-called lexicons, each of which represents a specific type of prefix, suffix or
root.

Sound changes at morpheme boundaries (morphophonological alternation rules) are implemented by
means of xfst regular expressions.

These lexc and xfst scripts are then compiled and subsequently composed into a single finite state
transducer, constituting the morphological analyser.

While the implementation of the morphotactics and alternation rules is, in principle complete, the word
root lexicons still need to be extended to include all known and valid Setswana roots.

The verb morphology is based on the assumption that valid verb structures are disjunctively written.

For example, the verb token re tla dula (we will sit/stay) is analysed as follows:

Verb(INDmode),(FUTtense,Pos): AgrSubj-1p-Pl+TmpPre+[dul]+Term

  or

Verb(PARmode),(FUTtense,Pos): AgrSubj-1p-Pl+TmpPre+[dul]+Term

Both modes, indicative and participial, constitute valid analyses. The occurrence of multiple valid
morphological analyses is typical and would require (context dependent) disambiguation at subse-
quent levels of processing.
Tokeniser
A grammar for linguistically valid verb constructions is implemented with xfst regular expressions.
The key principle on which the tokeniser is based is right-to-left longest match.

We note that:
(i) it may happen that a longest match does not constitute a valid verb construct;
(ii) the right-to-left strategy is appropriate since the verb root and suffixes are written conjunctively
     and therefore should not be individually identified at the tokenisation stage while disjunctively
     written prefixes need to be recognised.

The two aspects that need further clarification are:
(i) How do we determine whether a morpheme sequence is valid?
(ii) How do we recognise disjunctively written prefixes?

Methodology
Central to our approach is the assumption that only analysed tokens are valid tokens and strings that
could not be analysed are not valid linguistic words.


                                       Normalise running text



                                          Verb Tokeniser yielding
                                             longest matches



                                         Morphological Analyser



                            Analysed strings, typically verbs       Unanalysed strings

                                                                     Whitespace Tokeniser
                                                                     yielding orthographic
                                  Tokens                                     words

                         Analysed strings, typically
                         other word categories
                                                                     Morphological Analyser


                                                                    Unanalysed strings

                                                                            Errors



            Normalise test data (running text) by removing capitalisation and punctuation;
  Step 1:
            Tokenise on longest match right-to-left;
  Step 2:
            Perform a morphological analysis of the “tokens” from step 2;
  Step 3:
            Separate the tokens that were successfully analysed in step 3 from those that could not be
  Step 4:
            analysed;
            Tokenise all unanalysed “tokens” from step 4 on whitespace;
  Step 5:
            [Example: unanalysed wa me becomes wa and me.]
            Perform a morphological analysis of the “tokens” in step 5;
  Step 6:
            Again, as in step 4, separate the analysed and unanalysed strings resulting from step 6;
  Step 7:
            Combine all the valid tokens from steps 4 and 7.
  Step 8:

Finally a comparison of the correspondences and differences between the hand-tokenised tokens
(hand-tokens) and the tokens obtained by computational means (auto-tokens) is necessary in order to
assess the reliability of the described tokenisation approach.

Test data and results
Since the purpose was to establish the validity of the tokenisation approach, we made use of a short
Setswana text of 547 orthographic words, containing a variety of verb constructions (see Table 1).
The text was tokenised by hand and checked by a linguist in order to provide a means to measure the
success of the tokenisation approach. Furthermore, the text was normalised not to contain capitalisa-
tion and punctuation. All word roots occurring in the text were added to the root lexicon of the mor-
phological analyser to ensure that limitations in the analyser would not influence the tokenisation ex-
periment.
Examples of output of step 2:
ke tla nna
o tla go kopa
le ditsebe
Examples of output of step 3:
Based on the morphological analysis, the first two of the above longest matches are tokens and the
third is not. The relevant analyses are:
ke tla nna
Verb(INDmode), (FUTtense,Pos): AgrSubj-1p-Sg+TmpPre+[nn]+Term
o tla go kopa
Verb(INDmode), (FUTtense,Pos): AgrSubj-Cl1+TmpPre+AgrObj-2p-Sg+[kop]+Term
Examples of output of step 5:
le, ditsebe
Examples of output of step 6:
le
CopVerb(Descr), (INDmode), (FUTtense,Neg): AgrSubj-Cl5
ditsebe
NPre10+[tsebe]


The results of the tokenisation procedure applied to the test data, is summarised in Tables 1 and 2.

 Token length                Test data      Correctly
 (in orthographic words)                    tokenised
 2                           84             68
 3                           25             25
 4                           2              2
   Table 1. Verb constructions

Table 1 shows that 111 of the 409 tokens in the test data consist of more than one orthographic word
(i.e. verb constructions) of which 95 are correctly tokenised. Moreover, it suggests that the tokenisa-
tion improves with the length of the tokens.


                        Tokens            Types
 Hand-tokens, H         409               208
 Auto-tokens, A         412               202
 H∩A                    383 (93.6%)       193 (92.8%)
 AH                    29                9
 HA                    26                15
 Precision, P           0.93              0.96
 Recall, R              0.94              0.93
 F-score, 2PR/(P+R)     0.93              0.94
  Table 2. Tokenisation results


Issues
    •    Longest matches that allow morphological analysis, but do not constitute tokens. Examples
         are ba ba neng, e e siameng and o o fetileng. In these instances the tokeniser did not rec-
         ognise the qualificative particle. The tokenisation should have been ba/ ba neng, e/ e sia-
         meng and o/ o fetileng.
    •    Longest matches that do not allow morphological analysis and are directly split up into single
         orthographic words instead of allowing verb constructions of intermediate length. An example
         is e le monna, which was finally tokenised as e/ le/ monna instead of e le/ monna.
    •    Finally, perfect tokenisation is context sensitive. The string ke tsala should have been toke-
         nised as ke/ tsala (noun), and not as the verb construction ke tsala. In another context it can
         however be a verb with tsal- as the verb root.

In conclusion, we have successfully demonstrated that the novel combination of a precise tokeniser
and morphological analyser for Setswana could indeed form the basis for resolving the issue of dis-
junctive orthography.

Mais conteúdo relacionado

Mais procurados

Types of parsers
Types of parsersTypes of parsers
Types of parsersSabiha M
 
Lecture 7: Definite Clause Grammars
Lecture 7: Definite Clause GrammarsLecture 7: Definite Clause Grammars
Lecture 7: Definite Clause GrammarsCS, NcState
 
Minimalist program
Minimalist programMinimalist program
Minimalist programRabbiaAzam
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERijnlc
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! EdholeEdhole.com
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationAndi Wu
 
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)Why parsing is a part of Language Faculty Science (by Daisuke Bekki)
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)Daisuke BEKKI
 
Final formal languages
Final formal languagesFinal formal languages
Final formal languagesMegha Khanna
 
Types of Language in Theory of Computation
Types of Language in Theory of ComputationTypes of Language in Theory of Computation
Types of Language in Theory of ComputationAnkur Singh
 
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...Ashish Duggal
 
The Application of Distributed Morphology to the Lithuanian First Accent Noun...
The Application of Distributed Morphology to the Lithuanian First Accent Noun...The Application of Distributed Morphology to the Lithuanian First Accent Noun...
The Application of Distributed Morphology to the Lithuanian First Accent Noun...Adrian Lin
 
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAMORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
 

Mais procurados (19)

Parser
ParserParser
Parser
 
Types of parsers
Types of parsersTypes of parsers
Types of parsers
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Minimalist program
Minimalist programMinimalist program
Minimalist program
 
Lecture 7: Definite Clause Grammars
Lecture 7: Definite Clause GrammarsLecture 7: Definite Clause Grammars
Lecture 7: Definite Clause Grammars
 
Minimalist program
Minimalist programMinimalist program
Minimalist program
 
Lfg and gpsg
Lfg and gpsgLfg and gpsg
Lfg and gpsg
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMER
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! Edhole
 
Phrase structure grammar
Phrase structure grammarPhrase structure grammar
Phrase structure grammar
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
 
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)Why parsing is a part of Language Faculty Science (by Daisuke Bekki)
Why parsing is a part of Language Faculty Science (by Daisuke Bekki)
 
Final formal languages
Final formal languagesFinal formal languages
Final formal languages
 
Flat unit 3
Flat unit 3Flat unit 3
Flat unit 3
 
Types of Language in Theory of Computation
Types of Language in Theory of ComputationTypes of Language in Theory of Computation
Types of Language in Theory of Computation
 
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
 
The Application of Distributed Morphology to the Lithuanian First Accent Noun...
The Application of Distributed Morphology to the Lithuanian First Accent Noun...The Application of Distributed Morphology to the Lithuanian First Accent Noun...
The Application of Distributed Morphology to the Lithuanian First Accent Noun...
 
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAMORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 

Destaque

1.1 nouns and articles
1.1 nouns and articles1.1 nouns and articles
1.1 nouns and articlesaperlick
 
Morphosyntax third class
Morphosyntax   third classMorphosyntax   third class
Morphosyntax third classMauro Uchoa
 
Reflexive and Reciprocal pronouns
Reflexive and Reciprocal pronounsReflexive and Reciprocal pronouns
Reflexive and Reciprocal pronounsGHoltappels
 
Grammar and structure
Grammar and structure Grammar and structure
Grammar and structure BEv Oblina
 
1.12.17 nouns,articles,numbers
1.12.17 nouns,articles,numbers1.12.17 nouns,articles,numbers
1.12.17 nouns,articles,numbersamberglong11
 
8.23.16 Nouns, Articles, Numbers
8.23.16 Nouns, Articles, Numbers8.23.16 Nouns, Articles, Numbers
8.23.16 Nouns, Articles, Numbersamberglong11
 
Nouns and articles in Spanish
Nouns and articles in SpanishNouns and articles in Spanish
Nouns and articles in SpanishEdward Acuna
 
Pronouns reflexive and reciprocal
Pronouns reflexive and reciprocalPronouns reflexive and reciprocal
Pronouns reflexive and reciprocaltati2487
 
100 most common English verbs for beginners .Audio : http://youtu.be/it03G5m...
100 most common English  verbs for beginners .Audio : http://youtu.be/it03G5m...100 most common English  verbs for beginners .Audio : http://youtu.be/it03G5m...
100 most common English verbs for beginners .Audio : http://youtu.be/it03G5m...Olga Vareli
 
English: modal auxiliary verbs (theory and examples)
English: modal auxiliary verbs (theory and examples)English: modal auxiliary verbs (theory and examples)
English: modal auxiliary verbs (theory and examples)home
 
Helping (auxiliary) verbs english- M. van Eijk
Helping (auxiliary) verbs  english- M. van EijkHelping (auxiliary) verbs  english- M. van Eijk
Helping (auxiliary) verbs english- M. van EijkZadkine
 
Verb forms tenses class 9 cbse
Verb forms tenses class 9 cbseVerb forms tenses class 9 cbse
Verb forms tenses class 9 cbsePinky Vincent
 
English Syntax - Basic Sentence Structure
English Syntax - Basic Sentence StructureEnglish Syntax - Basic Sentence Structure
English Syntax - Basic Sentence StructuretheLecturette
 

Destaque (20)

1.1 nouns and articles
1.1 nouns and articles1.1 nouns and articles
1.1 nouns and articles
 
Morphosyntax third class
Morphosyntax   third classMorphosyntax   third class
Morphosyntax third class
 
Reflexive and Reciprocal pronouns
Reflexive and Reciprocal pronounsReflexive and Reciprocal pronouns
Reflexive and Reciprocal pronouns
 
Grammar and structure
Grammar and structure Grammar and structure
Grammar and structure
 
1.12.17 nouns,articles,numbers
1.12.17 nouns,articles,numbers1.12.17 nouns,articles,numbers
1.12.17 nouns,articles,numbers
 
8.23.16 Nouns, Articles, Numbers
8.23.16 Nouns, Articles, Numbers8.23.16 Nouns, Articles, Numbers
8.23.16 Nouns, Articles, Numbers
 
Nouns and articles in Spanish
Nouns and articles in SpanishNouns and articles in Spanish
Nouns and articles in Spanish
 
MORPHOSYNTAX: GENERATIVE MORPHOLOGY
MORPHOSYNTAX: GENERATIVE MORPHOLOGYMORPHOSYNTAX: GENERATIVE MORPHOLOGY
MORPHOSYNTAX: GENERATIVE MORPHOLOGY
 
Possession
PossessionPossession
Possession
 
Pronouns reflexive and reciprocal
Pronouns reflexive and reciprocalPronouns reflexive and reciprocal
Pronouns reflexive and reciprocal
 
100 most common English verbs for beginners .Audio : http://youtu.be/it03G5m...
100 most common English  verbs for beginners .Audio : http://youtu.be/it03G5m...100 most common English  verbs for beginners .Audio : http://youtu.be/it03G5m...
100 most common English verbs for beginners .Audio : http://youtu.be/it03G5m...
 
Gamit ng modal
Gamit ng modalGamit ng modal
Gamit ng modal
 
English: modal auxiliary verbs (theory and examples)
English: modal auxiliary verbs (theory and examples)English: modal auxiliary verbs (theory and examples)
English: modal auxiliary verbs (theory and examples)
 
English verb system
English verb systemEnglish verb system
English verb system
 
Helping (auxiliary) verbs english- M. van Eijk
Helping (auxiliary) verbs  english- M. van EijkHelping (auxiliary) verbs  english- M. van Eijk
Helping (auxiliary) verbs english- M. van Eijk
 
LISP: Introduction to lisp
LISP: Introduction to lispLISP: Introduction to lisp
LISP: Introduction to lisp
 
Verb forms tenses class 9 cbse
Verb forms tenses class 9 cbseVerb forms tenses class 9 cbse
Verb forms tenses class 9 cbse
 
English Syntax - Basic Sentence Structure
English Syntax - Basic Sentence StructureEnglish Syntax - Basic Sentence Structure
English Syntax - Basic Sentence Structure
 
Basic English Structure
Basic English StructureBasic English Structure
Basic English Structure
 
Region 6 Western Visayas
Region 6 Western VisayasRegion 6 Western Visayas
Region 6 Western Visayas
 

Semelhante a Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography

SETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGINGSETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGINGkevig
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
7. ku gr.sem 2013: Syntax
7. ku gr.sem 2013: Syntax7. ku gr.sem 2013: Syntax
7. ku gr.sem 2013: SyntaxTikaram Poudel
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpusThennarasuSakkan
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdfNatural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdftheboysaiml
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzerahmed51236
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlpSumit Sony
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466IJRAT
 
The early theroy of chomsky
The early theroy of chomskyThe early theroy of chomsky
The early theroy of chomskyKarim Islam
 
Multi lingual text-processing
Multi lingual text-processingMulti lingual text-processing
Multi lingual text-processingNAVER Engineering
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerijnlc
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous CharactersAndi Wu
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlpmonircse2
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)iwan_rg
 

Semelhante a Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography (20)

SETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGINGSETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGING
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
7. ku gr.sem 2013: Syntax
7. ku gr.sem 2013: Syntax7. ku gr.sem 2013: Syntax
7. ku gr.sem 2013: Syntax
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpus
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Aw32322326
Aw32322326Aw32322326
Aw32322326
 
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdfNatural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzer
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlp
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466
 
The early theroy of chomsky
The early theroy of chomskyThe early theroy of chomsky
The early theroy of chomsky
 
Multi lingual text-processing
Multi lingual text-processingMulti lingual text-processing
Multi lingual text-processing
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzer
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous Characters
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlp
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)
 

Mais de Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 

Mais de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography

  • 1. Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography Rigardt Pretorius, Ansu Berg, Laurette Pretorius, Biffie Viljoen Introduction Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Af- rica. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suf- fixal morphemes follow a conjunctive writing style. Verb morphology The most basic form of the verb in Setswana consists of an infinitive prefix + a root + a verb-final suf- fix, for example, go bona (to see) consists of the infinitive prefix go, the root bon- and the verb-final suffix -a. While verbs in Setswana may also include various other prefixes and suffixes, the root al- ways forms the lexical core of a word. Prefixes of the Setswana verb • The subject agreement morphemes, written disjunctively, include non-consecutive subject agreement morphemes and consecutive subject agreement morphemes. For example, the non-consecutive subject agreement morpheme for class 5 is le as in lekau le a tshega (the young man is laughing), while the consecutive subject agreement morpheme for class 5 is la as in lekau la tshega (the young man then laughed). • The object agreement morpheme is written disjunctively in most instances, for example ba di bona (they see it). • The reflexive morpheme i- (-self) is always written conjunctively to the root, for example o ipona (he sees himself). • The aspectual morphemes are written disjunctively and include the present tense mor- pheme a, the progressive morpheme sa (still) and the potential morpheme ka (can). Exam- ples are o a araba (he answers), ba sa ithuta (they are still learning) and ba ka ithuta (they can learn). • The temporal morpheme tla (indicating the future tense) is written disjunctively, for example ba tla ithuta (they shall learn). • The negative morphemes ga, sa and se are written disjunctively. Examples are ga ba ithute (they do not learn), re sa mo thuse (we do not help him), o se mo rome (do not send him). Suffixes of the Setswana verb Various morphemes may be suffixed to the verbal root and follow the conjunctive writing style. Tokenisation Tokenisation may be defined as the process of breaking up the sequence of characters in a text at the word boundaries. It may therefore be regarded as a core technology in natural language processing. Example 1: In the English sentence “I shall buy meat” the four tokens (separated by “/”) are I / shall / buy / meat. However, in the Setswana sentence Ke tla reka nama (I shall buy meat) the two tokens are Ke tla reka / nama. Example 2: Improper tokenisation may distort corpus linguistic conclusions and statistics. Compare the following: A/ o itse/ rre/ yo/? (Do you know this gentleman?) Interrogative particle; Re bone/ makau/ a/ maabane/. (We saw these young men yesterday.) Demonstrative pronoun; Metsi/ a/ bollo/. (The water is hot.) Descriptive copulative; Madi/ a/ rona/ a/ mo/ bankeng/. (Our money (the money of us) is in the bank.) Possessive particle and descriptive copulative; Mosadi/ a ba bitsa/. (The woman (then) called them.) Subject agreement morpheme; Dintšwa/ ga di a re bona/. (The dogs did not see us.) Negative morpheme, which is concomitant with the negative morpheme ga when the negative of the perfect is indicated, thus an example of a separated dependency.
  • 2. In the six occurrences of a above only four represent orthographic words that should form part of a word frequency count for a. Clearly, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alpha- betic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) mor- phological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morpho- logical analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis. Research question Can the development and application of a precise tokeniser and morphological analyser for Setswana resolve the issue of disjunctive orthography? If so, subsequent levels of processing could exploit the inherent structural similarities between the Bantu languages. Our approach Our underlying assumption is that the Bantu languages are structurally very closely related. Our contention is that precise tokenisation will result in comparable morphological analyses, and that the similarities and structural agreement between Setswana and languages such as Zulu will prevail at subsequent levels of syntactic analysis, which could and should then also be computationally ex- ploited. Our approach is based on the novel combination of two tokeniser transducers and a morphological analyser for Setswana. Morphological analyser The finite-state morphological analyser prototype for Setswana, developed with the Xerox finite state toolkit, implements Setswana morpheme sequencing (morphotactics) by means of a lexc script con- taining cascades of so-called lexicons, each of which represents a specific type of prefix, suffix or root. Sound changes at morpheme boundaries (morphophonological alternation rules) are implemented by means of xfst regular expressions. These lexc and xfst scripts are then compiled and subsequently composed into a single finite state transducer, constituting the morphological analyser. While the implementation of the morphotactics and alternation rules is, in principle complete, the word root lexicons still need to be extended to include all known and valid Setswana roots. The verb morphology is based on the assumption that valid verb structures are disjunctively written. For example, the verb token re tla dula (we will sit/stay) is analysed as follows: Verb(INDmode),(FUTtense,Pos): AgrSubj-1p-Pl+TmpPre+[dul]+Term or Verb(PARmode),(FUTtense,Pos): AgrSubj-1p-Pl+TmpPre+[dul]+Term Both modes, indicative and participial, constitute valid analyses. The occurrence of multiple valid morphological analyses is typical and would require (context dependent) disambiguation at subse- quent levels of processing.
  • 3. Tokeniser A grammar for linguistically valid verb constructions is implemented with xfst regular expressions. The key principle on which the tokeniser is based is right-to-left longest match. We note that: (i) it may happen that a longest match does not constitute a valid verb construct; (ii) the right-to-left strategy is appropriate since the verb root and suffixes are written conjunctively and therefore should not be individually identified at the tokenisation stage while disjunctively written prefixes need to be recognised. The two aspects that need further clarification are: (i) How do we determine whether a morpheme sequence is valid? (ii) How do we recognise disjunctively written prefixes? Methodology Central to our approach is the assumption that only analysed tokens are valid tokens and strings that could not be analysed are not valid linguistic words. Normalise running text Verb Tokeniser yielding longest matches Morphological Analyser Analysed strings, typically verbs Unanalysed strings Whitespace Tokeniser yielding orthographic Tokens words Analysed strings, typically other word categories Morphological Analyser Unanalysed strings Errors Normalise test data (running text) by removing capitalisation and punctuation; Step 1: Tokenise on longest match right-to-left; Step 2: Perform a morphological analysis of the “tokens” from step 2; Step 3: Separate the tokens that were successfully analysed in step 3 from those that could not be Step 4: analysed; Tokenise all unanalysed “tokens” from step 4 on whitespace; Step 5: [Example: unanalysed wa me becomes wa and me.] Perform a morphological analysis of the “tokens” in step 5; Step 6: Again, as in step 4, separate the analysed and unanalysed strings resulting from step 6; Step 7: Combine all the valid tokens from steps 4 and 7. Step 8: Finally a comparison of the correspondences and differences between the hand-tokenised tokens (hand-tokens) and the tokens obtained by computational means (auto-tokens) is necessary in order to assess the reliability of the described tokenisation approach. Test data and results Since the purpose was to establish the validity of the tokenisation approach, we made use of a short Setswana text of 547 orthographic words, containing a variety of verb constructions (see Table 1). The text was tokenised by hand and checked by a linguist in order to provide a means to measure the success of the tokenisation approach. Furthermore, the text was normalised not to contain capitalisa- tion and punctuation. All word roots occurring in the text were added to the root lexicon of the mor- phological analyser to ensure that limitations in the analyser would not influence the tokenisation ex- periment.
  • 4. Examples of output of step 2: ke tla nna o tla go kopa le ditsebe Examples of output of step 3: Based on the morphological analysis, the first two of the above longest matches are tokens and the third is not. The relevant analyses are: ke tla nna Verb(INDmode), (FUTtense,Pos): AgrSubj-1p-Sg+TmpPre+[nn]+Term o tla go kopa Verb(INDmode), (FUTtense,Pos): AgrSubj-Cl1+TmpPre+AgrObj-2p-Sg+[kop]+Term Examples of output of step 5: le, ditsebe Examples of output of step 6: le CopVerb(Descr), (INDmode), (FUTtense,Neg): AgrSubj-Cl5 ditsebe NPre10+[tsebe] The results of the tokenisation procedure applied to the test data, is summarised in Tables 1 and 2. Token length Test data Correctly (in orthographic words) tokenised 2 84 68 3 25 25 4 2 2 Table 1. Verb constructions Table 1 shows that 111 of the 409 tokens in the test data consist of more than one orthographic word (i.e. verb constructions) of which 95 are correctly tokenised. Moreover, it suggests that the tokenisa- tion improves with the length of the tokens. Tokens Types Hand-tokens, H 409 208 Auto-tokens, A 412 202 H∩A 383 (93.6%) 193 (92.8%) AH 29 9 HA 26 15 Precision, P 0.93 0.96 Recall, R 0.94 0.93 F-score, 2PR/(P+R) 0.93 0.94 Table 2. Tokenisation results Issues • Longest matches that allow morphological analysis, but do not constitute tokens. Examples are ba ba neng, e e siameng and o o fetileng. In these instances the tokeniser did not rec- ognise the qualificative particle. The tokenisation should have been ba/ ba neng, e/ e sia- meng and o/ o fetileng. • Longest matches that do not allow morphological analysis and are directly split up into single orthographic words instead of allowing verb constructions of intermediate length. An example is e le monna, which was finally tokenised as e/ le/ monna instead of e le/ monna. • Finally, perfect tokenisation is context sensitive. The string ke tsala should have been toke- nised as ke/ tsala (noun), and not as the verb construction ke tsala. In another context it can however be a verb with tsal- as the verb root. In conclusion, we have successfully demonstrated that the novel combination of a precise tokeniser and morphological analyser for Setswana could indeed form the basis for resolving the issue of dis- junctive orthography.