SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Statistical Analysis of Myanmar Words on the World Wide Web for
                         Search Engine Development


                   Pann Yu Mon            Maung Maung Thant            Ohnmar Htun Pe
    s065402@ics.nagaokaut.ac.jp               mmthant@gmail.com           ohnmar.iuj@gmail.com


                             San Ko Oo                   Yoshiki Mikami
                        sankooo@gmail.com             mikami@kjs.nagaokaut.ac.jp

               †Management and Information Systems Engineering Department
                            Nagaoka University of Technology
                           ††International University of Japan
                   Abstract                             the Indian subcontinent between 5th Century B.C
                                                        and 3rd Century AD. Myanmar language has 33
        This paper introduces an automatic              consonants and 12 vowels according to traditional
Myanmar word analysis program for ongoing               tones on grammar.
research of Myanmar search engine development.
                                                                  Since 1990 Myanmar natural language
In this research we collected Myanmar words
from documents on the World Wide Web to know            processing task has been done by Myanmar
which words are frequently used. This program is        Unicode & NLP Research Center. The first
designed for encodings compatible with Unicode          Myanmar Unicode font for GUI environment
5.1standard. Our program can automatically              (Mac) was developed in 1988 and the one for
generate Markov Chain matrix on the result              Windows system was developed in 1992. In 1998,
words. The program was written by using PHP             Myanmar Language processing was first
script. Myanmar head words that include in
Myanmar-English dictionary are also used as             discussed at ISO/IEC JTC1 and Unicode
index words.                                            Technical Committee and finally Myanmar
Keywords                                                character code set was included in ISO 10646.
                                                                  Until now, they keep on trying over
Myanmar, Code conversion tools, Myanmar word            Myanmar language processing tasks to cope well
searching                                               with all applications so as to complete all the tasks
                                                        to cover the whole area which requires more
1. Introduction                                         endeavors.
                                                                  In this research, the program that can
          Myanmar Language, a member of the
                                                        automatically collect Myanmar words from the
Tibeto-Burman language, subfamily of the Sino-
                                                        Myanmar Web Pages is proposed. The main
Tibetan family of language, is spoken as mother
                                                        purpose of this research is to present the analysis
language by more than 37 million Burmese and as
                                                        of Myanmar words on the Myanmar Web pages to
second language by about 20 million ethnic
                                                        support Myanmar Search Engine Development.
minorities in Myanmar. It is the only official
                                                        To establish the Myanmar Search Engine, it is
language of Myanmar which is formerly known as
                                                        needed to do a lot of tasks such as indexing rule,
Burma. Myanmar language is written in a script
                                                        sorting algorithm, stemming algorithm, word
shaped in circular and semi-circular letters, which
                                                        breaking algorithm and so on.
are adopted from the Mon script. And the mon
                                                                  In this study, we have collected
script is derived from Indian Brahmi flourished in
                                                        Myanmar Web pages from various Web sites
including Myanmar daily newspaper, community           to multi-font converter to the Unicode 5.1. At last
Web sites, news Web sites total of which accounts      the program run for searching the word from input
to 9,274 Kbytes. And then we extracted words           text, and result words are saved in the Database.
                                                       The process will be explained step by step in the
from downloaded Myanmar Web pages. And
                                                       next section in more detail.
detail process of collecting words and analysis of
result data will be discussed in following sections.   3.1.First Step : Downloading Myanmar
                                                       Web Pages
2. Related Research
                                                                World Wide Web is the most convenient
         A number of researchers not only from         existing source of linguistic data providing the
local but also from word wide have collected           users abundance of texts in various types in a
Myanmar words from different sources for their         large number of languages. Already having in
individual purposes.                                   electronic forms, these texts are quite suitable for
     From 2007, Myanmar Unicode and NLP                the corpus studies.
Research Center has started the development task                In order to download Myanmar Web
                                                       pages, it needs very efficient crawler that can
of Myanmar National Corpus (MNC) [5]. MNC
                                                       collect only Myanmar Web pages selectively from
includes all texts including written text and          the World Wide Web. In this research, the
spoken text from various resources. That project is    Language Specific Crawler (LSC) developed by
almost finished.                                       one of the authors [3] was used. LSC runs
         Hla Hla Htay and colleagues [2] have          concurrently with language identifier and collect
developed Myanmar corpora based on various             Myanmar Web pages efficiently. Following table
                                                       explains the sources of the downloaded web sites.
resources such as text from official newspapers in
                                                       After downloading, downloaded pages were
Myanmar, over 300 full books and Myanmar texts         passed to converter.
from various Web sites including news sites and
on-line magazines. In their research they had            Table 1. Detail Information for source data
processed all their tasks based on ASCII format.

3. Methodology




                                                       3.2.Second Step : Conversion of various
                                                       encoding to Unicode 5.1 Standard

                                                               Myanmar texts on the Web are using
                                                       various encoding which are not fully compliant
                                                       with Unicode 5.1. So it is required to convert the
                                                       crawled Web Pages to Unicode encoding. If the
                                                       Web pages are encoded in Unicode then the work
Figure. 1. Step by step Procedure of Analysis          becomes easier.
         The step by step processes of our                     In order to convert various Myanmar
analysis are shown in figure 1. Firstly it needs to    encodings to Unicode, an efficient converter is
collect Myanmar Web pages regardless of their          needed. Currently, there are a number of
fonts and encodings. Then, we have to pass them        Myanmar font conversion tools available on the
Web. In this research, Kanaung converter 1 and          match. If no such match is found in the word lists,
Burglish converter2 were used. Although both of         the character is simply segmented as a word.
them work nicely, it is still needed to edit a little
bit. For example, Kanaug converter could not            3.4. Fourth Step: Frequency Markov
covert ‘ ’ and ‘ ’ properly and correctly. In case      Chain Analysis
of Burglish, it works correctly in the conversion
from “Zawgyi-One” font to “Myanmar3” font.                         In the program, Word-based Markov
But in the conversion from “Wininwa” font to            models are also used to calculated word matrix
“Myanmar3” font, it cannot covert accurately for        table to know the adjacency word in the sentences
‘ ’. And it cannot correctly work on punctuation        (This mean which word most frequently appears
marks and quotation marks. Thus manual                  after one word.) It gives us high level background
correction is needed in those cases though they are     information for word boundary detection in
somewhat perfect.                                       parsing of the Myanmar language. Our program
                                                        firstly finds the words on the given web pages and
3.3. Third         Step:     Word       Searching       calculates the frequency of that word to know how
Algorithm                                               many times that word appears on the Web sites.
                                                        After that, Markov chain matrix table was
      Myanmar language is written in a syllabic         generated automatically.
system and there are no spaces always put
between words or sentences. That is why word            4. Result
segmenting algorithm and word searching
algorithm for Myanmar Language are needed.                       We downloaded the various web sites
Very little research in different approach has been
                                                        including    newspaper        sites,   blog    sites,
published on segmenting sentences into words in
Myanmar language [1] [4].                               entertainment sites, sport sites and collected 9,274
      In our program, all of the Myanmar head           Kbytes of text data. After running the program,
words that included in Myanmar–English                  total 766,892 words are collected and 12,211
Dictionary 3 are used as indexed file. It includes      unique head words found.
28,000 Myanmar words. Those head words are
stored in the database and sorted in reverse order
of syllable length to compare with the input data.
                                                        4.1. Distribution of Words on input string
If the input word is matched with one of the head
word, the program will retrieve that word. If the                It is found that mono-syllable is most
input word does not match with the head word            frequently used because those words can be used
lists, the program cannot retrieve the word             in several ways. For Example, mono-syllable
correctly. Thus the accuracy of this algorithm is       “      ” was found more than 20,000 times.
largely depends on the head word lists.                 Because it can be used in different ways. For
      In our algorithm the longest matching             Example, in case 1: polite prefix to a young man’s
algorithm, was used to find the word on the input       name (as in “           ”), in case 2: postpositional
data. It normally starts at the first character in a    marker      to    indicate     objective    (as    in
text using a heard word list and attempts to find       “                        ”), in case 3: emphatic
the longest word in the list. If such a word is         particle     suffixed     to     words     (as     in
found, the longest-matching algorithm marks a           “                              ”) and in case 4: post
boundary at the end of the longest word, and then       positional marker indicating destination (as in
it repeats the same process as to start searching       “                                ”). And then bi-
longest match at the characters following the           syllables words are second most and it is followed
                                                        by the tri-syllables and so on. The top ten words
                                                        sorted by frequency for mono-syllable, bi-
                                                        syllables, tri-syllables and tetra-syllables are
1
 http://code.google.com/p/kanaung/                      shown in the following tables.
2http://burglish.googlepages.com/fontconv.htm
3
  Myanmar-English dictionary produced by
Department of the Myanmar Language
Commission
Table 2. Top ten mono-syllable words                                                Table 3. Top ten bi-syllable words

Mono-Syllable                                                  Frequency                   Bi-Syllable                      Frequency
       [ko]                                                               20070                               [Kyun                    3537
Postpositional marker to                                                (2.61%)            taw]                                     (0.46%)
indicate objective case                                                                    I(male)
   [ma]                                                                   18181                        [Kyun ma]                       3332
Partical prefixed to a verb to                                          (2.40%)            I(female)                                (0.43%)
the negative sence                                                                                 [Ka lay]                            1994
   [ka]                                                                   17469            Child                                    (0.26%)
Postpositional marker to                                                (2.30%)                                                        1981
                                                                                                     [A twat]
indicate nominative case                                                                                                            (0.25%)
                                                                                           For
        [tal]                                                             14424
                                                                                                   [Ae di]                             1737
Colloquial form of the                                                  (1.90%)
                                                                                           That                                     (0.22%)
sentence final
    [par]                                                                 12774
Particle denoting inclusion                                             (1.70%)

 Table 4. Top ten tri-syllable words                                              Table 5. Top ten tetra-syllable words
                                                                                             Tetra-Syllable                    Frequency
                                                                                                              [sar yay sa              222
 Tri-Syllable                                            Frequency                           yar]                                  (0.02%)
                                          [Tha yot                      627                  Author
 saung]                                                             (0.08%)                              [a nu pa nyar]                204
 Actor                                                                                       Art                                   (0.02%)
            [Pa ri thet]                                                500                               [a chay a nay]               176
 Audience                                                           (0.06%)                  Condition                             (0.02%)
        [Sa yar ma]                                                     495                                   [a yay a tar]            157
 Teacher(female)                                                    (0.06%)                  Writing                               (0.01%)
                     [Thu nge                                           404                             [a mhat ta ya]                 138
 chin]                                                               (0.5%)                  Remembrance                           (0.01%)
 Friend
                   [Main ka lay]                                        400
 Girl                                                               (0.05%)

                               600,000           581,355

                               500,000
   number of collected words




                               400,000


                               300,000


                               200,000               147,100

                               100,000                         27,770    9,752      758     117          16           5       17        2

                                   -
                                          Mono-     Bi-      Tri-      4-       5-       6-       7-       8-       9-      10-
                                         Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable
                                                                                  Number of Syllables
                                                         Figure. 2. Number of Syllables found in Test Data
4.2. Word Level Frequency Matrix

           Based on the input string, the program                              for parsing of the sentence into words. By applying
      generated word level Markov table. By using this                         this algorithm in character level we can also generate
      matrix we can know adjacency word pairs. It                              a character level Markov table. It can be used in
      gives us the high level background information                           Myanmar character input method to Mobile phone.

                                                                Table 6 .Word-Level Matrix

                                 Sum of        Second
                                 Frequency     Word
                                                                                                                     Grand
                                 First Word                                                                          Total

                                                                                                       1144            1144

                                                         722                 1273                             1217     4893

                                                                                    1564                               2343

                                                                      1339                    1511                     2850

                                                                934                                                     934

                                                  1205         1717                                                    2922

                                                                                     809                               1754
                                 Grand
                                 Total            1205   722   2651   1339   1273   2373      1511     1144   1217    16840


      4.3. Distribution of characters on Input
      String                                                                   It is found that the words begins with the “ ” is the
                                                                               over 90,000 and it is first ranking character. And it is
                                                                               followed by the “ ” and so on. No words are found
          We analyzed character level frequency of the                         that starting with the characters “ ”. We could not
      input data. The result is shown in Figure 3.                             find that words even in the Myanmar – English
                                                                               dictionary.




                            100000
                             90000
                             80000
                             70000
number of collected words




                             60000
                             50000
                             40000
                             30000
                             20000
                             10000
                                 0
                                                                      List of Characters

                                         Figure. 3. Total Frequency of Myanmar Characters found in Test Data
5. Error Analysis                                     expect this ongoing research will yield benefits
                                                      for our Myanmar search engine development task.
          In our test data of 9,274 Kbytes, we
found 2,935,233 characters which excluding            Acknowledgements
punctuation marks, numerals and English words.
In terms of words, we identified total 766,892                 We acknowledge and highly appreciate
Myanmar words (12,211 unique headwords). But          the kind assistance and help given by Myanmar
5,861 words (0.76%) were not identified. The          Unicode & NLP Research Center. We would like
errors result from the incorrect spelling in the      to express our thanks to Dr. Daw Myint Myint
original text, undefined headwords (proper nouns      Than and U Ngwe Tun as they kindly provided us
which are not defined in the dictionary) and          the data we are in need of.
incorrect description of syllable length in the
database. Moreover, some error results from the       References
words ending with some characters such as “ ”
(Myanmar Sign Dot Below) and ambiguity in
word segmentation. Some examples of errors are        [1] Hla Hla Htay and et al., “Myanmar Word
listed in Table 7.                                    Segmentation using Syllable level Longest
                                                      Matching”, Proceedings of the 6th Workshop on
                                                      Asian Language Resources (ALR6), Hyderabad,
       Table 7. Some Examples of errors
                                                      India, January 2008.
                                                      [2] Hla Hla Htay, G. Bharadwaja Kumar and
                                                      Kavi N. Murthy, “Constructing English-Myanmar
                                                      Parallel Corpora”. The Fourth International
                                                      Conference on Computer Application 2006.
                                                      [3] Pann Yu Mon, Chew Yew Choong, Yoshiki
                                                      Mikami, “Language Specific Crawler for
                                                      Myanmar Pages”, Proceedings of the 11th
                                                      International Conference on Humans and
                                                      Computers (HC 2008), Nagaoka, Japan,
                                                      November 2008.
                                                      [4] Tun Thura Thet and et al., “Word
                                                      Segmentaion of the Myanmar Language”, Journal
                                                      of Information Science, Vol. 34, No.5, pp 688-
                                                      704. 2008
                                                      [5] Wunna Ko Ko and Thin Zar Phyo, “Selection
                                                      of XML tag set for Myanmar National Corpus”,
6. Conclusion                                         Proceedings of the 6th Workshop on Asian
                                                      Language Resources (ALR6), Hyderabad, India,
         In this paper, we presented word             January 2008.
segmentation program for Myanmar text based on
longest string matching algorithm and dictionary.
Also we presented both word level and character
level frequency distributions and word level
Markov table generated by this program. The
program performed segmentation work well and
proved itself to be used as a practical word
segmentation engine for various NLP applications,
including Myanmar search engine (in particular
word stemming engine). Statistical data generated
by this program is useful as background
information for designing various Myanmar NLP
applications including input system etc. For future
task, we plan to extend our program by collecting
all possible Myanmar words including not only
conversational words but also proper nouns. We

Mais conteúdo relacionado

Mais procurados

NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIDEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIijaia
 
Adilla's group corpus linguistic sec2
Adilla's group corpus linguistic sec2Adilla's group corpus linguistic sec2
Adilla's group corpus linguistic sec2Wan Aliaa
 

Mais procurados (6)

NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
 
Cf32516518
Cf32516518Cf32516518
Cf32516518
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIDEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
 
Adilla's group corpus linguistic sec2
Adilla's group corpus linguistic sec2Adilla's group corpus linguistic sec2
Adilla's group corpus linguistic sec2
 

Semelhante a Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+

MYANMAR WEB PAGES CRAWLER
MYANMAR WEB PAGES CRAWLERMYANMAR WEB PAGES CRAWLER
MYANMAR WEB PAGES CRAWLERijwscjournal
 
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONMACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONkevig
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
 
Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
Assisting Tool For Essay Grading For Turkish Language Instructors
Assisting Tool For Essay Grading For Turkish Language InstructorsAssisting Tool For Essay Grading For Turkish Language Instructors
Assisting Tool For Essay Grading For Turkish Language InstructorsLeslie Schulte
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
 
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...kevig
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS - The Language Data Network
 

Semelhante a Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+ (20)

MYANMAR WEB PAGES CRAWLER
MYANMAR WEB PAGES CRAWLERMYANMAR WEB PAGES CRAWLER
MYANMAR WEB PAGES CRAWLER
 
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONMACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach
 
L10n mozmycampus-uum
L10n mozmycampus-uumL10n mozmycampus-uum
L10n mozmycampus-uum
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
Assisting Tool For Essay Grading For Turkish Language Instructors
Assisting Tool For Essay Grading For Turkish Language InstructorsAssisting Tool For Essay Grading For Turkish Language Instructors
Assisting Tool For Essay Grading For Turkish Language Instructors
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
 
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Chengqing Zong, Casia...
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+

  • 1. Statistical Analysis of Myanmar Words on the World Wide Web for Search Engine Development Pann Yu Mon Maung Maung Thant Ohnmar Htun Pe s065402@ics.nagaokaut.ac.jp mmthant@gmail.com ohnmar.iuj@gmail.com San Ko Oo Yoshiki Mikami sankooo@gmail.com mikami@kjs.nagaokaut.ac.jp †Management and Information Systems Engineering Department Nagaoka University of Technology ††International University of Japan Abstract the Indian subcontinent between 5th Century B.C and 3rd Century AD. Myanmar language has 33 This paper introduces an automatic consonants and 12 vowels according to traditional Myanmar word analysis program for ongoing tones on grammar. research of Myanmar search engine development. Since 1990 Myanmar natural language In this research we collected Myanmar words from documents on the World Wide Web to know processing task has been done by Myanmar which words are frequently used. This program is Unicode & NLP Research Center. The first designed for encodings compatible with Unicode Myanmar Unicode font for GUI environment 5.1standard. Our program can automatically (Mac) was developed in 1988 and the one for generate Markov Chain matrix on the result Windows system was developed in 1992. In 1998, words. The program was written by using PHP Myanmar Language processing was first script. Myanmar head words that include in Myanmar-English dictionary are also used as discussed at ISO/IEC JTC1 and Unicode index words. Technical Committee and finally Myanmar Keywords character code set was included in ISO 10646. Until now, they keep on trying over Myanmar, Code conversion tools, Myanmar word Myanmar language processing tasks to cope well searching with all applications so as to complete all the tasks to cover the whole area which requires more 1. Introduction endeavors. In this research, the program that can Myanmar Language, a member of the automatically collect Myanmar words from the Tibeto-Burman language, subfamily of the Sino- Myanmar Web Pages is proposed. The main Tibetan family of language, is spoken as mother purpose of this research is to present the analysis language by more than 37 million Burmese and as of Myanmar words on the Myanmar Web pages to second language by about 20 million ethnic support Myanmar Search Engine Development. minorities in Myanmar. It is the only official To establish the Myanmar Search Engine, it is language of Myanmar which is formerly known as needed to do a lot of tasks such as indexing rule, Burma. Myanmar language is written in a script sorting algorithm, stemming algorithm, word shaped in circular and semi-circular letters, which breaking algorithm and so on. are adopted from the Mon script. And the mon In this study, we have collected script is derived from Indian Brahmi flourished in Myanmar Web pages from various Web sites
  • 2. including Myanmar daily newspaper, community to multi-font converter to the Unicode 5.1. At last Web sites, news Web sites total of which accounts the program run for searching the word from input to 9,274 Kbytes. And then we extracted words text, and result words are saved in the Database. The process will be explained step by step in the from downloaded Myanmar Web pages. And next section in more detail. detail process of collecting words and analysis of result data will be discussed in following sections. 3.1.First Step : Downloading Myanmar Web Pages 2. Related Research World Wide Web is the most convenient A number of researchers not only from existing source of linguistic data providing the local but also from word wide have collected users abundance of texts in various types in a Myanmar words from different sources for their large number of languages. Already having in individual purposes. electronic forms, these texts are quite suitable for From 2007, Myanmar Unicode and NLP the corpus studies. Research Center has started the development task In order to download Myanmar Web pages, it needs very efficient crawler that can of Myanmar National Corpus (MNC) [5]. MNC collect only Myanmar Web pages selectively from includes all texts including written text and the World Wide Web. In this research, the spoken text from various resources. That project is Language Specific Crawler (LSC) developed by almost finished. one of the authors [3] was used. LSC runs Hla Hla Htay and colleagues [2] have concurrently with language identifier and collect developed Myanmar corpora based on various Myanmar Web pages efficiently. Following table explains the sources of the downloaded web sites. resources such as text from official newspapers in After downloading, downloaded pages were Myanmar, over 300 full books and Myanmar texts passed to converter. from various Web sites including news sites and on-line magazines. In their research they had Table 1. Detail Information for source data processed all their tasks based on ASCII format. 3. Methodology 3.2.Second Step : Conversion of various encoding to Unicode 5.1 Standard Myanmar texts on the Web are using various encoding which are not fully compliant with Unicode 5.1. So it is required to convert the crawled Web Pages to Unicode encoding. If the Web pages are encoded in Unicode then the work Figure. 1. Step by step Procedure of Analysis becomes easier. The step by step processes of our In order to convert various Myanmar analysis are shown in figure 1. Firstly it needs to encodings to Unicode, an efficient converter is collect Myanmar Web pages regardless of their needed. Currently, there are a number of fonts and encodings. Then, we have to pass them Myanmar font conversion tools available on the
  • 3. Web. In this research, Kanaung converter 1 and match. If no such match is found in the word lists, Burglish converter2 were used. Although both of the character is simply segmented as a word. them work nicely, it is still needed to edit a little bit. For example, Kanaug converter could not 3.4. Fourth Step: Frequency Markov covert ‘ ’ and ‘ ’ properly and correctly. In case Chain Analysis of Burglish, it works correctly in the conversion from “Zawgyi-One” font to “Myanmar3” font. In the program, Word-based Markov But in the conversion from “Wininwa” font to models are also used to calculated word matrix “Myanmar3” font, it cannot covert accurately for table to know the adjacency word in the sentences ‘ ’. And it cannot correctly work on punctuation (This mean which word most frequently appears marks and quotation marks. Thus manual after one word.) It gives us high level background correction is needed in those cases though they are information for word boundary detection in somewhat perfect. parsing of the Myanmar language. Our program firstly finds the words on the given web pages and 3.3. Third Step: Word Searching calculates the frequency of that word to know how Algorithm many times that word appears on the Web sites. After that, Markov chain matrix table was Myanmar language is written in a syllabic generated automatically. system and there are no spaces always put between words or sentences. That is why word 4. Result segmenting algorithm and word searching algorithm for Myanmar Language are needed. We downloaded the various web sites Very little research in different approach has been including newspaper sites, blog sites, published on segmenting sentences into words in Myanmar language [1] [4]. entertainment sites, sport sites and collected 9,274 In our program, all of the Myanmar head Kbytes of text data. After running the program, words that included in Myanmar–English total 766,892 words are collected and 12,211 Dictionary 3 are used as indexed file. It includes unique head words found. 28,000 Myanmar words. Those head words are stored in the database and sorted in reverse order of syllable length to compare with the input data. 4.1. Distribution of Words on input string If the input word is matched with one of the head word, the program will retrieve that word. If the It is found that mono-syllable is most input word does not match with the head word frequently used because those words can be used lists, the program cannot retrieve the word in several ways. For Example, mono-syllable correctly. Thus the accuracy of this algorithm is “ ” was found more than 20,000 times. largely depends on the head word lists. Because it can be used in different ways. For In our algorithm the longest matching Example, in case 1: polite prefix to a young man’s algorithm, was used to find the word on the input name (as in “ ”), in case 2: postpositional data. It normally starts at the first character in a marker to indicate objective (as in text using a heard word list and attempts to find “ ”), in case 3: emphatic the longest word in the list. If such a word is particle suffixed to words (as in found, the longest-matching algorithm marks a “ ”) and in case 4: post boundary at the end of the longest word, and then positional marker indicating destination (as in it repeats the same process as to start searching “ ”). And then bi- longest match at the characters following the syllables words are second most and it is followed by the tri-syllables and so on. The top ten words sorted by frequency for mono-syllable, bi- syllables, tri-syllables and tetra-syllables are 1 http://code.google.com/p/kanaung/ shown in the following tables. 2http://burglish.googlepages.com/fontconv.htm 3 Myanmar-English dictionary produced by Department of the Myanmar Language Commission
  • 4. Table 2. Top ten mono-syllable words Table 3. Top ten bi-syllable words Mono-Syllable Frequency Bi-Syllable Frequency [ko] 20070 [Kyun 3537 Postpositional marker to (2.61%) taw] (0.46%) indicate objective case I(male) [ma] 18181 [Kyun ma] 3332 Partical prefixed to a verb to (2.40%) I(female) (0.43%) the negative sence [Ka lay] 1994 [ka] 17469 Child (0.26%) Postpositional marker to (2.30%) 1981 [A twat] indicate nominative case (0.25%) For [tal] 14424 [Ae di] 1737 Colloquial form of the (1.90%) That (0.22%) sentence final [par] 12774 Particle denoting inclusion (1.70%) Table 4. Top ten tri-syllable words Table 5. Top ten tetra-syllable words Tetra-Syllable Frequency [sar yay sa 222 Tri-Syllable Frequency yar] (0.02%) [Tha yot 627 Author saung] (0.08%) [a nu pa nyar] 204 Actor Art (0.02%) [Pa ri thet] 500 [a chay a nay] 176 Audience (0.06%) Condition (0.02%) [Sa yar ma] 495 [a yay a tar] 157 Teacher(female) (0.06%) Writing (0.01%) [Thu nge 404 [a mhat ta ya] 138 chin] (0.5%) Remembrance (0.01%) Friend [Main ka lay] 400 Girl (0.05%) 600,000 581,355 500,000 number of collected words 400,000 300,000 200,000 147,100 100,000 27,770 9,752 758 117 16 5 17 2 - Mono- Bi- Tri- 4- 5- 6- 7- 8- 9- 10- Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Number of Syllables Figure. 2. Number of Syllables found in Test Data
  • 5. 4.2. Word Level Frequency Matrix Based on the input string, the program for parsing of the sentence into words. By applying generated word level Markov table. By using this this algorithm in character level we can also generate matrix we can know adjacency word pairs. It a character level Markov table. It can be used in gives us the high level background information Myanmar character input method to Mobile phone. Table 6 .Word-Level Matrix Sum of Second Frequency Word Grand First Word Total 1144 1144 722 1273 1217 4893 1564 2343 1339 1511 2850 934 934 1205 1717 2922 809 1754 Grand Total 1205 722 2651 1339 1273 2373 1511 1144 1217 16840 4.3. Distribution of characters on Input String It is found that the words begins with the “ ” is the over 90,000 and it is first ranking character. And it is followed by the “ ” and so on. No words are found We analyzed character level frequency of the that starting with the characters “ ”. We could not input data. The result is shown in Figure 3. find that words even in the Myanmar – English dictionary. 100000 90000 80000 70000 number of collected words 60000 50000 40000 30000 20000 10000 0 List of Characters Figure. 3. Total Frequency of Myanmar Characters found in Test Data
  • 6. 5. Error Analysis expect this ongoing research will yield benefits for our Myanmar search engine development task. In our test data of 9,274 Kbytes, we found 2,935,233 characters which excluding Acknowledgements punctuation marks, numerals and English words. In terms of words, we identified total 766,892 We acknowledge and highly appreciate Myanmar words (12,211 unique headwords). But the kind assistance and help given by Myanmar 5,861 words (0.76%) were not identified. The Unicode & NLP Research Center. We would like errors result from the incorrect spelling in the to express our thanks to Dr. Daw Myint Myint original text, undefined headwords (proper nouns Than and U Ngwe Tun as they kindly provided us which are not defined in the dictionary) and the data we are in need of. incorrect description of syllable length in the database. Moreover, some error results from the References words ending with some characters such as “ ” (Myanmar Sign Dot Below) and ambiguity in word segmentation. Some examples of errors are [1] Hla Hla Htay and et al., “Myanmar Word listed in Table 7. Segmentation using Syllable level Longest Matching”, Proceedings of the 6th Workshop on Asian Language Resources (ALR6), Hyderabad, Table 7. Some Examples of errors India, January 2008. [2] Hla Hla Htay, G. Bharadwaja Kumar and Kavi N. Murthy, “Constructing English-Myanmar Parallel Corpora”. The Fourth International Conference on Computer Application 2006. [3] Pann Yu Mon, Chew Yew Choong, Yoshiki Mikami, “Language Specific Crawler for Myanmar Pages”, Proceedings of the 11th International Conference on Humans and Computers (HC 2008), Nagaoka, Japan, November 2008. [4] Tun Thura Thet and et al., “Word Segmentaion of the Myanmar Language”, Journal of Information Science, Vol. 34, No.5, pp 688- 704. 2008 [5] Wunna Ko Ko and Thin Zar Phyo, “Selection of XML tag set for Myanmar National Corpus”, 6. Conclusion Proceedings of the 6th Workshop on Asian Language Resources (ALR6), Hyderabad, India, In this paper, we presented word January 2008. segmentation program for Myanmar text based on longest string matching algorithm and dictionary. Also we presented both word level and character level frequency distributions and word level Markov table generated by this program. The program performed segmentation work well and proved itself to be used as a practical word segmentation engine for various NLP applications, including Myanmar search engine (in particular word stemming engine). Statistical data generated by this program is useful as background information for designing various Myanmar NLP applications including input system etc. For future task, we plan to extend our program by collecting all possible Myanmar words including not only conversational words but also proper nouns. We