Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+
1. Statistical Analysis of Myanmar Words on the World Wide Web for
Search Engine Development
Pann Yu Mon Maung Maung Thant Ohnmar Htun Pe
s065402@ics.nagaokaut.ac.jp mmthant@gmail.com ohnmar.iuj@gmail.com
San Ko Oo Yoshiki Mikami
sankooo@gmail.com mikami@kjs.nagaokaut.ac.jp
†Management and Information Systems Engineering Department
Nagaoka University of Technology
††International University of Japan
Abstract the Indian subcontinent between 5th Century B.C
and 3rd Century AD. Myanmar language has 33
This paper introduces an automatic consonants and 12 vowels according to traditional
Myanmar word analysis program for ongoing tones on grammar.
research of Myanmar search engine development.
Since 1990 Myanmar natural language
In this research we collected Myanmar words
from documents on the World Wide Web to know processing task has been done by Myanmar
which words are frequently used. This program is Unicode & NLP Research Center. The first
designed for encodings compatible with Unicode Myanmar Unicode font for GUI environment
5.1standard. Our program can automatically (Mac) was developed in 1988 and the one for
generate Markov Chain matrix on the result Windows system was developed in 1992. In 1998,
words. The program was written by using PHP Myanmar Language processing was first
script. Myanmar head words that include in
Myanmar-English dictionary are also used as discussed at ISO/IEC JTC1 and Unicode
index words. Technical Committee and finally Myanmar
Keywords character code set was included in ISO 10646.
Until now, they keep on trying over
Myanmar, Code conversion tools, Myanmar word Myanmar language processing tasks to cope well
searching with all applications so as to complete all the tasks
to cover the whole area which requires more
1. Introduction endeavors.
In this research, the program that can
Myanmar Language, a member of the
automatically collect Myanmar words from the
Tibeto-Burman language, subfamily of the Sino-
Myanmar Web Pages is proposed. The main
Tibetan family of language, is spoken as mother
purpose of this research is to present the analysis
language by more than 37 million Burmese and as
of Myanmar words on the Myanmar Web pages to
second language by about 20 million ethnic
support Myanmar Search Engine Development.
minorities in Myanmar. It is the only official
To establish the Myanmar Search Engine, it is
language of Myanmar which is formerly known as
needed to do a lot of tasks such as indexing rule,
Burma. Myanmar language is written in a script
sorting algorithm, stemming algorithm, word
shaped in circular and semi-circular letters, which
breaking algorithm and so on.
are adopted from the Mon script. And the mon
In this study, we have collected
script is derived from Indian Brahmi flourished in
Myanmar Web pages from various Web sites
2. including Myanmar daily newspaper, community to multi-font converter to the Unicode 5.1. At last
Web sites, news Web sites total of which accounts the program run for searching the word from input
to 9,274 Kbytes. And then we extracted words text, and result words are saved in the Database.
The process will be explained step by step in the
from downloaded Myanmar Web pages. And
next section in more detail.
detail process of collecting words and analysis of
result data will be discussed in following sections. 3.1.First Step : Downloading Myanmar
Web Pages
2. Related Research
World Wide Web is the most convenient
A number of researchers not only from existing source of linguistic data providing the
local but also from word wide have collected users abundance of texts in various types in a
Myanmar words from different sources for their large number of languages. Already having in
individual purposes. electronic forms, these texts are quite suitable for
From 2007, Myanmar Unicode and NLP the corpus studies.
Research Center has started the development task In order to download Myanmar Web
pages, it needs very efficient crawler that can
of Myanmar National Corpus (MNC) [5]. MNC
collect only Myanmar Web pages selectively from
includes all texts including written text and the World Wide Web. In this research, the
spoken text from various resources. That project is Language Specific Crawler (LSC) developed by
almost finished. one of the authors [3] was used. LSC runs
Hla Hla Htay and colleagues [2] have concurrently with language identifier and collect
developed Myanmar corpora based on various Myanmar Web pages efficiently. Following table
explains the sources of the downloaded web sites.
resources such as text from official newspapers in
After downloading, downloaded pages were
Myanmar, over 300 full books and Myanmar texts passed to converter.
from various Web sites including news sites and
on-line magazines. In their research they had Table 1. Detail Information for source data
processed all their tasks based on ASCII format.
3. Methodology
3.2.Second Step : Conversion of various
encoding to Unicode 5.1 Standard
Myanmar texts on the Web are using
various encoding which are not fully compliant
with Unicode 5.1. So it is required to convert the
crawled Web Pages to Unicode encoding. If the
Web pages are encoded in Unicode then the work
Figure. 1. Step by step Procedure of Analysis becomes easier.
The step by step processes of our In order to convert various Myanmar
analysis are shown in figure 1. Firstly it needs to encodings to Unicode, an efficient converter is
collect Myanmar Web pages regardless of their needed. Currently, there are a number of
fonts and encodings. Then, we have to pass them Myanmar font conversion tools available on the
3. Web. In this research, Kanaung converter 1 and match. If no such match is found in the word lists,
Burglish converter2 were used. Although both of the character is simply segmented as a word.
them work nicely, it is still needed to edit a little
bit. For example, Kanaug converter could not 3.4. Fourth Step: Frequency Markov
covert ‘ ’ and ‘ ’ properly and correctly. In case Chain Analysis
of Burglish, it works correctly in the conversion
from “Zawgyi-One” font to “Myanmar3” font. In the program, Word-based Markov
But in the conversion from “Wininwa” font to models are also used to calculated word matrix
“Myanmar3” font, it cannot covert accurately for table to know the adjacency word in the sentences
‘ ’. And it cannot correctly work on punctuation (This mean which word most frequently appears
marks and quotation marks. Thus manual after one word.) It gives us high level background
correction is needed in those cases though they are information for word boundary detection in
somewhat perfect. parsing of the Myanmar language. Our program
firstly finds the words on the given web pages and
3.3. Third Step: Word Searching calculates the frequency of that word to know how
Algorithm many times that word appears on the Web sites.
After that, Markov chain matrix table was
Myanmar language is written in a syllabic generated automatically.
system and there are no spaces always put
between words or sentences. That is why word 4. Result
segmenting algorithm and word searching
algorithm for Myanmar Language are needed. We downloaded the various web sites
Very little research in different approach has been
including newspaper sites, blog sites,
published on segmenting sentences into words in
Myanmar language [1] [4]. entertainment sites, sport sites and collected 9,274
In our program, all of the Myanmar head Kbytes of text data. After running the program,
words that included in Myanmar–English total 766,892 words are collected and 12,211
Dictionary 3 are used as indexed file. It includes unique head words found.
28,000 Myanmar words. Those head words are
stored in the database and sorted in reverse order
of syllable length to compare with the input data.
4.1. Distribution of Words on input string
If the input word is matched with one of the head
word, the program will retrieve that word. If the It is found that mono-syllable is most
input word does not match with the head word frequently used because those words can be used
lists, the program cannot retrieve the word in several ways. For Example, mono-syllable
correctly. Thus the accuracy of this algorithm is “ ” was found more than 20,000 times.
largely depends on the head word lists. Because it can be used in different ways. For
In our algorithm the longest matching Example, in case 1: polite prefix to a young man’s
algorithm, was used to find the word on the input name (as in “ ”), in case 2: postpositional
data. It normally starts at the first character in a marker to indicate objective (as in
text using a heard word list and attempts to find “ ”), in case 3: emphatic
the longest word in the list. If such a word is particle suffixed to words (as in
found, the longest-matching algorithm marks a “ ”) and in case 4: post
boundary at the end of the longest word, and then positional marker indicating destination (as in
it repeats the same process as to start searching “ ”). And then bi-
longest match at the characters following the syllables words are second most and it is followed
by the tri-syllables and so on. The top ten words
sorted by frequency for mono-syllable, bi-
syllables, tri-syllables and tetra-syllables are
1
http://code.google.com/p/kanaung/ shown in the following tables.
2http://burglish.googlepages.com/fontconv.htm
3
Myanmar-English dictionary produced by
Department of the Myanmar Language
Commission
4. Table 2. Top ten mono-syllable words Table 3. Top ten bi-syllable words
Mono-Syllable Frequency Bi-Syllable Frequency
[ko] 20070 [Kyun 3537
Postpositional marker to (2.61%) taw] (0.46%)
indicate objective case I(male)
[ma] 18181 [Kyun ma] 3332
Partical prefixed to a verb to (2.40%) I(female) (0.43%)
the negative sence [Ka lay] 1994
[ka] 17469 Child (0.26%)
Postpositional marker to (2.30%) 1981
[A twat]
indicate nominative case (0.25%)
For
[tal] 14424
[Ae di] 1737
Colloquial form of the (1.90%)
That (0.22%)
sentence final
[par] 12774
Particle denoting inclusion (1.70%)
Table 4. Top ten tri-syllable words Table 5. Top ten tetra-syllable words
Tetra-Syllable Frequency
[sar yay sa 222
Tri-Syllable Frequency yar] (0.02%)
[Tha yot 627 Author
saung] (0.08%) [a nu pa nyar] 204
Actor Art (0.02%)
[Pa ri thet] 500 [a chay a nay] 176
Audience (0.06%) Condition (0.02%)
[Sa yar ma] 495 [a yay a tar] 157
Teacher(female) (0.06%) Writing (0.01%)
[Thu nge 404 [a mhat ta ya] 138
chin] (0.5%) Remembrance (0.01%)
Friend
[Main ka lay] 400
Girl (0.05%)
600,000 581,355
500,000
number of collected words
400,000
300,000
200,000 147,100
100,000 27,770 9,752 758 117 16 5 17 2
-
Mono- Bi- Tri- 4- 5- 6- 7- 8- 9- 10-
Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable
Number of Syllables
Figure. 2. Number of Syllables found in Test Data
5. 4.2. Word Level Frequency Matrix
Based on the input string, the program for parsing of the sentence into words. By applying
generated word level Markov table. By using this this algorithm in character level we can also generate
matrix we can know adjacency word pairs. It a character level Markov table. It can be used in
gives us the high level background information Myanmar character input method to Mobile phone.
Table 6 .Word-Level Matrix
Sum of Second
Frequency Word
Grand
First Word Total
1144 1144
722 1273 1217 4893
1564 2343
1339 1511 2850
934 934
1205 1717 2922
809 1754
Grand
Total 1205 722 2651 1339 1273 2373 1511 1144 1217 16840
4.3. Distribution of characters on Input
String It is found that the words begins with the “ ” is the
over 90,000 and it is first ranking character. And it is
followed by the “ ” and so on. No words are found
We analyzed character level frequency of the that starting with the characters “ ”. We could not
input data. The result is shown in Figure 3. find that words even in the Myanmar – English
dictionary.
100000
90000
80000
70000
number of collected words
60000
50000
40000
30000
20000
10000
0
List of Characters
Figure. 3. Total Frequency of Myanmar Characters found in Test Data
6. 5. Error Analysis expect this ongoing research will yield benefits
for our Myanmar search engine development task.
In our test data of 9,274 Kbytes, we
found 2,935,233 characters which excluding Acknowledgements
punctuation marks, numerals and English words.
In terms of words, we identified total 766,892 We acknowledge and highly appreciate
Myanmar words (12,211 unique headwords). But the kind assistance and help given by Myanmar
5,861 words (0.76%) were not identified. The Unicode & NLP Research Center. We would like
errors result from the incorrect spelling in the to express our thanks to Dr. Daw Myint Myint
original text, undefined headwords (proper nouns Than and U Ngwe Tun as they kindly provided us
which are not defined in the dictionary) and the data we are in need of.
incorrect description of syllable length in the
database. Moreover, some error results from the References
words ending with some characters such as “ ”
(Myanmar Sign Dot Below) and ambiguity in
word segmentation. Some examples of errors are [1] Hla Hla Htay and et al., “Myanmar Word
listed in Table 7. Segmentation using Syllable level Longest
Matching”, Proceedings of the 6th Workshop on
Asian Language Resources (ALR6), Hyderabad,
Table 7. Some Examples of errors
India, January 2008.
[2] Hla Hla Htay, G. Bharadwaja Kumar and
Kavi N. Murthy, “Constructing English-Myanmar
Parallel Corpora”. The Fourth International
Conference on Computer Application 2006.
[3] Pann Yu Mon, Chew Yew Choong, Yoshiki
Mikami, “Language Specific Crawler for
Myanmar Pages”, Proceedings of the 11th
International Conference on Humans and
Computers (HC 2008), Nagaoka, Japan,
November 2008.
[4] Tun Thura Thet and et al., “Word
Segmentaion of the Myanmar Language”, Journal
of Information Science, Vol. 34, No.5, pp 688-
704. 2008
[5] Wunna Ko Ko and Thin Zar Phyo, “Selection
of XML tag set for Myanmar National Corpus”,
6. Conclusion Proceedings of the 6th Workshop on Asian
Language Resources (ALR6), Hyderabad, India,
In this paper, we presented word January 2008.
segmentation program for Myanmar text based on
longest string matching algorithm and dictionary.
Also we presented both word level and character
level frequency distributions and word level
Markov table generated by this program. The
program performed segmentation work well and
proved itself to be used as a practical word
segmentation engine for various NLP applications,
including Myanmar search engine (in particular
word stemming engine). Statistical data generated
by this program is useful as background
information for designing various Myanmar NLP
applications including input system etc. For future
task, we plan to extend our program by collecting
all possible Myanmar words including not only
conversational words but also proper nouns. We