SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
DOI : 10.5121/ijait.2013.3203 35
PART OF SPEECH TAGGING OF MARATHI TEXT
USING TRIGRAM METHOD
Jyoti Singh1
, Nisheeth Joshi2
, Iti Mathur3
1,2,3
Apaji Institute, Banasthali University, Rajasthan, India,
1
jyoti.singh132@gmail.com,
2
nisheeth.joshi@rediffmail.com,
3
mathur_iti@rediffmail.com
ABSTRACT
In this paper we present a Marathipart of speech tagger. It is morphologically rich language. it is spoken
by the native people of Maharashtra. The general approach used for development of tagger is statistical
using Trigram Method. The main concept of Trigram is to explore the most likely POS for a token based on
given information of previous two tags by calculating probabilities to determine whichthe best sequence of
tag is. In this paper we show the development of the tagger. Moreover we have also shown the evaluation
done.
KEYWORDS:
Part of Speech Tagging, Stochastic Tagging, N-gram Modeling,Marathi
1. INTRODUCTION
Natural Language is the medium for communication which is incorporated by every human
being. One of the most important activities in processing natural languages is Part of Speech
tagging. In POS Tagging we assign a Part of Speech tag to each word in a sentence and literature.
POS tagging is one of the simplest, most constant and statistical model for many NLP
application. POS Tagging is an initial stage of linguistics, text analysis like information retrieval,
machine translator, text to speech synthesis, information extraction etc. Part-of-speech tagging is
a process of assigning the words in a text as corresponding to a particular part of speech.
A fundamental version of POS tagging is the identification of words as nouns, verbs, adjectives
etc. POS Tagging can be regarded as a simplified form of morphological analysis where it only
deals with assigning an appropriate POS tag to the word, while morphological analysis deals with
finding the internal structure of the word. Indian languages are morphologically rich and they
have more than one morpheme of a word due to this tagging of Indian languages are difficult.A
Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and
assigns parts of speech to each word. A part of speech includes nouns, verbs, adverbs, adjectives,
pronouns, conjunction and their sub-categories. Various approaches have been proposed to
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
36
implement POS taggers. Broadly they can be classified as rule based Statistical and Hybrid
approaches.The rule based POS tagging models apply a set of hand written rules and use
contextual information to assign POS tags to words.
The necessity of a linguistic background and manually constructing the rules are the main
drawbacks of the rule based systems.A stochastic approach includes frequency and probability or
statistics. The simplest stochastic approach finds out the most frequently used tag for a specific
word in the annotated training data and uses this information to tag that word in the unannotated
text. The problem with this approach is that it can come up with sequences of tags for sentences
that are not acceptable according to the grammar rules of a language.The Hybrid approaches use a
pre-defined set of handcrafted rules as well as automatically induced rules that are generated
during training.
2. PROBLEMS OF PART OF SPEECH TAGGING
Ambiguous words are the main problem in part of speech tagging. There may be many words
which can have more than one tag. Sometimes it happens that a word has same POS but have
different meaning in different context. To solve this problem we consider the context instead of
taking single word. For example-
अंकुश/NNPने/PSP,/SYMया चा /PRPबोलया वर /VM अंकुश/RB लावला/VAUX./SYM
The same word ‘अंकु श’ is given a different label in a same sentence. In the first case it is termed
as a proper noun. In the second case it is termed as a common noun as it is referring to control.
Since first word अंकु श occur in a sentence as subject and after that there is a postposition that’s
why it is labelled as NNP and in second time अंकु श comes between main verb and auxiliary verb
so it is assigned as adverb. POS Tagging tries to correctly identify a POS of a word by looking at
the context (surrounding words) in a sentence.
3. PREVIOUS WORK ON INDIAN LANGUAGE POS TAGGING
There are several approaches that have been used for POStagging and several research workhave
been carried out in this area. Singh et. al. [2] proposed a Manipuri POS Tagger using CRF and
SVM. In this paper, they described a tagger for using Conditional Random Field (CRF) and
Support Vector Machine (SVM). There Evaluation results demonstrated the accuracies of
72.04%, and 74.38% in the CRF, and SVM, respectively. Hasanet. al. [3] Proposed Comparison
of Unigram, Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian
Languages. They used stochastic methodology for some of south Asian languages like Bangla,
Hindi and Telugu and compared corresponding result for all languages.They found that brill
transformation based taggers results are better. Ekbal and Sivaji [4] proposed Web-based Bengali
News Corpus for Lexicon Development and POS Tagging the POS taggers using Hidden Markov
Model (HMM) and Support Vector Machine (SVM). The POS taggers are developed for Bengali
shows the accuracies as 85.56%, and 91.23% for HMM, and SVM, respectively.
Dhanalakshmi V,et. al. [5] presentedTamil POS Tagging using Linear Programming. In this paper
they proposed the Part Of Speech tagger for Tamil using Machine learning techniques. They
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
37
discovered that SVM based machine learning tool affords the most encouraging result for Tamil
POS tagger (95.64%). Kumaret. al. [6] presentedBuilding Feature Rich POS Tagger for
Morphologically Rich Languages: Experiences in Hindi. A statistical part-of-speech tagger for a
morphologically rich language: Hindi.This Tagger employs the maximum entropy Markov model
with a rich set of features capturing the lexical and morphological characteristics of the language.
The system achieved the best accuracy of 94.89% and an average accuracy of 94.38%.Singhet. al.
[7], in 2008 proposed Part-of-Speech Tagging for Grammar Checking of Punjabi. In this paper,
they have discussed the issues concerning the development of a POS tagset and a POS tagger for
the use as a part of the project on developing an automated grammar checking system for Punjabi
Language.In 2009, Manju K.et. al. [8] proposed Development of a POS Tagger for Malayalam
which was a Hidden Markov Model (HMM) based part of speech tagger for Malayalam language.
The performance of the developed POS Tagger is about 90% and almost 80% of the sequences
that have been generated automatically for the test case were found correct. Joshiet. al. [9],
proposed Part of Speech Tagging for Hindi. They have used IL POS tag set for the development
of the tagger. They disambiguated correct word-tag combinations using the contextual
information available in the text.They have achieved the accuracy of 92%.
Patel et. al. [10] proposed Part-Of-Speech Tagging for Gujarati Using Conditional Random
Fields. In this paper they described a machine learning algorithm for Gujarati Part of Speech
Tagging.This paper shows a machine learning algorithm for Gujarati Part of Speech Tagging. The
machine learning part is performed using a CRF model.The algorithms have achieved an
accuracy of 92% for Gujarati texts where the training corpus is of 10,000 words and the test
corpus is of 5,000 words. Reddy and Sharoff[11] proposed Cross Language POS Taggers (and
other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources they
have usedTnT (Brants, 2000), a popular implementation of the second-order Markov model for
POS tagging. Kumar et. al[12] presented Part of Speech Tagger for Morphologically rich Indian
Languages: A survey. In this paper they have reported about differentPOS taggers based on
different languages and methods. In this paper they have shown the accuracy of corresponding
tagger.
4. POS TAGSET
Depending on some general principle of tagset design strategy, a number of POS tagsets have
been developed by different organizations based. For POS annotation of texts in Marathi, we have
used tagset developed by IIIT Hyderabad (Bharti, et. al., 2006) [1]. Table shows brief description
of IL POS Tag set.
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
38
S.No. Tag Description (Tag Used for) Example
1. NN Common Nouns मुलगा, साखर,मंडळी, सैय ,
चांगुलपणा
2. NST Noun Denoting Spatial and Temporal
Expressions
मागेपुढे,वर,खाली
3. NNP Proper Nouns (name of person) मोहन, राम, सुरेश
4. PRP Pronoun मी,आही ,तुही
5. DEM Demonstrative तो, ती, ते, हा, ही, हे
6. VM Verb Main (Finite or Non-Finite) बसणे,दसणे , िलिहणे
7. VAUX Verb Auxiliary (Any verb, present besides
main verb shall be marked as auxiliary
verb)
नाही,नको,करणे ,हवे,
नये
8. JJ Adjective (Modifier of Noun) उसा ही ,े ,बळवान
9. RB Adverb (Modifier of Verb) आता,काल,कधी,नेहमी
10. PSP Postposition आिण,वर,कडे
11. RP Particles भी, तो, ही
12. QF Quantifiers बत , थोडा, कम
13. QC Cardinals एक, दोन, तीन
14. CC Conjuncts (Coordinating and Subordinating) आिण,केहा ,तेहा ं , जर, तर
15. WQ Question Words कायकधीकुठेकोण
16. QO Ordinals पिहलादुसराितसरा
17. INTF Intensifier खूपफारपुकळ
18. INJ Interjection आहा,छान,अगो, हाय
19. NEG Negative नाही,नको
20. SYM Symbol ? , ; : !
21. XC Compounds काळेमांजर-काळमांजर,
तेलपाणी- तेलवणी
22. RDP Reduplications जवळ-जवळ
24. UNK Foreign Words English, ગુજરાતી,
Table 1: POS tagset for Marathi
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
39
5. OUR APPROACH
In this paper we are describing Trigram Model for Marathi POS tagger. Our main aim is to
perform POS Tagging to determine the most likely tag for a word, given the previous two tags.
For trigrams, the probability of a sequence is just the product of conditional probabilities of its
trigrams. So if t1, t2 …tn are tag sequence and w1, w2…wn are corresponding word sequence then
the following equation explains this fact-
P(t |w ) = P(w |t ). P(t |t ,t )
Where ti denotes the tag sequence and wi denotes the word sequences
P (w |t ) is the probability of current word given current tag.
(1)
Here, P(t |t t )is the probability of a current tag given the previous two tags.
This provides the transition between the tags and helps capture the context of the sentence. These
probabilities are computed by following equation.
P(ti/ti-2, ti-1) =f (ti-2, ti-1,ti)/f (ti-2, ti-1) (2)
Each tag transition probability is computed by calculating the frequency count of two tags which
come together in the corpus divided by the frequency count of the previous two tagscomingin the
corpus.
Figure 1. Working of trigram model
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
40
6. EVALUATION
For testing the performance of our system, we developed a test corpus of 2000 sentences (48,635
words). This provided us with accuracy of 91.63%. The accuracy was calculated using the
formula:
Accuracy(%) =
.
.
*100
Test scores of our system are as follows:
No. of Correct POS tags assigned bythe system = 44563
No. of POS taginthe text = 48635
Thus the accuracy of the system is 91.63%.
(3)
7. CONCLUSION
The topic of POS tagging discussed in this paper showsTrigram based POS tagger for Marathi.
The POS tagger described here is very simple and efficient for automatic tagging, but the
morphological complexity of the Marathi makes it a little hard. The performance of the current
system is good and the results achieved by this method are excellent. In future we wish to
improve the accuracy of our system by adding more tagged sentences in our training corpus.
REFERENCES
[1] Bharati, A., Sharma, D.M., Bai, L., Sangal, R., (2006) “AnnCorra: Annotating Corpora Guidelines for
POS and Chunk Annotation for Indian Languages”.
[2] Singh Thoudam Doren and Bandyopadhyay Sivaji, (2008) “Morphology Driven Manipuri POS
Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 91–
98, Hyderabad, India.
[3] Hasan Fahim Muhammad, Zaman Naushad Uz and Khan Mumit, (2007) “Comparison of Unigram,
Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian Languages”. In
proceeding of Center for Research on Bangla Language Processing.
[4] Ekbal Asif and Bandyopadhyay Shivaji, (2008) “Web-based Bengali News Corpus for Lexicon
Development and POS Tagging”. In Proceeding of Language Resource and Evaluation.
[5] Dhanalakshmi V, Anandkumar M, Rajendran S, Soman K P, (2009) “Tamil POS Tagging using
Linear Programming” in proceeding of International Journal of Recent Trends in Engineering, Vol. 1,
No. 2.
[6] Dalal Aniket, Kumar Nagraj, Sawant Uma, Shelke Sandeep and Bhattacharyya Pushpak, (2007)
“Building Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi”. In
Proceedings ofInternational Conference on Natural Language Processing (ICON).
[7] Singh Mandeep, Lehal Gurpreet, and Sharma Shiv, (2009) “Part-of-Speech Tagging for Grammar
Checking of Punjabi” in proceeding of The Linguistics Journal Volume 4 Issue.
[8] Manju K, Soumya S, Idicul S.M., (2009) “A Development of A POS Tagger for Malayalam - An
Experience.” In Proceedings of International Conference on Advances in Recent Technologies in
Communication and Computing.
[9] Joshi Nisheeth, Darbari Hemant, Mathur Iti, (2013) “HMM based POS Tagger for Hindi”. In
Proceeding of 2013 International Conference on Artificial Intelligence and Soft Computing.
International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013
41
[10] Patel Chirag, Gali Karthik, (2008) “Part-Of-Speech Tagging for Gujarati Using Conditional Random
Fields”. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pp 117-
122.
[11] Reddy Siva, Sharoff Serge, (2011) “Cross Language POS Taggers (and other Tools) for Indian
Languages: An Experiment with Kannada using Telugu Resources”. In Proceedings of IJCNLP
workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need
of Multilingual Societies.
[12] Kumar Dinesh and Josan Gurpreet Singh, (2010) “Part of Speech Tagger for Morphologically rich
Indian Languages: A survey”, International Journal of Computer Application, Vol 6(5).
AUTHORS
Jyoti Singh is pursuing her M.Tech in Computer Science from Banasthali University, Rajasthan and is
working as a Research Assistant in English-Indian Languages Machine Translation System Project
sponsored by TDIL Programme, DeitY. She has her interest in Machine Translation specifically in English-
Marathi Language Pair. She has developed various tools on Marathi Language Processing. Her current
research interest includes Natural Language Processing and Machine Translation.
Nisheeth Joshi is an Assistnat Professor at Banasthali University. He has been primarily working in design
and development of evaluation Matrices in Indian languages. Besides this he is also actively involved in the
development of MT engines for English to Indian Languages. He is one of the experts empanelled with
TDIL programme, Department of Electronics and Information Technology (DeitY), Govt. of India, a
premier organization which foresees Language Technology Funding and Research in India. He has several
publications in various journals and conferences and also serves on the Programme Committees and
Editorial Boards of several conferences and journals.
Iti Mathur is an Assistant Professor at Banasthali University. Her primary area of research is Computational
Semantics and Ontological Engineering. Besides this she is also involved in the development of MT
engines for English to Indian Languages. She is one of the experts empanelled with TDIL Programme,
Department of Electronics and Information Technology (DeitY), Govt. of India, a premier organization
which foresees Language Technology Funding and Research in India. She has several publications in
various journals and conferences and also serves on the Programme Committees and Editorial Boards of
several conferences and journals.

Mais conteúdo relacionado

Mais procurados

Morpheme Based Myanmar Word Segmenter
Morpheme Based Myanmar Word SegmenterMorpheme Based Myanmar Word Segmenter
Morpheme Based Myanmar Word Segmenterijtsrd
 
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological AnalysisKarthik Sankar
 
Kannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachKannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachIJERA Editor
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyIAEME Publication
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyiaemedu
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems IJECEIAES
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKijnlc
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET Journal
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
 

Mais procurados (17)

Morpheme Based Myanmar Word Segmenter
Morpheme Based Myanmar Word SegmenterMorpheme Based Myanmar Word Segmenter
Morpheme Based Myanmar Word Segmenter
 
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processing
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
Kannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachKannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical Approach
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a study
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a study
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
 
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHHANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
 

Semelhante a PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD

Current state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesCurrent state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesiaemedu
 
Current state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesCurrent state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesiaemedu
 
Current state of the art pos tagging for indian language
Current state of the art pos tagging for indian languageCurrent state of the art pos tagging for indian language
Current state of the art pos tagging for indian languageiaemedu
 
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalamijcsit
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
 
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisGrapheme-To-Phoneme Tools for the Marathi Speech Synthesis
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
 
Hybrid part of-speech tagger for non-vocalized arabic text
Hybrid part of-speech tagger for non-vocalized arabic textHybrid part of-speech tagger for non-vocalized arabic text
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Textsijcnes
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
 

Semelhante a PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD (20)

Current state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesCurrent state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languages
 
Current state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languagesCurrent state of the art pos tagging for indian languages
Current state of the art pos tagging for indian languages
 
Current state of the art pos tagging for indian language
Current state of the art pos tagging for indian languageCurrent state of the art pos tagging for indian language
Current state of the art pos tagging for indian language
 
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalam
 
Presentation1
Presentation1Presentation1
Presentation1
 
Cl35491494
Cl35491494Cl35491494
Cl35491494
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Language
 
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisGrapheme-To-Phoneme Tools for the Marathi Speech Synthesis
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
 
Hybrid part of-speech tagger for non-vocalized arabic text
Hybrid part of-speech tagger for non-vocalized arabic textHybrid part of-speech tagger for non-vocalized arabic text
Hybrid part of-speech tagger for non-vocalized arabic text
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
F017163443
F017163443F017163443
F017163443
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD

  • 1. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 DOI : 10.5121/ijait.2013.3203 35 PART OF SPEECH TAGGING OF MARATHI TEXT USING TRIGRAM METHOD Jyoti Singh1 , Nisheeth Joshi2 , Iti Mathur3 1,2,3 Apaji Institute, Banasthali University, Rajasthan, India, 1 jyoti.singh132@gmail.com, 2 nisheeth.joshi@rediffmail.com, 3 mathur_iti@rediffmail.com ABSTRACT In this paper we present a Marathipart of speech tagger. It is morphologically rich language. it is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using Trigram Method. The main concept of Trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine whichthe best sequence of tag is. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done. KEYWORDS: Part of Speech Tagging, Stochastic Tagging, N-gram Modeling,Marathi 1. INTRODUCTION Natural Language is the medium for communication which is incorporated by every human being. One of the most important activities in processing natural languages is Part of Speech tagging. In POS Tagging we assign a Part of Speech tag to each word in a sentence and literature. POS tagging is one of the simplest, most constant and statistical model for many NLP application. POS Tagging is an initial stage of linguistics, text analysis like information retrieval, machine translator, text to speech synthesis, information extraction etc. Part-of-speech tagging is a process of assigning the words in a text as corresponding to a particular part of speech. A fundamental version of POS tagging is the identification of words as nouns, verbs, adjectives etc. POS Tagging can be regarded as a simplified form of morphological analysis where it only deals with assigning an appropriate POS tag to the word, while morphological analysis deals with finding the internal structure of the word. Indian languages are morphologically rich and they have more than one morpheme of a word due to this tagging of Indian languages are difficult.A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word. A part of speech includes nouns, verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories. Various approaches have been proposed to
  • 2. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 36 implement POS taggers. Broadly they can be classified as rule based Statistical and Hybrid approaches.The rule based POS tagging models apply a set of hand written rules and use contextual information to assign POS tags to words. The necessity of a linguistic background and manually constructing the rules are the main drawbacks of the rule based systems.A stochastic approach includes frequency and probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. The problem with this approach is that it can come up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language.The Hybrid approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training. 2. PROBLEMS OF PART OF SPEECH TAGGING Ambiguous words are the main problem in part of speech tagging. There may be many words which can have more than one tag. Sometimes it happens that a word has same POS but have different meaning in different context. To solve this problem we consider the context instead of taking single word. For example- अंकुश/NNPने/PSP,/SYMया चा /PRPबोलया वर /VM अंकुश/RB लावला/VAUX./SYM The same word ‘अंकु श’ is given a different label in a same sentence. In the first case it is termed as a proper noun. In the second case it is termed as a common noun as it is referring to control. Since first word अंकु श occur in a sentence as subject and after that there is a postposition that’s why it is labelled as NNP and in second time अंकु श comes between main verb and auxiliary verb so it is assigned as adverb. POS Tagging tries to correctly identify a POS of a word by looking at the context (surrounding words) in a sentence. 3. PREVIOUS WORK ON INDIAN LANGUAGE POS TAGGING There are several approaches that have been used for POStagging and several research workhave been carried out in this area. Singh et. al. [2] proposed a Manipuri POS Tagger using CRF and SVM. In this paper, they described a tagger for using Conditional Random Field (CRF) and Support Vector Machine (SVM). There Evaluation results demonstrated the accuracies of 72.04%, and 74.38% in the CRF, and SVM, respectively. Hasanet. al. [3] Proposed Comparison of Unigram, Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian Languages. They used stochastic methodology for some of south Asian languages like Bangla, Hindi and Telugu and compared corresponding result for all languages.They found that brill transformation based taggers results are better. Ekbal and Sivaji [4] proposed Web-based Bengali News Corpus for Lexicon Development and POS Tagging the POS taggers using Hidden Markov Model (HMM) and Support Vector Machine (SVM). The POS taggers are developed for Bengali shows the accuracies as 85.56%, and 91.23% for HMM, and SVM, respectively. Dhanalakshmi V,et. al. [5] presentedTamil POS Tagging using Linear Programming. In this paper they proposed the Part Of Speech tagger for Tamil using Machine learning techniques. They
  • 3. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 37 discovered that SVM based machine learning tool affords the most encouraging result for Tamil POS tagger (95.64%). Kumaret. al. [6] presentedBuilding Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi. A statistical part-of-speech tagger for a morphologically rich language: Hindi.This Tagger employs the maximum entropy Markov model with a rich set of features capturing the lexical and morphological characteristics of the language. The system achieved the best accuracy of 94.89% and an average accuracy of 94.38%.Singhet. al. [7], in 2008 proposed Part-of-Speech Tagging for Grammar Checking of Punjabi. In this paper, they have discussed the issues concerning the development of a POS tagset and a POS tagger for the use as a part of the project on developing an automated grammar checking system for Punjabi Language.In 2009, Manju K.et. al. [8] proposed Development of a POS Tagger for Malayalam which was a Hidden Markov Model (HMM) based part of speech tagger for Malayalam language. The performance of the developed POS Tagger is about 90% and almost 80% of the sequences that have been generated automatically for the test case were found correct. Joshiet. al. [9], proposed Part of Speech Tagging for Hindi. They have used IL POS tag set for the development of the tagger. They disambiguated correct word-tag combinations using the contextual information available in the text.They have achieved the accuracy of 92%. Patel et. al. [10] proposed Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields. In this paper they described a machine learning algorithm for Gujarati Part of Speech Tagging.This paper shows a machine learning algorithm for Gujarati Part of Speech Tagging. The machine learning part is performed using a CRF model.The algorithms have achieved an accuracy of 92% for Gujarati texts where the training corpus is of 10,000 words and the test corpus is of 5,000 words. Reddy and Sharoff[11] proposed Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources they have usedTnT (Brants, 2000), a popular implementation of the second-order Markov model for POS tagging. Kumar et. al[12] presented Part of Speech Tagger for Morphologically rich Indian Languages: A survey. In this paper they have reported about differentPOS taggers based on different languages and methods. In this paper they have shown the accuracy of corresponding tagger. 4. POS TAGSET Depending on some general principle of tagset design strategy, a number of POS tagsets have been developed by different organizations based. For POS annotation of texts in Marathi, we have used tagset developed by IIIT Hyderabad (Bharti, et. al., 2006) [1]. Table shows brief description of IL POS Tag set.
  • 4. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 38 S.No. Tag Description (Tag Used for) Example 1. NN Common Nouns मुलगा, साखर,मंडळी, सैय , चांगुलपणा 2. NST Noun Denoting Spatial and Temporal Expressions मागेपुढे,वर,खाली 3. NNP Proper Nouns (name of person) मोहन, राम, सुरेश 4. PRP Pronoun मी,आही ,तुही 5. DEM Demonstrative तो, ती, ते, हा, ही, हे 6. VM Verb Main (Finite or Non-Finite) बसणे,दसणे , िलिहणे 7. VAUX Verb Auxiliary (Any verb, present besides main verb shall be marked as auxiliary verb) नाही,नको,करणे ,हवे, नये 8. JJ Adjective (Modifier of Noun) उसा ही ,े ,बळवान 9. RB Adverb (Modifier of Verb) आता,काल,कधी,नेहमी 10. PSP Postposition आिण,वर,कडे 11. RP Particles भी, तो, ही 12. QF Quantifiers बत , थोडा, कम 13. QC Cardinals एक, दोन, तीन 14. CC Conjuncts (Coordinating and Subordinating) आिण,केहा ,तेहा ं , जर, तर 15. WQ Question Words कायकधीकुठेकोण 16. QO Ordinals पिहलादुसराितसरा 17. INTF Intensifier खूपफारपुकळ 18. INJ Interjection आहा,छान,अगो, हाय 19. NEG Negative नाही,नको 20. SYM Symbol ? , ; : ! 21. XC Compounds काळेमांजर-काळमांजर, तेलपाणी- तेलवणी 22. RDP Reduplications जवळ-जवळ 24. UNK Foreign Words English, ગુજરાતી, Table 1: POS tagset for Marathi
  • 5. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 39 5. OUR APPROACH In this paper we are describing Trigram Model for Marathi POS tagger. Our main aim is to perform POS Tagging to determine the most likely tag for a word, given the previous two tags. For trigrams, the probability of a sequence is just the product of conditional probabilities of its trigrams. So if t1, t2 …tn are tag sequence and w1, w2…wn are corresponding word sequence then the following equation explains this fact- P(t |w ) = P(w |t ). P(t |t ,t ) Where ti denotes the tag sequence and wi denotes the word sequences P (w |t ) is the probability of current word given current tag. (1) Here, P(t |t t )is the probability of a current tag given the previous two tags. This provides the transition between the tags and helps capture the context of the sentence. These probabilities are computed by following equation. P(ti/ti-2, ti-1) =f (ti-2, ti-1,ti)/f (ti-2, ti-1) (2) Each tag transition probability is computed by calculating the frequency count of two tags which come together in the corpus divided by the frequency count of the previous two tagscomingin the corpus. Figure 1. Working of trigram model
  • 6. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 40 6. EVALUATION For testing the performance of our system, we developed a test corpus of 2000 sentences (48,635 words). This provided us with accuracy of 91.63%. The accuracy was calculated using the formula: Accuracy(%) = . . *100 Test scores of our system are as follows: No. of Correct POS tags assigned bythe system = 44563 No. of POS taginthe text = 48635 Thus the accuracy of the system is 91.63%. (3) 7. CONCLUSION The topic of POS tagging discussed in this paper showsTrigram based POS tagger for Marathi. The POS tagger described here is very simple and efficient for automatic tagging, but the morphological complexity of the Marathi makes it a little hard. The performance of the current system is good and the results achieved by this method are excellent. In future we wish to improve the accuracy of our system by adding more tagged sentences in our training corpus. REFERENCES [1] Bharati, A., Sharma, D.M., Bai, L., Sangal, R., (2006) “AnnCorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages”. [2] Singh Thoudam Doren and Bandyopadhyay Sivaji, (2008) “Morphology Driven Manipuri POS Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 91– 98, Hyderabad, India. [3] Hasan Fahim Muhammad, Zaman Naushad Uz and Khan Mumit, (2007) “Comparison of Unigram, Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian Languages”. In proceeding of Center for Research on Bangla Language Processing. [4] Ekbal Asif and Bandyopadhyay Shivaji, (2008) “Web-based Bengali News Corpus for Lexicon Development and POS Tagging”. In Proceeding of Language Resource and Evaluation. [5] Dhanalakshmi V, Anandkumar M, Rajendran S, Soman K P, (2009) “Tamil POS Tagging using Linear Programming” in proceeding of International Journal of Recent Trends in Engineering, Vol. 1, No. 2. [6] Dalal Aniket, Kumar Nagraj, Sawant Uma, Shelke Sandeep and Bhattacharyya Pushpak, (2007) “Building Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi”. In Proceedings ofInternational Conference on Natural Language Processing (ICON). [7] Singh Mandeep, Lehal Gurpreet, and Sharma Shiv, (2009) “Part-of-Speech Tagging for Grammar Checking of Punjabi” in proceeding of The Linguistics Journal Volume 4 Issue. [8] Manju K, Soumya S, Idicul S.M., (2009) “A Development of A POS Tagger for Malayalam - An Experience.” In Proceedings of International Conference on Advances in Recent Technologies in Communication and Computing. [9] Joshi Nisheeth, Darbari Hemant, Mathur Iti, (2013) “HMM based POS Tagger for Hindi”. In Proceeding of 2013 International Conference on Artificial Intelligence and Soft Computing.
  • 7. International Journal of Advanced Information Technology (IJAIT) Vol. 3, No.2, April2013 41 [10] Patel Chirag, Gali Karthik, (2008) “Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields”. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pp 117- 122. [11] Reddy Siva, Sharoff Serge, (2011) “Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources”. In Proceedings of IJCNLP workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies. [12] Kumar Dinesh and Josan Gurpreet Singh, (2010) “Part of Speech Tagger for Morphologically rich Indian Languages: A survey”, International Journal of Computer Application, Vol 6(5). AUTHORS Jyoti Singh is pursuing her M.Tech in Computer Science from Banasthali University, Rajasthan and is working as a Research Assistant in English-Indian Languages Machine Translation System Project sponsored by TDIL Programme, DeitY. She has her interest in Machine Translation specifically in English- Marathi Language Pair. She has developed various tools on Marathi Language Processing. Her current research interest includes Natural Language Processing and Machine Translation. Nisheeth Joshi is an Assistnat Professor at Banasthali University. He has been primarily working in design and development of evaluation Matrices in Indian languages. Besides this he is also actively involved in the development of MT engines for English to Indian Languages. He is one of the experts empanelled with TDIL programme, Department of Electronics and Information Technology (DeitY), Govt. of India, a premier organization which foresees Language Technology Funding and Research in India. He has several publications in various journals and conferences and also serves on the Programme Committees and Editorial Boards of several conferences and journals. Iti Mathur is an Assistant Professor at Banasthali University. Her primary area of research is Computational Semantics and Ontological Engineering. Besides this she is also involved in the development of MT engines for English to Indian Languages. She is one of the experts empanelled with TDIL Programme, Department of Electronics and Information Technology (DeitY), Govt. of India, a premier organization which foresees Language Technology Funding and Research in India. She has several publications in various journals and conferences and also serves on the Programme Committees and Editorial Boards of several conferences and journals.