Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
DevoxxFR 2024 Reproducible Builds with Apache Maven
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
1. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
DOI:10.5121/ijfcst.2013.3408 67
HIDDEN MARKOV MODEL BASED NAMED
ENTITY RECOGNITION TOOL
Deepti Chopra1
, Sudha Morwal2
and Dr. G.N. Purohit3
Department of Computer Engineering, Banasthali Vidyapith, (Raj.), INDIA
deeptichopra11@yahoo.co.in
sudha_morwal@yahoo.co.in
gn_purohitjaipur@yahoo.co.in
ABSTRACT
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and
then classifying them into different categories of Named Entity classes. In this paper we have introduced
our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,
performs Corpus Development task i.e. assist in developing Training and Testing document but also solves
unknown words problem in NER, handles spurious words and automatically computes Performance
Metrics for NER based system i.e. Recall, Precision and F-Measure.
KEYWORDS
NER, Transliteration, Unknown words, Performance Metrics
1. INTRODUCTION
Named Entity Recognition (NER) is one of the application areas of Natural Language
Processing, in which Named Entities are identified and thereafter categorised into different
classes of Named Entities. The various classes of Named Entities can be the name of person,
location, organization, state, sport, river, city, country, percentage, time, quantity etc. Various
applications of NER include: Information extraction, Machine Translation, Question Answering
System, Information Retrieval, Automatic Summarization etc.
e. g. Consider Training Sentences:
Ram/PER is/OTHER a/OTHER intelligent/OTHER boy/OTHER
Deepa/PER lives/OTHER in/OTHER Nagpur/CITY
Ankit/PER is/OTHER a/OTHER football/SPORT player/OTHER
Aabhas/PER plays/OTHER cricket/SPORT
In the given above tagged training text in English, ‘PER’ denotes that ‘Ram’, ‘Deepa’,’ Ankit’
and ‘Aabhas’ are the Names of Person. ’Nagpur’ is tagged with ‘CITY’ tag since it is a Name of
City. Similarly, ‘football’ and ‘cricket’ are the names of Sport, so they are tagged with ‘SPORT’
tag. The entities that are tagged with ‘OTHER’ tag are not Named Entities. The above tagged
sentences are input to HMM Train module that computes HMM Parameters i.e. Start Probability,
Transition Probability and Emission Probability. HMM Parameters and Testing sentences are
input to the HMM Test module, and using Viterbi Algorithm Named Entities can be derived.
If testing sentence in NER is given as:
2. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
68
Aabhas lives in Nagpur
The output of NER based system for the above testing sentence is list of Named Entities along
with their tags i.e. Aabhas/PER and Nagpur/CITY.
We have developed a tool NERHMM, a language independent NER tool based on Hidden
Markov Model technique. [1][2]. In this paper, we will discuss about our modified tool.
2. PERFORMANCE METRICS OF NER BASED SYSTEM
Performance Metrics is means to compute the performance of a NER based system.
Performance Metrics can be estimated in terms of three parameters: Precision, Accuracy and F-
Measure. The result of a NER based system is referred to as “response” and the interpretation of
human as the “answer key” [9]. Consider the following terms:
1. Correct-If the response is same as the answer key.
2. Incorrect-If the response is not same as the answer key.
3. Missing-If answer key is found to be tagged but response is not tagged.
4. Spurious-If response is found to be tagged but answer key is not tagged. [6]
Hence, we define Precision, Recall and F-Measure as follows: [5]7][8]
Precision (P): Correct / (Correct + Incorrect + Missing)
Recall (R): Correct / (Correct + Incorrect + Spurious)
F-Measure: (2 * P * R) / (P + R)
3. HIDDEN MARKOV MODEL
Hidden Markov Model (HMM) is a machine learning based approach that was used initially
for the purpose of Speech Recognition but now it is being used for performing Named Entity
Recognition on Natural languages. HMM can be represented using three parameters: λ = (A, B,
П). Start Probability (П), Transition probability (A = aij) and Emission Probability (B
={bj(O)}).[1][3]
Start Probability (П) means the probability that a given tag occurs first in a sentence.
Transition probability (A = aij) means the probability of occurrence of the next tag j in a sentence
given the occurrence of particular tag i at present
Emission Probability (B = {bj(O)}) is the probability of occurrence of output sequence given a
state j. HMM involves two steps: HMM Training and HMM Testing. The input to the HMM
Train is an annotated text and the output of HMM Train are the three parameters i.e. Start
Probability (П), Transition probability (A = aij) and Emission Probability (B ={bj(O)}).The input
to the HMM Test is a testing sentence and the three parameters obtained in previous phase. The
output of the HMM Test are the sequence of states from which Named Entities can be detected.
4. OUR HMM BASED NER TOOL
We have performed NER in eight languages namely: English, Hindi, Bengali, Telugu, Punjabi,
Urdu, Marathi and French. Our tool is capable of performing Annotation task. If any of the
existing tags need to be modified, then this can be done. Annotation module is shown in fig1.
3. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
69
Figure 1: Annotation in NER Tool
Figure 2 HMM Train and HMM Parameter estimation
Similarly, we can develop Testing document also using our tool. So, our tool is capable of
performing Corpus Development both for training as well as for testing.
After getting the annotated corpus, we click on ‘TRAIN HMM’ button and choose the file to be
trained by clicking on Browse button. HMM parameters (Start Probability, Transition Probability
and Emission Probability) are calculated and can be viewed by clicking on View Parameters
button. This is shown in Fig2.
4. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
70
Figure 3 HMM Testing and its Output
Now, when we click on TEST HMM button, we can either click on browse button to select a
file for testing, or build a testing file by clicking on button named ‘Develop a new testing
Corpus’.
Finally, when we click on ‘TEST HMM’, we select a testing file using Browse button and
Viterbi algorithm is made to run that accepts all the HMM parameters computed by the tool and
displays optimal state sequence as shown in Fig 3. If any unknown word appears in testing file
then transliteration module is made to run and the unknown word can be handled
Our system can perform training and testing in any language while dealing with known words.
In case of dealing with unknown words, our system can handle only those words that appear in
one of the following languages: Hindi, Punjabi, Marathi, Bengali, Telugu, Urdu, English and
French. When we click on ‘SAVE OUTPUT’ button then output of NER based system can be
saved in a file. And, when we click on NER EVALUATION button, then Performance Metrics of
NER based system is calculated automatically and displayed in a new window. fig 4. Our system
is capable of handling Spurious words. Spurious words are those that are found to be untagged in
training file. Such words are tagged as ‘OTHER’ or Not-a-Named Entity by our system. We have
tried to solve the problem of unknown words using Transliteration approach.
5. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
71
Figure 4 NER Evaluation
5. FEATURES OF OUR TOOL
Some of unique features of our tool include the following:
Performs task of Corpus Development i.e. assist in developing Training as well Testing
documents.
It is a Language Independent tool can perform NER in any language. Unknown word
handling task has been performed for eight languages i.e. English, French, Hindi, Urdu,
Punjabi, Telugu, Bengali and Marathi using Transliteration approach.
Spurious words i.e. words that are found untagged in Training Corpus are handled.
The words that are found in testing file and are absent in training file are given Not-a-
Named Entity tag and are given as a feedback to the training file again, so that next time
when testing is done then these words are known words.
Automatic computation of NER Evaluation or Performance Metrics (i.e. Start Probability,
Emission Probability and Transition Probability) can be performed by our tool.
Our tool can perform NER on documents of any domain with high accuracy. Documents
may include dynamic tag sets.
Our tool can perform NER on Mutilingual documents also.
Our tool is user friendly in nature, since it assists in Corpus development, automatically
computes HMM Parameters and performs NER Evaluation also.
It is highly accurate. The result of NER Evaluation or Performance Metrics is close to
that of Human interpretation.
6. CONCLUSION
We have performed Named Entity Recognition using Hidden Markov Model in Natural
languages such as Hindi, Marathi, Punjabi, Telugu, Urdu, Bengali, English and French.
6. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
72
The existing tools related to Named Entity Recognition are highly language dependent and
domain specific in nature. So, a need was felt to develop a tool that is language independent and
can work in any domain. So, we developed a tool that performs NER in Natural languages and
can work in any domain using Hidden Markov Model.
We have also tried to solve the problem of Unknown words in Named Entity Recognition
using Transliteration approach.
Our system is also capable of performing NER on multilingual data. If the training Named
Entities is in one language and in testing file same Named Entities are in another language, then
using Transliteration approach these Named Entities can be identified easily
ACKNOWLEDGEMENT
We would like to thank all those who helped me in accomplishing this task.
REFERENCES
[1] Sudha Morwal and Deepti Chopra” NERHMM: A Tool For Named Entity Recognition based on
Hidden Markov Model“International Journal on Natural Language Computing (IJNLC) Vol.2, No.2,
April 2013 DOI:10.5121/ijnlc.2013.2204, Pg 43-49. Available at:
http://airccse.org/journal/ijnlc/papers/2213ijnlc04.pdf
[2] Sudha Morwal and Deepti Chopra “Identification and Classification of Named Entities in Indian
Languages” International Journal on Natural Language Computing (IJNLC) Vol.2, No.1, February
2013 DOI:10.5121/ijnlc.2013.210 Pg 37-43 Available at:
http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf
[3] Sudha Morwal, Nusrat Jahan and Deepti Chopra “Named Entity Recognition using Hidden Markov
Model (HMM)” International Journal on Natural Language Computing (IJNLC) Vol.1, No.4,
December 2012, DOI:10.5121/ijnlc.2012.1402, Pg 15-23Available at:
http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf
[4] Deepti Chopra, Nusrat Jahan and Sudha Morwal ”Hindi Named Entity Recognition By Using Rule
Based Heuristics And Hidden Markov Model”International Journal of Information Sciences and
Techniques (IJIST) Vol.2, No.6, November 2012. DOI : 10.5121/ijist.2012.2604. Available at:
http://airccse.org/journal/IS/papers/2612ijist04.pdf
[5] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity
Recognition for Telugu Using Maximum Entropy Model”
[6] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of
Computer Science Issues, Vol. 8, Issue 2, March 2011.
[7] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay
“Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the
IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad,
India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf
[8] Darvinder kaur, Vishal Gupta.“A survey of Named Entity Recognition in English and other Indian
Languages”.IJCSI International Journal of Computer Science Issues, Vol.7, Issue 6, November 2010.
[9] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi
Language:A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume
(2): Issue (1): 2011.Available at
http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
7. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
73
Authors
Deepti Chopra is working as Assistant Professor in the Department of Computer
Science at Banasthali University (Rajasthan), India. She has received B.Tech degree
in Computer Science and Engineering from Rajasthan College of Engineering for
Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and
Engineering from Banasthali University, Rajasthan in 2013. Her research interests
include Artificial Intelligence, Natural Language Processing, and Information
Retrieval. She has published many papers in International journals and conferences.
Sudha Morwal is an active researcher in the field of Natural Language Processing.
Currently working as Associate Professor in the Department of Computer Science at
Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) ,
NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali
University (Rajasthan), India. She has published many papers in International
Conferences and Journals.
Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at
Banasthali University (Rajasthan). Before joining Banasthali University, he was
Professor and Head of the Department of Mathematics, University of Rajasthan,
Jaipur. He had been Chief-editor of a research journal and regular reviewer of many
journals. His present interest is in O.R., Discrete Mathematics and Communication
networks. He has published around 40 research papers in various journals.