Mais conteúdo relacionado
Semelhante a 50120130405026 (20)
Mais de IAEME Publication (20)
50120130405026
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 5, September – October (2013), pp. 224-231
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
©IAEME
IDENTIFICATION OF DEVANAGARI SCRIPT FROM IMAGE DOCUMENT
SATISH R. DAMADE1, K. P. ADHIYA2 and RANJANA S. ZINJORE3
1
Computer Engineering, KBS’s College of Engineering & Technology, North Maharashtra
Knowledge city, Jalgaon.
2
Computer Engineering, SSBT’s College of Engineering & Technology, Bambhori, Jalgaon.
3
Computer Application, KCES’s Institute of Management and Research, Jalgaon.
ABSTRACT
Texts that appear in the image contain useful and important information. Optical Character
Recognition technology is restricted to finding text printed against clean backgrounds, and cannot
handle text printed against shaded or textured backgrounds or embedded in images. It is necessary to
extract the text form image which is helpful in a society for a blind and visually impaired person
when voice synthesizer is attached with the system. In this paper, we present a methodology for
extracting text from printed image document and then identified Devanagari Script (Hindi language)
from extracted text. Firstly we used Morphological Approach for extracting the text from image
documents. The resultant text image is passed to Optical Character Recognition for Identification
purpose. Projection profile is used for segmentation followed by Visual Discriminating approach for
feature extraction. Finally for classification purpose Heuristic search is used. The result of proposed
method for text extraction is compared with edge based and connected component with projection
profile approach. After comparison using precision and recall rate it is observed that proposed
algorithm work well.
Keywords: Area, Bounding Box, Canny edge detector, Heuristic Search, Projection Profile,
Visual Discriminating feature.
I. INTRODUCTION
In recent years, the escalating use of physical documents has made to progress towards the
creation of electronic documents to facilitate easy communication and storage of documents. Now a
day, information is becoming increasingly enriched by multimedia components containing images
and video in addition to the textual information. The extraction of text in an image is a classical
224
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
problem in the computer vision research area. Text extraction from images and video find many
applications in document processing, detection of vehicle license plate, mobile robot navigation,
object identification, text in www images, content based image retrieval from image database and
video content analysis [1]. There are basically three kinds of images: Document image, Scene text
images and Caption text images for extracting the text from these images basically two approaches
are used i.e. – Region based approach and Textual based approach [2]. After extracting the text from
image document, script identification play a vital role in designing of Optical character Recognition.
Script Identification is a key step that arises in document image analysis especially when the
environment is multi script and language identification is required to identify the different language
that exists in the same script. In India, Script Identification facilitates many important applications
such as sorting the images, selecting appropriate script specific text understanding system and
searching online archives of document image containing a particular script [3]. In this paper we used
Hindi language for identification purpose because Hindi is the third most spoken language of the
world after Chinese and English, and there are approximately 500 billion people all over the world
who speak and write in Hindi language. Also there are many forms and application available in
combination of state official language and English language. In this paper we used printed images
consist of Hindi and English text and then identifying Hindi language from such image documents.
Hindi is derived from Devanagari script consisting 12 vowels and 34 consonants apart from a
horizontal line at the upper part of a characters called as Shirorekha. English alphabet is a Latinbased alphabet consisting of 26 letters each of upper and lower case characters. The structure of the
English alphabet contains more vertical and slant strokes.
II. CHALLENGES AND RELAED WORK
Text extraction from complex images is one of the most useful and difficult applications of
pattern recognition and computer vision. Also identifying script form extracted image text is also
very difficult task due to similar shape of the characters of the script. Authors presented a technique
for detecting caption text for indexing purposes. Caption text objects are detected combining texture
and geometric features and textured areas is detected using wavelet analysis. [4]. Zhong et al. [5]
located text form complex images like compact disc, book cover, or traffic scenes. For finding text
location authors used higher spatial variance of the image intensity of horizontal text lines. In this
paper authors [6] proposed a four step system which automatically detects and extracts text in images
including texture segmentation in which image is filter using bank of linear filters followed by
strokes extraction, drawing rectangular box around the text and finally detecting the text. Authors [7]
are used edge based approach for extracting the text based on generating a feature map using three
important properties of edges: edge strength, density and variance of orientation. Neha Gupta[8]
have proposed a method for image segmentation for text extraction based on 2d-Discrete Wavelet
Transform which decompose the image into four sub component. After that edges of three sub-bands
are fused to create a candidate text region followed by projection profile approach and based on
some threshold text is extracted. C. V. Jawahar proposed [9] a technique to distinguish between
Hindi and Telugu script. For Hindi, segmentation involves the removal of shirorekha. For Telugu,
component extraction implies the separation of connected components. S. Basavaraj Patil [10]
presents a approach for identification of Hindi, English and Kannada language script. For feature
extraction, input image is dilated using 3x3 masks in Horizontal, Vertical, left and right diagonal
direction followed by average pixel distribution of resultant image and neural network is used for
classification. Pal and Chaudhuri [11] have proposed an automatic technique of separating the text
lines from 12 Indian scripts (English, Devanagari, Bangla, Gujarati, Tamil, Kashmiri, Malayalam,
Oriya, Punjabi, Telugu and Urdu) using ten triplets formed by grouping English and Devanagari with
225
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
any one of the other scripts. Santanu Choudhuri, et al. [12] have proposed a method for identification
of Indian languages by combining Gabor filter based technique and direction distance histogram
classifier considering Hindi, English, Malayalam, Bengali, Telugu and Urdu.
III. PROPOSED ARCHITECTURE AND METHODOLOGY
3.1 Proposed Architecture: The proposed architecture for identification of Hindi Script from image
document is shown in Fig.1.
3.2. Methodology
3.2.1) Preprocessing:
i) In this step we have convert the image having RGB color space (Fig. 2) into gray scale image.
Gray scale image is converted into binary using Ostu’s thresholding.
Color image
with complex
background
Detection of
text region
Preprocessing
Display the text region
and removing non-text
region which is a
resultant image
Input
Improving the quality of text
Subtract resultant
image with input
image
Output
Final Image
containing
Text
Passed the
extracted text for
script Identification
Segmentation of
text into line &
words
Heuristic Search
Feature Extraction
Identified
Hindi
Script
Identification of
Hindi Script
Figure 1: Proposed Architecture for Identification of Hindi script from Image Document
ii) Edge detection and Morphological Dilation: An efficient canny method is used for edge
extraction. The edge image is dilated using square structuring element of size. Mathematical dilation
of A by B is denoted by:
----------------- (1.1)
iii) Hole filling: Hole filling is determined by selection of marker and mask images.
0
-------------------(1.2)
226
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
3.2.2) Detection of Text Region
i) The dilated image is labeled using bwlabel in matlab followed by 8 way connectivity. To
obtained measurement (‘Bounding Box’ and ‘Area’) of image region we used region props
properties in Matlab (Fig. 3).
ii) Further for extraction of text region, we have computed a new-value by multiplying height
and width of a Bounding Box and then resultant new-value is divided by Area. By
experimentation it is found that if the ratio (new-value/area) is less than 1.78 and height is
greater than 9 then the region so obtained are text region (specific condition).
Figure 2: Input color Image
Figure 3: Bounding box
3.2.3) Displayed Text region and removed non-text Region
i) We found the connected component (CC) of a binary image which returns CC using
bwconncomp Function in matlab.
ii) Obtained the size of dilated image and make its value as false to make a blank background.
iii) The connected component which satisfied the above specific condition makes that connected
component value as true using PixelIndexList.
iv)
The resultant image visualization is very poor. To increases the visualization we subtract the
resultant image with input image (Fig. 4).
Figure 4: Final Result
Figure 5: Segmentation of
text into line
227
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
3.2.4) Segmentation of Text into lines and words
i) The document image is segmented into several text lines using horizontal projection profiles
computed by a row-wise sum of black pixels. After that we found valley of minimum and
maximum points of a text document in the histogram and then draw a line (cut point) from
minimum point to size of documents. Shown in Fig. 5.
ii) After line segmentation we have used vertical projection profile by considering the threshold
value (which maintained the inter character gap) for Bilingual text (Devanagari – English)
word segmentation. The words obtained are thin for feature extraction and also we have
provided the bounding box to the word by obtaining left, right, top and bottom first pixel. The
extracted words are inverted for feature extraction which is shown in Fig. 6.
Figure 6: Word segmented Image
3.2.5) Script Identification
The distinct features used for script Identification are:
i) Feature 1:Top_profile and Bottom_profile: The top_profile (bottom_profile) of a text line
represents a set of black pixels obtained by scanning each column of the text line from top (bottom)
until it reaches a first black pixel. Thus, a component of width N gets N such pixels.
ii) Feature 2: Top-max-row: Represents the row number of the top_profile at which the maximum
number of black pixels lies (black pixels having the value 0’s correspond to object and white pixels
having value 1’s correspond to background).
iii) Feature 3: Bottom-max-row: Represents the row number of the bottom_profile at which the
maximum number of black pixels lies (black pixels having the value 0’s correspond to object and
white pixels having value 1’s correspond to background).
iv) Feature 4: Top-horizontal-line: (i) Obtain the top-max-row from the top-profile. (ii) Find the
components whose number of black pixels is greater than threshold1 (threshold1 = half of the height
of the bounding box) and store the number of such components in the attribute horizontal-lines. (iii)
Compute the feature top-horizontal-line using the equation (1.3) below:
Top-horizontal-line = (hlines * 100) / tc
-----------------(1.3)
Where- hlines represent number of horizontal lines and tc represents total number of components of
the top-max-row.
228
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
3.2.6) Heuristic script identification algorithm ():
(Result is shown in Fig. 7)
Input: Pre-processed text words of Devnagari and English scripts
Output: Range of feature values
1. Compute top-profile
2. Compute bottom-profile
3. Compute features F3 & F4
4. Identify the Script type as follows
If Top_max_row=Bottom_max_row OR
Top_horizontal_lines >= 60 then Script=”Hindi”
else Script=”Others”
5. Return Script
Figure 7: Hindi Identified words
IV) RESULTS AND DISCUSSION
We compared proposed algorithm with edge based and connected component using
projection algorithm. For comparison we are used precision and recall rate.
--- (1.2)
---- (1.3)
Precision rate takes into consideration the false positives, which are the non-text regions in
the image and have been detected by the algorithm as text regions. Recall rate takes into
consideration the false negatives, which are text words in the image, and have not been detected by
the algorithm. Thus, precision and recall rates are useful as measures to determine the accuracy of
each algorithm in locating correct text regions and eliminating non-text regions as shown in
table 1.1.
229
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
Input Image
Image1
Image2
Image3
Image4
Image5
Table 1.1: Comparison of Results of Three Algorithms
Edge based Algorithm
Connected component Proposed Algorithm
Based Algorithm
Precision
Recall Rate Precision
Recall
Precision
Recall
Rate
Rate
Rate
Rate
Rate
76.19
80.00
83.33
99.00
94.11
80.00
68.42
76.47
77.27
99.00
86.66
76.47
61.53
100.00
66.66
100.00
72.72
100.00
0.00
0.00
53.84
100.00
63.63
100.00
83.33
99.90
66.66
80.00
90.90
98.00
Table 1.2: Results of Identified Devanagari (Hindi) script from Images
Sr. No
Dataset Name
Hindi Words
1
Image1
Correct Classification
100%
Misclassification
0%
Rejection
0%
2
Image2
Correct Classification
80.00%
Misclassification
20.00%
Rejection
0%
3
Image3
Correct Classification
66.66%
Misclassification
00.01%
Rejection
33.33%
4
Image4
Correct Classification
100%
Misclassification
0%
Rejection
0%
5
Image5
Correct Classification
100%
Misclassification
0%
Rejection
0%
V)
CONLUSION
In this paper, we have presented a very efficient and easy algorithm for extraction of text
from image document based on connected component. The morphological approach is applied
followed by finding a result by multiplication of height and width and then divided the result by area.
By experimentation we have fixed the value of result to remove non text region from image. The
proposed algorithm is tested on five images having same font size and obtained average accuracy of
precision rate 81.60% and recall rate 90.89%. The extracted text form image is passed for script
identification. Using Heuristic Search classifier we have obtained correct classification accuracy of
89.33%. In future we are tested the algorithm on variable font size images.
REFERENCES
1. Keechul Jung, Kwang In Kim and Anil K. Jain, “Text information extraction in images and
video: a survey”, The journal of the Pattern Recognition society, Vol. 37, Issue 5, pp. 977-997,
May 2004.
2. Chitrakala Gopalan and D. Manjula, “Contourlet Based Approach for Text Identification and
Extraction from Heterogeneous Textual Images”, International Journal of Electrical and
Electronics Engineering 2(8), pp. 491-500, 2008.
230
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
3. M. C. Padma and P.A. Vijaya, “Script Identification form Trilingual Documents using profile
Based Features”, International Journal of Computer Science and Applications, Vol. 7 No. 4, pp. 16
- 33 , 2010.
4. Leon, M., Vilaplana, V., Gasull, A. and Marques, F., "Caption text extraction for indexing purposes
using a hierarchical region-based image model," 16th IEEE International Conference on Image
Processing (ICIP), Nov. 2009.
5. Yu. Zhong, K. Karu, A. K. Jain, “Locating text in complex color images,” 3rd International
Conference on Document Analysis and Recognition, vol. 1, pp. 146-149,1995.
6. Victor. Wu, R. Manmatha, E. M. Riseman, “Text Finder: an automatic system to detect and
recognize text in images”, IEEE Transactions on PAMI, vol. 21, pp. 1224-1228, 1999.
7. Jagath Samarabandu and Xiaoqing Liu, “An Edge-based Text Region Extraction Algorithm for
Indoor Mobile Robot Navigation”, World Academy of Science, Engineering and Technology , pp
382-389, 2007
8. Neha Gupta, V .K. Banga, “Image Segmentation for Text Extraction”, 2nd International Conference
on Electrical, Electronics and Civil Engineering (ICEECE'2012), Singapore, April 28-29, 2012
9. C. V. Jawahar, Pavan Kumar, S.S.Ravi Kiran, “A Bilingual OCR for Hindi-Telugu Documents and
its applications”, Proceedings of 7th International Conference on Document Analysis and
Recognition (ICDAR)- Aug 2003, Vol 1, pp 408-412,2003.
10. S. Basavaraj Patil and N V Subbareddy, “Neural network based system for script identification in
Indian documents”, Sadhana, Academy Proceedings in Engineering Sciences, Vol. 27, Part 1, pp.
83–97, , February 2002
11. K. Roy, U. Pal, and B. B. Chaudhuri, “Neural Network based Word wise Handwritten Script
Identification System for Indian Postal Automation”, Proceedings of ICISIP, International
Conference on IEEE, pp 240-245,2005.
12. Santanu Choudhury, Gaurav Harit, Shekar Madnani, R.B. Shet, “Identification of Scripts of Indian
Languages by Combining Trainable Classifiers”, ICVGIP, Dec.20-22, Bangalore, India, (2000).
13. M Swamy Das, D Sandhya Rani, C R K Reddy and A Govardhan, “Script identification from
Multilingual Telugu, Hindi and English Text Documents”, International Journal of Wisdom Based
Computing, Vol. 1 (3), December 2011.
14. M. M. Kodabagi and S. R. Karjol, “Script Identification from Printed Document Images using
Statistical Features”, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 2, 2013, pp. 607 - 622, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
15. R. Edbert Rajan and Dr.K.Prasadh, “Spatial and Hierarchical Feature Extraction Based on Sift for
Medical Images”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 308 - 322, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
16. M. M. Kodabagi, S. A. Angadi and Chetana. R. Shivanagi, “Character Recognition of Kannada
Text in Scene Images using Neural Network”, International Journal of Graphics and Multimedia
(IJGM), Volume 4, Issue 1, 2013, pp. 9 - 19, ISSN Print: 0976 – 6448, ISSN Online: 0976 –6456.
17. Patange V.V and Prof. Deshmukh B.T, “Visual Acknowledgement [O.C.R.] – A Method to Identify
the Printed Characters”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 108 - 114, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
18. M. M. Kodabagi, S. A. Angadi and Anuradha. R. Pujari, “Text Region Extraction from Low
Resolution Display Board Images using Wavelet Features”, International Journal of Information
Technology and Management Information Systems (IJITMIS), Volume 4, Issue 1, 2013,
pp. 38 - 49, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413.
231