More Related Content
Similar to Enriching search results using ontology
Similar to Enriching search results using ontology (20)
More from IAEME Publication
More from IAEME Publication (20)
Enriching search results using ontology
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
500
ENRICHING SEARCH RESULTS USING ONTOLOGY
Shobha B. Patil1
, S. K. Shirgave2
1
ME student, D. Y. Patil College of Engineering & Technology, Kolhapur,
Maharashtra, India.
2
Associate Professor, Dept. of I.T, D.K.T.E. Society's Textile & Engineering Institute,
Ichalkaranji, Maharashtra, India.
ABSTRACT
The contents of World Wide Web increases dynamically every day. Keyword based
search is used for finding documents that are relevant to search query. Due to tremendous
amount of information available on internet, it becomes very difficult to get relevant
documents by using only keyword based search. The search results are based solely upon the
frequency of keywords alone without any extra intelligence. The meaning of keywords is not
considered in traditional search techniques.
It is possible to increase rate of relevant documents by using ontology. It allows
sophisticated semantic search. Ontology is representation of concepts in a domain of interest,
their relationships. This paper presents a method for enriching search result documents by
semantically relevant documents to the search query. In existing systems, ranking of
documents in search result is determined by the pages having the highest frequency of the
queried words. The method proposed in this paper also provides how to rank documents in
search results according to their semantic relevance to search query.
Keywords: Ontology, OWL, OWLAPI, Protégé.
1. INTRODUCTION
The World Wide Web is one of the fastest growing areas of information. There is a
dynamic and explosive growth of information on internet every day. Traditionally, keyword
based search is used for finding documents that are relevant to search query. Keyword based
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 2, March – April (2013), pp. 500-507
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
501
search uses concept of TF-IDF to find relevant documents. Term Frequency Inverse
Document Frequency (TF-IDF) determines what words in a set of documents might be more
relevant to use in a query.TF-IDF finds values for each word in a document through an
inverse proportion of the frequency of the word in a particular document to the percentage of
documents the word appears in. Words with high TF-IDF value imposes a strong relationship
with the document they appear in, suggesting that if that word were to appear in a query, the
document could be of relevant to the search query terms.
Due to tremendous amount of information available on internet, it becomes very
difficult to get relevant documents by using only keyword based search. The current keyword
based search returns only few relevant documents and a lot of irrelevant documents. The
search results are based solely upon the frequency of keywords alone without any extra
intelligence.
Ontology can help to find documents that are semantically relevant to search query.
Ontology describes concepts and their relationships in a domain of interests. Ontologies are
the structural frameworks for organizing the information. Ontology is a "formal, explicit
specification of a shared conceptualization"[1]. There are various tools are available to
implement, design and maintain ontology such as Protégé, OntoEdit, SWOOP and etc [2].
Ontology languages such as RDF, OWL allow users to write explicit, formal
conceptualizations of domains models. OWL builds on RDF and RDF Schema, and uses
RDF's XML syntax [3].
The method proposed in this paper, enrich keyword based search by using ontology.
The documents resulted from keyword based search will be augmented with the documents
resulted from ontology mapping to search query. All documents from search result get ranked
by considering both the frequency of keywords in that document and ontology. The
documents those are semantically relevant to search query are given more relevance.
The organization of this paper is as follows: In section 2, terminologies used in this
paper are defined. Section 3 focuses on related work. In section 4, the architecture of
proposed method and implementation is discussed. Section 5 presents results and evaluation
using proposed method. Finally, in section 6 conclusion of this paper is presented.
2. TERMINOLOGIES
2.1 Ontology [1]:
Ontology is a "formal, explicit specification of a shared conceptualization". Ontology
is an explicit representation of concepts of some domain of interest, with their characteristics
and their relationships. Ontology represents knowledge in terms of concepts defined by
classes, properties and individuals and a set of axioms that assert how those concepts are to
be interpreted.
2.2 Domain ontology [1]:
A domain ontology (or domain-specific ontology) models a specific domain, which
represents part of the world. Particular meanings of terms applied to that domain are
provided by domain ontology.
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
502
2.3 Owl [4]:
Web Ontology Language (OWL) is one of the knowledge representation languages
for creating, manipulating and processing ontologies. OWL used to describe the classes and
relations between them that are inherent in web documents and applications.
2.4 Owl API:
There are three variations of OWL existing currently: OWL Full, OWL DL and OWL
Lite. The OWL API is targeted primarily at representing OWL-DL.OWLAPI is a java based
API for semantic web ontology.
2.5 Search terms:
Search terms consisting of one or more keywords, which represent information
needed to user.
2.6 Protégé [5][6]:
It is a free, open-source ontology editor and knowledge-based framework, produced
by Stanford University. Protégé is a tool that enables the construction of domain ontologies,
customized data entry forms to enter data. Protégé allows the definition of classes, class
hierarchies, variables, variable-value restrictions, and the relationships between classes and
the properties of these relationships.
3. LITERATURE REVIEW
Traditionally, in the keyword based search technique, the documents are represented
using the vector space model. The TF/IDF method is used to find out frequency of terms and
documents. The documents having high TF/IDF value resulted as relevant documents.
As the amount of information available from various information sources such as the
World Wide Web is dynamically increasing, it becomes more difficult to find relevant
information in large information spaces. In this situation, an ontology that describes concepts
and their relationships in a domain of interests can help since terms provided by ontology can
help novices within a specific domain or people who are not familiar with searching [7].
Ontology can be also utilized for effective navigation.
Ontology is referred to as the explicit and formal specification of a shared
conceptualization. As an engineering artifact, it consists of terms and relationships that
describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of
the vocabulary. Ontology can serve as background knowledge [8] and it can help users refine
the search results from domains that they are not familiar with [9]. Constructing a formal
ontology generally relies on an interactive process to explicit knowledge and formalizes it.
An overview of some editing tools for ontology is given in [2]. The paper [5] gives
information about ontology management tools (Protégé 3.4, Apollo, IsaViz & SWOOP) that
are freely available and review them in terms of: a) interoperability, b) openness, c) easiness
to update and maintain, d) market status and penetration.
The method presented in this paper uses the keyword based search (TF/IDF data) and
ontology to provide the documents that is most relevant to search query.
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
503
4. PROPOSED METHOD
Fig.1 The structure of proposed method.
The method implementation is divided into five different modules:
A. Extraction of terms from documents
B. TF/IDF data
C. Ontology
D. Enriching search result using ontology
E. Ranking the result.
A. Extraction of terms from documents:
All html files from web site are processed for getting terms in following way.
1. Tokenization: It is process of splitting a string containing text into individual tokens.
2. Stop words are removed from obtained terms.
3. Stemming process is applied on terms.
Stemming is the reduction of words to abbreviated word roots that allow for easy
comparison for equality of similar words.
B. TF/IDF data:
1.Vector space model[10]:
Vector Space Model represents each document as a vector with one entry per term. If
term j appears k times in document i, the document vector for i contains value k in
position j. The document vector for i contains the value 0 in positions corresponding to
terms that do not appear in document i.
Search
query
Document
searching with
TF/IDF data
Enriched
search result
documents
Ontology
Ranking
Module
Ranked
documents
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
504
The value for a term in a document vector is a Term Frequency (TF) i.e. number of
occurrences of that term in the given document.
2. TF/IDF data[10]:
TF is the term frequency which is extracted from vector space model.
IDF (inverse document frequency):
IDF of a term j is defined as log (N/nj),
Where N is the total number of documents.
nj is the number of documents that term j appears in.
Document vector is refined by using wij.
wij= tij × IDFj
The value associated with term j in the document vector for document i, denoted as
wij, is obtained by multiplying the term frequency tij by the IDF of term j in the document
collection. IDF effectively increases the weight given to rare terms.
C. Ontology:
Domain ontology is designed by using protégé tool [5][6]. Protégé ontologies can be
exported into a variety of formats including RDF(S), OWL, and XML Schema. Concepts
and their relationships are modeled as ontology using Protégé and stored in OWL file.
D. Enriching search result using ontology:
1. TF/IDF data is used to return the documents relevant to the search query. The text from
search query is tokenized to get terms. Then stop words are removed from query terms.
Stemming is applied to get stemmed terms. Resulted terms are then mapped in TF/IDF
data to get relevant documents containing the search query.
2. OWLAPI is used to map search query to domain ontology. The text from search terms is
tokenized to get terms and then stop words are removed from terms. Each of the class
from ontology retrieved to check whether it is in required search terms. If it is present in
required search query, subclasses are retrieved for that class from ontology. The web
pages related to subclasses are accessed to add in relevant documents.
E. Ranking of documents in the search result [11] [12][13]:
Ranking of documents resulted from semantic search is done by using Cosine Similarity.
The Cosine Similarity between each document from result and query terms is calculated.
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
505
Where dj: document from user profile
q : query document
N: Number of terms
Wi,j: Weight of ith term in document j (term frequency).
Wi,q: Weight of ith term in document q (term frequency).
The documents resulted from ontology mapping is given higher priority; whereas
documents resulted from searching using TF/IDF data is given second priority.
The resulted cosine similarity is boosted by 0.9 if document is resulted from ontology
Mapping, and by 0.7 for those documents resulted from TF/IDF data.
The resulted web pages are then sorted in descending order based on boosted similarity.
5. RESULTS AND EVALUATION
The result of this system is evaluated on following metrics:
1. Precision:
Precision is the ratio of the number of relevant documents retrieved to the total number
Of documents retrieved.
2. Recall:
Recall is the ratio of the number of relevant documents retrieved to the total number of
relevant documents in the collection.
3. F1-measure:
F1=
ଶכ ୮୰ୣୡ୧ୱ୧୭୬כ୰ୣୡୟ୪୪
୮୰ୣୡ୧ୱ୧୭୬ା୰ୣୡୟ୪୪
The following search queries are tested for one web site:
Q1: “film music”.
Q2: “film magic”.
Q3: “poetry literature”.
Q4: “dance”
Q5: “comedy magic”
Table 1: Performance of document searching using TF/IDF data
Query Precision Recall F1-measure
Q1 0.67 0.45 0.54
Q2 1 0.5 0.67
Q3 0.83 0.34 0.47
Q4 0.83 0.62 0.71
Q5 1 0.28 0.48
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
506
Table 2: Performance of proposed method
Fig.2 Graph of existing method and proposed method for F1-measure
6. CONCLUSION
Keyword based system for searching documents returns only few relevant documents
for the search query. The search results are based only on the frequency of keywords
without any extra intelligence.
The method proposed in this paper, enrich keyword based search by using ontology.
Ontology can help to find documents that are semantically relevant to search query.
Keyword based search system with ontology gives more relevant documents for the search
query.
REFERENCES:
[1] Webpage: http://www.en.wikipedia.org/ontology
[2] Escórcio, L. and Cardoso, J. "Editing Tools for Ontology Construction", in "Semantic
Web Services: Theory, Tools and Applications", Idea Group. 2007.
[3] Grigoris Antoniou and Frank van Harmelen, “Web Ontology Language: OWL”.
[4] Web Ontology Language (OWL) : http://www.w3.org/2004/OWL/
[5] “A Comparative Study Ontology Building Tools for Semantic Web Applications”
International journal of Web & Semantic Technology (IJWesT) Vol.1, Num.3,
July 2010.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Q1 Q2 Q3 Q4 Q5
Existing
method
Proposed
method
Query Precision Recall F1-measure
Q1 0.75 0.67 0.70
Q2 0.75 0.75 0.75
Q3 0.86 0.8 0.88
Q4 0.86 0.75 0.80
Q5 0.86 0.86 0.86
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
507
[6] http://protege.stanford.edu
[7] Groth, K., Lanner¨o, P., “Context browser: ontology based navigation in
information spaces”. Proc. 1st Int. Conf. on information interaction in Context, IIiX:
Vol. 176, p. 75-78, 2006.
[8] Schroeder, M., Burger, A., Kostkova, P., Stevens, R., Habermann, B., Dieng-Kunts, R
“Sealife: A Semantic Grid Browser for the Life Sciences Applied to the Study of
Infectious Diseases”. Vol. 120, p.167-178, 2006.
[9] Bonomi, A., Mosca, A., Palmonari, M., Vizzari, G., “Integrating a Wiki in an
Ontology Driven Web Site: Approach, Architecture and Application in the
Archaeological Domain”. 3rd
Semantic Wiki Workshop. 2008.
[10] Book: “Database management systems”, McGraw Hill, 3rd edition, by : Raghu
Ramkrishanan, Gehrke.
[11] Zhuhadar, L., Nasraoui, O., Wyatt, R.: Dual representation of the semantic user
profile for personalized web search in an evolving domain. In: Proceedings of the
AAAI 2009 Spring Symposium on Social Semantic Web, Where Web 2.0 meets
Web 3.0. (2009) 84–89.
[12] Ahu Sieg, Bamshad Mobasher, Robin Burke “Learning Ontology-Based User
Profiles:A Semantic Approach to Personalized Web Search” IEEE Int. Informatics
Bulletin Nov.2007 [13] Anna Huang, “Similarity Measures for Text Document
Clustering”, NZCSRSC 2008,
April 2008, Christchurch, New Zealand
[13] C. Santhosh kumar, D.Palanikkumar, “Dynamic Customization In The Business
Process Service Composition Using Ontology” International Journal Of Computer
Engineering & Technology (IJCET) Volume 3, Issue 2, 2012, pp. 138 - 149, Issn
Print: 0976 - 6367, Issn Online: 0976 - 6375.
[14] Vinu P.V., Sherimon P.C., Reshmy Krishnan, “Development Of Seafood Ontology For
Semantically Enhanced Information Retrieval” International Journal Of Computer
Engineering & Technology (IJCET) Volume 3, Issue 1, 2012, pp. 154 - 162, ISSN
Print: 0976 - 6367, ISSN Online: 0976 - 6375.