50120130406017

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 6, November - December (2013), pp. 156-160
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com

IJCET
©IAEME

EFFECTIVE WEB MINING TECHNIQUE FOR RETRIEVAL
INFORMATION ON THE WORLD WIDE WEB
Miss Purvi Dubey
Information Technology, Medicapas Institute of Science and Technology, RGPV
Indore (M.P.), India
Asst. Prof. Sourabh Dave
Information Technology, Medicapas Institute of Science and Technology, RGPV
Indore (M.P.), India

ABSTRACT
In today’s age the major problem is related to the predicting user’s web page request. In the
past few years the markov model is used for this problem. The effective web mining techniques like
Clustering, Association rule mining and markov model having many drawbacks. We are proposing
an approach for overcome that drawbacks which help us to improving search engine and helping user
to find interesting web pages.
Keywords – Association rule mining, Clustering, Hybrid model, Markov model, Page Rank
algorithm.
1. INTRODUCTION
Web is a complicated system of interconnected elements, the method of gaining coal or other
minerals from a mine is called mining, so the web mining is an application of data mining techniques
to explore pattern from the web [1].
1.1 Classification Of web Mining –web mining can be classified in to the three subheadings
1.1.1 Web content mining: - Web content mining is the process of extracting information from
content of document, so it is called web content mining.

156


1.1.2 Web structure mining: - Web structure mining is the process of getting a conclusion from
reference book in the web.
1.1.3 Web Usage mining: - web usage mining is the process of extracting patterns in web access log
and it is also called web log mining [2].
2. WEB MINING TECHNIQUES
Web mining techniques having the following points:2.1 PageRank
Pagerank works by counting the number and quality of links to a page for determine a rough
estimate how important a web site is. Pagerank is defining as follows: We assume page a has pages
T1..., Tn, which point to it (i.e., are citations). The parameter d is damping factor which can be set
between o and 1 and is usually set 0.85. Out_deg (A) denotes the number of links going out of page
A (out-degree of A).
Definition 2.1.1 PageRank: - The page rank of a page A is given as follows:
௉ோሺ்௜ሻ

ܴܲ ሺ‫ܣ‬ሻ ൌ ሺ1 െ ݀ሻ ൅ ݀ ቂ∑௡ ௉ோሺ்௜ሻቃ (1)
௜ୀଵ
Page rank of PR(A) cab be calculated using a simple iterative algorithm, and corresponds to
the principal eigenvector of the normalized link matrix of the web. Let n be the number of documents
we have. We define the link matrix M, where the Mij entry is 1/nj if there is a link form document j to
document i, otherwise Mij is o. nj is the number of the forwarded link of document. Then we can
compute the PageRank on the graph which is the dominant eigenvector of the matrix A [5].
2.2 Clustering
The clustering is a effective process in which search result combine it to a interesting page
groups according to the typed query and allows user to navigate into a document quickly.
2.3 Markov Model
A traditional markov model describing a sequence of possible event in which the probability
of each event depends only on the state attained in the previous event.
2.4 Association rule mining
In data mining association rule mining is a popular and well researched method for
discovering interesting relations between variables in large database. It is related to identify strong
rules discovered in database using measure of interestingness.
3. PROBLEM FORMULATION T
The web mining techniques having the following problems.
3.1 PageRank
The PageRank work well when the search results are good, user can easily go to the
interested page , from top ranked result which is found by the PageRank. The problem arises when
the search results are several types. To overcome this problem clustering is used. To overcome this
problem clustering is used.
157


3.2 Clustering
As we discussed earlier clustering is defined as the process in which search results combine
into a interesting page group according to the typed query and allows user to enter a document
quickly.
The general idea of clustering is to partition the search results into clusters and the search
result is a set of top-rank research result is called clustering, the clustering also having some
drawbacks that are:(i)

(ii)

The organizing way of clustering is not always deliver a correct information as
expected by user, for example if a user want to find the “phone number”or” mobile
number”when entering the query “contact number” , but the cluster discovered by the
current method may partition the results into “flat number ” and “street number” such
cluster would not be very useful for users.
Sometime, the cluster captions are not describable so it misleads the user to find
search results.

To overcome from this problems many strategies are used in which the problem solved in the
part. And the method fails if we want the results of single query in the query clusters. Other methods
analyzed the different clicked url of a query in user click through logs directly. The number of
different clicked is not related to the type of query which user wants.
3.3 Markov model
One limitation of applying the markov model techniques to the web personalization and
prediction is the difficulty of data interpretation and visualization. The main problem that faces the
markov model users is the identification of an optimal number of markov model orders.
3.4 Limitations of association rule mining
The main problem with association rule mining is the frequent item problem where the items
that occur together with a high frequency will also appear together in many of the resulting rules, and
thus, resulting in inconsistent predictions.
4. PROPOSED ALGORITHM
As we discussed in problem formulation part the PageRank and clustering having some
drawbacks, our main aim is improving the search engine delivery results, so we are proposing the
following approach
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
(ix)
(x)

Pick access log information from the web server.
From the log information retrieve only IP address and URL form web server and prevent
noise and filtrate the inconsequential expansion.
Keep each transaction in individual cluster and create click stream transaction.
Now, explore similarity between transactions.
Recognize first K neighbors having similarity more than threshold, for each transaction
and remove other neighbors.
Group the pairs with highest similarity.
Update equality for objects in the neighborhood of combined pair.
Detect new set of K neighbors from 2K neighbor of combined pair.
Update the neighbor in the previous list of combined pairs.
Reduplication of step (vi) to (vii) until no more combination is possible.
158


(xi)
(xii)
(xiii)
(xiv)
(xv)
(xvi)

Detect the clusters, the one having access sequence equal to test sessions.
Begin with highest possible value of K.
Apply the markov-model and explore K-th order states for the session from its cluster.
If the support is so much less, estimate next lower order states for the test session from its
cluster.
Again apply step (xiv) until states are generated with enough support.
Show the page with highest probability as the recommended page.

5. RESULTS
In this section we will present the experimental results to check the performance of proposing
algorithm. Fig 1 describes better web page prediction accuracy by applying proposed algorithm as
compare to applying the traditional markov model.
The accuracy of prediction is given by:
Acc=Te ∩ Tr / Te (2)
Where, Te is the number of test case and Tr is the number of training cases.
Figure shows that the accuracy of proposed algorithm is much higher than the traditional
markov model.
100
90
80
70
60
50
40
30
20
10
0
Traditional Markov Model

Proposed Algorithm

Figure: - Accuracy of proposed algorithm compare to traditional markov model

We have applied proposed model on 60 test sessions and we found 51 session having
accurate predictions while the traditional model gave only 42 sessions accurate prediction out of 60
test sessions, the proposed model having 85% accuracy while the traditional model have 70%
accuracy.
159


6. CONCLUSION
The proposed algorithm will help users to find interesting web pages. It also enhances the
search result delivery. The limitations of the proposed approach algorithm it is based on estimate so
the result is not always exact. The result shown in the figure shows the accuracy of proposed
algorithm as compare to traditional markov model. The Proposed approach can be used to enhance
the search engine results and main advantage of this algorithm it is used to promote e-commerce,
online marketing, purchasing of goods and services.
REFERENCES
[1]

Cooley, B. mobasher, and J. Shrivastava, Web mining: Information and pattern discovery on
www, IEEE, 1997, 1082-3409.
[2] Pooja Sharma, Asst.prof Rupali bhartiya, An efficient Algorithm for improved web usage
mining, ISSN, 3(2), 766-79.
[3] Miss shital c.patil , prof. R.R keole, Web usage mining and web content mining- A combine
approach for enhancing search engine delivery, ISSN, 3(10),2013,800-803.
[4] Poonam kaushal, Hybrid model for better prediction of web page, International journal of
scientific and research publication,2(8),2012,1-4.
[5] Arun K Pujari, Data mining techniques (Hyderabad (a.p.) India universities press (India)
Private limited, 2010)
[6] Poonam Kaushal, Prediction of user’s next web page request by hybrid technique,
International Journal of engineering technology and advanced engineering, 2(3),2012,
339-342.
[7] Suresh Subramanian and Dr. Sivaprakasam, “Genetic Algorithm with a Ranking Based
Objective Function and Inverse Index Representation for Web Data Mining”, International
Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013,
pp. 84 - 90, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[8] M. Karthikeyan, M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on the Data
Mining and Information Security”, International Journal of Computer Engineering &
Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
[9] Alamelu Mangai J, Santhosh Kumar V and Sugumaran V, “Recent Research in Web Page
Classification – A Review”, International Journal of Computer Engineering & Technology
(IJCET), Volume 1, Issue 1, 2010, pp. 112 - 122, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375.
[10] Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models
and Methodologies for Information Retrieval”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
[11] R. Lakshman Naik, D. Ramesh and B. Manjula, “Instances Selection Using Advance Data
Mining Techniques”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[12] Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, “Research on Classification Algorithms
and its Impact on Web Mining”, International Journal of Computer Engineering &
Technology (IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.

160

50120130406017

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to 50120130406017

Similar to 50120130406017 (20)

More from IAEME Publication

More from IAEME Publication (20)

Recently uploaded

Recently uploaded (20)

50120130406017