1. PageRank and Hyperlink- Induced
Topic Search in Web Structure Mining
Presented By
Priyabrata Satapathy
2. Plan of My work(I)
Learn Basic Knowledge of Web structure
Hub
Authority
Link analysis
PageRank
HITS
2 Anand Bihari
3. Plan of My work(II)
Literature Survey on PageRank and HITS in Web Structure Mining.
Defining Problem (PageRank and HITS).
Proposing/ Designing a new Algorithm for Computing a PageRank of
web page.
Simulation and Performance Analysis of proposed Algorithm.
3 Anand Bihari
4. Outline
Introduction
Basic Concepts of
Web Structure
Hub and Authority
PageRank
HITS
Conclusion
Future Work
References
4 Anand Bihari
5. Introduction
World Wide Web is distributed by numerous Web sites around the world, a
global information system.
Web servers can potentially host millions of pages which make the number of
web pages extremely difficult to track.
Web networks like the thousands of interconnected, intertwined with the cells
organized in a complex structure.
Each Web site also contains a number of Web pages.
It contains the following three parts;
Body of the page,
The page contains hypertext markup language and
Hyperlinks between Web pages.
5 Anand Bihari
6. Web Mining
Web mining can generally be divided into three categories:
Web content mining,
Web structure mining
Web usage mining
6 Anand Bihari
7. Web Structure Mining
Web structure mining is the main content of hyperlink analysis, that
is, by analyzing the links between pages to study the relationship
between the reference pages to find useful patterns, improve search
quality.
Structure mining is the site with one page to another page from a link
diagram.
7 Anand Bihari
8. Simple Web Link Graph
Page A Page B
A
Page C Page D
8 Anand Bihari
9. Hub
A hub is a page with many out-links.
Authority
An authority is a page with many in-links.
9 Anand Bihari
10. Hubs and Authorities on the Internet
Hubs Authorities
Authorities and Hubs have a mutual reinforcement relationship.
A good hub increases the authority weight of the pages it points.
A good authority increases the hub weight of the pages that point to it.
10 Anand Bihari
11. Link Analysis
There are two famous link analysis methods:
1.PageRank Algorithm
2.HITS Algorithm
11 Anand Bihari
12. PageRank
The heart of Google’s searching software is PageRank.
A system for ranking web pages developed by Larry Page and Sergey
Brin at Stanford University in 1996.
Based on the idea of a ’random surfer’
PageRank is a static ranking of Web pages.
PageRank is based on the measure of prestige in social networks,
The PageRank value of each page can be regarded as its prestige.
12 Anand Bihari
13. PageRank
From the perspective of prestige, we use the following to derive the
PageRank algorithm.
A hyperlink from a page pointing to another page is an implicit conveyance
of authority to the target page. Thus, the more in-links that a page “ i “
receives, the more prestige the page “ i “ has.
Pages that point to page “ i “also have their own prestige scores. A page
with a higher prestige score pointing to “ i “ is more important than a page
with a lower prestige score pointing to “ i.” In other words, a page is
important if it is pointed to by other important pages.
13 Anand Bihari
14. PageRank
In-links of page i: These are the hyperlinks that point to page “ i “
from other pages. Usually, hyperlinks from the same site are not
considered.
Out-links of page i: These are the hyperlinks that point out to other
pages from page “ i “. Usually, links to pages of the same site are not
considered.
A B
Website 1 Website 2
14 Anand Bihari
15. PageRank Algorithm
The PageRank of a web page is therefore calculated as a sum of the
PageRanks of all pages linking to it (its incoming links), divided by the number of
out links on each of those pages (its outgoing links).
Where:
PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti
d is a damping factor which can be set between 0 and 1. It depends on the
number of clicks, usually set to 0.85.
n is the number of inlinks of page A.
It’s obvious that the PageRank algorithm does not rank the whole website, but it’s
determined for each page individually. Furthermore, the PageRank of page A is recursively
defined by the PageRank of those pages which link to page A
15 Anand Bihari
16. A B
A
The Characteristics of PageRank
C D
We regard a small web consisting of four pages A, B, C and D, whereby page A links
to the pages B ,C and D, page B links to page C , page C links to page A and page D
links to page C. According to Page and Brin, the damping factor d is usually set to
0.85, but to keep the calculation simple we set it to 0.5.
PR(A) = 0.5 + 0.5 ( PR(C))
PR(B) = 0.5 + 0.5 ( PR(A)/3)
PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) )
PR(D) = 0.5 + 0.5 ( PR(A)/3 )
We get the following PageRank values for the single pages:
PR(A) = 12/10 = 1.2
PR(B) = 7/10 = 0.7
PR(C) = 14/10 = 1.4
16
PR(D) = 7/10 = 0.7 Anand Bihari
17. The Iterative Computation of PageRank
The Google search engine uses an approximative, iterative computation
of PageRank values.
This means that each page is assigned an initial starting value.
The iteration ends when the PageRank value do not change much or
equal.
17 Anand Bihari
18. The Iterative Computation of PageRank
Algorithm
General PageRank equation is
PR(A)=(1-d)+d(PR(T1)/C(T1)+-------------+PR(Tn)/C(Tn))
Iteration Algorithm
Set PR [ R1,R2,……………,Rn] where R is some initial rank of
page and n is the number of pages in the graph.
d 0.5
i1
Do
Pri (A) (1-d) + d (Pri-1(T1)/C(T1) +… +Pri-1(Tn)/C(Tn))
k | PRi (A) – Pri-1(A)|
i i+1
While k < e , where e is a small number indicating the convergence threshold
Return PR
18 Anand Bihari
19. The Iterative Computation of PageRank
(example)
Let initial PageRank value of each page is 1
Iteration PR(A) PR(B) PR(C) PR(D)
0 1 1 1 1
1 1 0.6667 1.3332 0.6667
2 1.1666 0.6944 1.3888 0.6944
3 1.1944 0.6990 1.3980 0.6990
4 1.1990 0.6998 1.3996 0.6998
5 1.1998 0.6999 1.3998 0.6999
6 1.1999 0.6999 1.3998 0.6999
7 1.1999 0.6999 1.3998 0.6999
The sum of all pages' PageRanks still converges to the total
number of web pages. So the average PageRank of a web page is 1.
19 Anand Bihari
20. Effects of Inbound Links(I)
Each additional inbound link for a web page always increases that
page's PageRank.
One may assume that an additional inbound link from page X increases
the PageRank of page A by
d PR(X) / C(X) X
PR(A)=0.5+0.5(PR(X)+PR(C))
A B
A
PR(B) = 0.5 + 0.5 ( PR(A)/3)
PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) C D
PR(D) = 0.5 + 0.5 ( PR(A)/3 )
20 Anand Bihari
21. Effects of Inbound Links(II)
Let PR(X) = 10.
We get the following PageRank values for the single pages:
PR(A) = 31/5 = 6.2
PR(B) = 23/15 = 1.53
PR(C) = 46/15 = 3.067
PR(D) = 23/15 = 1.53
We see that the initial effect of the additional inbound link of page A,
which was given by
d PR(X) / C(X) = 0.5 10 / 1 = 5
Hence page A will have an even higher PageRank benefit from
its additional inbound link.
21 Anand Bihari
22. Effect of outbound Links(I)
Since PageRank is based on the linking structure of the whole web.
it is inescapable that if the inbound links of a page influence its
PageRank, its outbound links do also have some impact.
In this graph Page B have an additional outbound links.
Then PageRank Value of A B
A
PR(A)=0.5+0.5(PR(C))
PR(B)=0.5+0.5(PR(A)/3) C D
PR(C)= 0.5+0.5(PR(A)/3+PR(B)/2+PR(D))
PR(D)=0.5+0.5(PR(A)/3+PR(B)/2)
22 Anand Bihari
23. Effect of outbound Links(II)
We get the following PageRank values for the single pages:
PR(A) = 1.14
PR(B) = 0.753
PR(C) = 0.8796
PR(D) = 1.31805
The total PageRank of all pages’ = 4.
Hence, adding a link has no effect on the total PageRank of the web.
Additionally, the PageRank of page D is increased and the PageRank of
Page A and C are decereased.
23 Anand Bihari
24. The Effect of the Number of Pages
An additional page increases the PageRank of all pages on the web .
24 Anand Bihari
25. How Increase the PageRank of Websites
Add new pages to your website (as many as you can)
Swap links with websites which have high PageRank value
Raise the number of inbound links (Advertise your website on other
sites) etc.
25 Anand Bihari
26. HITS
HITS stands for Hyperlink Induced Topic Search.
Developed by Jon Kleinberg
HITS is search query dependent.
When the user issues a search query, HITS first expands the list of
relevant pages returned by a search engine and then produce two
rankings of the expanded set of pages, authority ranking and Hub
ranking.
Uses hubs and authorities to define a recursive relationship between
web pages.
26 Anand Bihari
27. HITS Algorithms (I)
HITS depend on query words. Firstly HITS invokes a traditional search
engine to get a set of pages related to the query, and then expands the
set by hyperlinks pointing to them or pointed by them. After that, HITS
tries to find the top hubs and authorities by iterative calculations. All of
the processing are done online.
R is a root set that returned by the query and S is base set to cover all
linked pages.
27 Anand Bihari
28. HITS Algorithm (II)
Let the authority score of the page i be ap(i) and the hub score of page i
is hp(i) .The mutual reinforcement relationship of the two scores is
represented as follows:
ap(i) = hq
hp(i) = aq
The implication q→ p is that there is a point p from the q hyperlink.
After several iterative calculations until the results converge, the final
output of HITS algorithm is a set of weights with large Hub p pages
and have greater weight Authority page.
28 Anand Bihari
29. HITS Algorithm (III)
Let A be the adjacency matrix of the root set R and denote the authority
weight vector by “a” and the hub weight vector by “h” , where
a = a1 h= h1
a2 h2
. .
. .
an hn
Then a=AT.h and h=A.a
The computation of authority scores and hub scores is basically the same
as the computation of the PageRank scores using the iteration method. If
we use ak and hk to denote authority and hub scores at the kth iteration,
the iterative processes for generating the final solutions are
ak = ATAak-1 and hk = A AT hk-1
Starting with a0 = h0 = 1
1
.
29 . Anand Bihari
1
30. A B
A
HITS Example C D
The adjacency matrix of the graph is
A= 0 1 1 1 with transpose AT = 0 0 1 0
0 0 1 0 1 0 0 0
1 0 0 0 1 1 0 1
0 0 1 0 1 0 0 0
Assume the initial hub and authority weight is:
h= 1 and a = 1
1 1
1 1
1 1
We compute the authority weight vector by
a = AT.h = 1 h = A.a = 3
1 1
3 1
1 1
30 Anand Bihari
31. HITS Example(cont.)
Hub weight of
Page A = 3,
Page B = 1,
Page C = 1 and
Page D = 1;
Authority weight of
Page A = 1,
Page B = 1,
Page C = 3 and
Page D = 1;
Hence we say that the Hub weight of a page is the total number of its
out linked pages and the Authority weight of a page is the total
number of in linked pages .
31 Anand Bihari
32. Conclusion
Study basic concepts of Hyperlinks Analysis.
Study PageRanking Technique.
Study HITS Technique.
32 Anand Bihari
33. Future Work
Study Hyperlink analysis technique.
Literature Survey on Hyperlink analysis and other related topic.
Defining problem in PageRank and HITS.
Proposing new algorithm or Improve the PageRank and HITS
algorithms.
Simulation and Performance Analysis of proposed Model.
33 Anand Bihari
34. Future Literature Survey
Titles Name of Journal/Conferences Publication
Year
Mining web informative structures and IEEE Transactions On Knowledge And 2004
Contents based on entropy analysis Data Engineering
Wisdom: web intra page informative IEEE Transactions On Knowledge And 2005
structure Mining based on document Data Engineering
object model
Knowledge Discovery and Retrieval 2010 Fourth Asia International 2010
on World Wide Web Using Web Conference on Mathematical/ Analytical
Structure Mining Modelling and Computer Simulation
Design and implementation of a web International Conference on internet 2011
structure Mining algorithm using technology and secured transactions
breadth first search Strategy for
academic search application
34 Anand Bihari
35. References
Bing Liu “Web Data Mining ” Springer International Edition.
IEEE Conference Paper “Research on PageRank and Hyperlink –
Induced Topic Search in Web Structure Mining “
Website : Google, Wikipedia, http://pr.efactory.de/
www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lectur
e4.html
35 Anand Bihari