Page rank and hyperlink

PageRank and Hyperlink- Induced
Topic Search in Web Structure Mining

Presented By
Priyabrata Satapathy

Plan of My work(I)
 Learn Basic Knowledge of Web structure

 Hub

 Authority

 Link analysis

 PageRank

 HITS

2 Anand Bihari

Plan of My work(II)
 Literature Survey on PageRank and HITS in Web Structure Mining.

 Defining Problem (PageRank and HITS).

 Proposing/ Designing a new Algorithm for Computing a PageRank of

web page.

 Simulation and Performance Analysis of proposed Algorithm.

3 Anand Bihari

Outline
 Introduction

 Basic Concepts of

 Web Structure

 Hub and Authority

 PageRank

 HITS

 Conclusion

 Future Work

 References

4 Anand Bihari

Introduction
 World Wide Web is distributed by numerous Web sites around the world, a

global information system.

 Web servers can potentially host millions of pages which make the number of

web pages extremely difficult to track.

 Web networks like the thousands of interconnected, intertwined with the cells

organized in a complex structure.

 Each Web site also contains a number of Web pages.

 It contains the following three parts;

 Body of the page,

 The page contains hypertext markup language and

 Hyperlinks between Web pages.
5 Anand Bihari

Web Mining
 Web mining can generally be divided into three categories:

 Web content mining,

 Web structure mining

 Web usage mining

6 Anand Bihari

Web Structure Mining
 Web structure mining is the main content of hyperlink analysis, that

is, by analyzing the links between pages to study the relationship
between the reference pages to find useful patterns, improve search
quality.

 Structure mining is the site with one page to another page from a link

diagram.

7 Anand Bihari

Simple Web Link Graph

Page A Page B
A

Page C Page D

8 Anand Bihari

Hub
 A hub is a page with many out-links.

Authority
 An authority is a page with many in-links.

9 Anand Bihari

Hubs and Authorities on the Internet

Hubs Authorities

 Authorities and Hubs have a mutual reinforcement relationship.

 A good hub increases the authority weight of the pages it points.

 A good authority increases the hub weight of the pages that point to it.

10 Anand Bihari

Link Analysis
There are two famous link analysis methods:
1.PageRank Algorithm
2.HITS Algorithm

11 Anand Bihari

PageRank
 The heart of Google’s searching software is PageRank.

 A system for ranking web pages developed by Larry Page and Sergey

Brin at Stanford University in 1996.

 Based on the idea of a ’random surfer’

 PageRank is a static ranking of Web pages.

 PageRank is based on the measure of prestige in social networks,

 The PageRank value of each page can be regarded as its prestige.

12 Anand Bihari

PageRank
 From the perspective of prestige, we use the following to derive the

PageRank algorithm.
 A hyperlink from a page pointing to another page is an implicit conveyance
of authority to the target page. Thus, the more in-links that a page “ i “
receives, the more prestige the page “ i “ has.

 Pages that point to page “ i “also have their own prestige scores. A page

with a higher prestige score pointing to “ i “ is more important than a page
with a lower prestige score pointing to “ i.” In other words, a page is
important if it is pointed to by other important pages.

13 Anand Bihari

PageRank
 In-links of page i: These are the hyperlinks that point to page “ i “

from other pages. Usually, hyperlinks from the same site are not
considered.

 Out-links of page i: These are the hyperlinks that point out to other

pages from page “ i “. Usually, links to pages of the same site are not
considered.

A B

Website 1 Website 2

14 Anand Bihari

PageRank Algorithm
The PageRank of a web page is therefore calculated as a sum of the
PageRanks of all pages linking to it (its incoming links), divided by the number of
out links on each of those pages (its outgoing links).

Where:
 PR(A) is the PageRank of page A,
 PR(Ti) is the PageRank of pages Ti which link to page A,
 C(Ti) is the number of outbound links on page Ti
 d is a damping factor which can be set between 0 and 1. It depends on the
number of clicks, usually set to 0.85.
 n is the number of inlinks of page A.
It’s obvious that the PageRank algorithm does not rank the whole website, but it’s
determined for each page individually. Furthermore, the PageRank of page A is recursively
defined by the PageRank of those pages which link to page A
15 Anand Bihari

A B
A
The Characteristics of PageRank
C D

We regard a small web consisting of four pages A, B, C and D, whereby page A links
to the pages B ,C and D, page B links to page C , page C links to page A and page D
links to page C. According to Page and Brin, the damping factor d is usually set to
0.85, but to keep the calculation simple we set it to 0.5.
PR(A) = 0.5 + 0.5 ( PR(C))
PR(B) = 0.5 + 0.5 ( PR(A)/3)
PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) )
PR(D) = 0.5 + 0.5 ( PR(A)/3 )
We get the following PageRank values for the single pages:
PR(A) = 12/10 = 1.2
PR(B) = 7/10 = 0.7
PR(C) = 14/10 = 1.4

16
PR(D) = 7/10 = 0.7 Anand Bihari

The Iterative Computation of PageRank
 The Google search engine uses an approximative, iterative computation

of PageRank values.

 This means that each page is assigned an initial starting value.

 The iteration ends when the PageRank value do not change much or

equal.

17 Anand Bihari

Algorithm
General PageRank equation is
PR(A)=(1-d)+d(PR(T1)/C(T1)+-------------+PR(Tn)/C(Tn))
Iteration Algorithm
Set PR  [ R1,R2,……………,Rn] where R is some initial rank of
page and n is the number of pages in the graph.
d  0.5
i1
Do
Pri (A) (1-d) + d (Pri-1(T1)/C(T1) +… +Pri-1(Tn)/C(Tn))
k  | PRi (A) – Pri-1(A)|
i  i+1
While k < e , where e is a small number indicating the convergence threshold
Return PR

18 Anand Bihari

(example)
Let initial PageRank value of each page is 1
Iteration PR(A) PR(B) PR(C) PR(D)
0 1 1 1 1
1 1 0.6667 1.3332 0.6667
2 1.1666 0.6944 1.3888 0.6944
3 1.1944 0.6990 1.3980 0.6990
4 1.1990 0.6998 1.3996 0.6998
5 1.1998 0.6999 1.3998 0.6999
6 1.1999 0.6999 1.3998 0.6999
7 1.1999 0.6999 1.3998 0.6999

The sum of all pages' PageRanks still converges to the total
number of web pages. So the average PageRank of a web page is 1.
19 Anand Bihari

Effects of Inbound Links(I)
 Each additional inbound link for a web page always increases that

page's PageRank.

 One may assume that an additional inbound link from page X increases

the PageRank of page A by
d PR(X) / C(X) X

PR(A)=0.5+0.5(PR(X)+PR(C))
A B
A
PR(B) = 0.5 + 0.5 ( PR(A)/3)

PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) C D

PR(D) = 0.5 + 0.5 ( PR(A)/3 )

20 Anand Bihari

Effects of Inbound Links(II)
Let PR(X) = 10.
PR(A) = 31/5 = 6.2
PR(B) = 23/15 = 1.53
PR(C) = 46/15 = 3.067
PR(D) = 23/15 = 1.53
We see that the initial effect of the additional inbound link of page A,
which was given by
d PR(X) / C(X) = 0.5 10 / 1 = 5
Hence page A will have an even higher PageRank benefit from
its additional inbound link.
21 Anand Bihari

Effect of outbound Links(I)
 Since PageRank is based on the linking structure of the whole web.

 it is inescapable that if the inbound links of a page influence its

PageRank, its outbound links do also have some impact.

 In this graph Page B have an additional outbound links.

 Then PageRank Value of A B
A
PR(A)=0.5+0.5(PR(C))

PR(B)=0.5+0.5(PR(A)/3) C D

PR(C)= 0.5+0.5(PR(A)/3+PR(B)/2+PR(D))
PR(D)=0.5+0.5(PR(A)/3+PR(B)/2)

22 Anand Bihari

Effect of outbound Links(II)
PR(A) = 1.14
PR(B) = 0.753
PR(C) = 0.8796
PR(D) = 1.31805

The total PageRank of all pages’ = 4.

Hence, adding a link has no effect on the total PageRank of the web.

Additionally, the PageRank of page D is increased and the PageRank of

Page A and C are decereased.

23 Anand Bihari

The Effect of the Number of Pages
 An additional page increases the PageRank of all pages on the web .

24 Anand Bihari

How Increase the PageRank of Websites
 Add new pages to your website (as many as you can)

 Swap links with websites which have high PageRank value

 Raise the number of inbound links (Advertise your website on other

sites) etc.

25 Anand Bihari

HITS
 HITS stands for Hyperlink Induced Topic Search.

 Developed by Jon Kleinberg

 HITS is search query dependent.

 When the user issues a search query, HITS first expands the list of

relevant pages returned by a search engine and then produce two
rankings of the expanded set of pages, authority ranking and Hub
ranking.
 Uses hubs and authorities to define a recursive relationship between
web pages.

26 Anand Bihari

HITS Algorithms (I)
 HITS depend on query words. Firstly HITS invokes a traditional search

engine to get a set of pages related to the query, and then expands the
set by hyperlinks pointing to them or pointed by them. After that, HITS
tries to find the top hubs and authorities by iterative calculations. All of
the processing are done online.

 R is a root set that returned by the query and S is base set to cover all

linked pages.
27 Anand Bihari

HITS Algorithm (II)
 Let the authority score of the page i be ap(i) and the hub score of page i
is hp(i) .The mutual reinforcement relationship of the two scores is
represented as follows:
ap(i) = hq

hp(i) = aq
 The implication q→ p is that there is a point p from the q hyperlink.

After several iterative calculations until the results converge, the final
output of HITS algorithm is a set of weights with large Hub p pages
and have greater weight Authority page.

28 Anand Bihari

HITS Algorithm (III)
 Let A be the adjacency matrix of the root set R and denote the authority
weight vector by “a” and the hub weight vector by “h” , where
a = a1 h= h1
a2 h2
. .
. .

an hn

Then a=AT.h and h=A.a
 The computation of authority scores and hub scores is basically the same
as the computation of the PageRank scores using the iteration method. If
we use ak and hk to denote authority and hub scores at the kth iteration,
the iterative processes for generating the final solutions are
ak = ATAak-1 and hk = A AT hk-1
Starting with a0 = h0 = 1
1
.
29 . Anand Bihari
1

A B
A

HITS Example C D

The adjacency matrix of the graph is
A= 0 1 1 1 with transpose AT = 0 0 1 0
0 0 1 0 1 0 0 0
1 0 0 0 1 1 0 1
0 0 1 0 1 0 0 0
Assume the initial hub and authority weight is:
h= 1 and a = 1
1 1
1 1
1 1
We compute the authority weight vector by
a = AT.h = 1 h = A.a = 3
1 1
3 1
1 1
30 Anand Bihari

HITS Example(cont.)
 Hub weight of
 Page A = 3,
 Page B = 1,
 Page C = 1 and
 Page D = 1;
 Authority weight of
 Page A = 1,
 Page B = 1,
 Page C = 3 and
 Page D = 1;
Hence we say that the Hub weight of a page is the total number of its
out linked pages and the Authority weight of a page is the total
number of in linked pages .

31 Anand Bihari

Conclusion
 Study basic concepts of Hyperlinks Analysis.

 Study PageRanking Technique.

 Study HITS Technique.

32 Anand Bihari

Future Work
 Study Hyperlink analysis technique.

 Literature Survey on Hyperlink analysis and other related topic.

 Defining problem in PageRank and HITS.

 Proposing new algorithm or Improve the PageRank and HITS

algorithms.

 Simulation and Performance Analysis of proposed Model.

33 Anand Bihari

Future Literature Survey
Titles Name of Journal/Conferences Publication
Year
Mining web informative structures and IEEE Transactions On Knowledge And 2004
Contents based on entropy analysis Data Engineering

Wisdom: web intra page informative IEEE Transactions On Knowledge And 2005
structure Mining based on document Data Engineering
object model
Knowledge Discovery and Retrieval 2010 Fourth Asia International 2010
on World Wide Web Using Web Conference on Mathematical/ Analytical
Structure Mining Modelling and Computer Simulation
Design and implementation of a web International Conference on internet 2011
structure Mining algorithm using technology and secured transactions
breadth first search Strategy for
academic search application

34 Anand Bihari

References
 Bing Liu “Web Data Mining ” Springer International Edition.

 IEEE Conference Paper “Research on PageRank and Hyperlink –

Induced Topic Search in Web Structure Mining “

 Website : Google, Wikipedia, http://pr.efactory.de/

 www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lectur

e4.html

35 Anand Bihari

Thank You

36 Anand Bihari

Page rank and hyperlink

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Page rank and hyperlink

Semelhante a Page rank and hyperlink (20)

Mais de Silicon

Mais de Silicon (6)

Page rank and hyperlink