SlideShare uma empresa Scribd logo
1 de 36
PageRank and Hyperlink- Induced
Topic Search in Web Structure Mining

             Presented By
         Priyabrata Satapathy
Plan of My work(I)
     Learn Basic Knowledge of Web structure

       Hub

       Authority

       Link analysis

       PageRank

       HITS




2                                              Anand Bihari
Plan of My work(II)
     Literature Survey on PageRank and HITS in Web Structure Mining.

     Defining Problem (PageRank and HITS).

     Proposing/ Designing a new Algorithm for Computing a PageRank of

      web page.

     Simulation and Performance Analysis of proposed Algorithm.




3                                                                  Anand Bihari
Outline
     Introduction

     Basic Concepts of

        Web Structure

        Hub and Authority

        PageRank

        HITS

     Conclusion

     Future Work

     References



4                            Anand Bihari
Introduction
     World Wide Web is distributed by numerous Web sites around the world, a

       global information system.

     Web servers can potentially host millions of pages which make the number of

       web pages extremely difficult to track.

     Web networks like the thousands of interconnected, intertwined with the cells

       organized in a complex structure.

     Each Web site also contains a number of Web pages.

     It contains the following three parts;

        Body of the page,

        The page contains hypertext markup language and

        Hyperlinks between Web pages.
5                                                                       Anand Bihari
Web Mining
     Web mining can generally be divided into three categories:

        Web content mining,

        Web structure mining

        Web usage mining




6                                                                  Anand Bihari
Web Structure Mining
     Web structure mining is the main content of hyperlink analysis, that

      is, by analyzing the links between pages to study the relationship
      between the reference pages to find useful patterns, improve search
      quality.

     Structure mining is the site with one page to another page from a link

      diagram.




7                                                                 Anand Bihari
Simple Web Link Graph

          Page A            Page B
            A




          Page C            Page D




8                                    Anand Bihari
Hub
       A hub is a page with many out-links.




    Authority
        An authority is a page with many in-links.




9                                                     Anand Bihari
Hubs and Authorities on the Internet




                 Hubs                              Authorities


      Authorities and Hubs have a mutual reinforcement relationship.

      A good hub increases the authority weight of the pages it points.

      A good authority increases the hub weight of the pages that point to it.

10                                                                    Anand Bihari
Link Analysis
     There are two famous link analysis methods:
          1.PageRank Algorithm
          2.HITS Algorithm




11                                                 Anand Bihari
PageRank
      The heart of Google’s searching software is PageRank.

      A system for ranking web pages developed by Larry Page and Sergey

       Brin at Stanford University in 1996.

      Based on the idea of a ’random surfer’

      PageRank is a static ranking of Web pages.

      PageRank is based on the measure of prestige in social networks,

      The PageRank value of each page can be regarded as its prestige.




12                                                                  Anand Bihari
PageRank
      From the perspective of prestige, we use the following to derive the

       PageRank algorithm.
        A hyperlink from a page pointing to another page is an implicit conveyance
          of authority to the target page. Thus, the more in-links that a page “ i “
          receives, the more prestige the page “ i “ has.

        Pages that point to page “ i “also have their own prestige scores. A page

          with a higher prestige score pointing to “ i “ is more important than a page
          with a lower prestige score pointing to “ i.” In other words, a page is
          important if it is pointed to by other important pages.




13                                                                             Anand Bihari
PageRank
      In-links of page i: These are the hyperlinks that point to page “ i “

       from other pages. Usually, hyperlinks from the same site are not
       considered.

      Out-links of page i: These are the hyperlinks that point out to other

       pages from page “ i “. Usually, links to pages of the same site are not
       considered.

                                         A        B



                           Website 1                  Website 2

14                                                                     Anand Bihari
PageRank Algorithm
               The PageRank of a web page is therefore calculated as a sum of the
     PageRanks of all pages linking to it (its incoming links), divided by the number of
     out links on each of those pages (its outgoing links).


     Where:
      PR(A) is the PageRank of page A,
      PR(Ti) is the PageRank of pages Ti which link to page A,
      C(Ti) is the number of outbound links on page Ti
      d is a damping factor which can be set between 0 and 1. It depends on the
        number of clicks, usually set to 0.85.
      n is the number of inlinks of page A.
               It’s obvious that the PageRank algorithm does not rank the whole website, but it’s
     determined for each page individually. Furthermore, the PageRank of page A is recursively
     defined by the PageRank of those pages which link to page A
15                                                                                   Anand Bihari
A              B
                                                                       A
     The Characteristics of PageRank
                                                                       C              D

     We regard a small web consisting of four pages A, B, C and D, whereby page A links
     to the pages B ,C and D, page B links to page C , page C links to page A and page D
     links to page C. According to Page and Brin, the damping factor d is usually set to
     0.85, but to keep the calculation simple we set it to 0.5.
          PR(A) = 0.5 + 0.5 ( PR(C))
          PR(B) = 0.5 + 0.5 ( PR(A)/3)
          PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) )
         PR(D) = 0.5 + 0.5 ( PR(A)/3 )
     We get the following PageRank values for the single pages:
         PR(A) = 12/10 = 1.2
          PR(B) = 7/10 = 0.7
          PR(C) = 14/10 = 1.4

16
          PR(D) = 7/10 = 0.7                                               Anand Bihari
The Iterative Computation of PageRank
      The Google search engine uses an approximative, iterative computation

       of PageRank values.

      This means that each page is assigned an initial starting value.

      The iteration ends when the PageRank value do not change much or

       equal.




17                                                                        Anand Bihari
The Iterative Computation of PageRank
                         Algorithm
     General PageRank equation is
       PR(A)=(1-d)+d(PR(T1)/C(T1)+-------------+PR(Tn)/C(Tn))
       Iteration Algorithm
       Set PR  [ R1,R2,……………,Rn] where R is some initial rank of
         page and n is the number of pages in the graph.
       d  0.5
       i1
       Do
         Pri (A) (1-d) + d (Pri-1(T1)/C(T1) +… +Pri-1(Tn)/C(Tn))
         k  | PRi (A) – Pri-1(A)|
         i  i+1
       While k < e , where e is a small number indicating the convergence threshold
       Return PR


18                                                                          Anand Bihari
The Iterative Computation of PageRank
     (example)
     Let initial PageRank value of each page is 1
     Iteration        PR(A)          PR(B)           PR(C)           PR(D)
     0                1              1               1               1
     1                1              0.6667          1.3332          0.6667
     2                1.1666         0.6944          1.3888          0.6944
     3                1.1944         0.6990          1.3980          0.6990
     4                1.1990         0.6998          1.3996          0.6998
     5                1.1998         0.6999          1.3998          0.6999
     6                1.1999         0.6999          1.3998          0.6999
     7                1.1999         0.6999          1.3998          0.6999

            The sum of all pages' PageRanks still converges to the total
     number of web pages. So the average PageRank of a web page is 1.
19                                                                   Anand Bihari
Effects of Inbound Links(I)
      Each additional inbound link for a web page always increases that

       page's PageRank.

      One may assume that an additional inbound link from page X increases

       the PageRank of page A by
            d   PR(X) / C(X)                              X

     PR(A)=0.5+0.5(PR(X)+PR(C))
                                                          A                B
                                                          A
     PR(B) = 0.5 + 0.5 ( PR(A)/3)

     PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) )         C               D

     PR(D) = 0.5 + 0.5 ( PR(A)/3 )

20                                                                Anand Bihari
Effects of Inbound Links(II)
     Let PR(X) = 10.
     We get the following PageRank values for the single pages:
         PR(A) = 31/5 = 6.2
         PR(B) = 23/15 = 1.53
         PR(C) = 46/15 = 3.067
         PR(D) = 23/15 = 1.53
     We see that the initial effect of the additional inbound link of page A,
      which was given by
     d    PR(X) / C(X) = 0.5 10 / 1 = 5
               Hence page A will have an even higher PageRank benefit from
     its additional inbound link.
21                                                                       Anand Bihari
Effect of outbound Links(I)
      Since PageRank is based on the linking structure of the whole web.

      it is inescapable that if the inbound links of a page influence its

       PageRank, its outbound links do also have some impact.

      In this graph Page B have an additional outbound links.

      Then PageRank Value of                                      A            B
                                                                   A
           PR(A)=0.5+0.5(PR(C))

           PR(B)=0.5+0.5(PR(A)/3)                                  C             D

           PR(C)= 0.5+0.5(PR(A)/3+PR(B)/2+PR(D))
           PR(D)=0.5+0.5(PR(A)/3+PR(B)/2)

22                                                                       Anand Bihari
Effect of outbound Links(II)
     We get the following PageRank values for the single pages:
        PR(A) = 1.14
        PR(B) = 0.753
        PR(C) = 0.8796
        PR(D) = 1.31805

     The total PageRank of all pages’ = 4.

     Hence, adding a link has no effect on the total PageRank of the web.

     Additionally, the PageRank of page D is increased and the PageRank of

       Page A and C are decereased.

23                                                                   Anand Bihari
The Effect of the Number of Pages
      An additional page increases the PageRank of all pages on the web .




24                                                                  Anand Bihari
How Increase the PageRank of Websites
      Add new pages to your website (as many as you can)

      Swap links with websites which have high PageRank value

      Raise the number of inbound links (Advertise your website on other

       sites) etc.




25                                                                 Anand Bihari
HITS
      HITS stands for Hyperlink Induced Topic Search.

      Developed by Jon Kleinberg

      HITS is search query dependent.

      When the user issues a search query, HITS first expands the list of

       relevant pages returned by a search engine and then produce two
       rankings of the expanded set of pages, authority ranking and Hub
       ranking.
      Uses hubs and authorities to define a recursive relationship between
       web pages.


26                                                                   Anand Bihari
HITS Algorithms (I)
       HITS depend on query words. Firstly HITS invokes a traditional search

        engine to get a set of pages related to the query, and then expands the
        set by hyperlinks pointing to them or pointed by them. After that, HITS
        tries to find the top hubs and authorities by iterative calculations. All of
        the processing are done online.




       R is a root set that returned by the query and S is base set to cover all

        linked pages.
27                                                                       Anand Bihari
HITS Algorithm (II)
      Let the authority score of the page i be ap(i) and the hub score of page i
       is hp(i) .The mutual reinforcement relationship of the two scores is
       represented as follows:
          ap(i) =      hq

          hp(i) =      aq
      The implication q→ p is that there is a point p from the q hyperlink.

       After several iterative calculations until the results converge, the final
       output of HITS algorithm is a set of weights with large Hub p pages
       and have greater weight Authority page.



28                                                                      Anand Bihari
HITS Algorithm (III)
      Let A be the adjacency matrix of the root set R and denote the authority
       weight vector by “a” and the hub weight vector by “h” , where
       a = a1                h=       h1
             a2                         h2
              .                          .
              .                          .

             an                         hn

              Then      a=AT.h and h=A.a
      The computation of authority scores and hub scores is basically the same
       as the computation of the PageRank scores using the iteration method. If
       we use ak and hk to denote authority and hub scores at the kth iteration,
       the iterative processes for generating the final solutions are
                 ak = ATAak-1 and hk = A AT hk-1
       Starting with                      a0 = h0 =          1
                                                          1
                                                          .
29                                                        .           Anand Bihari
                                                          1
A              B
                                                                          A

      HITS Example                                                        C              D

     The adjacency matrix of the graph is
     A=      0 1 1 1           with transpose AT =        0   0   1   0
             0 0 1 0                                      1   0   0   0
             1 0 0 0                                      1   1   0   1
             0 0 1 0                                      1   0   0   0
     Assume the initial hub and authority weight is:
     h=      1                 and a =           1
             1                                   1
             1                                   1
             1                                   1
     We compute the authority weight vector by
             a = AT.h =        1                  h = A.a =               3
                               1                                          1
                               3                                          1
                               1                                          1
30                                                                            Anand Bihari
HITS Example(cont.)
      Hub weight of
        Page A = 3,
        Page B = 1,
        Page C = 1 and
        Page D = 1;
      Authority weight of
        Page A = 1,
        Page B = 1,
        Page C = 3 and
        Page D = 1;
       Hence we say that the Hub weight of a page is the total number of its
        out linked pages and the Authority weight of a page is the total
        number of in linked pages .

31                                                                  Anand Bihari
Conclusion
      Study basic concepts of Hyperlinks Analysis.

      Study PageRanking Technique.

      Study HITS Technique.




32                                                    Anand Bihari
Future Work
      Study Hyperlink analysis technique.

      Literature Survey on Hyperlink analysis and other related topic.

      Defining problem in PageRank and HITS.

      Proposing new algorithm or Improve the PageRank and HITS

       algorithms.

      Simulation and Performance Analysis of proposed Model.




33                                                                   Anand Bihari
Future Literature Survey
     Titles                                  Name of Journal/Conferences              Publication
                                                                                      Year
     Mining web informative structures and   IEEE Transactions On Knowledge And       2004
     Contents based on entropy analysis      Data Engineering

     Wisdom: web intra page informative      IEEE Transactions On Knowledge And       2005
     structure Mining based on document      Data Engineering
     object model
     Knowledge Discovery and Retrieval       2010 Fourth Asia International           2010
     on World Wide Web Using Web             Conference on Mathematical/ Analytical
     Structure Mining                        Modelling and Computer Simulation
     Design and implementation of a web      International Conference on internet     2011
     structure Mining algorithm using        technology and secured transactions
     breadth first search Strategy for
     academic search application



34                                                                                    Anand Bihari
References
      Bing Liu “Web Data Mining ” Springer International Edition.

      IEEE Conference Paper “Research on PageRank and Hyperlink –

       Induced Topic Search in Web Structure Mining “

      Website : Google, Wikipedia, http://pr.efactory.de/

      www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lectur

       e4.html




35                                                                   Anand Bihari
Thank You




36               Anand Bihari

Mais conteúdo relacionado

Mais procurados (20)

Seminar Report Mine
Seminar Report MineSeminar Report Mine
Seminar Report Mine
 
Cybersecurity
CybersecurityCybersecurity
Cybersecurity
 
web mining
web miningweb mining
web mining
 
Page rank
Page rankPage rank
Page rank
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Nmap commands
Nmap commandsNmap commands
Nmap commands
 
Network security
Network securityNetwork security
Network security
 
Networking and penetration testing
Networking and penetration testingNetworking and penetration testing
Networking and penetration testing
 
Web mining
Web miningWeb mining
Web mining
 
Page rank
Page rankPage rank
Page rank
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Ceh v5 module 03 scanning
Ceh v5 module 03 scanningCeh v5 module 03 scanning
Ceh v5 module 03 scanning
 
Network security
Network security Network security
Network security
 
Comptia security-sy0-601-exam-objectives-(2-0)
Comptia security-sy0-601-exam-objectives-(2-0)Comptia security-sy0-601-exam-objectives-(2-0)
Comptia security-sy0-601-exam-objectives-(2-0)
 
Python
PythonPython
Python
 
Cyber security ppt
Cyber security pptCyber security ppt
Cyber security ppt
 
Application of Machine Learning in Cyber Security
Application of Machine Learning in Cyber SecurityApplication of Machine Learning in Cyber Security
Application of Machine Learning in Cyber Security
 
Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
KNN
KNN KNN
KNN
 

Semelhante a Page rank and hyperlink (20)

Page rank algortihm
Page rank algortihmPage rank algortihm
Page rank algortihm
 
Google page rank
Google page rankGoogle page rank
Google page rank
 
PageRank & Searching
PageRank & SearchingPageRank & Searching
PageRank & Searching
 
Google page rank
Google page rankGoogle page rank
Google page rank
 
The Role of Backlinks in SEO
The Role of Backlinks in SEOThe Role of Backlinks in SEO
The Role of Backlinks in SEO
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
Google page rank
Google page rankGoogle page rank
Google page rank
 
Page rank2
Page rank2Page rank2
Page rank2
 
PageRank Algorithm
PageRank AlgorithmPageRank Algorithm
PageRank Algorithm
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 
Web mining
Web miningWeb mining
Web mining
 
Page rank by university of michagain.ppt
Page rank by university of michagain.pptPage rank by university of michagain.ppt
Page rank by university of michagain.ppt
 
Ranking Web Pages
Ranking Web PagesRanking Web Pages
Ranking Web Pages
 
I04015559
I04015559I04015559
I04015559
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
Pr
PrPr
Pr
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
 
SEO Web 2.0 Era - Johns Hopkins University
SEO Web 2.0 Era - Johns Hopkins UniversitySEO Web 2.0 Era - Johns Hopkins University
SEO Web 2.0 Era - Johns Hopkins University
 
Page rank
Page rankPage rank
Page rank
 
Dm page rank
Dm page rankDm page rank
Dm page rank
 

Mais de Silicon

Research professional activity network analysis2
Research professional activity network analysis2Research professional activity network analysis2
Research professional activity network analysis2Silicon
 
Research professional activity network analysis
Research professional activity network analysisResearch professional activity network analysis
Research professional activity network analysisSilicon
 
Web mining
Web miningWeb mining
Web miningSilicon
 
Neural network
Neural networkNeural network
Neural networkSilicon
 
Data mining
Data miningData mining
Data miningSilicon
 
Data classification
Data classificationData classification
Data classificationSilicon
 

Mais de Silicon (6)

Research professional activity network analysis2
Research professional activity network analysis2Research professional activity network analysis2
Research professional activity network analysis2
 
Research professional activity network analysis
Research professional activity network analysisResearch professional activity network analysis
Research professional activity network analysis
 
Web mining
Web miningWeb mining
Web mining
 
Neural network
Neural networkNeural network
Neural network
 
Data mining
Data miningData mining
Data mining
 
Data classification
Data classificationData classification
Data classification
 

Page rank and hyperlink

  • 1. PageRank and Hyperlink- Induced Topic Search in Web Structure Mining Presented By Priyabrata Satapathy
  • 2. Plan of My work(I)  Learn Basic Knowledge of Web structure  Hub  Authority  Link analysis  PageRank  HITS 2 Anand Bihari
  • 3. Plan of My work(II)  Literature Survey on PageRank and HITS in Web Structure Mining.  Defining Problem (PageRank and HITS).  Proposing/ Designing a new Algorithm for Computing a PageRank of web page.  Simulation and Performance Analysis of proposed Algorithm. 3 Anand Bihari
  • 4. Outline  Introduction  Basic Concepts of  Web Structure  Hub and Authority  PageRank  HITS  Conclusion  Future Work  References 4 Anand Bihari
  • 5. Introduction  World Wide Web is distributed by numerous Web sites around the world, a global information system.  Web servers can potentially host millions of pages which make the number of web pages extremely difficult to track.  Web networks like the thousands of interconnected, intertwined with the cells organized in a complex structure.  Each Web site also contains a number of Web pages.  It contains the following three parts;  Body of the page,  The page contains hypertext markup language and  Hyperlinks between Web pages. 5 Anand Bihari
  • 6. Web Mining  Web mining can generally be divided into three categories:  Web content mining,  Web structure mining  Web usage mining 6 Anand Bihari
  • 7. Web Structure Mining  Web structure mining is the main content of hyperlink analysis, that is, by analyzing the links between pages to study the relationship between the reference pages to find useful patterns, improve search quality.  Structure mining is the site with one page to another page from a link diagram. 7 Anand Bihari
  • 8. Simple Web Link Graph Page A Page B A Page C Page D 8 Anand Bihari
  • 9. Hub  A hub is a page with many out-links. Authority  An authority is a page with many in-links. 9 Anand Bihari
  • 10. Hubs and Authorities on the Internet Hubs Authorities  Authorities and Hubs have a mutual reinforcement relationship.  A good hub increases the authority weight of the pages it points.  A good authority increases the hub weight of the pages that point to it. 10 Anand Bihari
  • 11. Link Analysis There are two famous link analysis methods: 1.PageRank Algorithm 2.HITS Algorithm 11 Anand Bihari
  • 12. PageRank  The heart of Google’s searching software is PageRank.  A system for ranking web pages developed by Larry Page and Sergey Brin at Stanford University in 1996.  Based on the idea of a ’random surfer’  PageRank is a static ranking of Web pages.  PageRank is based on the measure of prestige in social networks,  The PageRank value of each page can be regarded as its prestige. 12 Anand Bihari
  • 13. PageRank  From the perspective of prestige, we use the following to derive the PageRank algorithm.  A hyperlink from a page pointing to another page is an implicit conveyance of authority to the target page. Thus, the more in-links that a page “ i “ receives, the more prestige the page “ i “ has.  Pages that point to page “ i “also have their own prestige scores. A page with a higher prestige score pointing to “ i “ is more important than a page with a lower prestige score pointing to “ i.” In other words, a page is important if it is pointed to by other important pages. 13 Anand Bihari
  • 14. PageRank  In-links of page i: These are the hyperlinks that point to page “ i “ from other pages. Usually, hyperlinks from the same site are not considered.  Out-links of page i: These are the hyperlinks that point out to other pages from page “ i “. Usually, links to pages of the same site are not considered. A B Website 1 Website 2 14 Anand Bihari
  • 15. PageRank Algorithm The PageRank of a web page is therefore calculated as a sum of the PageRanks of all pages linking to it (its incoming links), divided by the number of out links on each of those pages (its outgoing links). Where:  PR(A) is the PageRank of page A,  PR(Ti) is the PageRank of pages Ti which link to page A,  C(Ti) is the number of outbound links on page Ti  d is a damping factor which can be set between 0 and 1. It depends on the number of clicks, usually set to 0.85.  n is the number of inlinks of page A. It’s obvious that the PageRank algorithm does not rank the whole website, but it’s determined for each page individually. Furthermore, the PageRank of page A is recursively defined by the PageRank of those pages which link to page A 15 Anand Bihari
  • 16. A B A The Characteristics of PageRank C D We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages B ,C and D, page B links to page C , page C links to page A and page D links to page C. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. PR(A) = 0.5 + 0.5 ( PR(C)) PR(B) = 0.5 + 0.5 ( PR(A)/3) PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) PR(D) = 0.5 + 0.5 ( PR(A)/3 ) We get the following PageRank values for the single pages: PR(A) = 12/10 = 1.2 PR(B) = 7/10 = 0.7 PR(C) = 14/10 = 1.4 16 PR(D) = 7/10 = 0.7 Anand Bihari
  • 17. The Iterative Computation of PageRank  The Google search engine uses an approximative, iterative computation of PageRank values.  This means that each page is assigned an initial starting value.  The iteration ends when the PageRank value do not change much or equal. 17 Anand Bihari
  • 18. The Iterative Computation of PageRank Algorithm General PageRank equation is PR(A)=(1-d)+d(PR(T1)/C(T1)+-------------+PR(Tn)/C(Tn)) Iteration Algorithm Set PR  [ R1,R2,……………,Rn] where R is some initial rank of page and n is the number of pages in the graph. d  0.5 i1 Do Pri (A) (1-d) + d (Pri-1(T1)/C(T1) +… +Pri-1(Tn)/C(Tn)) k  | PRi (A) – Pri-1(A)| i  i+1 While k < e , where e is a small number indicating the convergence threshold Return PR 18 Anand Bihari
  • 19. The Iterative Computation of PageRank (example) Let initial PageRank value of each page is 1 Iteration PR(A) PR(B) PR(C) PR(D) 0 1 1 1 1 1 1 0.6667 1.3332 0.6667 2 1.1666 0.6944 1.3888 0.6944 3 1.1944 0.6990 1.3980 0.6990 4 1.1990 0.6998 1.3996 0.6998 5 1.1998 0.6999 1.3998 0.6999 6 1.1999 0.6999 1.3998 0.6999 7 1.1999 0.6999 1.3998 0.6999 The sum of all pages' PageRanks still converges to the total number of web pages. So the average PageRank of a web page is 1. 19 Anand Bihari
  • 20. Effects of Inbound Links(I)  Each additional inbound link for a web page always increases that page's PageRank.  One may assume that an additional inbound link from page X increases the PageRank of page A by d PR(X) / C(X) X PR(A)=0.5+0.5(PR(X)+PR(C)) A B A PR(B) = 0.5 + 0.5 ( PR(A)/3) PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B) + PR(D) ) C D PR(D) = 0.5 + 0.5 ( PR(A)/3 ) 20 Anand Bihari
  • 21. Effects of Inbound Links(II) Let PR(X) = 10. We get the following PageRank values for the single pages: PR(A) = 31/5 = 6.2 PR(B) = 23/15 = 1.53 PR(C) = 46/15 = 3.067 PR(D) = 23/15 = 1.53 We see that the initial effect of the additional inbound link of page A, which was given by d PR(X) / C(X) = 0.5 10 / 1 = 5 Hence page A will have an even higher PageRank benefit from its additional inbound link. 21 Anand Bihari
  • 22. Effect of outbound Links(I)  Since PageRank is based on the linking structure of the whole web.  it is inescapable that if the inbound links of a page influence its PageRank, its outbound links do also have some impact.  In this graph Page B have an additional outbound links.  Then PageRank Value of A B A PR(A)=0.5+0.5(PR(C)) PR(B)=0.5+0.5(PR(A)/3) C D PR(C)= 0.5+0.5(PR(A)/3+PR(B)/2+PR(D)) PR(D)=0.5+0.5(PR(A)/3+PR(B)/2) 22 Anand Bihari
  • 23. Effect of outbound Links(II) We get the following PageRank values for the single pages: PR(A) = 1.14 PR(B) = 0.753 PR(C) = 0.8796 PR(D) = 1.31805 The total PageRank of all pages’ = 4. Hence, adding a link has no effect on the total PageRank of the web. Additionally, the PageRank of page D is increased and the PageRank of Page A and C are decereased. 23 Anand Bihari
  • 24. The Effect of the Number of Pages  An additional page increases the PageRank of all pages on the web . 24 Anand Bihari
  • 25. How Increase the PageRank of Websites  Add new pages to your website (as many as you can)  Swap links with websites which have high PageRank value  Raise the number of inbound links (Advertise your website on other sites) etc. 25 Anand Bihari
  • 26. HITS  HITS stands for Hyperlink Induced Topic Search.  Developed by Jon Kleinberg  HITS is search query dependent.  When the user issues a search query, HITS first expands the list of relevant pages returned by a search engine and then produce two rankings of the expanded set of pages, authority ranking and Hub ranking.  Uses hubs and authorities to define a recursive relationship between web pages. 26 Anand Bihari
  • 27. HITS Algorithms (I)  HITS depend on query words. Firstly HITS invokes a traditional search engine to get a set of pages related to the query, and then expands the set by hyperlinks pointing to them or pointed by them. After that, HITS tries to find the top hubs and authorities by iterative calculations. All of the processing are done online.  R is a root set that returned by the query and S is base set to cover all linked pages. 27 Anand Bihari
  • 28. HITS Algorithm (II)  Let the authority score of the page i be ap(i) and the hub score of page i is hp(i) .The mutual reinforcement relationship of the two scores is represented as follows: ap(i) = hq hp(i) = aq  The implication q→ p is that there is a point p from the q hyperlink. After several iterative calculations until the results converge, the final output of HITS algorithm is a set of weights with large Hub p pages and have greater weight Authority page. 28 Anand Bihari
  • 29. HITS Algorithm (III)  Let A be the adjacency matrix of the root set R and denote the authority weight vector by “a” and the hub weight vector by “h” , where a = a1 h= h1 a2 h2 . . . . an hn Then a=AT.h and h=A.a  The computation of authority scores and hub scores is basically the same as the computation of the PageRank scores using the iteration method. If we use ak and hk to denote authority and hub scores at the kth iteration, the iterative processes for generating the final solutions are ak = ATAak-1 and hk = A AT hk-1 Starting with a0 = h0 = 1 1 . 29 . Anand Bihari 1
  • 30. A B A HITS Example C D The adjacency matrix of the graph is A= 0 1 1 1 with transpose AT = 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 Assume the initial hub and authority weight is: h= 1 and a = 1 1 1 1 1 1 1 We compute the authority weight vector by a = AT.h = 1 h = A.a = 3 1 1 3 1 1 1 30 Anand Bihari
  • 31. HITS Example(cont.)  Hub weight of  Page A = 3,  Page B = 1,  Page C = 1 and  Page D = 1;  Authority weight of  Page A = 1,  Page B = 1,  Page C = 3 and  Page D = 1; Hence we say that the Hub weight of a page is the total number of its out linked pages and the Authority weight of a page is the total number of in linked pages . 31 Anand Bihari
  • 32. Conclusion  Study basic concepts of Hyperlinks Analysis.  Study PageRanking Technique.  Study HITS Technique. 32 Anand Bihari
  • 33. Future Work  Study Hyperlink analysis technique.  Literature Survey on Hyperlink analysis and other related topic.  Defining problem in PageRank and HITS.  Proposing new algorithm or Improve the PageRank and HITS algorithms.  Simulation and Performance Analysis of proposed Model. 33 Anand Bihari
  • 34. Future Literature Survey Titles Name of Journal/Conferences Publication Year Mining web informative structures and IEEE Transactions On Knowledge And 2004 Contents based on entropy analysis Data Engineering Wisdom: web intra page informative IEEE Transactions On Knowledge And 2005 structure Mining based on document Data Engineering object model Knowledge Discovery and Retrieval 2010 Fourth Asia International 2010 on World Wide Web Using Web Conference on Mathematical/ Analytical Structure Mining Modelling and Computer Simulation Design and implementation of a web International Conference on internet 2011 structure Mining algorithm using technology and secured transactions breadth first search Strategy for academic search application 34 Anand Bihari
  • 35. References  Bing Liu “Web Data Mining ” Springer International Edition.  IEEE Conference Paper “Research on PageRank and Hyperlink – Induced Topic Search in Web Structure Mining “  Website : Google, Wikipedia, http://pr.efactory.de/  www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lectur e4.html 35 Anand Bihari
  • 36. Thank You 36 Anand Bihari