SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
1




Linear Algebra in Use: Ranking Web Pages with an
                   Eigenvector
                                                    Maia Bittner, Yifei Feng




   Abstract—Google PageRank is an algorithm that uses the              In this paper, we will explain how the interlinking structure
underlying, hyperlinked structure of the web to determine the       of the web and the properties of Markov Chains can be used to
theoretical number of times a random web surfer would visit each    quantify the relative importance of each indexed page. We will
page. Google converts these numbers into probabilities, and uses
these probabilities as each web page’s relative importance. Then,   examine Markov Chains, eigenvectors, and the power iteration
the most important pages for a given query can be returned first     method, as well as some of the problems that arise when using
in search results. In this paper, we focus on the math behind       this system to rank web pages.
PageRank, which includes eigenvectors and Markov Chains, and
on explaining how it is used to rank webpages.
                                                                    A. Markov Chains
  Google PageRank, Eigenvector, Eigenvalue, Markov Chain
                                                                       Markov chains are mathematical models that describe par-
                        I. I NTRODUCTION                            ticular types of systems. For example, we can construe the
   The internet is humanity’s largest collection of information,    number of students at Olin College who are sick as a Markov
and its democratic nature means that anyone can contribute          Chain if we know the how likely it is that a student will
more information to it. Search engines help users sort through      become sick. Let us say that if a student is feeling well, she has
the billions of available web pages to find the information          a 5% chance of becoming sick the next day, and that if she is
that they are looking for. Most search engines use a two step       already sick, she has a 35% chance of feeling better tomorrow.
process to return web pages based on the user’s query. The first     In our example, a student can only be healthy or sick; these
step involves finding which of the pages the search engine has       two states are called the state space of the system. In addition,
indexed are related to the query, either by containing the words    we’ve decided to only ask how the students are feeling in the
in the query, or by more advanced means that use semantic           morning, and their health on any day only depends on how
models. The second step is to order this list of relevant pages     they were feeling the previous morning. This constant, discrete
according to some criterion. For example, the first web search       increase in time makes the system time-homogenous. We can
engines, launched in the early 90’s, used text-based matching       generate a set of linear equations that will describe how many
systems as their criterion for ordering returned results by         students at Olin College are healthy and sick on any given day.
relevancy. This ranking method often resulted in returning          If we let mk indicates the number of healthy students and nk
exact matches in unauthoritative, poorly written pages before       indicates the number of sick students at morning k, then we
results that the user could trust. Even worse, this system was      get the following two equations:
easy to exploit by page owners, who could fill their pages                               mk+1 = .95mk + .35nk
with irrelevant words and phrases, with the hope of ranking
                                                                                        nk+1 = .05mk + .65nk
highly. These problems prompted researchers to investigate
more advanced methods of ranking.
   Larry Page and Sergey Brin were researching a new kind           Putting this system of linear equations into matrix notation,
of search engine at Stanford University when they had the           we get:
idea that pages could be ranked by link popularity. The
underlying social basis of their idea was that more reputable                    mk+1           .95 .35        mk
                                                                                           =                                      (1)
pages are linked to by other pages more often. Page and Brin                     nk+1           .05 .65        nk
developed an algorithm that could quantify this idea, and in        We can take this matrix full of probabilities and call it P, the
1998, christened the algorithm PageRank [2] and published           transition matrix.
their first paper on it. Shortly afterwards, they founded Google,
a web search engine that uses PageRank to help rank the
                                                                                                .95   .35
returned results. Google’s famously useful search results [3]                           P=                                        (2)
                                                                                                .05   .65
helped it reach almost $29 billion dollars in revenue in 2010
[1]. The original algorithm organized the indexed web pages         The columns can be viewed as representing the present state
such that the links between them are used to construct the          of the system, and the rows can be viewed as representing
probability of a random web surfer navigating from one page         the future state of the system. The first column accounts for
to another. This system can be characterized as a Markov            the students who are healthy today, and the second column
Chain, a mathematical model described below, in order to take       accounts for the students who are sick today, while the first row
advantage of their convenient properties.                           indicates the students who will be healthy tomorrow and the
2



second row indicates the students who will be sick tomorrow.        classified as a Markov Chain has these steady-state values that
The intersecting elements of the transition matrix represent        the system will remain constant at, regardless of the initial
the probability that a student will transition from that column     state. This steady-state vector is a specific example of an
state to the row state. So we can see that p1,1 indicates that      eigenvector, explained below.
95% of the students who are healthy today will be healthy
tomorrow. The total number of students who will be healthy          B. Eigenvalues and Eigenvectors
tomorrow is represented by the first row, so that it is 95% of
                                                                       An eigenvector is a nonzero vector x that, when multipled
the students who are healthy today plus 35% of the students
                                                                    by a matrix A, only scales in length and does not change
who are sick today. Similarly, the second row shows us the
                                                                    direction, except for potentially a reverse in direction.
number of students who will be sick tomorrow: 5% of the
students who are healthy today plus 65% of the students who                                     Ax = λx                          (9)
are sick today.
                                                                    The corresponding amount that an eigenvector is scaled by, λ,
   You can see that each column sums to one, to account for
                                                                    is called its eigenvalue. There are several techniques to find the
100% of students who are sick and 100% of students who
                                                                    eigenvalues and eigenvectors of a matrix. We will demonstrate
are healthy in the current state. Square matrices like this, that
                                                                    one technique below with matrix A.
have nonnegative real entries and where every column sums                                                 
to one, are called column-stochastic.                                                            1 2 5
   We can find the total number of students who will be in                                A= 0 3 0 
a state on day k + 1 by multiplying the transition matrix by                                     0 0 4
a vector containing how many students were in each state on         How to find the eigenvalues for matrix A: We know that λ is
day k.                                                              a eigenvalue of A if and only if the equation
                          Pxk = xk+1                          (3)
                                                                                                Ax = λx                         (10)
For example, if 270 students at Olin College are healthy today,
and 30 are sick, we can find the state vector for tomorrow:          has a nontrivial solution. This is equivalent to finding λ such
                                                                    that:
              .95    .35         270             267
                                         =                   (4)                            (A − λI)x = 0                     (11)
              .05    .65          30              33
                                                                    The above equation has nontrivial solution when the determi-
which shows that tomorrow, 267 students will be healthy and
                                                                    nant of A − λI is zero.
33 students will be sick. To find the next day’s values, you
can multiply again by the transition matrix:                                                       1−λ      2       5
                                                                           det(A − λI3 )    =        0     3−λ      0
             .95 .35            267              265.2                                               0      0     4−λ
                                       =                     (5)
             .05 .65            33               34.8
                                                                                            =    (4 − λ)(1 − λ)(3 − λ) = 0
  which is the same as
                           2
                                                                    Solving for λ, we get that the eigenvalues are λ1 = 1, λ2 = 3,
            .95     .35         270              265.2              and λ3 = 4. If we solve for Avi = λi vi , we will get the
                                         =                   (6)
            .05     .65          30               34.8              corresponding eigenvector for each eigenvalue:
                                                                                                                
   So we can see that in order to find mk and nk , we can                              1               4               5
multiply P k by the state vector containing m0 and n0 , as in                v1 =  0  , v2 =  −1  , v3 =  0 
the equation below:                                                                   0               2               3
                                             k
             mk                .95 .35            m0                This means that if v2 is transformed by A, the result will
                      =                                      (7)    scale v2 by its eigenvalue, 3.
             nk                .05 .65            n0
                                                                       You can see in Eqn. (8) that the steady state of a Markov
  If you continue to multiply the state vector by the transition    Chain has an eigenvalue of 1. This is why those steady state
matrix for very high values of k, we will see that it will          vectors are a special case of eigenvectors. Because they are
eventually converge upon a steady state, regardless of initial      column-stochastic, all transition matrices of Markov Chains
conditions. This is represented by vector q in Eqn. (8).            will have an eigenvalue of 1 (we invite the reader to prove
                                                                    this in Exercise 4). A system having an eigenvalue of 1 is the
                                Pq = q                       (8)    same as it having a steady state.
   Being able to find this steady-state vector is the main              In some matrices, we may get repeated roots when solving
advantage of using a column-stochastic matrix to model a            det(A − λI) = 0. We will demonstrate this for the column-
system. Column-stochastic transition matrices are always used       stochastic matrix P:
                                                                                                              
to represent the known probabilities of transitioning between                                 0 1 0 0 0
states in Markov Chains. To model a system as a Markov                                      1 0 0 0 0 
                                                                                                            1
                                                                                                               
Chain, it must be a discrete-time process that has a finite                            P= 0 0 0 1 2 
                                                                                                              
state space, and the probability distribution for any state must                            0 0 0 0 1 
                                                                                                             2
depend only on the previous state. Every situation that can be                                0 0 1 0 0
3



                                                                     four webpages. Page 1 is linked to by pages 2 and 3, so its
                                                                     importance score is x1 = 2. In the same way, we can get
                                                                     x2 = 3, x3 = 1, x4 = 2. Page 2 has the highest importance
                                                                     score, indicating that page 2 is the most important page in this
                                                                     web.
                                                                        However, this approach has several drawbacks. First, the
                                                                     pages that have more links to other pages would have more
                                                                     votes, which means that a website can easily gain more
                                                                     influence by creating many links to other pages. Second, we
                                                                     would expect that a vote from an important page should weigh
                                                                     more than a vote from an unimportant one, but every page’s
                                                                     vote is worth the same amount with this method. A way to
                                                                     fix both of these problems is to give each page the amount
Figure 1.   A small web of 4 pages, connected by directional links
                                                                     of voting power that is equivalent to its importance score. So
                                                                     for webpage k with an importance score of xk , it has a total
To find the eigenvalues, solve:                                       voting power of xk . Then we can equally distribute xk to all
                                                                     the pages it links to. We can define the importance score of
                       −λ 1         0 0 0                            a page as the sum of all the weighted votes it gets from the
                         1 −λ 0 0 0                                  pages that link to it. So if webpage k has a set of pages Sk
 det(P − λI5 ) =         0    0 −λ 1 1       2                       linked to it, we have
                         0    0 −λ 0 1       2                                                            xj
                         0    0     1 0 −λ                                                    xk =                              (12)
                                                                                                          nj
                        1                                                                          j∈Sk
                  = − (λ − 1)2 (λ + 1)(2λ2 + 2λ + 1) = 0
                        2                                            where nj is the number of links from page j. If we apply this
   When we solve the characteristic equation, we find that the        method to the network of Figure 1, we can get a system of
five eigenvalues are: λ1 = 1, λ2 = 1, λ3 = −1, λ4 = − 1 − 2 , i       linear equations:
                                                        2
         1     i                                                                              x2     x4
λ5 = − 2 + 2 . Since 1 appears twice as an eigenvalue, we
                                                                                        x1 =      +
say that is has algebraic multiplicity of 2. The number of                                     1      2
individual eigenvectors associated with eigenvalue 1 is called                                x1     x3   x4
                                                                                        x2 =      +     +
the geometric multiplicity of λ = 1. The reader can confirm                                     3      2    2
                                                                                              x1
that in this case, λ = 1 has geometric multiplicity of 2 with                           x3 =
                                                                                               3
associated eigenvectors x and y.                                                              x1     x3
                                                                                        x4 =      +
                      √2          √2                                                       3      2
                        √2            − 2
                                       √
                      2               2 
                      2           2 
                 x=  0 ,y =  0 
                                                                  which can be written in the matrix form x = Lx, where x =
                      0           0                              [x1 , x2 , x3 , x4 ]T and
                         0              0                                                      
                                                                                                 0 1 0 2 1
                                                                                                           

   We can see that when transition matrices for Markov chains                                   1 0 1 1 
                                                                                           L= 1 3   2   2 
have geometric multiplicity for eigenvalue of 1, it’s unclear                                  
                                                                                                 3 0 0 0 
                                                                                                 1   1
which independent eigenvector should be used to represent                                        3 0 2 0
the steady-state of the system.                                      L is called the link matrix of this network system since it
                                                                     encapsulates the links between all the pages in the system.
                            II. PAGE R ANK                           Because we’ve evenly distributed xk to each of the pages
   When the founders of Google created PageRank, they were           k links to, the link matrix is always column-stochastic. As
trying to discern the relative authority of web pages from the       we defined earlier, vector x contains the importance scores
underlying structure of links that connects the web. They did        of all web pages. To find these scores, we can solve for
this by calculating an importance score for each web page.           Lx = x. You’ll notice that this looks similar to Eqn. (8),
Given a webpage, call it page k, we can use xk to denote the         and indeed, we can transform this problem into finding the
importance of this page among the total number of n pages.           eigenvector with eigenvalue λ = 1 for the matrix L! For
There are many different ways that one could calculate an            matrix L, the eigenvector is [0.387, 0.290, 0.129, 0.194]T for
importance score. One simple and intuitive way to do page            λ = 1. So we know the importance score of each page is
ranking is to count the number of links from other pages             x1 ≈ 0.387, x2 ≈ 0.290, x3 ≈ 0.129, x4 ≈ 0.194. Note that
to page k, and assign that number as xk . We can think of            with this more sophisticated method, page 1 has the highest
each link as being one vote for the page it links to, and of         importance score instead of page 2. This is because page
the number of votes a page gets as showing the importance            2 only links to page 1, so it casts its entire vote to page
of the page. In the example network of Figure 1, there are           1, boosting up its score. Knowing that the link matrix is a
4



column-stochastic matrix, let us now look at the problem of
ranking internet pages in terms of a Markov Chain system.
   For a network with n pages, the ith entry of the n × 1
vector xk denotes the probability of visiting page i after k
clicks. The link matrix L is the transition matrix such that
the entry lij is the probability of clicking on a link to page
i when on page j. Finding the importance score of a page is
the same as finding its entry in the steady state vector of the
Markov chain that describes the system. For example, say we
start from page 1, so that vector that represents our begining
state is x0 = [1, 0, 0, 0]T To find the probability of ending up
on each web page after n clicks, we do:

                          xn = Ln x0                        (13)    Figure 2.   Here are two small subwebs, which do not exchange links

where n represents the state after n clicks (the possibility of
being on each page), and L is the link matrix, or transition                                                         
matrix. So by calculating the powers of L, we can determine                                        0 1 0 0          0
the steady state vector. This process is called the Power
                                                                                           
                                                                                                  1 0 0 0          0 
                                                                                                                    1 
                                                                                                                      
Iteration method, and it converges on an estimate of the                                 A=
                                                                                                  0 0 0 1          2 
                                                                                                                    1 
eigenvector for the greatest eigenvalue, which is always 1 in                                     0 0 0 0          2
the case of a column-stochastic matrix. For example, by raising                                    0 0 1 0          0
the link matrix L to the 25th power, we have                           Mathematically, this problem poses itself as being a multi-
                 
                   0.387 0.387 0.387 0.387
                                                                   dimensional Markov Chain. Link matrix A has an geometric
                  0.290 0.290 0.290 0.290                         multiplicity of 2 for the eigenvalue of 1, as we showed
           B≈    0.129 0.129 0.129 0.129 
                                                                   in section I. B. It’s unclear which of the two associated
                   0.194 0.194 0.194 0.194                          eigenvectors should be chosen to form the rankings. The
                                                                    two eigenvectors are essentially eigenvectors for each subweb,
If we multiply matrix B by our initial state x0 , we get our        and each shows rankings which are accurate locally, but not
steady state vector                                                 globally.
                                                   T                   Google has chosen to solve this problem of subwebs by
           s=     0.387   0.290    0.129   0.194                    introducing an element of randomness into the link matrix.
                                                                                                                                1
which shows the probability of each link being visited. This        Defining a matrix S as an n × n matrix with all entries n
power iteration process gives us approximately the same             and a value m between 0 and 1 as a relative weight, we can
result as finding the eigenvector of the link matrix, but is         replace link matrix A with:
often more computationally feasible, especially for matrices
with a dimension of around 1 billion, like Google’s. These                                 M = (1 − m)A + mS                              (14)
computation savings are why this is the method by which                If m > 0, there will be no parts of matrix M which
Google actually calculates the PageRank of web pages [5]. By        represent entirely disconnected subwebs, as every web surfer
taking powers of the matrix to estimate eigenvectors, Google        has some probability of reaching another page regardless of
is doing the reverse of many applications, which diagonalize        the page they’re on. In the original PageRank algorithm, an
matrices into their eigenvector components in order to take         m value of .15 was used. Today, it is spectulated by those
them to a high power. Few applications actually use the power       outside of Google that the value currently in use lies between
iteration method, since it is only appropriate given a narrow       .1 and .2. The larger the value of m, the higher the random
range of conditions. The sparseness of the web’s link matrix,       matrix is weighted, and the more egalitarian the corresponding
and the need to only know the eigenvector corresponding to          PageRank values are. If m is 1, a web surfer has equal
the dominant eigenvalue, make it an application well-suited to      probability of getting to any page on the web from any other
take advantages of the power iteration method.                      page, and the all links would be ignored. If m is 0, any
                                                                    subwebs contained in the system will cause the eigenvalues to
A. Subwebs                                                          have a multiplicity greater than 1, and there will be ambiguity
                                                                    in the system.
   We now address a problem that this model has when
faced with real-world constraints. We refer to this problem
as disconnected subwebs, as shown in Figure 2. If there are                                   III. D ISCUSSION
two or more groups of linked pages that do not link to each            We’ve shown how systems that can be characterized as
other, it is impossible to rank all pages relative to each other.   a Markov Chain will converge to a steady state, and that
The matrix shown below is the link matrix for the web shown         these steady state values can be found by using either the
in Figure 2:                                                        characteristic equation or the power iteration method. We then
5



investigated how the web can be viewed as a Markov Chain
when the state is which page a web surfer is on, and the hy-
perlinks between pages dictate the probability of transitioning
from one state to another. With this characterization, we can
view the steady-state vector as the proportional amount of time
a web surfer would spend on every page, hence, a valuable
metric for which pages are more important.

                            R EFERENCES
[1] O’Neill, Nick. Why Facebook Could Be Worth $200 Billion All Facebook.
    Available at http://www.allfacebook.com/is-facebooks-valuation-hype-a-
    macro-perspective-2011-01
[2] Page, L. and Brin, S. and Motwani, R. and Winograd, T. The pagerank
    citation ranking: Bringing order to the web Technical report, Stanford     Figure 3. A small web of 5 pages
    Digital Library Technologies Project, 1998.
[3] Granka, Laura A. and Joachims, Thorsten and Gay, Geri. Eye-tracking
    analysis of user behavior in WWW search SIGIR, 2004. Available at
    http://www.cs.cornell.edu/People/tj/publications/granketal04a.pdf                                       IV. E XERCISES
[4] Grinstead, Charles M. and Snell, J. Laurie Introduction                        Exercise 1: Find the eigenvalues and correspoding eigenvectors of the
    to Probability: Chapter 11 Markov Chains Available at                          following matrix.
    http://www.dartmouth.edu/ chance/teachingaids/booksarticles/probabilitybook/Chapter11.pdf
                                                                                                               3     0    −1 0
                                                                                                                                  
[5] Ipsen, Ilse, and Rebecca M. Wills Analysis and Computation of Google’s
    PageRank 7th IMACS International Symposium on Iterative Methods in                                      4      −3     2    0 
                                                                                                      A=
    Scientific Computing, Fields Institute, Toronto, Canada, 5 May 2005.                                        0     0     1    0 
    Available at http://www4.ncsu.edu/ ipsen/ps/slidesimacs.pdf                                               −2     0     3    2
                                                                                 Exercise 2: Given the column-stochastic matrix P:
                                                                                                                      0.6   0.3
                                                                                                          P=
                                                                                                                      0.4   0.7
                                                                                 find the steady-state vector for P.

                                                                                 Exercise 3: Create a link matrix for the network with 5 internet pages
                                                                                 in Figure 3, then rank the pages.

                                                                                 Exercise 4: In section I B. we claim that all transition matrices for
                                                                                 Markov chains have 1 as an eigenvalue. Why is this true for every
                                                                                 column-stochastic matrix?
6



                V. S OLUTION TO E XERCISES
Solution 1: The eigenvalues are λ1 = 2, λ2 = −3, λ3 = 3 and λ4 = 1,
and corresponding eigenvectors are
           0             0               3            1
                                                   
         0           1           2            2 
  x1 =         x =           x =             x =
           0  2  0  3  0  4  2 
           1             0              −6           −4
Solution 2: The steady-state vector for P is,
                                   0.4286
                         s=
                                    5714
Solution 3: The link matrix is,
                                               1
                            0          1               0   0
                                                              
                                               3
                         1            0       1
                                                       0   0   
                         3                    3
                                                       1
                                                               
                   A= 0              0       0       2
                                                           1   
                                                               
                         1            0       1
                                                       0   0   
                               3               3
                               1                       1
                               3
                                       0       0       2
                                                           0

The ranking vector is,
                           1       1       1       1       1   T
                   x=      4       6       4       6       6

Solution 4: By definition, every column in a column stochastic matrix
contains no negative numbers and sums to 1. As shown in Section I. B.,
eigenvalues are the numbers that, when subtracted from the main diagonal
of a matrix, cause its determinant to equal 0. Since every column adds to
1, and we are subtracting 1 from every column, clearly the determinant
will equal 0. Therefore, 1 is always an eigenvalue of column-stochastic
matrices.

Mais conteúdo relacionado

Destaque

Coca-Cola Social Media Case Study
Coca-Cola Social Media Case StudyCoca-Cola Social Media Case Study
Coca-Cola Social Media Case Study
amadrierith
 
Google hummingbird algorithm ppt
Google hummingbird algorithm pptGoogle hummingbird algorithm ppt
Google hummingbird algorithm ppt
Priyodarshini Dhar
 

Destaque (17)

Starbucks Social Media Campaign 2016
Starbucks Social Media Campaign 2016 Starbucks Social Media Campaign 2016
Starbucks Social Media Campaign 2016
 
Coca-Cola Social Media Case Study
Coca-Cola Social Media Case StudyCoca-Cola Social Media Case Study
Coca-Cola Social Media Case Study
 
Social media marketing strategy & plan 2017
Social media marketing strategy & plan 2017Social media marketing strategy & plan 2017
Social media marketing strategy & plan 2017
 
Google algorithms
Google algorithmsGoogle algorithms
Google algorithms
 
Starbucks 2017 Social Media Strategy
Starbucks 2017 Social Media StrategyStarbucks 2017 Social Media Strategy
Starbucks 2017 Social Media Strategy
 
Google Adwords Introduction PPT
Google Adwords Introduction PPTGoogle Adwords Introduction PPT
Google Adwords Introduction PPT
 
Seo 7 step seo process
Seo 7 step seo processSeo 7 step seo process
Seo 7 step seo process
 
Google hummingbird algorithm ppt
Google hummingbird algorithm pptGoogle hummingbird algorithm ppt
Google hummingbird algorithm ppt
 
Searching higher up the funnel
Searching higher up the funnelSearching higher up the funnel
Searching higher up the funnel
 
Introduction to SEO Presentation
Introduction to SEO PresentationIntroduction to SEO Presentation
Introduction to SEO Presentation
 
Starbucks - Competitive Analysis
Starbucks - Competitive AnalysisStarbucks - Competitive Analysis
Starbucks - Competitive Analysis
 
What is Inbound Marketing?
What is Inbound Marketing?What is Inbound Marketing?
What is Inbound Marketing?
 
40 Inspiring Social Media Case Studies
40 Inspiring Social Media Case Studies40 Inspiring Social Media Case Studies
40 Inspiring Social Media Case Studies
 
Search Engine Optimization PPT
Search Engine Optimization PPT Search Engine Optimization PPT
Search Engine Optimization PPT
 
Introduction to SEO
Introduction to SEOIntroduction to SEO
Introduction to SEO
 
2017 Digital Yearbook
2017 Digital Yearbook2017 Digital Yearbook
2017 Digital Yearbook
 
Digital in 2017 Global Overview
Digital in 2017 Global OverviewDigital in 2017 Global Overview
Digital in 2017 Global Overview
 

Semelhante a Google PageRank

Ijsea04031005
Ijsea04031005Ijsea04031005
Ijsea04031005
Editor IJCATR
 
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdfalgebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
enigmagb9
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
Kundan Bhaduri
 
System Kendali - Systems and control
System Kendali - Systems and controlSystem Kendali - Systems and control
System Kendali - Systems and control
Rizky Akbar
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 

Semelhante a Google PageRank (20)

Ijsea04031005
Ijsea04031005Ijsea04031005
Ijsea04031005
 
Application of finite markov chain to a model of schooling
Application of finite markov chain to a model of schoolingApplication of finite markov chain to a model of schooling
Application of finite markov chain to a model of schooling
 
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdfalgebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
algebra lineal ada5c97a-340b-4ae9-afce-b864c9f851df.pdf
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
AERA 2022 Presentation
AERA 2022 PresentationAERA 2022 Presentation
AERA 2022 Presentation
 
Bayesian Phylogenetics - Systematics.pptx
Bayesian Phylogenetics - Systematics.pptxBayesian Phylogenetics - Systematics.pptx
Bayesian Phylogenetics - Systematics.pptx
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
 
Mcs 021 solve assignment
Mcs 021 solve assignmentMcs 021 solve assignment
Mcs 021 solve assignment
 
An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...
 
System Kendali - Systems and control
System Kendali - Systems and controlSystem Kendali - Systems and control
System Kendali - Systems and control
 
Bachelor's Thesis
Bachelor's ThesisBachelor's Thesis
Bachelor's Thesis
 
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Multivariate Data Analysis Project Report
Multivariate Data Analysis Project ReportMultivariate Data Analysis Project Report
Multivariate Data Analysis Project Report
 
Machine learning
Machine learningMachine learning
Machine learning
 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping
 
Samplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdfSamplying in Factored Dynamic Systems_Fadel.pdf
Samplying in Factored Dynamic Systems_Fadel.pdf
 

Mais de Maia Bittner (9)

Autodesk olin scope mid year report
Autodesk olin scope mid year reportAutodesk olin scope mid year report
Autodesk olin scope mid year report
 
Autodesk olin scope mid year presentation
Autodesk olin scope mid year presentationAutodesk olin scope mid year presentation
Autodesk olin scope mid year presentation
 
2008 Personal Annual report : Dopplr
2008 Personal Annual report : Dopplr2008 Personal Annual report : Dopplr
2008 Personal Annual report : Dopplr
 
Maia Bittner in Olin College promotional brochure
Maia Bittner in Olin College promotional brochureMaia Bittner in Olin College promotional brochure
Maia Bittner in Olin College promotional brochure
 
Nails not wood: focus on building, not writing code
Nails not wood: focus on building, not writing codeNails not wood: focus on building, not writing code
Nails not wood: focus on building, not writing code
 
Final presentation widescreen
Final presentation widescreenFinal presentation widescreen
Final presentation widescreen
 
Senior Design Project
Senior Design ProjectSenior Design Project
Senior Design Project
 
Final Project Presentation for Computer Networking
Final Project Presentation for Computer NetworkingFinal Project Presentation for Computer Networking
Final Project Presentation for Computer Networking
 
Senior Design Project
Senior Design ProjectSenior Design Project
Senior Design Project
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Google PageRank

  • 1. 1 Linear Algebra in Use: Ranking Web Pages with an Eigenvector Maia Bittner, Yifei Feng Abstract—Google PageRank is an algorithm that uses the In this paper, we will explain how the interlinking structure underlying, hyperlinked structure of the web to determine the of the web and the properties of Markov Chains can be used to theoretical number of times a random web surfer would visit each quantify the relative importance of each indexed page. We will page. Google converts these numbers into probabilities, and uses these probabilities as each web page’s relative importance. Then, examine Markov Chains, eigenvectors, and the power iteration the most important pages for a given query can be returned first method, as well as some of the problems that arise when using in search results. In this paper, we focus on the math behind this system to rank web pages. PageRank, which includes eigenvectors and Markov Chains, and on explaining how it is used to rank webpages. A. Markov Chains Google PageRank, Eigenvector, Eigenvalue, Markov Chain Markov chains are mathematical models that describe par- I. I NTRODUCTION ticular types of systems. For example, we can construe the The internet is humanity’s largest collection of information, number of students at Olin College who are sick as a Markov and its democratic nature means that anyone can contribute Chain if we know the how likely it is that a student will more information to it. Search engines help users sort through become sick. Let us say that if a student is feeling well, she has the billions of available web pages to find the information a 5% chance of becoming sick the next day, and that if she is that they are looking for. Most search engines use a two step already sick, she has a 35% chance of feeling better tomorrow. process to return web pages based on the user’s query. The first In our example, a student can only be healthy or sick; these step involves finding which of the pages the search engine has two states are called the state space of the system. In addition, indexed are related to the query, either by containing the words we’ve decided to only ask how the students are feeling in the in the query, or by more advanced means that use semantic morning, and their health on any day only depends on how models. The second step is to order this list of relevant pages they were feeling the previous morning. This constant, discrete according to some criterion. For example, the first web search increase in time makes the system time-homogenous. We can engines, launched in the early 90’s, used text-based matching generate a set of linear equations that will describe how many systems as their criterion for ordering returned results by students at Olin College are healthy and sick on any given day. relevancy. This ranking method often resulted in returning If we let mk indicates the number of healthy students and nk exact matches in unauthoritative, poorly written pages before indicates the number of sick students at morning k, then we results that the user could trust. Even worse, this system was get the following two equations: easy to exploit by page owners, who could fill their pages mk+1 = .95mk + .35nk with irrelevant words and phrases, with the hope of ranking nk+1 = .05mk + .65nk highly. These problems prompted researchers to investigate more advanced methods of ranking. Larry Page and Sergey Brin were researching a new kind Putting this system of linear equations into matrix notation, of search engine at Stanford University when they had the we get: idea that pages could be ranked by link popularity. The underlying social basis of their idea was that more reputable mk+1 .95 .35 mk = (1) pages are linked to by other pages more often. Page and Brin nk+1 .05 .65 nk developed an algorithm that could quantify this idea, and in We can take this matrix full of probabilities and call it P, the 1998, christened the algorithm PageRank [2] and published transition matrix. their first paper on it. Shortly afterwards, they founded Google, a web search engine that uses PageRank to help rank the .95 .35 returned results. Google’s famously useful search results [3] P= (2) .05 .65 helped it reach almost $29 billion dollars in revenue in 2010 [1]. The original algorithm organized the indexed web pages The columns can be viewed as representing the present state such that the links between them are used to construct the of the system, and the rows can be viewed as representing probability of a random web surfer navigating from one page the future state of the system. The first column accounts for to another. This system can be characterized as a Markov the students who are healthy today, and the second column Chain, a mathematical model described below, in order to take accounts for the students who are sick today, while the first row advantage of their convenient properties. indicates the students who will be healthy tomorrow and the
  • 2. 2 second row indicates the students who will be sick tomorrow. classified as a Markov Chain has these steady-state values that The intersecting elements of the transition matrix represent the system will remain constant at, regardless of the initial the probability that a student will transition from that column state. This steady-state vector is a specific example of an state to the row state. So we can see that p1,1 indicates that eigenvector, explained below. 95% of the students who are healthy today will be healthy tomorrow. The total number of students who will be healthy B. Eigenvalues and Eigenvectors tomorrow is represented by the first row, so that it is 95% of An eigenvector is a nonzero vector x that, when multipled the students who are healthy today plus 35% of the students by a matrix A, only scales in length and does not change who are sick today. Similarly, the second row shows us the direction, except for potentially a reverse in direction. number of students who will be sick tomorrow: 5% of the students who are healthy today plus 65% of the students who Ax = λx (9) are sick today. The corresponding amount that an eigenvector is scaled by, λ, You can see that each column sums to one, to account for is called its eigenvalue. There are several techniques to find the 100% of students who are sick and 100% of students who eigenvalues and eigenvectors of a matrix. We will demonstrate are healthy in the current state. Square matrices like this, that one technique below with matrix A. have nonnegative real entries and where every column sums   to one, are called column-stochastic. 1 2 5 We can find the total number of students who will be in A= 0 3 0  a state on day k + 1 by multiplying the transition matrix by 0 0 4 a vector containing how many students were in each state on How to find the eigenvalues for matrix A: We know that λ is day k. a eigenvalue of A if and only if the equation Pxk = xk+1 (3) Ax = λx (10) For example, if 270 students at Olin College are healthy today, and 30 are sick, we can find the state vector for tomorrow: has a nontrivial solution. This is equivalent to finding λ such that: .95 .35 270 267 = (4) (A − λI)x = 0 (11) .05 .65 30 33 The above equation has nontrivial solution when the determi- which shows that tomorrow, 267 students will be healthy and nant of A − λI is zero. 33 students will be sick. To find the next day’s values, you can multiply again by the transition matrix: 1−λ 2 5 det(A − λI3 ) = 0 3−λ 0 .95 .35 267 265.2 0 0 4−λ = (5) .05 .65 33 34.8 = (4 − λ)(1 − λ)(3 − λ) = 0 which is the same as 2 Solving for λ, we get that the eigenvalues are λ1 = 1, λ2 = 3, .95 .35 270 265.2 and λ3 = 4. If we solve for Avi = λi vi , we will get the = (6) .05 .65 30 34.8 corresponding eigenvector for each eigenvalue:       So we can see that in order to find mk and nk , we can 1 4 5 multiply P k by the state vector containing m0 and n0 , as in v1 =  0  , v2 =  −1  , v3 =  0  the equation below: 0 2 3 k mk .95 .35 m0 This means that if v2 is transformed by A, the result will = (7) scale v2 by its eigenvalue, 3. nk .05 .65 n0 You can see in Eqn. (8) that the steady state of a Markov If you continue to multiply the state vector by the transition Chain has an eigenvalue of 1. This is why those steady state matrix for very high values of k, we will see that it will vectors are a special case of eigenvectors. Because they are eventually converge upon a steady state, regardless of initial column-stochastic, all transition matrices of Markov Chains conditions. This is represented by vector q in Eqn. (8). will have an eigenvalue of 1 (we invite the reader to prove this in Exercise 4). A system having an eigenvalue of 1 is the Pq = q (8) same as it having a steady state. Being able to find this steady-state vector is the main In some matrices, we may get repeated roots when solving advantage of using a column-stochastic matrix to model a det(A − λI) = 0. We will demonstrate this for the column- system. Column-stochastic transition matrices are always used stochastic matrix P:   to represent the known probabilities of transitioning between 0 1 0 0 0 states in Markov Chains. To model a system as a Markov  1 0 0 0 0   1  Chain, it must be a discrete-time process that has a finite P= 0 0 0 1 2    state space, and the probability distribution for any state must  0 0 0 0 1  2 depend only on the previous state. Every situation that can be 0 0 1 0 0
  • 3. 3 four webpages. Page 1 is linked to by pages 2 and 3, so its importance score is x1 = 2. In the same way, we can get x2 = 3, x3 = 1, x4 = 2. Page 2 has the highest importance score, indicating that page 2 is the most important page in this web. However, this approach has several drawbacks. First, the pages that have more links to other pages would have more votes, which means that a website can easily gain more influence by creating many links to other pages. Second, we would expect that a vote from an important page should weigh more than a vote from an unimportant one, but every page’s vote is worth the same amount with this method. A way to fix both of these problems is to give each page the amount Figure 1. A small web of 4 pages, connected by directional links of voting power that is equivalent to its importance score. So for webpage k with an importance score of xk , it has a total To find the eigenvalues, solve: voting power of xk . Then we can equally distribute xk to all the pages it links to. We can define the importance score of −λ 1 0 0 0 a page as the sum of all the weighted votes it gets from the 1 −λ 0 0 0 pages that link to it. So if webpage k has a set of pages Sk det(P − λI5 ) = 0 0 −λ 1 1 2 linked to it, we have 0 0 −λ 0 1 2 xj 0 0 1 0 −λ xk = (12) nj 1 j∈Sk = − (λ − 1)2 (λ + 1)(2λ2 + 2λ + 1) = 0 2 where nj is the number of links from page j. If we apply this When we solve the characteristic equation, we find that the method to the network of Figure 1, we can get a system of five eigenvalues are: λ1 = 1, λ2 = 1, λ3 = −1, λ4 = − 1 − 2 , i linear equations: 2 1 i x2 x4 λ5 = − 2 + 2 . Since 1 appears twice as an eigenvalue, we x1 = + say that is has algebraic multiplicity of 2. The number of 1 2 individual eigenvectors associated with eigenvalue 1 is called x1 x3 x4 x2 = + + the geometric multiplicity of λ = 1. The reader can confirm 3 2 2 x1 that in this case, λ = 1 has geometric multiplicity of 2 with x3 = 3 associated eigenvectors x and y. x1 x3 x4 = +  √2   √2  3 2 √2 − 2 √  2   2   2   2  x=  0 ,y =  0     which can be written in the matrix form x = Lx, where x =  0   0  [x1 , x2 , x3 , x4 ]T and 0 0  0 1 0 2 1  We can see that when transition matrices for Markov chains  1 0 1 1  L= 1 3 2 2  have geometric multiplicity for eigenvalue of 1, it’s unclear  3 0 0 0  1 1 which independent eigenvector should be used to represent 3 0 2 0 the steady-state of the system. L is called the link matrix of this network system since it encapsulates the links between all the pages in the system. II. PAGE R ANK Because we’ve evenly distributed xk to each of the pages When the founders of Google created PageRank, they were k links to, the link matrix is always column-stochastic. As trying to discern the relative authority of web pages from the we defined earlier, vector x contains the importance scores underlying structure of links that connects the web. They did of all web pages. To find these scores, we can solve for this by calculating an importance score for each web page. Lx = x. You’ll notice that this looks similar to Eqn. (8), Given a webpage, call it page k, we can use xk to denote the and indeed, we can transform this problem into finding the importance of this page among the total number of n pages. eigenvector with eigenvalue λ = 1 for the matrix L! For There are many different ways that one could calculate an matrix L, the eigenvector is [0.387, 0.290, 0.129, 0.194]T for importance score. One simple and intuitive way to do page λ = 1. So we know the importance score of each page is ranking is to count the number of links from other pages x1 ≈ 0.387, x2 ≈ 0.290, x3 ≈ 0.129, x4 ≈ 0.194. Note that to page k, and assign that number as xk . We can think of with this more sophisticated method, page 1 has the highest each link as being one vote for the page it links to, and of importance score instead of page 2. This is because page the number of votes a page gets as showing the importance 2 only links to page 1, so it casts its entire vote to page of the page. In the example network of Figure 1, there are 1, boosting up its score. Knowing that the link matrix is a
  • 4. 4 column-stochastic matrix, let us now look at the problem of ranking internet pages in terms of a Markov Chain system. For a network with n pages, the ith entry of the n × 1 vector xk denotes the probability of visiting page i after k clicks. The link matrix L is the transition matrix such that the entry lij is the probability of clicking on a link to page i when on page j. Finding the importance score of a page is the same as finding its entry in the steady state vector of the Markov chain that describes the system. For example, say we start from page 1, so that vector that represents our begining state is x0 = [1, 0, 0, 0]T To find the probability of ending up on each web page after n clicks, we do: xn = Ln x0 (13) Figure 2. Here are two small subwebs, which do not exchange links where n represents the state after n clicks (the possibility of being on each page), and L is the link matrix, or transition   matrix. So by calculating the powers of L, we can determine 0 1 0 0 0 the steady state vector. This process is called the Power   1 0 0 0 0  1   Iteration method, and it converges on an estimate of the A=  0 0 0 1 2  1  eigenvector for the greatest eigenvalue, which is always 1 in  0 0 0 0 2 the case of a column-stochastic matrix. For example, by raising 0 0 1 0 0 the link matrix L to the 25th power, we have Mathematically, this problem poses itself as being a multi-  0.387 0.387 0.387 0.387  dimensional Markov Chain. Link matrix A has an geometric  0.290 0.290 0.290 0.290  multiplicity of 2 for the eigenvalue of 1, as we showed B≈  0.129 0.129 0.129 0.129   in section I. B. It’s unclear which of the two associated 0.194 0.194 0.194 0.194 eigenvectors should be chosen to form the rankings. The two eigenvectors are essentially eigenvectors for each subweb, If we multiply matrix B by our initial state x0 , we get our and each shows rankings which are accurate locally, but not steady state vector globally. T Google has chosen to solve this problem of subwebs by s= 0.387 0.290 0.129 0.194 introducing an element of randomness into the link matrix. 1 which shows the probability of each link being visited. This Defining a matrix S as an n × n matrix with all entries n power iteration process gives us approximately the same and a value m between 0 and 1 as a relative weight, we can result as finding the eigenvector of the link matrix, but is replace link matrix A with: often more computationally feasible, especially for matrices with a dimension of around 1 billion, like Google’s. These M = (1 − m)A + mS (14) computation savings are why this is the method by which If m > 0, there will be no parts of matrix M which Google actually calculates the PageRank of web pages [5]. By represent entirely disconnected subwebs, as every web surfer taking powers of the matrix to estimate eigenvectors, Google has some probability of reaching another page regardless of is doing the reverse of many applications, which diagonalize the page they’re on. In the original PageRank algorithm, an matrices into their eigenvector components in order to take m value of .15 was used. Today, it is spectulated by those them to a high power. Few applications actually use the power outside of Google that the value currently in use lies between iteration method, since it is only appropriate given a narrow .1 and .2. The larger the value of m, the higher the random range of conditions. The sparseness of the web’s link matrix, matrix is weighted, and the more egalitarian the corresponding and the need to only know the eigenvector corresponding to PageRank values are. If m is 1, a web surfer has equal the dominant eigenvalue, make it an application well-suited to probability of getting to any page on the web from any other take advantages of the power iteration method. page, and the all links would be ignored. If m is 0, any subwebs contained in the system will cause the eigenvalues to A. Subwebs have a multiplicity greater than 1, and there will be ambiguity in the system. We now address a problem that this model has when faced with real-world constraints. We refer to this problem as disconnected subwebs, as shown in Figure 2. If there are III. D ISCUSSION two or more groups of linked pages that do not link to each We’ve shown how systems that can be characterized as other, it is impossible to rank all pages relative to each other. a Markov Chain will converge to a steady state, and that The matrix shown below is the link matrix for the web shown these steady state values can be found by using either the in Figure 2: characteristic equation or the power iteration method. We then
  • 5. 5 investigated how the web can be viewed as a Markov Chain when the state is which page a web surfer is on, and the hy- perlinks between pages dictate the probability of transitioning from one state to another. With this characterization, we can view the steady-state vector as the proportional amount of time a web surfer would spend on every page, hence, a valuable metric for which pages are more important. R EFERENCES [1] O’Neill, Nick. Why Facebook Could Be Worth $200 Billion All Facebook. Available at http://www.allfacebook.com/is-facebooks-valuation-hype-a- macro-perspective-2011-01 [2] Page, L. and Brin, S. and Motwani, R. and Winograd, T. The pagerank citation ranking: Bringing order to the web Technical report, Stanford Figure 3. A small web of 5 pages Digital Library Technologies Project, 1998. [3] Granka, Laura A. and Joachims, Thorsten and Gay, Geri. Eye-tracking analysis of user behavior in WWW search SIGIR, 2004. Available at http://www.cs.cornell.edu/People/tj/publications/granketal04a.pdf IV. E XERCISES [4] Grinstead, Charles M. and Snell, J. Laurie Introduction Exercise 1: Find the eigenvalues and correspoding eigenvectors of the to Probability: Chapter 11 Markov Chains Available at following matrix. http://www.dartmouth.edu/ chance/teachingaids/booksarticles/probabilitybook/Chapter11.pdf 3 0 −1 0   [5] Ipsen, Ilse, and Rebecca M. Wills Analysis and Computation of Google’s PageRank 7th IMACS International Symposium on Iterative Methods in  4 −3 2 0  A= Scientific Computing, Fields Institute, Toronto, Canada, 5 May 2005. 0 0 1 0  Available at http://www4.ncsu.edu/ ipsen/ps/slidesimacs.pdf −2 0 3 2 Exercise 2: Given the column-stochastic matrix P: 0.6 0.3 P= 0.4 0.7 find the steady-state vector for P. Exercise 3: Create a link matrix for the network with 5 internet pages in Figure 3, then rank the pages. Exercise 4: In section I B. we claim that all transition matrices for Markov chains have 1 as an eigenvalue. Why is this true for every column-stochastic matrix?
  • 6. 6 V. S OLUTION TO E XERCISES Solution 1: The eigenvalues are λ1 = 2, λ2 = −3, λ3 = 3 and λ4 = 1, and corresponding eigenvectors are 0 0 3 1          0   1   2   2  x1 =  x = x = x = 0  2  0  3  0  4  2  1 0 −6 −4 Solution 2: The steady-state vector for P is, 0.4286 s= 5714 Solution 3: The link matrix is, 1 0 1 0 0   3  1 0 1 0 0   3 3 1  A= 0 0 0 2 1    1 0 1 0 0  3 3 1 1 3 0 0 2 0 The ranking vector is, 1 1 1 1 1 T x= 4 6 4 6 6 Solution 4: By definition, every column in a column stochastic matrix contains no negative numbers and sums to 1. As shown in Section I. B., eigenvalues are the numbers that, when subtracted from the main diagonal of a matrix, cause its determinant to equal 0. Since every column adds to 1, and we are subtracting 1 from every column, clearly the determinant will equal 0. Therefore, 1 is always an eigenvalue of column-stochastic matrices.