7. PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
8. Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
9. Escape Term Solution: Rank Source c is maximized and = 1 E(u) is some vector over the web pages – uniform, favorite page etc.
10. Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
11. Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged
12. Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
13. Implementation Computing resources — 24 million pages — 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
14. Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
15. Convergence Properties Graph (V, E) is an expander with factor if for all (not too large) subsets S: |As| |s| Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
17. Personalized PageRank Rank Source E can be initialized : – uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking – total weight on a single page, e.g. Netscape, McCarthy great variation of ranks under different single pages as rank source – and everything in-between, e.g. server root pages allow manipulation by commercial interests
18. Applications I Estimate web traffic – Server/page aliases – Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor – Citation counts have been used to predict future citations – very difficult to map the citation structure of the web completely – avoid the local maxima that citation counts get stuck in and get better performance
19. Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
20. Issues Users are no random walkers – Content based methods Starting point distribution – Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
23. Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications