SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Link Structure Analysis
Kira Radinsky
All of the following slides are courtesy of Ronny Lempel (Yahoo!)
29 November 2010 236620 Search Engine Technology 2
Link Analysis
In the Lecture
• HITS: topic-specific algorithm
– Assigns each page two scores – a hub score and an authority score –
with respect to a topic
• PageRank: query independent algorithm
– Assigns each page a single, global importance score
• Both algorithms reduced to the computation of principal eigenvectors
of certain matrices
Today’s Tutorial:
1. Graph modifications in link analysis algorithms
2. SALSA – HITS with a random-walk twist
3. Topic-Sensitive PageRank
30 November 2010 236620 Search Engine Technology 3
Graph Modifications in Link-Analysis
Algorithms
1. Delete irrelevant elements (pages, links) from the collection.
 Non-informative links
 Pages that are deemed irrelevant (mostly by similarity of
content to the query), and their incident links [Bharat
and Henzinger, 1998]
2. Assign varying (positive) link weights to the non-deleted
links.
– Similarity of anchor text to the query [CLEVER]
– Links incident to pre-defined relevant pages [CLEVER]
– Multiple links from pages of site A to pages of site B
[Bharat and Henzinger, 1998]
• Note that some of the above modifications are only
applicable to topic distillation algorithms
29 November 2010 236620 Search Engine Technology 4
SALSA – Stochastic Approach to Link
Structure Analysis
• SALSA, like HITS, is a topic-distillation algorithm that aims to
assign pages both hub and authority scores
– SALSA analyzes the same topic-centric graph as HITS, but splits each
node into two – a “hub personality” without in-links and an
“authority personality” without out-links
– Examines the resulting bipartite graph
• Innovation: incorporate stochastic analysis with the
authority-hub paradigm
– Examine two separate random walk Markov chains:
an authority chain A, and a hub chain H.
– A single step in each chain is composed of two link traversals on the
Web - one link forward, and one link backwards.
– The principal community of each type: the most frequently visited
pages in the corresponding Markov Chain
Forming bi-pirate graph in Salsa
29 November 2010 236620 Search Engine Technology 6
Pr (23) = 2/5*1/3
• Formally, The transition
probability matrix:
SALSA – Authority Chain Example
[PA]i,j =  {k| ki, kj} (iin)-1(kout)-1
29 November 2010 236620 Search Engine Technology 7
SALSA: Analysis
• The transition probabilities induce a probability
distribution on the authorities (hubs) in the authority
(hub) Markov chain
– If the chains are not irreducible, the probability depends on the
initial distribution (chosen to be uniform)
• The principal community of authorities (hubs) is defined
as the k most probable pages in the authority (hub) chain
• While one can compute the scores by calculating the
principal eigenvector of the stochastic transition matrices,
a more efficient way exists
29 November 2010 236620 Search Engine Technology 8
Mathematical Analysis of SALSA leads to the
following theorem: SALSA’s authority weights reflect
The normalized in-degree of each page,
multiplied by the relative size of the
page’s component in the authority side of the graph
x
3 4
a(x) = ----- x ----- = 0.25
3 +5 4 +2
SALSA: Analysis (cont.)
29 November 2010 236620 Search Engine Technology 9
SALSA: Proof for Irreducible Authority Chains
• The proof assumes a weighted graph, in which the link kj
has weight w(kj)
– The examples shown so far assumed that all links have a weight of
1
• Define W as the sum of all links weights
• Define a distribution vector π by πj = din(j)/W, where din(j)
is the sum of weights of j’s incoming links
– Similarly, dout(k) is the sum of weights of k’s outgoing links
• It is enough to prove that πPA=π, since PA has a single
stationary eigenvector (Ergodic Theorem)
– Recall that PA is the transition matrix of the authority chain
– PA is always aperiodic
29 November 2010 236620 Search Engine Technology 10
SALSA: Proof for Irreducible Authority Chains
29 November 2010 236620 Search Engine Technology 11
Topic Sensitive PageRank
[T. Haveliwala, 2002]
• A topic T is defined by a set of on-topic pages ST.
• A T-biased PageRank is PageRank where the random jumps
(teleportations) land u.a.r. on ST rather than on any arbitrary
Web page
• Recall the alternative interpretation of PageRank, as walking
random paths of geometrically distributed lengths between
resets
– Here, a reset returns to some on-topic page
• If we assume that pages tend to link to pages with topical
affinity, short paths starting at ST will not stray too far away
from on-topic pages, hence the PageRanks will be T-biased
– Note that pages unreachable from ST will receive a T-biased PageRank of 0
• Where would be a good place to find sets ST for certain
topics?
– The pages classified under the 16 top-level topics of the Open Directory
Project (see next slide)
29 November 2010 236620 Search Engine Technology 12
29 November 2010 236620 Search Engine Technology 13
Topic-Sensitive PageRank (cont.)
• 16 PageRank vectors are computed, PR1,…,PR16
• Given a query q, its affinity to the 16 topics T1,…,T16 is
computed
– Based on the probability of generating the query by the language
model induced by the set of pages ST
– A distribution vector [α1,…,α16] is computed, where
αj ~ Prob(q | language model of STj)
• The PageRank vector that will be used to serve q is
PRq = αjPRj
• The idea of biasing PageRank’s random jump destinations is
also used for personalized PageRank flavors [e.g. Jeh and
Widom 2003]
29 November 2010 236620 Search Engine Technology 14
Link Analysis Algorithms - Summary
• Many variants and refinements of both HITS and PageRank
have been proposed.
• Other approaches include:
– Max-flow techniques [Flake et al., SIGKDD 2000]
– Machine learning and Bayesian techniques
• Examples of applications:
– Ranking pages (topic specific/global importance/ personalized rankings)
– Categorization, clustering, finding related pages
– Identifying virtual communities
• Computational issues:
– Distributed computations of eigenvectors of massive, sparse matrices
– Convergence acceleration, approximations
• A wealth of literature

Mais conteúdo relacionado

Semelhante a Tutorial 7 (link analysis) (20)

Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank (2)
Pagerank (2)Pagerank (2)
Pagerank (2)
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank (2)
Pagerank (2)Pagerank (2)
Pagerank (2)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
prueba
prueba prueba
prueba
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 

Mais de Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Kira
 

Mais de Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Último

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Último (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Tutorial 7 (link analysis)

  • 1. Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
  • 2. 29 November 2010 236620 Search Engine Technology 2 Link Analysis In the Lecture • HITS: topic-specific algorithm – Assigns each page two scores – a hub score and an authority score – with respect to a topic • PageRank: query independent algorithm – Assigns each page a single, global importance score • Both algorithms reduced to the computation of principal eigenvectors of certain matrices Today’s Tutorial: 1. Graph modifications in link analysis algorithms 2. SALSA – HITS with a random-walk twist 3. Topic-Sensitive PageRank
  • 3. 30 November 2010 236620 Search Engine Technology 3 Graph Modifications in Link-Analysis Algorithms 1. Delete irrelevant elements (pages, links) from the collection.  Non-informative links  Pages that are deemed irrelevant (mostly by similarity of content to the query), and their incident links [Bharat and Henzinger, 1998] 2. Assign varying (positive) link weights to the non-deleted links. – Similarity of anchor text to the query [CLEVER] – Links incident to pre-defined relevant pages [CLEVER] – Multiple links from pages of site A to pages of site B [Bharat and Henzinger, 1998] • Note that some of the above modifications are only applicable to topic distillation algorithms
  • 4. 29 November 2010 236620 Search Engine Technology 4 SALSA – Stochastic Approach to Link Structure Analysis • SALSA, like HITS, is a topic-distillation algorithm that aims to assign pages both hub and authority scores – SALSA analyzes the same topic-centric graph as HITS, but splits each node into two – a “hub personality” without in-links and an “authority personality” without out-links – Examines the resulting bipartite graph • Innovation: incorporate stochastic analysis with the authority-hub paradigm – Examine two separate random walk Markov chains: an authority chain A, and a hub chain H. – A single step in each chain is composed of two link traversals on the Web - one link forward, and one link backwards. – The principal community of each type: the most frequently visited pages in the corresponding Markov Chain
  • 6. 29 November 2010 236620 Search Engine Technology 6 Pr (23) = 2/5*1/3 • Formally, The transition probability matrix: SALSA – Authority Chain Example [PA]i,j =  {k| ki, kj} (iin)-1(kout)-1
  • 7. 29 November 2010 236620 Search Engine Technology 7 SALSA: Analysis • The transition probabilities induce a probability distribution on the authorities (hubs) in the authority (hub) Markov chain – If the chains are not irreducible, the probability depends on the initial distribution (chosen to be uniform) • The principal community of authorities (hubs) is defined as the k most probable pages in the authority (hub) chain • While one can compute the scores by calculating the principal eigenvector of the stochastic transition matrices, a more efficient way exists
  • 8. 29 November 2010 236620 Search Engine Technology 8 Mathematical Analysis of SALSA leads to the following theorem: SALSA’s authority weights reflect The normalized in-degree of each page, multiplied by the relative size of the page’s component in the authority side of the graph x 3 4 a(x) = ----- x ----- = 0.25 3 +5 4 +2 SALSA: Analysis (cont.)
  • 9. 29 November 2010 236620 Search Engine Technology 9 SALSA: Proof for Irreducible Authority Chains • The proof assumes a weighted graph, in which the link kj has weight w(kj) – The examples shown so far assumed that all links have a weight of 1 • Define W as the sum of all links weights • Define a distribution vector π by πj = din(j)/W, where din(j) is the sum of weights of j’s incoming links – Similarly, dout(k) is the sum of weights of k’s outgoing links • It is enough to prove that πPA=π, since PA has a single stationary eigenvector (Ergodic Theorem) – Recall that PA is the transition matrix of the authority chain – PA is always aperiodic
  • 10. 29 November 2010 236620 Search Engine Technology 10 SALSA: Proof for Irreducible Authority Chains
  • 11. 29 November 2010 236620 Search Engine Technology 11 Topic Sensitive PageRank [T. Haveliwala, 2002] • A topic T is defined by a set of on-topic pages ST. • A T-biased PageRank is PageRank where the random jumps (teleportations) land u.a.r. on ST rather than on any arbitrary Web page • Recall the alternative interpretation of PageRank, as walking random paths of geometrically distributed lengths between resets – Here, a reset returns to some on-topic page • If we assume that pages tend to link to pages with topical affinity, short paths starting at ST will not stray too far away from on-topic pages, hence the PageRanks will be T-biased – Note that pages unreachable from ST will receive a T-biased PageRank of 0 • Where would be a good place to find sets ST for certain topics? – The pages classified under the 16 top-level topics of the Open Directory Project (see next slide)
  • 12. 29 November 2010 236620 Search Engine Technology 12
  • 13. 29 November 2010 236620 Search Engine Technology 13 Topic-Sensitive PageRank (cont.) • 16 PageRank vectors are computed, PR1,…,PR16 • Given a query q, its affinity to the 16 topics T1,…,T16 is computed – Based on the probability of generating the query by the language model induced by the set of pages ST – A distribution vector [α1,…,α16] is computed, where αj ~ Prob(q | language model of STj) • The PageRank vector that will be used to serve q is PRq = αjPRj • The idea of biasing PageRank’s random jump destinations is also used for personalized PageRank flavors [e.g. Jeh and Widom 2003]
  • 14. 29 November 2010 236620 Search Engine Technology 14 Link Analysis Algorithms - Summary • Many variants and refinements of both HITS and PageRank have been proposed. • Other approaches include: – Max-flow techniques [Flake et al., SIGKDD 2000] – Machine learning and Bayesian techniques • Examples of applications: – Ranking pages (topic specific/global importance/ personalized rankings) – Categorization, clustering, finding related pages – Identifying virtual communities • Computational issues: – Distributed computations of eigenvectors of massive, sparse matrices – Convergence acceleration, approximations • A wealth of literature