This document analyzes characteristics of the Chilean web (.CL domain) from 2000 using various metrics like PageRank, HITS, and age. It finds that:
1) Only a small percentage of pages have relevant PageRank, hub, or authority scores.
2) PageRank, hub score, and authority score are not correlated.
3) PageRank is biased towards older pages while authority scores do not depend on age.
4) The web has a core structure with a main connected component and other isolated components.
2. Agenda
Introduction
•
Link-based ranking
•
Web structure
•
Web characteristics
•
Web usage
•
Web dynamics
•
Conclusions
•
Relating Web Characteristics
3. Introduction: Sample
Web sample: .CL domain on year 2000
•
670,000 pages in 7,500 domains
•
15kb average page size
•
Collection from the TodoCL web search
•
engine
Relating Web Characteristics
4. Introduction: Emphasis
• Broder et al.: Graph Structure on the
Web (2000)
– Page-based structure based on strongly
connected components
– The Web graph is not a random graph
– Process: cut & paste model
• Our is mostly a site-based analysis
– Trying to make Web structure meaningful
Relating Web Characteristics
7. Link ranking: Pagerank
Pages that point
to page p
k
q
Pagerank ( p ) = + (1 − q )∑ Pagerank (ri )
N i =1
Currently used by
Google
Probability of a
Brin & Page, 1998
random jump over
number of pages
Relating Web Characteristics
8. Link ranking: Hubs &
Authorities
• HITS algorithm (Kleinberg, 1998)
• A good authority is a page pointed by
good hubs, so we assume that it has
good content
• A good hub is a page that points to
good authorities, so we assume it is a
good set of links
• Linear system calculated by numerical
iteration
Relating Web Characteristics
9. Link ranking: Distribution
<2% with relevant
Pagerank
9% with relevant
2-3% with relevant
hub score
authority score
Relating Web Characteristics
10. Link ranking: Correlation
Hub score,
authority score
and Pagerank
do not seem
to be correlated
Relating Web Characteristics
11. Link ranking: Sites
• Which measure to use for sites ?
• Average score
– But good sites can have lots of bad pages
• Maximum score
– But one good page cannot be all that is
needed to be a good site
• Sum of the scores of all pages
– Natural for Pagerank
Relating Web Characteristics
12. Link ranking: Sites Graph
90% relevant site-Pagerank
It’s harder to have a
good hub than a
good authority (site)
Relating Web Characteristics
13. Web Structure: Basis
• The Web graph has structure:
MAIN
IN
OUT
ISLANDS
Relating Web Characteristics
14. Web Structure: Basis (cont.)
• The MAIN component has structure:
MAIN IN
MAIN OUT
MAIN MAIN
IN
MAIN NORM OUT
Relating Web Characteristics
24. Web Dynamics: Pagerank
Pagerank is biased
against newer pages
Relating Web Characteristics
25. Web Dynamics: Hubs &
Authorities
Authority Score
Hub Score
Age (months)
Relating Web Characteristics
26. Conclusions
• Pagerank/HITS do not seem to be
correlated
– And Pagerank is biased to older pages
• Site ranking can help to make good
human-selected directories
• Finding good pages is not so simple
• Characterizing Web structure gives
valuable insight
– Web Graph Mining is just starting
Relating Web Characteristics