SlideShare uma empresa Scribd logo
1 de 20
The Graph Structure of the Web
- Aggregated by Pay-Level Domain
Oliver Lehmberg, Robert Meusel, Christian Bizer
Research Group Data and Web Science
General Knowledge about the Web Graph
• Broder et al.* in 2000:
– In- and Outdegree follow power laws
– There is a directed path between two pages in 25% of all cases
– The Web Graph has the bow-tie structure
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web.
In WWW’00, pages 309–320. North-Holland Publishing Co, 2000.
Slide 2
Our Contributions
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure
in the web – revisted. WWW ’14, 2014.
– Analysis of the 2012 Web Graph on page level
• This presentation:
– Analysis of the same graph, aggregated by pay-level domain (PLD)
– Focus on inter-website connections
– No intra-website links
• Additionally:
– Interconnections between topical groups of websites
– Public Suffix aggregation
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 3
DATA SET
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 4
Web Data Commons Hyperlink Graph
• Page level: the largest hyperlink graph available to the public
– extracted from Common Crawl
– 3.5 billion nodes (web pages)
– 128 billion arcs (hyperlinks)
• Aggregated by pay-level domain
– 43 million nodes (websites)
– 623 million arcs (aggregated hyperlinks)
– 240 million registered domains in the Web in 2012 (18%)*
• Pay-level domain:
– dws.informatik.uni-mannheim.de  uni-mannheim.de
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 5
Downloading the WDC Hyperlink Graph
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
http://webdatacommons.org/hyperlinkgraph/
• 4 aggregation levels:
• Extraction code is published under Apache License
– Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Slide 6
GRAPH HANDS-ON
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 7
Node Centrality Ranking
http://wwwranking.webdatacommons.org
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 8
Top PLD Lists
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Rank Website Outdegree Website Indegree Website PageRank
1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388
2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173
3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206
4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644
5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081
6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901
7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799
8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018
9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594
10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395
11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929
12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329
13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165
14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793
15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254
16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146
17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966
18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903
19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083
20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539
Slide 9
Most interlinked PLDs
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 10
GRAPH ANALYSIS
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 11
In- and Outdegree – Power-Laws?
Power-Law:
𝑦 ∝ 𝑥−𝛾
Methodology:
• Clauset et al.*
Maximum-
likelihood fitting
(plfit *²)
• Goodness-of-fit
test
Indegree results:
𝑥0 = 3,062
𝛾 = 2.40
Cannot reject
power law
hypothesis
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 12
* Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
In- and Outdegree – Power-Laws?
Outdegree results:
𝑥0 = 496
𝛾 = 2.39
Must reject power
law hypothesis
Yet unclear which
distribution fits
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 13
Bow-Tie Structure
Observations:
Small IN component
Large OUT component
TEND and TUBES almost non-
existent
Compared to Broder et al.:
Unbalanced
LSCC much larger
Compared to our page graph*:
Proportions of IN and OUT
exchanged
Large fraction of IN pages were
merged into LSCC (ca. 1 billion
pages)
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
* R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.
Slide 14
Distance Distribution
Methodology:
Approximate distribution
several times (using
Hyperball*)
Connected pairs:
42.42(±3.59)%
Avg. distance:
4.27(±0.085)
Diameter (at least):
48
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall:
A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013
Slide 15
High connectivity based on Hubs?
• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27
– How important are hubs in this graph?
• Approach:
– A) Remove links to Hubs (i.e. high indegree)
– B) Keep only links to Hubs
– Repeat this for different indegree values as thresholds and then
measure largest remaining WCC/SCC
• Results
– Removing links to nodes with high indegree: no large SCC once all links
to nodes with indegree 10 or higher are removed
– Removing links to nodes with low indegree: the more links we remove,
the more likely are the remaining nodes to be part of the largest SCC
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 16
Two Layer Model
7/4/2014 Data and Web Science Group 17
Approach:
Remove incoming links from the
graph and measures sizes of
largest SCC/WCC
Subgraph with indegree < 𝟏𝟎
• 73.7% of all nodes weakly
connected
• No large strongly connected
component
•  Low Degree Layer
Subgraph with indegree ≥ 𝟏𝟎
• Removed incoming links of
79.2% of all nodes
• 16.1% of all nodes strongly
connected
•  High Degree Layer
PLD Topic Graph
Approach:
Use topical categories from the
open directory project* to
categorise our websites.
15 topical categories
Results:
“computers”: 6th largest, but largest
number of links
“shopping”: much more incoming
than outgoing links, few internal
links
Conclusion:
No obvious patterns, more
properties needed
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
health Kids and teens
news
Slide 18
*http://dmoz.org
Public Suffix (PS) Graph
Approach:
Top ten PSs from our PLD graph +
“others”
Generally agrees with Verisign
Domain Industry Brief*
gTLDs:
more external than internal links
ccTLDs:
more internal than external links
Extreme cases:
.com does not follow this rule
.de  half of all links are from a
single spammer
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
co.uk ru
others
org
nl
net
it
info
de
com
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 19
WebDataCommons.org also offers:
1.Corpus of 17 billion RDFa, Microdata, Microformats statements
2.Corpus of 147 million relational HTML tables
Thank you for your attention!
Advertisement
The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer

Mais conteúdo relacionado

Mais procurados

Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
 
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...Data Beers
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortarOpen Analytics
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
 
IN2N: Cross-institutional Authority Collaboration
IN2N: Cross-institutional Authority CollaborationIN2N: Cross-institutional Authority Collaboration
IN2N: Cross-institutional Authority CollaborationAlexander Haffner
 
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataHeuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataRobert Meusel
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applicationsUtkarsh Sharma
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference InformationKai Schlegel
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata ItemsLviv Data Science Summer School
 
Jarrar: Introduction to Linked Data
Jarrar: Introduction to Linked DataJarrar: Introduction to Linked Data
Jarrar: Introduction to Linked DataMustafa Jarrar
 
Introduction To Linked Data
Introduction To Linked DataIntroduction To Linked Data
Introduction To Linked DataLeigh Dodds
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong KongSammy Fung
 

Mais procurados (20)

Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
 
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
 
Linked data life cycles
Linked data life cyclesLinked data life cycles
Linked data life cycles
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
IN2N: Cross-institutional Authority Collaboration
IN2N: Cross-institutional Authority CollaborationIN2N: Cross-institutional Authority Collaboration
IN2N: Cross-institutional Authority Collaboration
 
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataHeuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applications
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
 
Linking Open Data
Linking Open DataLinking Open Data
Linking Open Data
 
4.1 webminig
4.1 webminig 4.1 webminig
4.1 webminig
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
Jarrar: Introduction to Linked Data
Jarrar: Introduction to Linked DataJarrar: Introduction to Linked Data
Jarrar: Introduction to Linked Data
 
Introduction To Linked Data
Introduction To Linked DataIntroduction To Linked Data
Introduction To Linked Data
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 

Destaque

Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And MiningSrinath Srinivasa
 
Non-Nursing Theory Presentation 2
Non-Nursing Theory Presentation 2 Non-Nursing Theory Presentation 2
Non-Nursing Theory Presentation 2 Arelis Gonzalez
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Nursing conceptual model presentation
Nursing conceptual model presentationNursing conceptual model presentation
Nursing conceptual model presentationTosin Ola-Weissmann
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph MiningSabri Skhiri
 
Nursing theories
Nursing theoriesNursing theories
Nursing theoriesMae Aguilar
 

Destaque (8)

gSpan algorithm
 gSpan algorithm gSpan algorithm
gSpan algorithm
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Non-Nursing Theory Presentation 2
Non-Nursing Theory Presentation 2 Non-Nursing Theory Presentation 2
Non-Nursing Theory Presentation 2
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 
Nursing conceptual model presentation
Nursing conceptual model presentationNursing conceptual model presentation
Nursing conceptual model presentation
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
Nursing theories
Nursing theoriesNursing theories
Nursing theories
 

Semelhante a The Graph Structure of the Web - Aggregated by Pay-Level Domain

The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET Journal
 
From Grid to Cloud
From Grid to CloudFrom Grid to Cloud
From Grid to Cloudgojkoadzic
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jNeo4j
 
2013_protect_presentation
2013_protect_presentation2013_protect_presentation
2013_protect_presentationJeff Holland
 
Effort estimation for web applications
Effort estimation for web applicationsEffort estimation for web applications
Effort estimation for web applicationsNagaraja Gundappa
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkRECAP Project
 
9.Microservices+Data+Patterns (1).pdf
9.Microservices+Data+Patterns (1).pdf9.Microservices+Data+Patterns (1).pdf
9.Microservices+Data+Patterns (1).pdfPratikashBagh1
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMStanmayshah95
 
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...AWS Chicago
 
On network throughput variability in microsoft azure cloud
On network throughput variability in microsoft azure cloudOn network throughput variability in microsoft azure cloud
On network throughput variability in microsoft azure cloudssuser79fc19
 
Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508GRNsight
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP Project
 
DSD-INT 2023 Needs related to user interfaces - Snippen
DSD-INT 2023 Needs related to user interfaces - SnippenDSD-INT 2023 Needs related to user interfaces - Snippen
DSD-INT 2023 Needs related to user interfaces - SnippenDeltares
 
Extending D365 with Azure
Extending D365 with AzureExtending D365 with Azure
Extending D365 with AzureNelson Johnson
 
Partitioning based Approach for Load Balancing Public Cloud
Partitioning based Approach for Load Balancing Public CloudPartitioning based Approach for Load Balancing Public Cloud
Partitioning based Approach for Load Balancing Public CloudIJERA Editor
 
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...Continuent
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 

Semelhante a The Graph Structure of the Web - Aggregated by Pay-Level Domain (20)

The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database Techniques
 
From Grid to Cloud
From Grid to CloudFrom Grid to Cloud
From Grid to Cloud
 
The Overture ΔQ testbed for design and deployment planning
The Overture ΔQ testbed for design and deployment planningThe Overture ΔQ testbed for design and deployment planning
The Overture ΔQ testbed for design and deployment planning
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
2013_protect_presentation
2013_protect_presentation2013_protect_presentation
2013_protect_presentation
 
Effort estimation for web applications
Effort estimation for web applicationsEffort estimation for web applications
Effort estimation for web applications
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation Framework
 
9.Microservices+Data+Patterns (1).pdf
9.Microservices+Data+Patterns (1).pdf9.Microservices+Data+Patterns (1).pdf
9.Microservices+Data+Patterns (1).pdf
 
Presence cloud
Presence cloudPresence cloud
Presence cloud
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMS
 
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...
MARK GAMBLE_ASC For Really Remote Edge Computing - AWS Community Day Chicago ...
 
On network throughput variability in microsoft azure cloud
On network throughput variability in microsoft azure cloudOn network throughput variability in microsoft azure cloud
On network throughput variability in microsoft azure cloud
 
Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation Approach
 
DSD-INT 2023 Needs related to user interfaces - Snippen
DSD-INT 2023 Needs related to user interfaces - SnippenDSD-INT 2023 Needs related to user interfaces - Snippen
DSD-INT 2023 Needs related to user interfaces - Snippen
 
Extending D365 with Azure
Extending D365 with AzureExtending D365 with Azure
Extending D365 with Azure
 
Partitioning based Approach for Load Balancing Public Cloud
Partitioning based Approach for Load Balancing Public CloudPartitioning based Approach for Load Balancing Public Cloud
Partitioning based Approach for Load Balancing Public Cloud
 
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...
Picking the Right Clustering for MySQL - Cloud-only Services or Flexible Tung...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 

Último

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 

Último (20)

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 

The Graph Structure of the Web - Aggregated by Pay-Level Domain

  • 1. The Graph Structure of the Web - Aggregated by Pay-Level Domain Oliver Lehmberg, Robert Meusel, Christian Bizer Research Group Data and Web Science
  • 2. General Knowledge about the Web Graph • Broder et al.* in 2000: – In- and Outdegree follow power laws – There is a directed path between two pages in 25% of all cases – The Web Graph has the bow-tie structure Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW’00, pages 309–320. North-Holland Publishing Co, 2000. Slide 2
  • 3. Our Contributions • R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. – Analysis of the 2012 Web Graph on page level • This presentation: – Analysis of the same graph, aggregated by pay-level domain (PLD) – Focus on inter-website connections – No intra-website links • Additionally: – Interconnections between topical groups of websites – Public Suffix aggregation Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 3
  • 4. DATA SET Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 4
  • 5. Web Data Commons Hyperlink Graph • Page level: the largest hyperlink graph available to the public – extracted from Common Crawl – 3.5 billion nodes (web pages) – 128 billion arcs (hyperlinks) • Aggregated by pay-level domain – 43 million nodes (websites) – 623 million arcs (aggregated hyperlinks) – 240 million registered domains in the Web in 2012 (18%)* • Pay-level domain: – dws.informatik.uni-mannheim.de  uni-mannheim.de Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 5
  • 6. Downloading the WDC Hyperlink Graph Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer http://webdatacommons.org/hyperlinkgraph/ • 4 aggregation levels: • Extraction code is published under Apache License – Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB Slide 6
  • 7. GRAPH HANDS-ON Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 7
  • 8. Node Centrality Ranking http://wwwranking.webdatacommons.org Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 8
  • 9. Top PLD Lists Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Rank Website Outdegree Website Indegree Website PageRank 1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388 2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173 3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206 4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644 5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081 6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901 7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799 8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018 9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594 10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395 11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929 12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329 13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165 14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793 15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254 16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146 17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966 18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903 19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083 20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539 Slide 9
  • 10. Most interlinked PLDs Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 10
  • 11. GRAPH ANALYSIS Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 11
  • 12. In- and Outdegree – Power-Laws? Power-Law: 𝑦 ∝ 𝑥−𝛾 Methodology: • Clauset et al.* Maximum- likelihood fitting (plfit *²) • Goodness-of-fit test Indegree results: 𝑥0 = 3,062 𝛾 = 2.40 Cannot reject power law hypothesis Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 12 * Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
  • 13. In- and Outdegree – Power-Laws? Outdegree results: 𝑥0 = 496 𝛾 = 2.39 Must reject power law hypothesis Yet unclear which distribution fits Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 13
  • 14. Bow-Tie Structure Observations: Small IN component Large OUT component TEND and TUBES almost non- existent Compared to Broder et al.: Unbalanced LSCC much larger Compared to our page graph*: Proportions of IN and OUT exchanged Large fraction of IN pages were merged into LSCC (ca. 1 billion pages) Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer * R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. Slide 14
  • 15. Distance Distribution Methodology: Approximate distribution several times (using Hyperball*) Connected pairs: 42.42(±3.59)% Avg. distance: 4.27(±0.085) Diameter (at least): 48 Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013 Slide 15
  • 16. High connectivity based on Hubs? • LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27 – How important are hubs in this graph? • Approach: – A) Remove links to Hubs (i.e. high indegree) – B) Keep only links to Hubs – Repeat this for different indegree values as thresholds and then measure largest remaining WCC/SCC • Results – Removing links to nodes with high indegree: no large SCC once all links to nodes with indegree 10 or higher are removed – Removing links to nodes with low indegree: the more links we remove, the more likely are the remaining nodes to be part of the largest SCC Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 16
  • 17. Two Layer Model 7/4/2014 Data and Web Science Group 17 Approach: Remove incoming links from the graph and measures sizes of largest SCC/WCC Subgraph with indegree < 𝟏𝟎 • 73.7% of all nodes weakly connected • No large strongly connected component •  Low Degree Layer Subgraph with indegree ≥ 𝟏𝟎 • Removed incoming links of 79.2% of all nodes • 16.1% of all nodes strongly connected •  High Degree Layer
  • 18. PLD Topic Graph Approach: Use topical categories from the open directory project* to categorise our websites. 15 topical categories Results: “computers”: 6th largest, but largest number of links “shopping”: much more incoming than outgoing links, few internal links Conclusion: No obvious patterns, more properties needed Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer health Kids and teens news Slide 18 *http://dmoz.org
  • 19. Public Suffix (PS) Graph Approach: Top ten PSs from our PLD graph + “others” Generally agrees with Verisign Domain Industry Brief* gTLDs: more external than internal links ccTLDs: more internal than external links Extreme cases: .com does not follow this rule .de  half of all links are from a single spammer Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer co.uk ru others org nl net it info de com *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 19
  • 20. WebDataCommons.org also offers: 1.Corpus of 17 billion RDFa, Microdata, Microformats statements 2.Corpus of 147 million relational HTML tables Thank you for your attention! Advertisement The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer