SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Sampling National Deep Web
   Denis Shestakov, fname.lname at aalto.fi
Department of Media Technology, Aalto University




                          DEXA'11, Toulouse, France, 31.08.2011
Outline



● Background
● Our approach: Host-IP cluster random
  sampling
● Results
● Conclusions
Background

● Deep Web: web content behind search
  interfaces
● See example of interface              -------->
● Main problem: hard to crawl, thus
  content poorly indexed and not
  available for search (hidden)
● Many research problems: roughly 150-
  200 works addressing certain aspects
  of challenge (e.g., see 'Search interfaces on the
  Web: querying and characterizing', Shestakov, 2008)
● "Clearly, the science and practice of
  deep web crawling is in its
  infancy" (in 'Web crawling', Olston&Najork, 2010)
Background

● What is still unknown (surprisingly):
   ○ How large is deep Web: number of deep web
     resources? amount of content in them? what
     portion is indexed?
● So far only several studies addressed this:
   ○ Bergman, 2001: number, amount of content
   ○ Chang et al., 2004: number, coverage
   ○ Shestakov et al., 2007: number
   ○ Chinese surveys: number
   ○ ....
Background

● All approaches used so far are not good
● Basically, the idea behind estimating number of
  deep web sites:
   ○ IP address random sampling method (proposed in
     1997)
   ○ Description: take a pool of all IP addresses (~3 billions
     currently in use), generate a random sample (~one
     million is ok), connect to them, if it serves HTTP crawl it
     and search for search interfaces
   ○ Obtain a number of search interfaces in a sample and
     apply sampling math to get an estimate
   ○ One can restrict to some segment of the Web (e.g.,
     national): then pool consists of national IP addresses
     only
Virtual Hosting

● Bottleneck: virtual hosting
● When only IP available then URLs for crawl look
  like these http://X.Y.Z.W -----> lots of web sites
  hosting on X.Z.Y.W missed
● Examples:
    ○ OVH (hosting company): 65,000 servers host
      7,500,000
    ○ This survey: 670,000 hosts on 80,000 IP
      addresses
● You can't ignore it!
Host-IP cluster sampling

● What if a large list of hosts is available?
   ○ In fact, not very trivial to get one as such a list
     should cover a certain web segment well
● Host random sampling can be applied (Shestakov
  et al., 2007)
   ○ Works but with limitations
   ○ Bottleneck: host aliasing, i.e., different hostnames
     lead to the same web site
       ■ Hard to solve: need to crawl all hosts in the list
         (their start web pages)
● Idea: resolve all hosts to their IPs
Host-IP cluster sampling

● Resolve all hosts in the list to their IP addresses
   ○ A set of host-IP pairs
● Cluster hosts (pairs) by IP
   ○ IP1: host11,host12, host13, ...
   ○ IP2: host21,host22, host23, ...
   ○ ...
   ○ IPN: hostN1,hostN2, hostN3, ...
● Generate random sample of IP
● Analyze sampled IPs
   ○ E.g., if IP2 sampled then crawl host21,host22,
     host23, ...
Host-IP cluster sampling

● Analyze sampled IPs
   ○ E.g., if IP2 sampled then crawl host21,host22,
     host23, ...
                                                           NO
   ○ While crawling 'unknown' (not in the list)
     hosts may be found
       ■ Crawl only those that either resolved to
         IP2 or to IPs that are not among list's IP list
         ( IP1, IP2,..., IPN)

● Identify search interfaces                YES --->
   ○ Filtering, machine learning, manual check
   ○ Out of the scope (see ref [14] in the paper)
● Apply sampling formulas (see Section 4.4
 of the paper)
Results

● Dataset:
   ○ ~670 thousand hostnames
   ○ Obtained from Yandex: good coverage of Russian
     Web as of 2006
   ○ Resolved to ~80 thousands unique IP addresses
   ○ 77.2% of hosts shared their IPs with at least 20
     other hosts <--virtual hosting scale
● 1075 IPs sampled - 6237 hosts in initial crawl
  seed
   ○ Enough if satisfied with NUM+/-25% with 95%
     confidence
Results
Comparison:
            host-IP vs IP sampling




Conclusion: IP random sampling (used in previous deep
web characterization studies) applied to the same dataset
resulted in estimates that are 3.5 times smaller than
actual numbers (obtained by host-IP)
Conclusion

● Proposed Host-IP clustering technique
   ○ Superior to IP random sampling
● Accurately characterized a national web segment
   ○ As of 09/2006, 14,200+/-3800 deep web sites in
     Russian Web
● Estimates obtained by Chang et al. (ref [9] in the
  paper) are underestimated
● Planning to apply Host-IP to other datasets
   ○ Main challenge is to obtain a large list of hosts that
     reliably covers a certain web segment
● Contact me if interested in Host-IP pairs datasets
Thank you!
Questions?

Mais conteúdo relacionado

Semelhante a Sampling national deep Web

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtatzafargilani
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchBill Liu
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherCharles Nutter
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101APNIC
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Olaf Hartig
 
Visualizing botnets with t-SNE
Visualizing botnets with t-SNEVisualizing botnets with t-SNE
Visualizing botnets with t-SNEmuayyad alsadi
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for PerformanceCris Holdorph
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman
 
DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6Alex Mayrhofer
 
Analyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jAnalyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jYaroslav Lukyanov
 

Semelhante a Sampling national deep Web (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform Further
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
 
Slides
SlidesSlides
Slides
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Visualizing botnets with t-SNE
Visualizing botnets with t-SNEVisualizing botnets with t-SNE
Visualizing botnets with t-SNE
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for Performance
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6DNS Magnitude - DNSheads Vienna #6
DNS Magnitude - DNSheads Vienna #6
 
Analyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4jAnalyzing network infrastructure with Neo4j
Analyzing network infrastructure with Neo4j
 

Mais de Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

Mais de Denis Shestakov (6)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Sampling national deep Web

  • 1. Sampling National Deep Web Denis Shestakov, fname.lname at aalto.fi Department of Media Technology, Aalto University DEXA'11, Toulouse, France, 31.08.2011
  • 2. Outline ● Background ● Our approach: Host-IP cluster random sampling ● Results ● Conclusions
  • 3. Background ● Deep Web: web content behind search interfaces ● See example of interface --------> ● Main problem: hard to crawl, thus content poorly indexed and not available for search (hidden) ● Many research problems: roughly 150- 200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008) ● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)
  • 4. Background ● What is still unknown (surprisingly): ○ How large is deep Web: number of deep web resources? amount of content in them? what portion is indexed? ● So far only several studies addressed this: ○ Bergman, 2001: number, amount of content ○ Chang et al., 2004: number, coverage ○ Shestakov et al., 2007: number ○ Chinese surveys: number ○ ....
  • 5. Background ● All approaches used so far are not good ● Basically, the idea behind estimating number of deep web sites: ○ IP address random sampling method (proposed in 1997) ○ Description: take a pool of all IP addresses (~3 billions currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces ○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate ○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
  • 6. Virtual Hosting ● Bottleneck: virtual hosting ● When only IP available then URLs for crawl look like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed ● Examples: ○ OVH (hosting company): 65,000 servers host 7,500,000 ○ This survey: 670,000 hosts on 80,000 IP addresses ● You can't ignore it!
  • 7. Host-IP cluster sampling ● What if a large list of hosts is available? ○ In fact, not very trivial to get one as such a list should cover a certain web segment well ● Host random sampling can be applied (Shestakov et al., 2007) ○ Works but with limitations ○ Bottleneck: host aliasing, i.e., different hostnames lead to the same web site ■ Hard to solve: need to crawl all hosts in the list (their start web pages) ● Idea: resolve all hosts to their IPs
  • 8. Host-IP cluster sampling ● Resolve all hosts in the list to their IP addresses ○ A set of host-IP pairs ● Cluster hosts (pairs) by IP ○ IP1: host11,host12, host13, ... ○ IP2: host21,host22, host23, ... ○ ... ○ IPN: hostN1,hostN2, hostN3, ... ● Generate random sample of IP ● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
  • 9. Host-IP cluster sampling ● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ... NO ○ While crawling 'unknown' (not in the list) hosts may be found ■ Crawl only those that either resolved to IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN) ● Identify search interfaces YES ---> ○ Filtering, machine learning, manual check ○ Out of the scope (see ref [14] in the paper) ● Apply sampling formulas (see Section 4.4 of the paper)
  • 10. Results ● Dataset: ○ ~670 thousand hostnames ○ Obtained from Yandex: good coverage of Russian Web as of 2006 ○ Resolved to ~80 thousands unique IP addresses ○ 77.2% of hosts shared their IPs with at least 20 other hosts <--virtual hosting scale ● 1075 IPs sampled - 6237 hosts in initial crawl seed ○ Enough if satisfied with NUM+/-25% with 95% confidence
  • 12. Comparison: host-IP vs IP sampling Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)
  • 13. Conclusion ● Proposed Host-IP clustering technique ○ Superior to IP random sampling ● Accurately characterized a national web segment ○ As of 09/2006, 14,200+/-3800 deep web sites in Russian Web ● Estimates obtained by Chang et al. (ref [9] in the paper) are underestimated ● Planning to apply Host-IP to other datasets ○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment ● Contact me if interested in Host-IP pairs datasets