The Codex of Business Writing Software for Real-World Solutions 2.pptx
Sampling national deep Web
1. Sampling National Deep Web
Denis Shestakov, fname.lname at aalto.fi
Department of Media Technology, Aalto University
DEXA'11, Toulouse, France, 31.08.2011
3. Background
● Deep Web: web content behind search
interfaces
● See example of interface -------->
● Main problem: hard to crawl, thus
content poorly indexed and not
available for search (hidden)
● Many research problems: roughly 150-
200 works addressing certain aspects
of challenge (e.g., see 'Search interfaces on the
Web: querying and characterizing', Shestakov, 2008)
● "Clearly, the science and practice of
deep web crawling is in its
infancy" (in 'Web crawling', Olston&Najork, 2010)
4. Background
● What is still unknown (surprisingly):
○ How large is deep Web: number of deep web
resources? amount of content in them? what
portion is indexed?
● So far only several studies addressed this:
○ Bergman, 2001: number, amount of content
○ Chang et al., 2004: number, coverage
○ Shestakov et al., 2007: number
○ Chinese surveys: number
○ ....
5. Background
● All approaches used so far are not good
● Basically, the idea behind estimating number of
deep web sites:
○ IP address random sampling method (proposed in
1997)
○ Description: take a pool of all IP addresses (~3 billions
currently in use), generate a random sample (~one
million is ok), connect to them, if it serves HTTP crawl it
and search for search interfaces
○ Obtain a number of search interfaces in a sample and
apply sampling math to get an estimate
○ One can restrict to some segment of the Web (e.g.,
national): then pool consists of national IP addresses
only
6. Virtual Hosting
● Bottleneck: virtual hosting
● When only IP available then URLs for crawl look
like these http://X.Y.Z.W -----> lots of web sites
hosting on X.Z.Y.W missed
● Examples:
○ OVH (hosting company): 65,000 servers host
7,500,000
○ This survey: 670,000 hosts on 80,000 IP
addresses
● You can't ignore it!
7. Host-IP cluster sampling
● What if a large list of hosts is available?
○ In fact, not very trivial to get one as such a list
should cover a certain web segment well
● Host random sampling can be applied (Shestakov
et al., 2007)
○ Works but with limitations
○ Bottleneck: host aliasing, i.e., different hostnames
lead to the same web site
■ Hard to solve: need to crawl all hosts in the list
(their start web pages)
● Idea: resolve all hosts to their IPs
8. Host-IP cluster sampling
● Resolve all hosts in the list to their IP addresses
○ A set of host-IP pairs
● Cluster hosts (pairs) by IP
○ IP1: host11,host12, host13, ...
○ IP2: host21,host22, host23, ...
○ ...
○ IPN: hostN1,hostN2, hostN3, ...
● Generate random sample of IP
● Analyze sampled IPs
○ E.g., if IP2 sampled then crawl host21,host22,
host23, ...
9. Host-IP cluster sampling
● Analyze sampled IPs
○ E.g., if IP2 sampled then crawl host21,host22,
host23, ...
NO
○ While crawling 'unknown' (not in the list)
hosts may be found
■ Crawl only those that either resolved to
IP2 or to IPs that are not among list's IP list
( IP1, IP2,..., IPN)
● Identify search interfaces YES --->
○ Filtering, machine learning, manual check
○ Out of the scope (see ref [14] in the paper)
● Apply sampling formulas (see Section 4.4
of the paper)
10. Results
● Dataset:
○ ~670 thousand hostnames
○ Obtained from Yandex: good coverage of Russian
Web as of 2006
○ Resolved to ~80 thousands unique IP addresses
○ 77.2% of hosts shared their IPs with at least 20
other hosts <--virtual hosting scale
● 1075 IPs sampled - 6237 hosts in initial crawl
seed
○ Enough if satisfied with NUM+/-25% with 95%
confidence
12. Comparison:
host-IP vs IP sampling
Conclusion: IP random sampling (used in previous deep
web characterization studies) applied to the same dataset
resulted in estimates that are 3.5 times smaller than
actual numbers (obtained by host-IP)
13. Conclusion
● Proposed Host-IP clustering technique
○ Superior to IP random sampling
● Accurately characterized a national web segment
○ As of 09/2006, 14,200+/-3800 deep web sites in
Russian Web
● Estimates obtained by Chang et al. (ref [9] in the
paper) are underestimated
● Planning to apply Host-IP to other datasets
○ Main challenge is to obtain a large list of hosts that
reliably covers a certain web segment
● Contact me if interested in Host-IP pairs datasets