Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
1. INTELLIGENT WEB CRAWLING
WI-IAT 2013 Tutorial
WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013
ver 1.8: 10.04.2015
Denis Shestakov
denshe at gmail
Department of Media Technology, Aalto University, Finland
2. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
1/98
References to this tutorial
To cite please use:
D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent
Informatics Bulletin, 14(1), pp. 5-7, 2013.
[BibTeX]
3. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
2/98
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in the area
Tutorials on web crawling
given at SAC’12 and
ICWE’13
Web Services Group in 2011
4. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
3/98
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/
5. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
4/98
TUTORIAL OUTLINE
I. OVERVIEW
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
II. INTELLIGENT WEB CRAWLING
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
III. OPEN CHALLENGES
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
6. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
5/98
Links to Tutorial
Slides:
http://goo.gl/woVtQk
http://www.slideshare.net/denshe/presentations
Similar tutorials:
Tutorials on web crawling at ICWE’13 and SAC’12
Their diffs with this tutorial: better overview the topic (parts I
and III), but not cover crawling strategies (part II)
Supporting materials:
http://www.mendeley.com/groups/531771/web-crawling/
7. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
6/98
PART I: OVERVIEW
Visualization of http://media.tkk.fi/webservices by aharef.info applet
8. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
7/98
Outline of Part I
Overview of Web Crawling
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
9. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
8/98
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
Set of policies involved (like ’ignore links to images’, etc.)
10. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
9/98
Web Crawling in a Nutshell
Example:
1. Follow http://media.tkk.fi/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1
11. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
10/98
Web Crawling in a Nutshell
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)
12. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
11/98
Web Crawling in a Nutshell
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)
13. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
12/98
Web Crawling in a Nutshell
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites
14. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
13/98
Web Crawling in a Nutshell
Classification
General/universal crawlers
Not so many of them, lots of resources required
Big web search engines
Topical/focused crawlers
Pages/sites on certain topic
Crawling all in one specific (i.e., national) web segment is
rather general, though
Batch crawling
One or several (static) snapshots
Incremental/continuous crawling
Re-visiting
Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
Search engines
15. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
14/98
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks
16. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
15/98
Applications of Web Crawling
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?
17. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
16/98
Applications of Web Crawling
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)
18. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
17/98
Applications of Web Crawling
Web Archiving
Digital preservation
“Librarian” look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites – web sites at
country-specific TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia
19. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
18/98
Applications of Web Crawling
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search
20. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
19/98
Applications of Web Crawling
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)
21. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
20/98
Applications of Web Crawling
Web Monitoring
Monitoring sites/pages for changes and updates
22. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
21/98
Applications of Web Crawling
Detection of malicious web sites
Typically a part of anti-virus, firewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such
23. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
22/98
Applications of Web Crawling
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)
24. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
23/98
Applications of Web Crawling
Copyright violation detection
Crawl to find (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like finding terrorist chat rooms also go here
25. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
24/98
Applications of Web Crawling
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API
26. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
25/98
Applications of Web Crawling
Web Mirroring
Copying of web sites
Hosting copies on different servers to ensure 24x7
accessibility
27. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
26/98
Industry vs. Academia Divide
In web crawling domain
Huge lag between industrial and academic web crawlers
Research-wise and development-wise
Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale
That is, dozens of billions pages
Only a few academic crawlers dealt with more than one
billion pages
Academic scale is rather hundreds of millions
28. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
27/98
Industry vs. Academia
Re-crawling
Batch crawls in academia
Regular re-crawls by industrial crawlers
Evaluation of crawled data
Crucial for corrections/improvements into crawlers
Direct evaluation by users of search engines
To some extent, artificial evaluation of academic crawls
29. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
28/98
Web Size and Structure
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008
30. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
29/98
Web Size and Structure
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) confirms one trillion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billion pages over 2 months
Throughput: 1000-1500 pages per second
Over 30 billion discovered URLs
31. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
30/98
Web Size and Structure
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155
33. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
32/98
Outline of Part II
Intelligent Web Crawling
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
34. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
33/98
Architecture of Web Crawler
Crawler crawls the Web
Crawled
URLs
URL Frontier
Seed
URLs
Uncrawled Web
35. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
34/98
Architecture of Web Crawler
Typically in a distributed fashion
Seed
URLs
Crawled
URLs
URL Frontier
crawling thread
Uncrawled Web
36. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
35/98
Architecture of Web Crawler
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps
37. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
36/98
Architecture of Web Crawler
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
38. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
37/98
Architecture of Web Crawler
Content seen?
If page fetched is already in the base/index, don’t process it
Document fingerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+filtered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)
39. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
38/98
Architecture of Web Crawler
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes
40. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
39/98
Architecture of Web Crawler
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
41. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
40/98
Architecture of Web Crawler
Implementation (in Perl)
Other popular languages: Java, Python, C/C++
42. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
41/98
Architecture of Web Crawler
Crawling objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)
43. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
42/98
Crawling Strategies
Download prioritization
Given a period, only a subset of web pages can be
downloaded
“Important” pages first
Hence, need in prioritization
Ordering a queue of URLs to be visited
Strategies (ordering metrics)
Breadth-First, Depth-First
Backlink count
Best-First
PageRank
Shark-Search
44. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
43/98
Crawling Strategies
Breadth-First, Depth-First
Breadth-First search
Implemented with
QUEUE (FIFO)
Pages with shortest
paths first
Depth-First search
Implemented with
STACK (LIFO)
45. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
44/98
Crawling Strategies
Pseudocode for Breadth-First
46. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
45/98
Crawling Strategies
Backlink count
Use the link graph information
Count # of crawled pages that point to a page
Links with highest counts first
47. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
46/98
Crawling Strategies
Best-First
Best link selected based on some criterion
I.e., lexical similarity between topic’s keywords and link’s
source page
Similarity score sim(topic, p) assigned to outgoing links of
page p
Cosine similarity often used
where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k
in q and p
48. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
47/98
Crawling Strategies
Pseudocode for Best-First
49. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
48/98
Crawling Strategies
PageRank
The pagerank of a page is the probability for a random
surfer (who follows links randomly) to be on this page at
any given time
A page’s score (rank) defined by scores of pages with links
to this page
where p is a page, in(p) is a set of pages with links to p, out(d) is a set
of links out of d, γ are damping factor
PageRank of pages periodically recalculated using data
structure with crawled pages
50. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
49/98
Crawling Strategies
Pseudocode for PageRank
51. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
50/98
Crawling Strategies
Shark-Search
More emphasis on web segments where relevant pages
were found
Penalizing segments yielding a few relevant pages
A link’s score defined by a link’s anchor text, text
surrounding a link (link context) and inherited score from
ancestor pages (pages pointing to a page with this link)
Parameters:
d - depth bound
r - relative importance of inherited score versus link
neighbourhood score
52. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
51/98
Crawling Strategies
Pseudocode for Shark-Search
53. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
52/98
Adaptive Crawling
Static vs. adaptive strategies
Strategies presented to this point are static
Not adjust in the course of the crawl
Adaptive (intelligent) crawling
InfoSpiders
Ant-based crawling
54. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
53/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
HTML parser
Noise word
remover
Stemmer
Document
relevance
assessment
Reproduction
or death
Learning
Link
assessment
and selection
HTML
document
Compact
document
representation
Document
assessment
########## $$$
########## $$$
Term
weights
Neural net
weights
Keyword
vector
Agent
representation
55. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
54/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
Each agent uses list of keywords (initialized with topic
keywords)
Neural network evaluates new links
Keywords in the vicinity a link used as input
More importance (weight) to those keywords close to a link
Maximum to words in the anchor text
Output is a numerical quality estimate for a link
Link score combined with cosine similarity score (between
agent’s keywords and a page with this link)
56. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
55/98
Adaptive Crawling
InfoSpiders
Each agent has an energy level
Agent moves from a current to a new page if boltzmann
function returns true
where δ is diff between similarity of new and current page to agent’s
keywords
If energy level passes some threshold, an agent
reproduces
Offspring gets the half of parent’s frontier
Offspring keywords mutated (expanded) with most
frequent terms in parent’s current document
57. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
56/98
Adaptive Crawling
Pseudocode for InfoSpiders
58. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
57/98
Adaptive Crawling
Pseudocode for InfoSpiders (cont.)
59. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
58/98
Adaptive Crawling
Ant-based crawling
Motivation: allow crawling agents to communicate with
each other
Follow a model of social insect collective behaviour
Ants leave the pheromone along the followed path
Other ants follow such pheromone trails
A crawler agent follows some path by visiting many URLs
At some moment, a certain amount of pheromone (weight)
can be assigned to sequence of URLs on the followed path
The amount can depend on similarity of visited pages to a
given topic
60. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
59/98
Adaptive Crawling
Ant-based crawling
Ants (crawlers) operate in cycles
During each cycle, agents make a predefined number of
moves (visits of pages)
#moves = constant ∗ #cycle
At the end of each cycle, pheromone intensity values are
updated for the followed path
Agents-ants return to their starting positions
61. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
60/98
Adaptive Crawling
Ant-based crawling
Next link selected based on probability, which is defined by
the corresponding pheromone intensity
If no pheromone information, an agent-ant moves
randomly
62. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
61/98
Adaptive Crawling
Ant-based crawling
Probability of selecting a link
where t is the cycle number, τij (t) is pheromone value between pi and
pj and (i, l) designates the presence of a link from pi to pl
During the cycle, each ant stores the list of visited URLs
If pj was already visited, Pij(t) = 0
At the end of cycle, the list with visited URLs emptied out
63. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
62/98
Adaptive Crawling
Implications
Strategies evaluating links based on their context (text
close by) are not directly applicable to large-scale crawling
I.e., consider crawling of 109 pages within one month
Crawl rate: around 400 documents per second
Around 40000 links per second
Every second 10000-30000 “new” links to be evaluated
(scored) and added to the frontier
Too many even for link’s anchor text evaluation only
65. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
64/98
Outline of Part III
Open Challenges
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
66. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
65/98
Crawlers in Web ecosystem
Push vs. Pull model
Web pages accessed via pull model
- HTTP is a pull protocol
That is, a client requests a page from a server
If push, a server would send a page/info to a client
Why Pull?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No specific protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed
67. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
66/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
What are these?
68. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
67/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
Publishing/updating content easier with push: no need in
redundant requests from crawlers
Better control over the content from providers: no need in
crawler politeness
69. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
68/98
Crawlers in Web ecosystem
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to define access to parts of a site
Via direct banning of agents hitting a site too often
70. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
69/98
Crawlers in Web ecosystem
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:
71. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
70/98
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a specific topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify filters to select required pages
Crawler as a common service
72. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
71/98
Collaborative Crawling
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests
73. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
72/98
Collaborative Crawling
New component
Process a stream of documents against a filter index
74. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
73/98
Collaborative Crawling
Filter processing architecture
75. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
74/98
Collaborative Crawling
Filter processing architecture
76. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
75/98
Collaborative Crawling
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web
77. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
76/98
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet
78. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
77/98
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)
79. ●
Deep Web – part of the Web not accessible through search
engines
●
My preferred: Deep Web - content behind web search forms on
publicly available pages
●
Pages with forms themselves are typically accessible/searchable
(=crawled)
1
Content hidden behind HTML forms
Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
80. Why is it important?
Large source of structured data
●
Forms present a search interface over backend databases
Significant gap in search engine coverage
●
Potentially more content that currently searchable
●
More than 10 million distinct HTML forms
●
Likely to increase and more data comes online
Size of the deep Web is unclear
●
500x figures are highly disputable
●
Number of resources is a bit simpler: ~450k databases on the Web in
2004
●
Some part of deep web content crawled/covered by search engines
●
Content can be both searched and browsed via links categorizing
content
●
Business-driven sites (e.g., shopping) typically provide both ways of
access
2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
81. Can’t pass through the forms (need to specify some values)
I.e., content is “hidden” behind search forms
●
Reason for another name for deep Web: hidden Web
To crawl/access the content behind the following is
required:
●
Identify a search form on a page
●
Fill form with proper values
●
Submit the form
●
Get the result pages
●
Extract links/data from them
Why crawlers not crawl deep Web
3Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
82. Approaches to deep Web crawling
Google’s Deep Web Crawl (2008)
●
Identify search forms
●
Pre-compute all interesting form submissions to each
HTML form
●
Each form submission corresponds to a distinct URL
●
Add URLs for each form submission into search
engine index
●
Allows to reuse existing search engine infrastructure
●
No aim for full coverage of a deep web resource
●
Not all forms (only GET forms) covered
4Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
83. Deep Web site identification
• Task: identify a search form leading to content-rich
web pages
• Surprisingly, quite challenging task
• One of the problems:
●
Detect if form is searchable
5Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
84. Searchable forms
Non-searchable: login forms, those that require user info
Depends: Highly-interactive forms, e.g., airline reservations
What are deep Web resources?
store locations
used cars
radio stations
patents
recipes
6Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
86. Deep Web site identification
• Detect if form is informational
●
Challenging for human too: e.g., assume a form is in
unknown language
• Detection by building/training binary classifiers
• Forms identified as searchable can then be classified into
domains (e.g., car search, apartment search, etc.)
●
Based on form structure (e.g., num.fields)
●
Based on form field labels
• Slow process
●
Done by specific component in offline mode
8Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
87. Crawling JavaScript-rich sites
• Web pages became more responsive, interactive,
user-friendly, etc.
●
Thanks to emergence of new web technologies
such as AJAX
• Besides, they led to wide spread of web applications
(RIAs)
• Challenge for crawlers as they do not
●
Manipulate client-side site
●
Take into account asynchronous communication
with the server
9Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
88. Crawling JavaScript-rich sites
• Very similar to deep Web crawling challenge
●
Content is hard to crawl
●
Direct problem: AJAX/JS-enabled forms are hard to
deal with (e.g., to detect and then generate meaningful
queries)
• Web pages designed for human beings, not for
automatic programs
• JS-code should be processed to get the actual content
●
Dynamically changing
●
Lots of additional resources required (crawler should
be supplemented with JS-interpreter)
10Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
89. Crawling JavaScript-rich sites
• Several techniques for AJAX crawling proposed since
2007/08
●
Focus is either on indexing and searching or on testing
RIAs
• Approach:
●
AJAX-enabled web page/application modeled using
states, events, transitions
●
Crawler uses breadth-first strategy:
●
Triggers the events on a page
●
If the DOM of a page changes then new
state/transition is added to transition graph
●
Back to initial state to invoke the next event
11Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
90. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
89/98
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)
91. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
90/98
Crawling Multimedia Content
Challenges in crawling multimedia
Bigger load on web sites since files are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy
92. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
91/98
Crawling Multimedia Content
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-defined web sites
Data can be exported in WARC (Web ARChive) files and in
RDF
93. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
92/98
Future Directions
Collaborative crawling, mixed pull-push model
Scalable adaptive strategies
Understanding site structure
Deep Web crawling
Semantic Web crawling
Media content crawling
Social network crawling
94. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
93/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages
95. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
94/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)
96. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
95/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC files
- 2.7 billions pages
- Includes multimedia data
- Available by request
97. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
96/98
References: Crawl Datasets
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites
98. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
97/98
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Intermediate: Current Challenges in Web Crawling tutorial
at ICWE 2013 by Shestakov; http://www.slideshare.
net/denshe/icwe13-tutorial-webcrawling
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017
99. Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
98/98
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/