Crawler pieces

Spider
实现 Spider 细节

们统构过
识术语

过过实现 “ ”讲实现

简单

问题论

监馈

键问题预测颈

读论 Spike

务调对 “ 识 ”

战较

vs.

http://www.flickr.com/photos/blueblankut/497571704/sizes/z/in/photostream/

http://www.flickr.com/photos/coreyburger/2481836757/sizes/z/in/photostream/

构

归

Dust

链锚

构
页识

链务优级页评

评键识

错惩罚历

鲜

访问

构

压

ip压

规则protocol sitemap robots.txt

链检测

构
处

结构识

语义

结构识

识

联Link

构

词

图 Query转换

词

键词

层访问

Cache

杂 ……

http://www.flickr.com/photos/regolare/791385521/

说Spider…… 构
线

实时优级线

实时处统Map-
Reduce
资
节资

馈长迟

Crawler Architecture

Repository
Downloader

Download Extractor
Worker Worker
save page
to repository

if 302 founded
get a link update link
http status
put downloaded
page to queue

links queue
pages queue extract links
and save

main loop will put
peek site's links to
queue

Crawler Linkbase
main loop

Site will reﬁll itself
when it's empty
TaskLoader

Priority Heap
Scope.txt Sites

Ordered Site
and their links

键术
储统

统调监

线驱动

链储结构务调

键术
Dust

页 Simhash

PageRank评还简单标评

词库词

键术
robust html css selector lxml tidy

认证码识

规则url

术 proxy 术还 UA伪

传1-NoSQL
统储扩颈尝试 NOSQL

传-NoSQL
统结构 NOSQL 实现

优队 Heap

队

队 FIFO

传-NoSQL
术选围这选
为

HBase

Cassandra

传-NoSQL
Cassandra

稳问题 bug

实 Random
patitioning 对实

资

对 Crawler 说们 Crawler 严赖
锁务

负闻

传-NoSQL
CAP对说应该

HBase -> CP Cassandra -> AP

实际 Cassandra C
Crawler link

传2-Google 动
incremental processing system -
Percolator. a.k.a. Caffeine

传2-Google 动
BigTable 储预

务证 timestamp oracle lightweight
lock

产 Notification 库 trigger

线 Observer传递Notification

统费Notification Percolator Worker实
现们线务

传2-Google 动
对Map-Reduce 评

迟

赖 Locality 设计

传2-Google 动
Trade-off

时 trillion:million
Map-Reduce 迟

单Page RPC MR 过读队组预读
缓 10 MR RPC

资

传2-Google 动
Percolator 传统DBMS DBMS

查询语

为 scale设计库为节
Percolator 节节

调迟调

Percolator 义为shared-nothing parallel
databases

Crawler pieces

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Crawler pieces

Similar to Crawler pieces (20)

Crawler pieces

Editor's Notes