This is a presentation I prepared for Beijing Open Party. It's a summary of what I learned when I was building a crawler system. There must be some mistakes, please don't use/read in seriously purpose.
4. vs.
http://www.flickr.com/photos/blueblankut/497571704/sizes/z/in/photostream/
http://www.flickr.com/photos/coreyburger/2481836757/sizes/z/in/photostream/
14. Crawler Architecture
Repository
Downloader
Download Extractor
Worker Worker
save page
to repository
if 302 founded
get a link update link
http status
put downloaded
page to queue
links queue
pages queue extract links
and save
main loop will put
peek site's links to
queue
Crawler Linkbase
main loop
Site will refill itself
when it's empty
TaskLoader
Priority Heap
Scope.txt Sites
Ordered Site
and their links