2. Agenda
âą Forward of Web Crawler
â HTML Parser
âą Practice
â Feed Crawler
âą Prototype demo
âą Conclusion
3. HTML Parser
âą HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
âą First clean up the mess and bring the
order to tags, attributes and
ordinary text.
4. Well-known Parser
âą Access the information using
standard XML interfaces.
âą HtmlCleaner
âą HtmlParser
âą Nekohtml
5. Parser inner structure
âą HTML scanner
â Pre-processing action
âą Tag balancer
â Reorders individual elements
â Produces well-formed XML
âą Extraction
âą Transformation
7. Extraction
âą Text extraction
â for use as input for text search engine
databases for example
âą Link extraction
â for crawling through web pages or harvesting
email addresses
âą Screen scraping
â for programmatic data input from web pages
8. Extraction
âą Resource extraction
â collecting images or sound
âą A browser front end
â the preliminary stage of page display
âą Link checking
â ensuring links are valid
âą Site monitoring
â checking for page differences beyond
simplistic diffs
9. Transformation
âą URL rewriting
â modifying some or all links on a page
âą Site capture
â moving content from the web to local disk
âą Censorship
â removing offending words and phrases from
pages
10. Transformation
âą HTML cleanup
â correcting erroneous pages
âą AD removal
â excising URLs referencing advertising
âą Conversion to XML
â moving existing web pages to XML
11. Practice
âą Feed Crawler
â HTML
âą Bloglines, Feedage
â XML
âą RssMountain
â JSON
âą Google AJAX Feed API
âą Prototype
â Demo
12. Conclusion
âą Page search, image search, news
search, blog search, feed search ...
âą Fault toleration of text processing
âą Text mining in web
âą Q&A