SlideShare uma empresa Scribd logo
1 de 12
Web crawler
Allan@eSobi.com
Agenda
‱ Forward of Web Crawler
– HTML Parser

‱ Practice
– Feed Crawler

‱ Prototype demo
‱ Conclusion
HTML Parser
‱ HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
‱ First clean up the mess and bring the
order to tags, attributes and
ordinary text.
Well-known Parser
‱ Access the information using
standard XML interfaces.
‱ HtmlCleaner
‱ HtmlParser
‱ Nekohtml
Parser inner structure
‱ HTML scanner

– Pre-processing action

‱ Tag balancer

– Reorders individual elements
– Produces well-formed XML

‱ Extraction
‱ Transformation
Example
Extraction
‱ Text extraction
– for use as input for text search engine
databases for example

‱ Link extraction
– for crawling through web pages or harvesting
email addresses

‱ Screen scraping
– for programmatic data input from web pages
Extraction

‱ Resource extraction

– collecting images or sound

‱ A browser front end

– the preliminary stage of page display

‱ Link checking

– ensuring links are valid

‱ Site monitoring

– checking for page differences beyond
simplistic diffs
Transformation
‱ URL rewriting
– modifying some or all links on a page

‱ Site capture
– moving content from the web to local disk

‱ Censorship
– removing offending words and phrases from
pages
Transformation
‱ HTML cleanup
– correcting erroneous pages

‱ AD removal
– excising URLs referencing advertising

‱ Conversion to XML
– moving existing web pages to XML
Practice
‱ Feed Crawler
– HTML

‱ Bloglines, Feedage

– XML

‱ RssMountain

– JSON

‱ Google AJAX Feed API

‱ Prototype
– Demo
Conclusion
‱ Page search, image search, news
search, blog search, feed search ...
‱ Fault toleration of text processing
‱ Text mining in web
‱ Q&A

Mais conteĂșdo relacionado

Destaque

Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applicationsPartnered Health
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 

Destaque (11)

Week10 Web Presentation
Week10 Web PresentationWeek10 Web Presentation
Week10 Web Presentation
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
search engines
search enginessearch engines
search engines
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 

Semelhante a Web Crawler

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
CNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationCNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationSam Bowne
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
CNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationCNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationSam Bowne
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptxDEEPAK948083
 
Search Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsSearch Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsBill Hartzer
 
Module-5 Ppt.pptx
Module-5 Ppt.pptxModule-5 Ppt.pptx
Module-5 Ppt.pptxssuser44f56b1
 
SEO for developers (session 1)
SEO for developers (session 1)SEO for developers (session 1)
SEO for developers (session 1)RankAbove
 
On page optimization
On page optimizationOn page optimization
On page optimizationshieepoo
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGVignesh sitaraman
 
How To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteHow To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteMatt Siltala
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 

Semelhante a Web Crawler (20)

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
CNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationCNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the Application
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
CNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationCNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the Application
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Search Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsSearch Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and Analytics
 
Search
SearchSearch
Search
 
On-Page SEO
On-Page  SEO On-Page  SEO
On-Page SEO
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Module-5 Ppt.pptx
Module-5 Ppt.pptxModule-5 Ppt.pptx
Module-5 Ppt.pptx
 
SEO for developers (session 1)
SEO for developers (session 1)SEO for developers (session 1)
SEO for developers (session 1)
 
On page optimization
On page optimizationOn page optimization
On page optimization
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATG
 
How To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteHow To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly Website
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 

Mais de Allan Huang

Concurrency in Java
Concurrency in  JavaConcurrency in  Java
Concurrency in JavaAllan Huang
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test toolsAllan Huang
 
Java JSON Parser Comparison
Java JSON Parser ComparisonJava JSON Parser Comparison
Java JSON Parser ComparisonAllan Huang
 
Netty 4-based RPC System Development
Netty 4-based RPC System DevelopmentNetty 4-based RPC System Development
Netty 4-based RPC System DevelopmentAllan Huang
 
eSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureeSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureAllan Huang
 
Java New Evolution
Java New EvolutionJava New Evolution
Java New EvolutionAllan Huang
 
Tomcat New Evolution
Tomcat New EvolutionTomcat New Evolution
Tomcat New EvolutionAllan Huang
 
JQuery New Evolution
JQuery New EvolutionJQuery New Evolution
JQuery New EvolutionAllan Huang
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web DesignAllan Huang
 
Boilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementBoilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementAllan Huang
 
YQL Case Study
YQL Case StudyYQL Case Study
YQL Case StudyAllan Huang
 
Build Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapBuild Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapAllan Huang
 
HTML5 Multithreading
HTML5 MultithreadingHTML5 Multithreading
HTML5 MultithreadingAllan Huang
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web ApplicationAllan Huang
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data StorageAllan Huang
 
Java Script Patterns
Java Script PatternsJava Script Patterns
Java Script PatternsAllan Huang
 
Weighted feed recommand
Weighted feed recommandWeighted feed recommand
Weighted feed recommandAllan Huang
 
eSobi Site Initiation
eSobi Site InitiationeSobi Site Initiation
eSobi Site InitiationAllan Huang
 
Architecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEArchitecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEAllan Huang
 

Mais de Allan Huang (20)

Concurrency in Java
Concurrency in  JavaConcurrency in  Java
Concurrency in Java
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test tools
 
Drools
DroolsDrools
Drools
 
Java JSON Parser Comparison
Java JSON Parser ComparisonJava JSON Parser Comparison
Java JSON Parser Comparison
 
Netty 4-based RPC System Development
Netty 4-based RPC System DevelopmentNetty 4-based RPC System Development
Netty 4-based RPC System Development
 
eSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureeSobi Website Multilayered Architecture
eSobi Website Multilayered Architecture
 
Java New Evolution
Java New EvolutionJava New Evolution
Java New Evolution
 
Tomcat New Evolution
Tomcat New EvolutionTomcat New Evolution
Tomcat New Evolution
 
JQuery New Evolution
JQuery New EvolutionJQuery New Evolution
JQuery New Evolution
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web Design
 
Boilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementBoilerpipe Integration And Improvement
Boilerpipe Integration And Improvement
 
YQL Case Study
YQL Case StudyYQL Case Study
YQL Case Study
 
Build Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapBuild Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGap
 
HTML5 Multithreading
HTML5 MultithreadingHTML5 Multithreading
HTML5 Multithreading
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web Application
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data Storage
 
Java Script Patterns
Java Script PatternsJava Script Patterns
Java Script Patterns
 
Weighted feed recommand
Weighted feed recommandWeighted feed recommand
Weighted feed recommand
 
eSobi Site Initiation
eSobi Site InitiationeSobi Site Initiation
eSobi Site Initiation
 
Architecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEArchitecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EE
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Web Crawler

  • 2. Agenda ‱ Forward of Web Crawler – HTML Parser ‱ Practice – Feed Crawler ‱ Prototype demo ‱ Conclusion
  • 3. HTML Parser ‱ HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. ‱ First clean up the mess and bring the order to tags, attributes and ordinary text.
  • 4. Well-known Parser ‱ Access the information using standard XML interfaces. ‱ HtmlCleaner ‱ HtmlParser ‱ Nekohtml
  • 5. Parser inner structure ‱ HTML scanner – Pre-processing action ‱ Tag balancer – Reorders individual elements – Produces well-formed XML ‱ Extraction ‱ Transformation
  • 7. Extraction ‱ Text extraction – for use as input for text search engine databases for example ‱ Link extraction – for crawling through web pages or harvesting email addresses ‱ Screen scraping – for programmatic data input from web pages
  • 8. Extraction ‱ Resource extraction – collecting images or sound ‱ A browser front end – the preliminary stage of page display ‱ Link checking – ensuring links are valid ‱ Site monitoring – checking for page differences beyond simplistic diffs
  • 9. Transformation ‱ URL rewriting – modifying some or all links on a page ‱ Site capture – moving content from the web to local disk ‱ Censorship – removing offending words and phrases from pages
  • 10. Transformation ‱ HTML cleanup – correcting erroneous pages ‱ AD removal – excising URLs referencing advertising ‱ Conversion to XML – moving existing web pages to XML
  • 11. Practice ‱ Feed Crawler – HTML ‱ Bloglines, Feedage – XML ‱ RssMountain – JSON ‱ Google AJAX Feed API ‱ Prototype – Demo
  • 12. Conclusion ‱ Page search, image search, news search, blog search, feed search ... ‱ Fault toleration of text processing ‱ Text mining in web ‱ Q&A