SlideShare uma empresa Scribd logo
1 de 18
http://www.flickr.com/photos/pio1976/3330670980/




                                                  Link extraction and classification
                                                                         Bruno Pedro
                                                                          December 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Link extraction
Link extraction
                        extract user   extract URL




         new tweet             yes




launch process           has URL?         save       show




                                no




                           dump
Content extraction

  • Short URLs
  • HTTP redirects
  • HTTP errors — retry?
  • Content analysis
Content extraction
    URL                                   retry?
                               yes



                                              yes



                                     no
fetch contents         redirect?          error?

                 yes

                                              no




                                          save
Content analysis

• Assume malformed (X)HTML
• Regular expressions
  or
• Convert to XHTML
• DOM traversing
Content classification
                • HTML title, description, keywords
   tag
  cloud
            {   • H1, H2, H3, ...
                • Paragraphs

  graph
placement   {   • Who shared the link?
                • Internal and external links
Content classification
 (X)HTML                 extract head
                                                   extract text         extract text
                          elements




                   yes                       yes                  yes




                          H1,H2,...                paragraphs
head found?                                                                save
                           found?                    found?




              no                        no
The big picture

new tweet                     classify content   save
             extract URL




            extract content
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• twitter streaming API
  http://dev.twitter.com/pages/streaming_api
tarpipe streamlines your     tarpipe is one of the most     Today I had a chance to
updates to various social    curious experiments in         spend time experimenting
web sites, creating simple   social media that I've         with tarpipe and I have to
or complex workflows to       seen lately. The service       say that I am intrigued by
update several buckets in    has the potential to be        the concept and impressed
one fell swoop.              the answer to the lament       by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal       Jeff Barr
lifehacker                   syndication overload.          Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                            automated publishing

Mais conteúdo relacionado

Destaque

Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)
Bruno Pedro
 
Plugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debuggingPlugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debugging
Mayflower GmbH
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?
Bruno Pedro
 

Destaque (20)

Schnelle Geschäfte
Schnelle GeschäfteSchnelle Geschäfte
Schnelle Geschäfte
 
Introd to CS. Presentation
Introd to CS. PresentationIntrod to CS. Presentation
Introd to CS. Presentation
 
Test-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPCTest-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPC
 
The Executable Web
The Executable WebThe Executable Web
The Executable Web
 
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und AlloyNative Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
 
node-fs
node-fsnode-fs
node-fs
 
Computer concepts presentation 2
Computer concepts presentation 2Computer concepts presentation 2
Computer concepts presentation 2
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumers
 
Shoeism - Frau im Glück
Shoeism - Frau im GlückShoeism - Frau im Glück
Shoeism - Frau im Glück
 
Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)
 
Plugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debuggingPlugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debugging
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?
 
Salt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native ClientSalt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native Client
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
 
The importance of /me
The importance of /meThe importance of /me
The importance of /me
 
Information Retrieval Challenges
Information Retrieval ChallengesInformation Retrieval Challenges
Information Retrieval Challenges
 
API Code Generation
API Code GenerationAPI Code Generation
API Code Generation
 

Semelhante a Link extraction and classification

Semelhante a Link extraction and classification (20)

WordPress Intermediate Workshop
WordPress Intermediate WorkshopWordPress Intermediate Workshop
WordPress Intermediate Workshop
 
Wp 3hr-course
Wp 3hr-courseWp 3hr-course
Wp 3hr-course
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
Pub355: SEO Copywriting
Pub355: SEO CopywritingPub355: SEO Copywriting
Pub355: SEO Copywriting
 
From PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and BackFrom PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and Back
 
Big Websites
Big WebsitesBig Websites
Big Websites
 
HTML Semantic Tags
HTML Semantic TagsHTML Semantic Tags
HTML Semantic Tags
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
Handout1 intro to html
Handout1 intro to htmlHandout1 intro to html
Handout1 intro to html
 
Why Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your ClientsWhy Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your Clients
 
CommunitySherpa Field Presentation
CommunitySherpa Field PresentationCommunitySherpa Field Presentation
CommunitySherpa Field Presentation
 
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
 
Aws r
Aws rAws r
Aws r
 
There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)
 
Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012
 
We are losing our tweets!
We are losing our tweets!We are losing our tweets!
We are losing our tweets!
 
<?php + WordPress
<?php + WordPress<?php + WordPress
<?php + WordPress
 

Mais de Bruno Pedro

OOP (in portuguese)
OOP (in portuguese)OOP (in portuguese)
OOP (in portuguese)
Bruno Pedro
 
Segurança (in portuguese)
Segurança (in portuguese)Segurança (in portuguese)
Segurança (in portuguese)
Bruno Pedro
 
Cache e Performance (in portuguese)
Cache e Performance (in portuguese)Cache e Performance (in portuguese)
Cache e Performance (in portuguese)
Bruno Pedro
 
Web Services (in portuguese)
Web Services (in portuguese)Web Services (in portuguese)
Web Services (in portuguese)
Bruno Pedro
 
Sessões (in portuguese)
Sessões (in portuguese)Sessões (in portuguese)
Sessões (in portuguese)
Bruno Pedro
 
User Interface (in portuguese)
User Interface (in portuguese)User Interface (in portuguese)
User Interface (in portuguese)
Bruno Pedro
 

Mais de Bruno Pedro (14)

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIs
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an API
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an API
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an API
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
 
OAuth checklist
OAuth checklistOAuth checklist
OAuth checklist
 
OOP (in portuguese)
OOP (in portuguese)OOP (in portuguese)
OOP (in portuguese)
 
Segurança (in portuguese)
Segurança (in portuguese)Segurança (in portuguese)
Segurança (in portuguese)
 
Cache e Performance (in portuguese)
Cache e Performance (in portuguese)Cache e Performance (in portuguese)
Cache e Performance (in portuguese)
 
Web Services (in portuguese)
Web Services (in portuguese)Web Services (in portuguese)
Web Services (in portuguese)
 
Sessões (in portuguese)
Sessões (in portuguese)Sessões (in portuguese)
Sessões (in portuguese)
 
User Interface (in portuguese)
User Interface (in portuguese)User Interface (in portuguese)
User Interface (in portuguese)
 
Takeoff2008
Takeoff2008Takeoff2008
Takeoff2008
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Link extraction and classification

  • 1. http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 6. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 7. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 8. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 10. Link extraction extract user extract URL new tweet yes launch process has URL? save show no dump
  • 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  • 12. Content extraction URL retry? yes yes no fetch contents redirect? error? yes no save
  • 13. Content analysis • Assume malformed (X)HTML • Regular expressions or • Convert to XHTML • DOM traversing
  • 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graph placement { • Who shared the link? • Internal and external links
  • 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphs head found? save found? found? no no
  • 16. The big picture new tweet classify content save extract URL extract content
  • 17. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • twitter streaming API http://dev.twitter.com/pages/streaming_api
  • 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you automated publishing

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n