SlideShare a Scribd company logo
1 of 26
Download to read offline
the SEO’s guide to: !

SCRAPING!
EVERYTHING!
  @eppievojt!
  digital marketing consultant, JPL!
NEXT LEVEL!
XPATH-ING!

  Use Case 1:
  Does site x link to any page on
  eppie.net?
NEXT LEVEL!
XPATH-ING!
  Scrape partial       What we know:"

  matches using        1)  Link will contain"
                           http://www.eppie.net in the "
  XPath’s “contains”       href attribute"
  function to find
                       2)  Some people like to hurt the internet
  inexact data.
           by capitalizing URLs, so we’ll need
                           to account for that"

                       3)  People who link to you don’t care
                           about your desire for
                           canonicalization
DO YOU LINK!
TO ME?!

  //a[contains(@href,'http://www.eppie.net’)]




             PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
Add translate() to normalize case
//a[contains(translate(@href,
   'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
   pqrstuvwxyz'),'http://www.eppie.net’)]




                             DO YOU LINK!
                                 TO ME?!
How you can use this:
Get notified when a link is removed
+ Make contact to potentially save dropping link (friendly
  reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process
+ Get notification when link goes live




                                     DO YOU LINK!
                                         TO ME?!
NEXT LEVEL!
XPATH-ING!

  Use Case 2:
  Find every external link from cnn.com
NEXT LEVEL!
XPATH-ING!
                        What we know:"
  Combine attribute
  selectors to more     1)  External links all contain http://"

  accurately target     2)  Internal links can also use http://"
  useful information
   3)  So we need to exclude http:// links
                            to the current domain
SCRAPE ALL!
EXTERNAL LINKS!

  //a[contains(@href,'http://') and not
    (contains(@href,'cnn.com'))]
How you can use this:
Identify if a page is too spammed out to bother with by
   pulling external link counts

Find expired or expiring domains being linked to from
   authority sites. Purchase and rebuild or redirect those
   sites.

Broken link building automation




                                SCRAPE ALL!
                             EXTERNAL LINKS!
LINK TYPE!
IDENTIFICATION!

  Use Case 3:
  How are they ranking? What kind of links
  do they have?
LINK TYPE!
IDENTIFICATION!
  XPath’s ancestor    What we know:"
  axis lets us        A link inside a containing element with
  leverage semantic   an id or class name including the word
                      “comment,” “footer,” or “blogroll” is
  markup to ID link   highly suggestive of type
  types.
LINK TYPE!
IDENTIFICATION!


  "//a[@href='h,p://randfishkin.com/blog']/
    ancestor::*[contains(@id|
    @class,'comment')]"

                                             ment-
                             Wa  s Rand com
                                             ay to
                             spa mming his w       E
                             the top  ? This + 0S
                                            y...
                             tells the stor
Why you might use this:
Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor
   text

Improve workflow: Ignore placed links (comments, directory
  submissions, article submissions, blog networks, etc) and
  work on a smaller subset of EARNED links for manual
  analysis




                                SCRAPE ALL!
                             EXTERNAL LINKS!
REGEX TO!
THE RESCUE!

  Use Case 4:
  I’ve scraped some data, now I need to
  extract some small portion of it that
  XPath can’t do on its own (easily)
REGEX TO!
THE RESCUE!

  Use regular
                     Example:
  expressions to
  pattern match      Extract all @mentions of a specific user
                     from a tweet or page
  structured text
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
EXTRACT!
@ MENTIONS!

       /(?:^|s)@([A-z0-9_]+)/gi
Why you might use this:
Pull contact information from a web site (Twitter username,
  email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs)
  for improved competitive research




                                       REGEX TO!
                                     THE RESCUE!
BEYOND THE !
SPREADSHEET!

  Use Case 5:
  I want to chain processes together,
  process lots of data, or allow multiple
  users to leverage what I build.
BEYOND THE !
SPREADSHEET!
  Scraping outside   PHP Scraping Overview:
  the spreadsheet
                     1)    CURL target page
  allows for more    2)    Convert to DOM Object
  complex systems    3)    Run Xpath Queries
                     4)    Store Data or Hit API
  to be built.
BEYOND THE !
SPREADSHEET!

 Simple PHP Scraper Class:
 http://www.scrapeeverything.com
SHOW!
SOME LOVE!

  I’m @eppievojt and I work for @jplcreative "

  eppie.net
  linkdetective.com
  jplcreative.com

More Related Content

What's hot

SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of Boston
The 42nd Estate
 

What's hot (16)

Screaming Frog PPT
Screaming Frog PPTScreaming Frog PPT
Screaming Frog PPT
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
SMX East - SEO Tools Panel
SMX East - SEO Tools PanelSMX East - SEO Tools Panel
SMX East - SEO Tools Panel
 
The New Renaissance of JavaScript
The New Renaissance of JavaScriptThe New Renaissance of JavaScript
The New Renaissance of JavaScript
 
WordPress SEO & Optimisation
WordPress SEO & OptimisationWordPress SEO & Optimisation
WordPress SEO & Optimisation
 
SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of Boston
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance Optimisation
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdi
 
WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014
 
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOUse Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
 
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesKahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
 
Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Technical SEO "Overoptimization"
Technical SEO "Overoptimization"
 

Similar to The SEO's Guide to Scraping Everything

Website Security
Website SecurityWebsite Security
Website Security
Carlos Z
 

Similar to The SEO's Guide to Scraping Everything (20)

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam Audette
 
Information Architecture for SEO
Information Architecture for SEOInformation Architecture for SEO
Information Architecture for SEO
 
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / Serpact
 
SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365
 
Website Security
Website SecurityWebsite Security
Website Security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
 
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceBest-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
 
Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008
 
TeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan Frank
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in Mahabubnagar
 
SEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseSEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead Horse
 
Atmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsAtmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOps
 
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. #CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
 
Diagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationDiagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine Optimization
 
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYBUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
 
A complete digital marketing sop divay jain ( profshine tech )
A complete digital marketing sop  divay jain ( profshine tech )A complete digital marketing sop  divay jain ( profshine tech )
A complete digital marketing sop divay jain ( profshine tech )
 
SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

The SEO's Guide to Scraping Everything

  • 1. the SEO’s guide to: ! SCRAPING! EVERYTHING! @eppievojt! digital marketing consultant, JPL!
  • 2. NEXT LEVEL! XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?
  • 3. NEXT LEVEL! XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization
  • 4. DO YOU LINK! TO ME?! //a[contains(@href,'http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
  • 5. Add translate() to normalize case //a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno pqrstuvwxyz'),'http://www.eppie.net’)] DO YOU LINK! TO ME?!
  • 6. How you can use this: Get notified when a link is removed + Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource) Integrate into link outreach process + Get notification when link goes live DO YOU LINK! TO ME?!
  • 7. NEXT LEVEL! XPATH-ING! Use Case 2: Find every external link from cnn.com
  • 8. NEXT LEVEL! XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain
  • 9. SCRAPE ALL! EXTERNAL LINKS! //a[contains(@href,'http://') and not (contains(@href,'cnn.com'))]
  • 10. How you can use this: Identify if a page is too spammed out to bother with by pulling external link counts Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites. Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
  • 11. LINK TYPE! IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?
  • 12. LINK TYPE! IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
  • 13. LINK TYPE! IDENTIFICATION! "//a[@href='h,p://randfishkin.com/blog']/ ancestor::*[contains(@id| @class,'comment')]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor
  • 14. Why you might use this: Analyze competitors’ strategies for acquiring links Find what types of links are being used to get good anchor text Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
  • 15. REGEX TO! THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)
  • 16. REGEX TO! THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a specific user from a tweet or page structured text
  • 21. EXTRACT! @ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi
  • 22. Why you might use this: Pull contact information from a web site (Twitter username, email address) to improve outreach efforts Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
  • 23. BEYOND THE ! SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.
  • 24. BEYOND THE ! SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.
  • 25. BEYOND THE ! SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
  • 26. SHOW! SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com