SlideShare a Scribd company logo
1 of 26
Download to read offline
the SEO’s guide to: !

SCRAPING!
EVERYTHING!
  @eppievojt!
  digital marketing consultant, JPL!
NEXT LEVEL!
XPATH-ING!

  Use Case 1:
  Does site x link to any page on
  eppie.net?
NEXT LEVEL!
XPATH-ING!
  Scrape partial       What we know:"

  matches using        1)  Link will contain"
                           http://www.eppie.net in the "
  XPath’s “contains”       href attribute"
  function to find
                       2)  Some people like to hurt the internet
  inexact data.
           by capitalizing URLs, so we’ll need
                           to account for that"

                       3)  People who link to you don’t care
                           about your desire for
                           canonicalization
DO YOU LINK!
TO ME?!

  //a[contains(@href,'http://www.eppie.net’)]




             PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
Add translate() to normalize case
//a[contains(translate(@href,
   'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
   pqrstuvwxyz'),'http://www.eppie.net’)]




                             DO YOU LINK!
                                 TO ME?!
How you can use this:
Get notified when a link is removed
+ Make contact to potentially save dropping link (friendly
  reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process
+ Get notification when link goes live




                                     DO YOU LINK!
                                         TO ME?!
NEXT LEVEL!
XPATH-ING!

  Use Case 2:
  Find every external link from cnn.com
NEXT LEVEL!
XPATH-ING!
                        What we know:"
  Combine attribute
  selectors to more     1)  External links all contain http://"

  accurately target     2)  Internal links can also use http://"
  useful information
   3)  So we need to exclude http:// links
                            to the current domain
SCRAPE ALL!
EXTERNAL LINKS!

  //a[contains(@href,'http://') and not
    (contains(@href,'cnn.com'))]
How you can use this:
Identify if a page is too spammed out to bother with by
   pulling external link counts

Find expired or expiring domains being linked to from
   authority sites. Purchase and rebuild or redirect those
   sites.

Broken link building automation




                                SCRAPE ALL!
                             EXTERNAL LINKS!
LINK TYPE!
IDENTIFICATION!

  Use Case 3:
  How are they ranking? What kind of links
  do they have?
LINK TYPE!
IDENTIFICATION!
  XPath’s ancestor    What we know:"
  axis lets us        A link inside a containing element with
  leverage semantic   an id or class name including the word
                      “comment,” “footer,” or “blogroll” is
  markup to ID link   highly suggestive of type
  types.
LINK TYPE!
IDENTIFICATION!


  "//a[@href='h,p://randfishkin.com/blog']/
    ancestor::*[contains(@id|
    @class,'comment')]"

                                             ment-
                             Wa  s Rand com
                                             ay to
                             spa mming his w       E
                             the top  ? This + 0S
                                            y...
                             tells the stor
Why you might use this:
Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor
   text

Improve workflow: Ignore placed links (comments, directory
  submissions, article submissions, blog networks, etc) and
  work on a smaller subset of EARNED links for manual
  analysis




                                SCRAPE ALL!
                             EXTERNAL LINKS!
REGEX TO!
THE RESCUE!

  Use Case 4:
  I’ve scraped some data, now I need to
  extract some small portion of it that
  XPath can’t do on its own (easily)
REGEX TO!
THE RESCUE!

  Use regular
                     Example:
  expressions to
  pattern match      Extract all @mentions of a specific user
                     from a tweet or page
  structured text
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
EXTRACT!
@ MENTIONS!

       /(?:^|s)@([A-z0-9_]+)/gi
Why you might use this:
Pull contact information from a web site (Twitter username,
  email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs)
  for improved competitive research




                                       REGEX TO!
                                     THE RESCUE!
BEYOND THE !
SPREADSHEET!

  Use Case 5:
  I want to chain processes together,
  process lots of data, or allow multiple
  users to leverage what I build.
BEYOND THE !
SPREADSHEET!
  Scraping outside   PHP Scraping Overview:
  the spreadsheet
                     1)    CURL target page
  allows for more    2)    Convert to DOM Object
  complex systems    3)    Run Xpath Queries
                     4)    Store Data or Hit API
  to be built.
BEYOND THE !
SPREADSHEET!

 Simple PHP Scraper Class:
 http://www.scrapeeverything.com
SHOW!
SOME LOVE!

  I’m @eppievojt and I work for @jplcreative "

  eppie.net
  linkdetective.com
  jplcreative.com

More Related Content

What's hot

Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Dawn Anderson MSc DigM
 
Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Michael McNeill
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUJason Mun
 
SMX East - SEO Tools Panel
SMX East - SEO Tools PanelSMX East - SEO Tools Panel
SMX East - SEO Tools PanelAbby Hamilton
 
The New Renaissance of JavaScript
The New Renaissance of JavaScriptThe New Renaissance of JavaScript
The New Renaissance of JavaScriptHamlet Batista
 
WordPress SEO & Optimisation
WordPress SEO & OptimisationWordPress SEO & Optimisation
WordPress SEO & OptimisationJoost de Valk
 
SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonThe 42nd Estate
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance OptimisationChris Burgess
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdiGlen Dimaandal
 
WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014Arsham Mirshah
 
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOUse Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOGerry White
 
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Hamlet Batista
 
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesKahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesMark Ginsberg
 
Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Hamlet Batista
 

What's hot (16)

Screaming Frog PPT
Screaming Frog PPTScreaming Frog PPT
Screaming Frog PPT
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...Things you should know about WordPress (but were always too afraid to ask): W...
Things you should know about WordPress (but were always too afraid to ask): W...
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
SMX East - SEO Tools Panel
SMX East - SEO Tools PanelSMX East - SEO Tools Panel
SMX East - SEO Tools Panel
 
The New Renaissance of JavaScript
The New Renaissance of JavaScriptThe New Renaissance of JavaScript
The New Renaissance of JavaScript
 
WordPress SEO & Optimisation
WordPress SEO & OptimisationWordPress SEO & Optimisation
WordPress SEO & Optimisation
 
SEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of BostonSEO Presentation - The 42nd Estate - BRA - City of Boston
SEO Presentation - The 42nd Estate - BRA - City of Boston
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance Optimisation
 
On site audit with screaming frog gdi
On site audit with screaming frog gdiOn site audit with screaming frog gdi
On site audit with screaming frog gdi
 
WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014WordPress SEO in 2014 - WordCamp Baltimore 2014
WordPress SEO in 2014 - WordCamp Baltimore 2014
 
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEOUse Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO
 
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5Solving Complex JavaScript Issues and Leveraging Semantic HTML5
Solving Complex JavaScript Issues and Leveraging Semantic HTML5
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot TablesKahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables
 
Technical SEO "Overoptimization"
Technical SEO "Overoptimization"Technical SEO "Overoptimization"
Technical SEO "Overoptimization"
 

Similar to The SEO's Guide to Scraping Everything

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteAdam Audette
 
Information Architecture for SEO
Information Architecture for SEOInformation Architecture for SEO
Information Architecture for SEOiProspect Canada
 
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Dealmaker Media
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactNikola Minkov
 
SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365Benjamin Niaulin
 
Website Security
Website SecurityWebsite Security
Website SecurityCarlos Z
 
Website Security
Website SecurityWebsite Security
Website SecurityMODxpo
 
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...Tin180 VietNam
 
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceBest-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceLaSandra Brill
 
Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101WO Strategies
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Nathan Buggia
 
TeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTraction Software
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in MahabubnagarSubhash Malgam
 
SEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseSEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseMichael Jones
 
Atmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsAtmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsPROIDEA
 
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. #CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. Mel Sciorra
 
Diagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationDiagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationNine By Blue
 
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYBUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYCodeCore
 
A complete digital marketing sop divay jain ( profshine tech )
A complete digital marketing sop  divay jain ( profshine tech )A complete digital marketing sop  divay jain ( profshine tech )
A complete digital marketing sop divay jain ( profshine tech )Divay Jain
 
SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)kdmcBerkeley at UC Berkeley
 

Similar to The SEO's Guide to Scraping Everything (20)

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam Audette
 
Information Architecture for SEO
Information Architecture for SEOInformation Architecture for SEO
Information Architecture for SEO
 
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / Serpact
 
SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365SPC Master Power User SharePoint & Office 365
SPC Master Power User SharePoint & Office 365
 
Website Security
Website SecurityWebsite Security
Website Security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...Best Kept Secrets To Search Engine Optimization Success  The Art And The Scie...
Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...
 
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the ScienceBest-kept Secrets to Search Engine Optimization Success: the Art and the Science
Best-kept Secrets to Search Engine Optimization Success: the Art and the Science
 
Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101Driving Volunteers to your Website: Online Marketing 101
Driving Volunteers to your Website: Online Marketing 101
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008
 
TeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan FrankTeamPage Beginner to Jedi, Jordan Frank
TeamPage Beginner to Jedi, Jordan Frank
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in Mahabubnagar
 
SEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead HorseSEO Practices for Blogs - Stop Blogging a Dead Horse
SEO Practices for Blogs - Stop Blogging a Dead Horse
 
Atmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOpsAtmosphere Conference 2015: The 10 Myths of DevOps
Atmosphere Conference 2015: The 10 Myths of DevOps
 
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More. #CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.
 
Diagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine OptimizationDiagnosing Technical Issues With Search Engine Optimization
Diagnosing Technical Issues With Search Engine Optimization
 
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREYBUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY
 
A complete digital marketing sop divay jain ( profshine tech )
A complete digital marketing sop  divay jain ( profshine tech )A complete digital marketing sop  divay jain ( profshine tech )
A complete digital marketing sop divay jain ( profshine tech )
 
SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)SEO: Optimizing Sites for People (and search engines)
SEO: Optimizing Sites for People (and search engines)
 

Recently uploaded

Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 

Recently uploaded (20)

Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 

The SEO's Guide to Scraping Everything

  • 1. the SEO’s guide to: ! SCRAPING! EVERYTHING! @eppievojt! digital marketing consultant, JPL!
  • 2. NEXT LEVEL! XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?
  • 3. NEXT LEVEL! XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization
  • 4. DO YOU LINK! TO ME?! //a[contains(@href,'http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
  • 5. Add translate() to normalize case //a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno pqrstuvwxyz'),'http://www.eppie.net’)] DO YOU LINK! TO ME?!
  • 6. How you can use this: Get notified when a link is removed + Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource) Integrate into link outreach process + Get notification when link goes live DO YOU LINK! TO ME?!
  • 7. NEXT LEVEL! XPATH-ING! Use Case 2: Find every external link from cnn.com
  • 8. NEXT LEVEL! XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain
  • 9. SCRAPE ALL! EXTERNAL LINKS! //a[contains(@href,'http://') and not (contains(@href,'cnn.com'))]
  • 10. How you can use this: Identify if a page is too spammed out to bother with by pulling external link counts Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites. Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
  • 11. LINK TYPE! IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?
  • 12. LINK TYPE! IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
  • 13. LINK TYPE! IDENTIFICATION! "//a[@href='h,p://randfishkin.com/blog']/ ancestor::*[contains(@id| @class,'comment')]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor
  • 14. Why you might use this: Analyze competitors’ strategies for acquiring links Find what types of links are being used to get good anchor text Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
  • 15. REGEX TO! THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)
  • 16. REGEX TO! THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a specific user from a tweet or page structured text
  • 21. EXTRACT! @ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi
  • 22. Why you might use this: Pull contact information from a web site (Twitter username, email address) to improve outreach efforts Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
  • 23. BEYOND THE ! SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.
  • 24. BEYOND THE ! SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.
  • 25. BEYOND THE ! SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
  • 26. SHOW! SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com