SlideShare uma empresa Scribd logo
1 de 36
Local Search
(Including ImportanceMetricsandLinkMerging)
Everythingyou wantedto know
about Crawling*
*ButDidn't KnowWhere to Ask
Agile SEO Meetup – South Jersey
Monday, September 10, 2012
7:00 PM to 9:00 PM
Bill Slawski
Webimax
@bill_slawski
In the Early Days of the Web,
there was an invasion
Robots
Spiders
Via Thomas Shahan - http://www.flickr.com/photos/opoterser/
Crawlers
Invaded pages across the World Wide Web
The Robots Mailing List was formed to solve the problem!
Led by a young Martijn Koster, they developed the Robots.txt protocol
Which Asked Robots to be Polite
And Not Melt Down Internet Servers
A student at Stanford named Lawrence Page went on
to co-author a paper on how robots might Crawl web
pages to index important pages first.
http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
<<Insert Subliminal Advertisement Here>>
Important Web Pages
1. Contain words similar to a query that starts the crawl
2. Have a high backlink count
3. Have a high PageRank
4. Have a high forward link count
5. Are in or are close to the root directory for sites
Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
So most crawlers will not only be
Polite, but they will also hunt down
important pages first
Search Engines filed patents on how they might crawl
and collect content found on Web pages, including collecting
URLs and Anchor Text associated with them.
<a href=“http://www.hungryrobots.com”>Feed Me</a>
http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
Also, in one embodiment,
the robots are configured
to not follow "permanent
redirects". Thus, when a
robot encounters a URL
that is permanently
redirected to another
URL, the robot does not
automatically retrieve the
document at the target
address of the permanent
redirect.
“Use a text browser such as Lynx to examine your site,
because most search engine spiders see your site much as
Lynx would. If fancy features such as JavaScript, cookies,
session IDs, frames, DHTML, or Flash keep you from
seeing all of your site in a text browser, then search engine
spiders may have trouble crawling your site.”*
*Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
Google’s Webmaster Guidelines make crawlers look pretty
unsophisticated, and incapable of much more than the simple
Lynx browser…
…But we have signs that crawlers can be smarter than that,
and Microsoft introduced a Vision-based Page Segmentation
Algorithm in 2003. Both Google and Yahoo have also published
patents and papers that describe smarter crawlers. IBM filed a patent
for a crawler in 2000 that is smarter than most browsers today.
VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7519902
Link Merging
Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151
•S-nodes – Structural Link Blocks - organizational and navigational link blocks;
Repeated across pages with the same layout and showing the organization of the site.
They are often lists of links that don’t usually contain other content elements such as text.
•C-nodes – Content link blocks, grouped together by some kind of content association,
such as relating to the same topic or sub-topic. These blocks usually point to information
resources and aren’t likely to be repeated across more than one page.
•I-nodes – Isolated links, which are links on a page that aren’t part of a link group,
may be only loosely related to each other, by virtue of something like their
appearing together within the same paragraph of text. Each link on a page might be
considered an individual i-node, or they might be grouped together by page as an i-node.
Crawling and Self Help
Canonical = Best!
There can be only one:
http://example.com
http://www.example.com
http://example.com/
http://www.example.com/
https://example.com
https://www.example.com
https://example.com/
https://www.example.com/
http://example.com/index.htm
http://www.example.com/index.htm
https://example.com/index.htm
https://www.example.com/index.htm
http://example.com/INDEX.htm
http://www.example.com/INDEX.htm
https://example.com/INDEX.htm
https://www.example.com/INDEX.htm
http://example.com/Index.htm
http://www.example.com/Index.htm
https://example.com/Index.htm
https://www.example.com/Index.htm
Canonical Link Element
<link rel="canonical" href="http://example.com/page.html"/>
Rel=“prev” & rel=“next”
On the first page, http://www.example.com/article?story=abc&page=1,
<link rel="next" href="http://www.example.com/article?story=abc&page=2" />
On the second page, http://www.example.com/article?story=abc&page=2:
<link rel="prev" href="http://www.example.com/article?story=abc&page=1" />
<link rel="next" href="http://www.example.com/article?story=abc&page=3" />
On the third page, http://www.example.com/article?story=abc&page=3
<link rel="prev" href="http://www.example.com/article?story=abc&page=2" />
<link rel="next" href="http://www.example.com/article?story=abc&page=4" />
And on the last page, http://www.example.com/article?story=abc&page=4:
<link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
Paginated Product Pages
Paginated Article Pages
View All Pages
Option 1
• Normal Prev/Next sequence
• Self Referential Canonicals (point to their Own URL
• Noindex meta element on View All page
Option 2
• Normal Prev/Next Sequence
• Canonicals (all pages use the view-all page URL)
http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
Rel=“hreflang”
Rel=“hreflang”
HTML link element.
In the HTML <head> section of http://www.example.com/, add
a link element pointing to the Spanish version of that webpage at
http://es.example.com/, like this:
<link rel="alternate" hreflang="es" href="http://es.example.com/" />
HTTP header.
If you publish non-HTML files (like PDFs), you can use an
HTTP header to indicate a different language version of a URL:
Link: <http://es.example.com/>; rel="alternate"; hreflang="es"
Sitemap.
Instead of using markup, you can submit language version
information in a Sitemap.
Rel=“hreflang” XML Sitemap
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/
0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>http://www.example.com/english/</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="http://www.example.com/deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="http://www.example.com/schweiz-
deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="en"
href="http://www.example.com/english/"
/>
</url>
XML Sitemap
XML Sitemap
•Use Canonical links
•Remove 404s
•Don’t set priority past 1 week
•If more than 50,000 URLs, use multiple Sitemaps
and a site index
•Validate with an XML Sitemap Validator
•Include a Sitemap statement in robots.txt
http://www.sitemaps.org/
Next, we study which of the two crawl systems, Sitemaps and Discovery,
sees URLs first. We conduct this test over a dataset consisting of over five
billion URLs that were seen by both systems.
According to the most recent statistics at the time of the writing,
78% of these URLs were seen by Sitemaps first, compared to
22% that were seen through Discovery first.
Crawling vs. XML
Sitemaps: Above and Beyond the Crawl of Duty –
http://www.shuri.org/publications/www2009_sitemaps.pdf
Crawling Social Media
Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=
G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457
Questions?
Bill Slawski
Webimax
@bill_slawski

Mais conteúdo relacionado

Mais procurados

All seo foot prints
All seo foot printsAll seo foot prints
All seo foot prints
azad008
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR Codes
Judy Horn
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open Web
Chris Messina
 
2000 Directories with ranking
2000 Directories with ranking2000 Directories with ranking
2000 Directories with ranking
same2cool
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible web
YKNIB O
 

Mais procurados (20)

All seo foot prints
All seo foot printsAll seo foot prints
All seo foot prints
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR Codes
 
Seo basics part 3
Seo basics part 3Seo basics part 3
Seo basics part 3
 
The Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is KingThe Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is King
 
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your SitesSEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
Facebook Coin
Facebook CoinFacebook Coin
Facebook Coin
 
SEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO SuccessSEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO Success
 
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
 
EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open Web
 
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You NeedThe Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
 
Content Strategy for Responsive Websites
Content Strategy for Responsive WebsitesContent Strategy for Responsive Websites
Content Strategy for Responsive Websites
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Negotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsNegotiating crawl budget with googlebots
Negotiating crawl budget with googlebots
 
2000 Directories with ranking
2000 Directories with ranking2000 Directories with ranking
2000 Directories with ranking
 
How to connect social media with open standards
How to connect social media with open standardsHow to connect social media with open standards
How to connect social media with open standards
 
SEO Quick Wins: The Small Things that Make The Big Differences
SEO Quick Wins: The Small Things that Make The Big DifferencesSEO Quick Wins: The Small Things that Make The Big Differences
SEO Quick Wins: The Small Things that Make The Big Differences
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible web
 
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
 

Semelhante a Everything you wanted to know about crawling, but didn't know where to ask

Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientists
Emanuele Della Valle
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extend
Seek Tan
 
When responsive web design meets the real world
When responsive web design meets the real worldWhen responsive web design meets the real world
When responsive web design meets the real world
Jason Grigsby
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
Lincoln III
 

Semelhante a Everything you wanted to know about crawling, but didn't know where to ask (20)

The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Inbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestInbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFest
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
NCompass Live: RSS: Feed Me
NCompass Live: RSS: Feed MeNCompass Live: RSS: Feed Me
NCompass Live: RSS: Feed Me
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientists
 
What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extend
 
Adaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup NycAdaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup Nyc
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Seo isn't that hard
Seo isn't that hardSeo isn't that hard
Seo isn't that hard
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Web2 And Java
Web2 And JavaWeb2 And Java
Web2 And Java
 
When responsive web design meets the real world
When responsive web design meets the real worldWhen responsive web design meets the real world
When responsive web design meets the real world
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / Serpact
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 

Mais de Bill Slawski

Hummingbird & the entity revolution
Hummingbird & the entity revolutionHummingbird & the entity revolution
Hummingbird & the entity revolution
Bill Slawski
 

Mais de Bill Slawski (20)

William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-search
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA Con
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0
 
Image Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image OptimizationImage Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image Optimization
 
SMXL Milan 2019 Graphs of Things
SMXL Milan 2019   Graphs of ThingsSMXL Milan 2019   Graphs of Things
SMXL Milan 2019 Graphs of Things
 
Smxl milan 2019 keyword school
Smxl milan 2019   keyword schoolSmxl milan 2019   keyword school
Smxl milan 2019 keyword school
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)
 
Guidelines and best practices for successful seo william slawski smxl milan...
Guidelines and best practices for successful seo   william slawski smxl milan...Guidelines and best practices for successful seo   william slawski smxl milan...
Guidelines and best practices for successful seo william slawski smxl milan...
 
Seo; Cutting Through The Noise
Seo; Cutting Through The NoiseSeo; Cutting Through The Noise
Seo; Cutting Through The Noise
 
Smx advanced-william-slawski-final
Smx advanced-william-slawski-finalSmx advanced-william-slawski-final
Smx advanced-william-slawski-final
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic Web
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
 
Bill Slawski SEO and the New Search Results
Bill Slawski   SEO and the New Search ResultsBill Slawski   SEO and the New Search Results
Bill Slawski SEO and the New Search Results
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of Search
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queries
 
Slawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswersSlawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswers
 
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
 
Hummingbird & the entity revolution
Hummingbird & the entity revolutionHummingbird & the entity revolution
Hummingbird & the entity revolution
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Everything you wanted to know about crawling, but didn't know where to ask

  • 1. Local Search (Including ImportanceMetricsandLinkMerging) Everythingyou wantedto know about Crawling* *ButDidn't KnowWhere to Ask Agile SEO Meetup – South Jersey Monday, September 10, 2012 7:00 PM to 9:00 PM Bill Slawski Webimax @bill_slawski
  • 2. In the Early Days of the Web, there was an invasion
  • 4. Spiders Via Thomas Shahan - http://www.flickr.com/photos/opoterser/
  • 6. Invaded pages across the World Wide Web
  • 7. The Robots Mailing List was formed to solve the problem!
  • 8. Led by a young Martijn Koster, they developed the Robots.txt protocol
  • 9. Which Asked Robots to be Polite
  • 10. And Not Melt Down Internet Servers
  • 11. A student at Stanford named Lawrence Page went on to co-author a paper on how robots might Crawl web pages to index important pages first. http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
  • 13. Important Web Pages 1. Contain words similar to a query that starts the crawl 2. Have a high backlink count 3. Have a high PageRank 4. Have a high forward link count 5. Are in or are close to the root directory for sites Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
  • 14. So most crawlers will not only be Polite, but they will also hunt down important pages first
  • 15. Search Engines filed patents on how they might crawl and collect content found on Web pages, including collecting URLs and Anchor Text associated with them. <a href=“http://www.hungryrobots.com”>Feed Me</a> http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
  • 16. Also, in one embodiment, the robots are configured to not follow "permanent redirects". Thus, when a robot encounters a URL that is permanently redirected to another URL, the robot does not automatically retrieve the document at the target address of the permanent redirect.
  • 17. “Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.”* *Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
  • 18. Google’s Webmaster Guidelines make crawlers look pretty unsophisticated, and incapable of much more than the simple Lynx browser… …But we have signs that crawlers can be smarter than that, and Microsoft introduced a Vision-based Page Segmentation Algorithm in 2003. Both Google and Yahoo have also published patents and papers that describe smarter crawlers. IBM filed a patent for a crawler in 2000 that is smarter than most browsers today.
  • 19. VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
  • 21. Link Merging Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151 •S-nodes – Structural Link Blocks - organizational and navigational link blocks; Repeated across pages with the same layout and showing the organization of the site. They are often lists of links that don’t usually contain other content elements such as text. •C-nodes – Content link blocks, grouped together by some kind of content association, such as relating to the same topic or sub-topic. These blocks usually point to information resources and aren’t likely to be repeated across more than one page. •I-nodes – Isolated links, which are links on a page that aren’t part of a link group, may be only loosely related to each other, by virtue of something like their appearing together within the same paragraph of text. Each link on a page might be considered an individual i-node, or they might be grouped together by page as an i-node.
  • 23. Canonical = Best! There can be only one: http://example.com http://www.example.com http://example.com/ http://www.example.com/ https://example.com https://www.example.com https://example.com/ https://www.example.com/ http://example.com/index.htm http://www.example.com/index.htm https://example.com/index.htm https://www.example.com/index.htm http://example.com/INDEX.htm http://www.example.com/INDEX.htm https://example.com/INDEX.htm https://www.example.com/INDEX.htm http://example.com/Index.htm http://www.example.com/Index.htm https://example.com/Index.htm https://www.example.com/Index.htm
  • 24. Canonical Link Element <link rel="canonical" href="http://example.com/page.html"/>
  • 25. Rel=“prev” & rel=“next” On the first page, http://www.example.com/article?story=abc&page=1, <link rel="next" href="http://www.example.com/article?story=abc&page=2" /> On the second page, http://www.example.com/article?story=abc&page=2: <link rel="prev" href="http://www.example.com/article?story=abc&page=1" /> <link rel="next" href="http://www.example.com/article?story=abc&page=3" /> On the third page, http://www.example.com/article?story=abc&page=3 <link rel="prev" href="http://www.example.com/article?story=abc&page=2" /> <link rel="next" href="http://www.example.com/article?story=abc&page=4" /> And on the last page, http://www.example.com/article?story=abc&page=4: <link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
  • 28. View All Pages Option 1 • Normal Prev/Next sequence • Self Referential Canonicals (point to their Own URL • Noindex meta element on View All page Option 2 • Normal Prev/Next Sequence • Canonicals (all pages use the view-all page URL) http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
  • 30. Rel=“hreflang” HTML link element. In the HTML <head> section of http://www.example.com/, add a link element pointing to the Spanish version of that webpage at http://es.example.com/, like this: <link rel="alternate" hreflang="es" href="http://es.example.com/" /> HTTP header. If you publish non-HTML files (like PDFs), you can use an HTTP header to indicate a different language version of a URL: Link: <http://es.example.com/>; rel="alternate"; hreflang="es" Sitemap. Instead of using markup, you can submit language version information in a Sitemap.
  • 31. Rel=“hreflang” XML Sitemap <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/ 0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>http://www.example.com/english/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/deutsch/" /> <xhtml:link rel="alternate" hreflang="de-ch" href="http://www.example.com/schweiz- deutsch/" /> <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/english/" /> </url>
  • 33. XML Sitemap •Use Canonical links •Remove 404s •Don’t set priority past 1 week •If more than 50,000 URLs, use multiple Sitemaps and a site index •Validate with an XML Sitemap Validator •Include a Sitemap statement in robots.txt http://www.sitemaps.org/
  • 34. Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first. Crawling vs. XML Sitemaps: Above and Beyond the Crawl of Duty – http://www.shuri.org/publications/www2009_sitemaps.pdf
  • 35. Crawling Social Media Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph- Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f= G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457