Everything you wanted to know about crawling, but didn't know where to ask

Local Search
(Including ImportanceMetricsandLinkMerging)
Everythingyou wantedto know
about Crawling*
*ButDidn't KnowWhere to Ask
Agile SEO Meetup – South Jersey
Monday, September 10, 2012
7:00 PM to 9:00 PM
Bill Slawski
Webimax
@bill_slawski

In the Early Days of the Web,
there was an invasion

Spiders
Via Thomas Shahan - http://www.flickr.com/photos/opoterser/

Invaded pages across the World Wide Web

The Robots Mailing List was formed to solve the problem!

Led by a young Martijn Koster, they developed the Robots.txt protocol

Which Asked Robots to be Polite

And Not Melt Down Internet Servers

A student at Stanford named Lawrence Page went on
to co-author a paper on how robots might Crawl web
pages to index important pages first.
http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf

<<Insert Subliminal Advertisement Here>>

Important Web Pages
1. Contain words similar to a query that starts the crawl
2. Have a high backlink count
3. Have a high PageRank
4. Have a high forward link count
5. Are in or are close to the root directory for sites
Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2

So most crawlers will not only be
Polite, but they will also hunt down
important pages first

Search Engines filed patents on how they might crawl
and collect content found on Web pages, including collecting
URLs and Anchor Text associated with them.
<a href=“http://www.hungryrobots.com”>Feed Me</a>
http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643

Also, in one embodiment,
the robots are configured
to not follow "permanent
redirects". Thus, when a
robot encounters a URL
that is permanently
redirected to another
URL, the robot does not
automatically retrieve the
document at the target
address of the permanent
redirect.

“Use a text browser such as Lynx to examine your site,
because most search engine spiders see your site much as
Lynx would. If fancy features such as JavaScript, cookies,
session IDs, frames, DHTML, or Flash keep you from
seeing all of your site in a text browser, then search engine
spiders may have trouble crawling your site.”*
*Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769

Google’s Webmaster Guidelines make crawlers look pretty
unsophisticated, and incapable of much more than the simple
Lynx browser…
…But we have signs that crawlers can be smarter than that,
and Microsoft introduced a Vision-based Page Segmentation
Algorithm in 2003. Both Google and Yahoo have also published
patents and papers that describe smarter crawlers. IBM filed a patent
for a crawler in 2000 that is smarter than most browsers today.

VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027

http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7519902

Link Merging
Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151
•S-nodes – Structural Link Blocks - organizational and navigational link blocks;
Repeated across pages with the same layout and showing the organization of the site.
They are often lists of links that don’t usually contain other content elements such as text.
•C-nodes – Content link blocks, grouped together by some kind of content association,
such as relating to the same topic or sub-topic. These blocks usually point to information
resources and aren’t likely to be repeated across more than one page.
•I-nodes – Isolated links, which are links on a page that aren’t part of a link group,
may be only loosely related to each other, by virtue of something like their
appearing together within the same paragraph of text. Each link on a page might be
considered an individual i-node, or they might be grouped together by page as an i-node.

Canonical = Best!
There can be only one:
http://example.com
http://www.example.com
http://example.com/
http://www.example.com/
https://example.com
https://www.example.com
https://example.com/
https://www.example.com/
http://example.com/index.htm
http://www.example.com/index.htm
https://example.com/index.htm
https://www.example.com/index.htm
http://example.com/INDEX.htm
http://www.example.com/INDEX.htm
https://example.com/INDEX.htm
https://www.example.com/INDEX.htm
http://example.com/Index.htm
http://www.example.com/Index.htm
https://example.com/Index.htm
https://www.example.com/Index.htm

Canonical Link Element
<link rel="canonical" href="http://example.com/page.html"/>

Rel=“prev” & rel=“next”
On the first page, http://www.example.com/article?story=abc&page=1,
<link rel="next" href="http://www.example.com/article?story=abc&page=2" />
On the second page, http://www.example.com/article?story=abc&page=2:
<link rel="prev" href="http://www.example.com/article?story=abc&page=1" />
On the third page, http://www.example.com/article?story=abc&page=3
And on the last page, http://www.example.com/article?story=abc&page=4:

View All Pages
Option 1
• Normal Prev/Next sequence
• Self Referential Canonicals (point to their Own URL
• Noindex meta element on View All page
Option 2
• Normal Prev/Next Sequence
• Canonicals (all pages use the view-all page URL)
http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html

Rel=“hreflang”
HTML link element.
In the HTML <head> section of http://www.example.com/, add
a link element pointing to the Spanish version of that webpage at
http://es.example.com/, like this:
<link rel="alternate" hreflang="es" href="http://es.example.com/" />
HTTP header.
If you publish non-HTML files (like PDFs), you can use an
HTTP header to indicate a different language version of a URL:
Link: <http://es.example.com/>; rel="alternate"; hreflang="es"
Sitemap.
Instead of using markup, you can submit language version
information in a Sitemap.

Rel=“hreflang” XML Sitemap
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/
0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>http://www.example.com/english/</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="http://www.example.com/deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="http://www.example.com/schweiz-
deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="en"
href="http://www.example.com/english/"
/>
</url>

XML Sitemap
•Use Canonical links
•Remove 404s
•Don’t set priority past 1 week
•If more than 50,000 URLs, use multiple Sitemaps
and a site index
•Validate with an XML Sitemap Validator
•Include a Sitemap statement in robots.txt
http://www.sitemaps.org/

Next, we study which of the two crawl systems, Sitemaps and Discovery,
sees URLs first. We conduct this test over a dataset consisting of over five
billion URLs that were seen by both systems.
According to the most recent statistics at the time of the writing,
78% of these URLs were seen by Sitemaps first, compared to
22% that were seen through Discovery first.
Crawling vs. XML
Sitemaps: Above and Beyond the Crawl of Duty –
http://www.shuri.org/publications/www2009_sitemaps.pdf

Crawling Social Media
Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=
G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457

Questions?
Bill Slawski
Webimax
@bill_slawski

Everything you wanted to know about crawling, but didn't know where to ask

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Everything you wanted to know about crawling, but didn't know where to ask

Semelhante a Everything you wanted to know about crawling, but didn't know where to ask (20)

Mais de Bill Slawski

Mais de Bill Slawski (20)

Último

Último (20)

Everything you wanted to know about crawling, but didn't know where to ask