4. Lecture 8
• Introduction
• How search works
• Anatomy of HTML
• On-page factors
• Off-page factors
• PageRank
From Code to Product Lecture 8 — Search Engines — Slide 4 gidgreen.com/course
5. History of web search
• 1994 — WebCrawler — first full text
• 1994 — Yahoo — directory then portal
• 1995 — AltaVista — first big index
• 1997 — Google — link citation analysis
• 2000 — 2004 — Yahoo uses Google
• 2000 — Baidu — now leader in China
• 2006 — Microsoft Live Search
• 2009 — Bing, switched to by Yahoo
From Code to Product Lecture 8 — Search Engines — Slide 5 gidgreen.com/course
6. Importance of search
• US: 17.1B “core searches” in April 2012
– 55 per US citizen [comScore]
• 92% of online US adults use search
– 96% of college graduates
– 98% with income $75k+
• 70–80% ignore paid ads on right
– (but only 10% ignore ads on top)
• 80% of sessions begins with a search
From Code to Product Lecture 8 — Search Engines — Slide 6 gidgreen.com/course
Sources:comScore,PewInternetReport,UserCentric,PCMagazine,
http://www.searchenginejournal.com/24-eye-popping-seo-statistics/42665/
8. Search as traffic source
From Code to Product Lecture 8 — Search Engines — Slide 8 gidgreen.com/course
9. Global market share
From Code to Product Lecture 8 — Search Engines — Slide 9 gidgreen.com/course
Google,
81.73%
Yahoo, 6.42%
Baidu, 5.65%
Bing, 4.15% Other,
2.05%
Global
Source:May2012figuresfromhttp://www.netmarketshare.com/
10. USA market share
From Code to Product Lecture 8 — Search Engines — Slide 10 gidgreen.com/course
Google,
76.57%
Bing, 10.46%
Yahoo, 9.83%
AOL,
1.47%
Ask, 1.33%
USA
Source:May2012figuresfromhttp://www.netmarketshare.com/
11. China market share
From Code to Product Lecture 8 — Search Engines — Slide 11 gidgreen.com/course
Baidu,
78.50%
Google,
16.60%
Sougou,
2.80%
SoSo, 1.40%
Others, 0.70%
China
Source:http://chineseseoshifu.com/china-search-engine-market-share/
Also: Japan,
Czech Republic,
South Korea,
Russia,
12. Search engine results page
From Code to Product Lecture 8 — Search Engines — Slide 12 gidgreen.com/course
13. Where do people look?
From Code to Product Lecture 8 — Search Engines — Slide 13 gidgreen.com/course
14. Where do people click?
From Code to Product Lecture 8 — Search Engines — Slide 14 gidgreen.com/course
http://www.seomoz.org/blog/mission-imposserpble-
establishing-clickthrough-rates
15. Black-hat vs white-hat
From Code to Product Lecture 8 — Search Engines — Slide 15 gidgreen.com/course
Black-hat SEO White-hat SEO
Tricking Google Working with Google
Hidden keywords Prominent keywords
Cloaking for search Structured for search
Content scraping Unique content
Link spam and farms Attracting links
Short-lived boost (maybe) Long-term results
16. Google’s recommendations
From Code to Product Lecture 8 — Search Engines — Slide 16 gidgreen.com/course
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
17. Lecture 8
• Introduction
• How search works
• Anatomy of HTML
• On-page factors
• Off-page factors
• PageRank
From Code to Product Lecture 8 — Search Engines — Slide 17 gidgreen.com/course
18. How search works
• Crawling
– Finding content to index
• Indexing
– Preparing content for search
• Searching
– Showing results to user
From Code to Product Lecture 8 — Search Engines — Slide 18 gidgreen.com/course
19. Basic crawling
• Create an empty URL queue (“frontier”)
• Add one good URL, e.g. wikipedia.org
• Repeat:
– Select random URL from queue
– Retrieve content for that URL
– Add links in content to queue
– Keep track to prevent repeat visits
From Code to Product Lecture 8 — Search Engines — Slide 19 gidgreen.com/course
20. Crawling issues
• Link prioritization
• Duplicate content
– Print versions, sorting
• Infinite loops
– Database-driven sites
• Revisiting pages
• Site overloading
• Parallelization
From Code to Product Lecture 8 — Search Engines — Slide 20 gidgreen.com/course
21. Indexing
From Code to Product Lecture 8 — Search Engines — Slide 21 gidgreen.com/course
22. Indexing
From Code to Product Lecture 8 — Search Engines — Slide 22 gidgreen.com/course
23. Inverted index
From Code to Product Lecture 8 — Search Engines — Slide 23 gidgreen.com/course
https://developer.apple.com/library/mac/#documentation/userexperience/
Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
24. Other indexed information
• Page metadata
• More about words
– Prominence
– Position
– Frequency
• Links between pages
– Including anchor text
• Images, etc…
From Code to Product Lecture 8 — Search Engines — Slide 24 gidgreen.com/course
25. Other formats
From Code to Product Lecture 8 — Search Engines — Slide 25 gidgreen.com/course
http://support.google.com/webmasters/bin/
answer.py?hl=en&answer=35287
Forms?
Javascript?
27. Recent Google changes
• Aug 2012: sometimes 7 results
• May 2012: knowledge graph
• Jan 2012: top heavy ads penalty
• Nov 2011: rewarding freshness
• Feb 2011: hitting content farms
• Dec 2010: social media signals
• Dec 2009: real-time search
From Code to Product Lecture 8 — Search Engines — Slide 27 gidgreen.com/course
http://www.seomoz.org/google-algorithm-change
28. Google web history (2005–2009)
From Code to Product Lecture 8 — Search Engines — Slide 28 gidgreen.com/course
29. Search + your world (2012)
From Code to Product Lecture 8 — Search Engines — Slide 29 gidgreen.com/course
http://www.ubergizmo.com/2012/01/google-now-searches-your-world/
30. Keyword research
But: consider also long tail (referrer logs)
From Code to Product Lecture 8 — Search Engines — Slide 30 gidgreen.com/course
32. Lecture 8
• Introduction
• How search works
• Anatomy of HTML
• On-page factors
• Off-page factors
• PageRank
From Code to Product Lecture 8 — Search Engines — Slide 32 gidgreen.com/course
33. HTTP protocol
GET /wiki/Hypertext_Transfer_Protocol HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/
534.50.2 (KHTML, like Gecko) Version/5.0.6 Safari/533.22.3
Referer: http://www.rexswain.com/httpview.html
Connection: close
HTTP/1.0 200 OK
Date: Sun, 17 Jun 2012 06:05:03 GMT
Server: Apache
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Last-Modified: Sat, 16 Jun 2012 03:14:24 GMT
Content-Length: 164814
Content-Type: text/html; charset=UTF-8
Connection: close
From Code to Product Lecture 8 — Search Engines — Slide 33 gidgreen.com/course
34. Page structure
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN" "http://www.w3.org/TR/html4/
loose.dtd">
<html lang="en">
<head>
<title>Uber List Manager</title>
</head>
<body>
<h1>Uber List Manager</h1>
<p>The world's leading & best priced list
management software.</p>
</body>
</html>
From Code to Product Lecture 8 — Search Engines — Slide 34 gidgreen.com/course
35. Key <HEAD> elements
<head>
<meta http-equiv="content-type" content="text/
html; charset=utf-8">
<link rel="stylesheet" type="text/css"
href="styles.css">
<script type="text/javascript" src=”script.js"></
script>
<title>Uber List Manager</title>
<meta name="description" content="An excellent
and well priced list management program.">
<meta name="keywords" content="lists, list
manager, uber, mailing software">
</head>
From Code to Product Lecture 8 — Search Engines — Slide 35 gidgreen.com/course
36. Key <BODY> elements
From Code to Product Lecture 8 — Search Engines — Slide 36 gidgreen.com/course
<body>
<h1>Uber List Manager</h1>
<p>The world's leading & best priced list
management software.</p>
<h2>Features</h2>
<h2>Customer stories</h2>
<img src="images/ulm.jpeg" width="320" height="240"
alt="Screenshot" title="ULM in action">
<form action="form.php" method="post">
<input type="submit" value="Submit">
</form>
<iframe src="iframe.html" width="300"
height="300"></iframe>
</body>
38. Links
From Code to Product Lecture 8 — Search Engines — Slide 38 gidgreen.com/course
Click <a href="more-information.html">here</a> for
ULM benefits and pricing.
Click for <a href="more-information.html">ULM
benefits and pricing</a>.
Click for <a href="more-information.html" title="ULM
benefits and pricing">more about ULM</A>.
Better than <a href="http://slowlists.com/"
rel="nofollow">our competitors</a>.
<a href="pricing.html"><img src="dollar-bill.jpeg"
alt="Pricing"></a>
45. URLs: good vs bad
www.really-cheap-great-mailing-list-manager.info
www.mailingmanager.com
googleblog.blogspot.com/view?
post_id=3982098§ion_id=231
googleblog.blogspot.com/2012/04/introducing-google-
drive.html
amazon.com/store/products/books/computing/internet/
seo/Eric+Edge/The%20Art%20Of%20SEO/details
amazon.com/The-Art-SEO-Eric-Edge
From Code to Product Lecture 8 — Search Engines — Slide 45 gidgreen.com/course
46. Meta descriptions
From Code to Product Lecture 8 — Search Engines — Slide 46 gidgreen.com/course
Used for display but not for ranking
Length: 150~160 characters
Avoid duplication across many pages
48. Formatting
From Code to Product Lecture 8 — Search Engines — Slide 48 gidgreen.com/course
and <b>good value</b>
and <span style="font-weight:bold;">good value</span>
and <span class="emboldened">good value</span>
and <em>good value</em>
and <strong>good value</strong>
<font size="+2>Features</font>
<big>Features</big>
<p style="font-size:24px;">Features</p>
<h2>Features<h2>
<h2 style="font-size:24px;">Features</h2>
49. Freshness and speed
• Freshness determined by:
– Date the page appeared
– Frequency of content change
– Amount of content change
– Rate of new incoming links
From Code to Product Lecture 8 — Search Engines — Slide 49 gidgreen.com/course
51. Lecture 8
• Introduction
• How search works
• Anatomy of HTML
• On-page factors
• Off-page factors
• PageRank
From Code to Product Lecture 8 — Search Engines — Slide 51 gidgreen.com/course
52. Links from external sites
• From high ranking sites
– Hard to manipulate
• From .edu or .gov
– No commercial motivation
• From topic-related sites
• From many sites
– Diversity of subject
– Different ownership / IP block
From Code to Product Lecture 8 — Search Engines — Slide 52 gidgreen.com/course
53. Links on external pages
• All-important anchor text
– First appearance counts
– Diversity of anchors
• Higher on linking page
• From core text content
– Not navigation/footers
– Image ALT text weaker
• Page has other good links
From Code to Product Lecture 8 — Search Engines — Slide 53 gidgreen.com/course
54. Power of anchors
From Code to Product Lecture 8 — Search Engines — Slide 54 gidgreen.com/course
55. Titles and URLs in anchors
Wikipedia, the free encyclopedia — 450 saves
Visit Wikipedia for more information
Recent referrers: en.wikipedia.org
http://en.wikipedia.org/wiki/Main_Page
From Code to Product Lecture 8 — Search Engines — Slide 55 gidgreen.com/course
56. Attracting links
• (Directories e.g. dmoz)
• Inbound marketing
– Great on-site content
– Post articles elsewhere
– Request reviews
• Viral marketing
– Banners + widgets
– Social network sharing
From Code to Product Lecture 8 — Search Engines — Slide 56 gidgreen.com/course
57. Link bait
From Code to Product Lecture 8 — Search Engines — Slide 57 gidgreen.com/course
60. Duplicate content
• Other sites stealing your content
• www.domain.com vs domain.com
• domain.com/ vs domain.com/index.html
• Printer-friendly versions
• URL parameters
From Code to Product Lecture 8 — Search Engines — Slide 60 gidgreen.com/course
61. robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
User-agent: BadBot
Disallow: /
Sitemap: http://www.example.com/sitemap.xml
From Code to Product Lecture 8 — Search Engines — Slide 61 gidgreen.com/course
Or in <HEAD> of page:
<meta name="robots"
content="noindex">
62. XML sitemaps
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/
schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
From Code to Product Lecture 8 — Search Engines — Slide 62 gidgreen.com/course
63. Redirects and rel=canonical
From Code to Product Lecture 8 — Search Engines — Slide 63 gidgreen.com/course
http://www.seomoz.org/blog/an-seos-guide-to-http-status-codes
64. Putting it all together
From Code to Product Lecture 8 — Search Engines — Slide 64 gidgreen.com/course
http://www.seomoz.org/article/search-ranking-factors#predictions
65. Lecture 8
• Introduction
• How search works
• Anatomy of HTML
• On-page factors
• Off-page factors
• PageRank
From Code to Product Lecture 8 — Search Engines — Slide 65 gidgreen.com/course
66. A random walk
From Code to Product Lecture 8 — Search Engines — Slide 66 gidgreen.com/course
A
B
C
D
E
67. Probability distribution
From Code to Product Lecture 8 — Search Engines — Slide 67 gidgreen.com/course
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
68. The maths
From Code to Product Lecture 8 — Search Engines — Slide 68 gidgreen.com/course
http://en.wikipedia.org/wiki/PageRank
69. PageRank sculpting?
From Code to Product Lecture 8 — Search Engines — Slide 69 gidgreen.com/course
http://www.seomoz.org/blog/google-says-yes-you-can-still-
sculpt-pagerank-no-you-cant-do-it-with-nofollow
70. PageRank in reality
• Domain authority signals
• Nofollow links are clicked by people
• Interval vs external links
• Paid link and link farm detection
• Removed from toolbar in 2009
From Code to Product Lecture 8 — Search Engines — Slide 70 gidgreen.com/course
https://sites.google.com/site/webmasterhelpforum/en/faq--crawling--indexing---ranking#pagerank
71. Tools
From Code to Product Lecture 8 — Search Engines — Slide 71 gidgreen.com/course