4. Preface
This book aims to provide a general overview of how search engines rank doc-
uments in practice, the core of which will remain true even as Search Engine's
algorithms are rened.
4
5. Part I
Theory
0.1 Why SEO is important
ˆ A higher search engine result will receive exponentially greater
clicks than a lower one
For example, if a search was repeated 1000 times by dierent users, this is
typically how many clicks each result would get.
Position Clicks
1 222
2 63
3 45
4 32
5 26
6 21
7 18
8 16
9 15
10 16
Source: Leaked Aol Click Data
ˆ Paid adverts have low click through rates, and get expensive
quickly
Search Engine % Organic Click Through Rate % Paid Result Click Through Rate
Google 72 28
Yahoo 61 39
MSN 71 29
AOL 50 50
Average 63 37
88% of online search dollars are spent on paid results, even though 85% of
searchers click on organic results.
Vanessa Fox, Marketing in the Age of Google, May 3, 2010
0.2 Dierent needs from SEO
There are many dierent reasons you may wish to engage in optimising your
search results, including
ˆ Money - Sales for e-commerce sites are directly correlated with trac.
ˆ Reputation - Some companies go to the extent of pushing negative arti-
cles down in the rankings.
5
6. ˆ Branding - Coming up top in the results pages is impressive to customers,
and is particularly important in industries where reputation is extremely
important.
6
7. 1 What is a Search Engine?
1.1 History of Search Engines
The rst mechanised information retrieval sysyems were built by the US military
to analyse the mass of documents being captured from the Germans. Research
was boosted when the UK and US governments funded research to reduce a
perceived science gap with the USSR. By the time the internet was becoming
commonplace in the early 1990s information retrieval was at an advanced stage.
Complicated methods, primarily statistical, had been developed an archives of
thousands of documents could be searched in seconds.
Web search engines are a special case of information retrieval systems, ap-
plied to the massive collection of documents available on the internet. A typical
search engine in 1990 was split into two parts: a web spider that traverses the
web following links and creating a local index of the pages, then traditional in-
formation retrieval methods to search the index for pages relevant to the users
query and order the pages by some ranking function. Many factors inuence a
person's decision about what is relevant, such as the current task, context and
freshness.
In 1998 pages were primarily ranked by their contextual content. Since this
is entirely controlled by the owner of the page, results were easy to manipulate
and as the Internet became ever more commercialized the noise from spam in
SERP's (search engine results pages) made search a frustrating activity. It was
also hard to discern websites which more people would want to visit, for example
a celebrities ocial home page, from less wanted websites with similar content,
for example a site. For these reasons directory sites such as Yahoo were still
popular, despite being out of date and making the user work out the relevance
Google's founders Larry Page and Sergey Brin's Page Rank innovation (named
after Larry Page), and that of a similar algorithm also released in 1998 called
Hyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the addi-
tional meta information from the link structure of the Internet. A more detailed
description of Page Rank will follow in [chapter], but for now Google's own de-
scription will suce.
PageRank relies on the uniquely democratic nature of the web by using its
vast link structure as an indicator of an individual page's value. In essence,
Google interprets a link from page A to page B as a vote, by page A, for page
B. But, Google looks at more than the sheer volume of votes, or links a page
receives; it also analyzes the page that casts the vote. Votes cast by pages that
are themselves important weight more heavily and help to make other pages
important.
Whilst it is impossible to know how Google has evolved their algorithms
since the 1998 paper that launched page rank, and how real world ecient
implementation diers from the theory, as Google themselves say the PageRank
algorithm remains the heart of Google's software ... and continues to provide
the basis for all of [their] web search tools. The search engines continue to
evolve at a blistering pace, improving their ranking algorithms (Google says
7
8. 1
there are now over 200 ranking factors considered for each search ), and indexing
a growing Internet more rapidly.
1.2 Important Issues
The building of a system as complex as a modern search engine is all about
balancing dierent positive qualities. For example, you could eectively prevent
low quality spam by paying humans to review every document on the web,
but the cost would be immense. Or you could speed up your search engine by
considering only every other document your spider encounters, but the relevance
of results would suer. Some things, such as getting a computer to analyse a
document to with the same quality as a human, are theoretically impossible
today, but Google in particular is pushing boundaries and getting ever closer.
Search engines have some particular considerations:
1.2.1 Performance
The response time to a user's query must be lightening fast.
1.2.2 Dynamic Data
Unlike a traditional information retrieval system in a library the pages on the
Internet are constantly changing.
1.2.3 Scalability
Search engines need to work with billions of users searching through trillions of
documents, distributed across the Earth.
1.2.4 Spam and Manipulation
Actively engaging against other humans to maintain the relevancy of results is
relatively unique to search engines. In a library system you may have an author
that creates a long title packed with words their readers may be interested in,
but that's about the worst of it. When designing your search engine you are
in a constant battle with adversaries who will attempt to reverse engineer your
algorithm to nd the easiest ways to aect your restyles. A common term
for this relation ship is Adverse rial Information Retrieval. The relationship
between the owner of a Web site trying to rank high on a search engine and the
search engine designer is an adversarial relationship in a zero-sum game. That
is, assuming the results were better before, every gain for the web site owner is a
loss for the search engine designer. Classifying where your eorts cross helping a
search engine be aware of your web site's content and popularity, which should
help to improve a search engine's results, and start instead ranking beyond
your means and start decreasing the quality of a search engine's results can be
1 See http://googlewebmastercentral.blogspot.com/2008/10/
good-times-with-inbound-links.html
8
9. somewhat tricky. The practicalities of what search engines consider to be spam,
and as importantly what they can detect and x, will be discussed later.
2
According to Web Spam Taxonomy , approximately 10-15% of indexed
content on the web is spam. What is considered spam and duplicate content
varies, which makes this statistic hard to verify. There is a core of about 56
million pages
3 that are highly interlinked at the center of the Internet, and are
less likely to be spam. Document's further away (in link steps) from this core
are more likely be spam.
Deciding the quality of a document well (say whether it is a page written
by an expert in the eld, or generated by a computer program using natural
language processing) is an AI Complete problem, that is it won't be possible
until we have articial intelligence that can match that of a human.
However, search engines hope to get spam under control by lessening the
nancial incentive of spam. This quote from a Microsoft Research paper
4 ex-
presses this nicely:
Eectively detecting web spam is essentially an arms race be-
tween search engines and site operators. It is almost certain that
we will have to adapt our methods overtime, to accommodate for
new spam methods that the spammers use. It is our hope that our
work will help the users enjoy a better search experience on the
web.Victory does not require perfection, just a rate of detec-tion that
alters the economic balance for a would-be spammer. It is our hope
that continued research on this front can make eective spam more
expensive than genuine content.
5
Google developers for their part describe web spam as the following , citing the
detrimental impact it has upon users
These manipulated documents can be referred to as spam. When a user
receives a manipulated document in the search results and clicks on the link to
go to the manipulated document, the document is very often an advertisement
for goods or services unrelated to the search query or a pornography website
or the manipulated document automatically forwards the user on to a website
unrelated to the user's query.
1.3 How a Search Engine works
A typical search engine can split into two parts: Indexing, where the Internet is
transformed into an internal representation that can be eciently searched. The
query process, where the index is searched for the user query and documents
are ranked and returned to the user in a list.
Indexing
2 Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Work-
shop on Adversarial Information Retrieval on the Web, May 2005
3 See On Determining Communities in the Web by K Verbeurg
4 See Detecting Spam Web Pages through Content Analysis by A Ntoulas
5 See patent 7302645: Methods and systems for identifying manipulated articles
9
10. 1.3.1 Text acquisition
A crawler starts at a seed site such as the DMOZ directory, then repeatedly
follows links to nd documents across the web, storing the content of the pages
and associated meta data (such as the date of indexing, which page linked to the
site). In a modern search engine the crawler is constantly running, downloading
thousands of pages simultaneously, to continuously update and expand the in-
dex. A good crawler will cover a large percentage of the pages on the Internet,
and visit popular pages frequently to keep its index fresh. A crawler will connect
to the web server and use a HTTP request to retrieve the document, if it has
changed. On average, Web page updates follow the Poisson distribution - that is
the crawler can expect the time until the web page updates next time to follow
an exponential distribution. Crawlers are now also indexing near real time data
through varying sources such as access to RSS Feeds and the Twitter API, and
are able to index a range of formats such as PDF's and Flash. These formats
are converted into a common intermediate format such as XML. A crawler can
also be asked to update its copy of a page via methods such as a ping or XML
sitemap, but the update time will still be up to the crawler. The document data
store stores the text and meta data the crawler retrieves, it must allow for very
fast access to a large amount of documents. Text can be compressed relatively
easily, and pages are typically indexed by a hash of their URL. Google's original
patent used a system called BigTable, Google now keeps documents in sections
called shards distributed over a range of data centres (this oers performance,
redundancy and security benets).
1.3.2 Duplicate Content Detection
Detecting exact duplicates is easy, remove the boilerplate content (menus etc.)
then compare the core text through check sums. Detecting near duplicates is
harder, particularly if you want to build an algorithm that is fast enough to
compare a document against every other document in the index. To perform
faster duplicate detection, nger prints of a document are taken.
A simple ngerprinting algorithm for this is outlined here:
1. Parse the document into words, and remove formatting content such as
punctuation and HTML tags.
2. The words are grouped into groups of words (called n-grams, a 3-gram
being 3 words, 4-gram 4 words etc.)
3. Some of these n-grams are selected to represent a document
4. The selected n-grams are hashed to create a shorter description
5. The hash values are stored in a quick look up database
6. The documents are compared by looking at overlaps of ngerprints.
10
11. Fingerprinting in action
A paper
6 by four Google employees found the following statistics across their
index of the web.
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of vegrams: 1,176,470,663
Most common trigram in English: all rights reserved
Detecting unusual patterns of n-grams can also be used to detect low qual-
ity/spam documents .
7
1.3.3 Text transformation
Tokenization is the process of splitting a series of characters up into separate
words. These tokens are then parsed to look for tokens such as a /a to
nd which parts of the text is plain text, links and such.
ˆ Identifying Content
Sections of documents that are just content are found, in an attempt to ignore
boiler plate content such as navigation menus. A simple way is to look for
sections where there are few HTML tags, more complicated methods consider
the visual layout of the page.
ˆ Stopping
Common words such as the and and are removed to increase the eciency
of the search engine, resulting in a slight loss in accuracy. In general, the more
unusual a word the better it is at determining if a document is relevant.
6 See N-gram Statistics in English and Chinese: Similarities and Dierences
7 See http://www.seobythesea.com/?p=5108
11
12. ˆ Stemming
Stemming reduces words to just their stem, for example computer and com-
puting become comput. Typically around a 10% improvement is seen in
relevance in English, and up to 50% in Arabic. The classic stemmer algorithm
is the Porter Stemmer which works through a series of rules such as replace
sses with ss to stresses - stress.
ˆ Information Extraction
Trying to determine the meaning of text is very dicult in general, but certain
words can give clues. For example the phrase x has worked at y is useful
when building an index of employees.
1.3.4 Index Creation
Document statistics such as the count of words are stored for use in ranking
algorithms.
8 is created to allow for fast full text searches.
An inverted index
9
The index is distributed across multiple data centres across the globe .
1.3.5 User Interaction
The user is provided with an interface in which to give their query. The query
is then transformed, using similar techniques to with documents such as stem-
ming, as well as spell checking and expanding the query to nd other queries
synonymous with the users query. After ranking the document set, a top set of
results are displayed together with snippets to show how they were matched.
1.3.6 Ranking
A scoring function calculates scores for documents. Some parts of the scoring
can be performed at query time, others at document processing time.
1.3.7 Evaluation
Users queries and their actions are logged in detail for improve results. For
example, if a user clicks on a result then quickly performs the same search
again, it is likely that they clicked a poor result.
8 An inverted index is an index data structure storing a mapping from content, such as
words or numbers, to its document in a set of documents. The purpose of an inverted index
is to allow fast full text searches, at a cost of increased processing when a document is added
http://en.wikipedia.org/wiki/Inverted_index
to the database.
9A approach is at http://highscalability.com/
good overview of Google's shard
google-architecture
12
13. 2 How good can a search engine be?
There are some very specic limits in computer science as to what a computer
program is capable of doing, and these have direct consequences for how search
engines can index and rank your web pages. The two core sets or problems
are NP-Complete problems, which for large sets of data take too long to solve
perfectly, and AI-Complete problems, which can't be done perfectly until we
have computers that are intelligent as people. That doesn't mean search engines
can't make approximations, for example nding the shortest route on a map is
a NP-Complete problem yet Google maps still manages to plot pretty good
routes
10 .
2.1 NP Hard Problems
Polynomial (P) problems can be solved in polynomial time, that is relatively
quickly. Non Polynomial (NP) problems cannot be solved in polynomial time,
that is they can't be solved for any reasonably large set of inputs such as a
number of web pages.
The time taken to solve the NP hard problem (in red) grows extremely
quickly as the size of the problem grows.
These concepts become complex quickly, but the key thing to pick up is
that if a problem is NP Hard there is no way it can ever be solved perfectly for
something as large as a search engines index, and approximations will have to
be used. There are some NP Hard problems that are of particular interest to
SEO:
ˆ The Hamiltonian Path Problem - Detecting a greedy network (IE if you
interlink your web pages to hoard page rank) in the structure of a Hamil-
tonian path
11 is an NP hard problem
ˆ Detecting Page Farms (the set of pages that link to a page) is NP hard
12
10 http://www.youtube.com/watch?v=-0ErpE8tQbw
11 http://en.wikipedia.org/wiki/Hamiltonian_path
12 See Sketching Landscapes of Page Farms by Bin Zhou and Jian Pei
13
14. ˆ Detecting Phrase Level Duplication in a Search Engine's Index
13
2.2 AI Hard Problems
AI Hard problems require intelligence matching that of a human being to be
solved. Examples include the Turing Test (tricking a human into thinking they
are talking to a human, not a computer), recognising dicult CAPTCHA's and
translating text as well as an expert (who wouldn't be perfect either).
During a question-and-answer session after a presentation at his alma mat-
ter,Stanford University, in May 2002, Page said that Google would full its mis-
sion only when its search engine was AI-complete, and said something similar
in an interview with Newsweek then Playboy.
I think we're pretty far along compared to 10 years ago, he said. At the
same time, where can you go? Certainly if you had all the world's information
directly attached to your brain, or an articial brain that was smarter than your
brain,you'd be better. Between that and today, there's plenty of space to cover.
What would a perfect search engine look like? we asked. It would be the mind
of God
14
And, actually, the ultimate search engine, which would understand, you
know, exactly what you wanted when you typed in a query, and it would give
you the exact right thing back, in computer science we call theatrical intelli-
gence. That means it would be smart, and we're a long way from having smart
computers.
15
Of particular interest to SEO is that fully understanding the meaning of
human text is an AI complete problem, and even getting close to understanding
words in context is very dicult
16 . This means detecting the quality of reason-
able quality computer generated text against that of a human expert automat-
ically is tricky. Its not unusual to see websites packed with decent computer
generated text (which automatically detecting is an AI complete problem) and
single phrases stitched together from a variety of sources (which is an NP com-
plete problem) ranking for Google Trends results. This is particularly hard to
stop as for new news items there are less fresh sources available to choose from,
this results in search engine poisoning
17 . Any site that receives a large amount
of trac from this will eventually be visited manually by a Google employee,
and penalised manually
18 .
Google's solution to the very similar machine translation problem is inter-
esting; rather than attempting to build AI they use their massive resources and
data stored from web pages and user queries to build a reliable statistical engine
13 See Detecting phrase-level duplication on the world wide web by Microsoft Research
employees
14 http: // searchenginewatch. com/ 2156601
15 http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/
16 http://en.wikipedia.org/wiki/Natural_language_understanding
17 http://igniteresearch.net/spam-in-poisoned-world-cup-results/
18 http://www.google.co.uk/search?q=Google+Spam+Recognition+Guide+for+Quality+
Rater
14
15. - their approach isn't necessarily far smarter than their competitors but their
resources make them the best translator out there.
2.3 Competitors
Although not a classic computer science problem, a big limit to how search
engines can treat possible spam is that competitors could attempt to make your
website look like it was spamming to lower your ranking, increasing theirs. For
example, if your website suddenly receives and inux of low quality links from
sites known to ink to spam, how would Google know if you naively ordered this
or a competitor did?
This is an unsolvable problem, short of non-stop surveillance of all website
owners. This is what Google has to say on the matter
19
There's almost nothing a competitor can do to harm your ranking or have
your site removed from our index. If you're concerned about another site linking
to yours, we suggest contacting the webmaster of the site in question. Google
aggregates and organizes information published on the web; we don't control the
content of these pages.
I can say from experience that Google bowling most certainly does happen,
and there are a couple of experiments written up on the web
20 , though it would
be very dicult to Google bowl a popular website. Essentially, if a small per-
centage of links to a site are most likely spam they are just ignored, if a large
percentage are likely spam then the links may result in a penalty rather than
just being ignored.
It seems likely that poor quality links are increasingly being ignored. The
paper Link Spam Alliances from Stanford, the Google founder's Alma mater,
discusses both dated methods of detecting and punishing potential link spam.
Note that link spam isn't the only way that sites can potentially be Google
bowled, if your competitor lls your comment section with duplicate content
about organ enlargement and links to known phishing sites it is unlikely to help
your rankings. Google now also takes into account users choosing to block sites
from results
21 , presumably with a negative eect.
3 Ranking Factors
Google engineers update their algorithms daily
22 . They then run many tests to
check they have the right balance between all these factors.
The following is from an interview with Google's Udi Manber.
Q: How do you determine that a change actually improves a set of results?
A: We ran over 5,000 experiments last year. Probably 10 experiments for ev-
ery successful launch. We launch on the order of 100 to 120 a quarter. We have
19 http://www.google.com/support/webmasters/bin/answer.py?answer=34449
20 http://bit.ly/jEKzMa
21 http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes.
html
22 http://www.nytimes.com/2007/06/03/business/yourmoney/03google
15
16. dozens of people working just on the measurement part. We have statisticians
who know how to analyze data, we have engineers to build the tools. We have at
least 5 or 10 tools where I can go and see here are 5 bad things that happened.
Like this particular query got bad results because it didn't nd something or the
pages were slow or we didn't get some spell correction.
I have created a spreadsheet that shows how a search engine may cal-
culate the ranking of a trivial set of documents for a particular query, you
http://igniteresearch.net/
can view it and try changing things yourself at
poodle-a-simple-emulation-of-search-engine-ranking-factors/.
3.1 On Page Factors
ˆ Keywords
Repetitions of the words in the query in the document, particularly in key areas
such as the title and headers are positive signals of relevance. The proximity
of the words together is important, particularly having the exact query in the
document. A very large repetition, particularly in nongrammatical sentences,
can be a negative signal of spam. Presence of the query words in the Domain
and URL are useful signals of relevance. Related phrases to the query are
also positive signals of relevance (see Latent Semantic Indexing). The meta
keywords HTML tag, meta name=keywords content=my, keywords, is
largely ignored by modern search engines
23 .
ˆ Quality
A number of dierent authors on a website, good grammar, spelling and long
pages written at reasonable time intervals are positive signs of high quality
content
24 .
ˆ Geographical Locality
Mentions of an address close the user show the document may be geographically
relevant to the user, particularly for geograpihcally sensitive queries such as
plumbers in london.
ˆ Freshness
For time dependant queries, such as news events, recent pages are more likely
to be helpful to the user. See Google's Quality Deserves Freshness drive, of
which Google's faster indexing Caeine update was a part.
ˆ Duplicate Content
Large percentages of content duplicated either from the same site, or others is
an indicator of poor quality content and users will only want to see the canonical
copy.
23 See http://googlewebmastercentral.blogspot.com/2009/09/
google-does-not-use-keywords-meta-tag.html
24 See http://www.seobythesea.com/?p=541
16
17. ˆ Adverts
A very large number of adverts can reduce the user experience, and aliate
links are often associated with heavily SEO manipulated websites.
ˆ Outbound Links
Links to spammy of phising websites, or an unusually large number of outbound
links on a number of pages, are common indicators of a page that users will not
want to visit
25 .
ˆ Spam
An unusual repetition of keywords, particularly outside of sentences is a sign
of spam. Techniques such as hidden text and sneaky javascript redirects are
relatively easy to detect and punish.
3.2 O Page Factors
ˆ Site Reliability
Unreliable or slow sites provide a poor user experience, and so will have a penalty
applied. You can be warned if this happens if you sign up for Google webmaster
tools
26 .
ˆ Popularity of the Site
From aggregated ISP data that search engine's buy and search trac
27 .
ˆ Incoming Links/ PageRank
The link structure of the internet is a useful pointer of a websites popularity.
Anchor text on incoming links related to query shows a search engine the page
is related to the query. Links they remain for a long time from sites that have
many links pointing to themselves are rated highly. Links that are in boiler plate
areas or sitewide may be ignored. Links that are all identical in anchor text (ie
blatantly machine generated), from spammy websites (bad neighbourhoods
28 ),
thought to be paid for with the intention of manipulating rankings or spam can
result in penalties. Links from sites that are most likely owned by the same
owner, detected either from Whois data or if the sites are hosted within the
same Class C IP, are likely considered less reliable signals of importance. A
normal rate of growth of incoming links, as opposed to bursty start stops
29 that
indicate link building campaigns
30 .
25 See Improving Web Spam Classiers Using Link Structure for a very interesting Yahoo
patent on detecting spam based on the number of inbound and outbound links
26 See http://www.mattcutts.com/blog/site-speed/
27 See http://trends.google.com/websites?q=bing.comgeo=alldate=all and http://
www.compete.com
28 See http://www.google.com/support/webmasters/bin/answer.py?answer=35769
29 See http://www.seobook.com/link-growth-profile
30 See http://www.wolf-howl.com/seo/google-patent-analysis/
17
18. ˆ Other indirect signals of a website's popularity
Other data can include mentions in chats, emails and social networks.
ˆ Links from trusted websites
The proximity on web graph to important, trusted sites (Links from old, high
page rank websites at the centre of the old heavily interconnected internet are
useful signals that a website can be trusted and is important
31 ).
ˆ Links from other sites that rank for the query
Results may be reordered based on how they link to each other.
ˆ Geographical Location
If the geographical location of server, website according to directories, top level
domain or location as set in Google Webmaster Tools match that of the user
it is a signal that the page will be more relevant to the user, particularly for
location sensitive searches.
ˆ User Click Data
If users often search again after clicking on the sites result that is an indicator
that the page is not a good match for the query. The personal history of results
clicked, and pattern of related searches may help indicate what a user is looking
for
32 .
ˆ Domain Information
Older domains are likely trusted more. Google is a domain registrar so has ex-
tensive information Whois Information, and validates that address information
associated with domains is correct.
ˆ Manul Reviews
Google Quality Raters
33 manually reviewing websites and tagging them as cat-
egories such as essential to query, not relevant to query, spam.
3.3 Google PageRank Notes
Google's PageRank was the innovation that propelled Google to the top of
the search engine pile. Whilst its implementation has changed much since its
original description, and many other factors are now taken into account, it is
still at the heart of modern search engines so some extra notes will be made on
it here.
31 See http://www.touchgraph.com/seo and type in http://www.nasa.gov for a visual graph
32 See Seehttp://www.seobythesea.com/?p=334
33 See http://searchengineland.com/the-google-quality-raters-handbook-13575
18
19. 3.3.1 Short Description
The key point is that PageRank considers each link a vote, and links from pages
which have many links themselves are considered more important. Or as Google
puts it:
PageRank reects our view of the importance of web pages by considering
more than 500 million variables and 2 billion terms. Pages that we believe are
important pages receive a higher PageRank and are more likely to appear at the
top of the search results. PageRank also considers the importance of each page
that casts a vote, as votes from some pages are considered to have greater value,
thus giving the linked page greater value.
3.3.2 Mathematical Description
Its not essential to have a mathematical understanding of how PageRank is cal-
culated, but for those familiar with basic graph theory and algebra it is useful.
You may wish to skip this section, and read a slightly less mathematical de-
34 . For a more complete treatment of the mathematics see the original
scription
PageRank paper
35 , the Deeper Inside PageRank by Amy N. Langvilleand
and Carl D, and this thesis
36 . The following is summarised from Sketching
Landscapes of Page Farms
37 by Bin Zhou and Jian Pei:
The Web can be modeled as a directed Web graph G = (V, E), where V is
the set of Web pages, and E is the set of hyperlinks. A link from page p to page
q is denoted by edge p q. An edge p q can also be writte nas a tuple (p,
q).
PageRank measues the importance of a page p by considering how collec-
tively other Web pages point to p directly or indirectly. Formally, for a Web
page p, the PageRank score is dened as:
Where M(p) = { q| q p } is the set of pages having a hyperlink point
to p, OutDeg(pi ) is the out-degree of pi (i.e., the number of hyperlinks from
pi pointing to some pages other than pi ), and d is a damping factor (0.85 in
the original PageRank implementation) which models the random transitions of
the web. If a damping factor of 0.5 is used then at each page there is a 50/50
34 See the introductions of http://www.sirgroane.net/google-page-rank/, http://www.
webworkshop.net/pagerank.html or the Wikipedia article
35 At http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
36 http://web.engr.oregonstate.edu/~sheldon/papers/thesis.pdf
37 See http://www.cs.sfu.ca/~bzhou/personal/paper/sdm07_page_farm.pdf
19
20. chance of the surfer clicking a link, or jumping to a random page on the internet.
Without the damping factor the PageRank of any page with an outgoing link
would be 0.
To calculate the PageRank scroes for all pages in a graph, one can assign a
random PageRank score value to each node in the graph, then apply the above
equation iteratively until the PageRank scroes in the graph converge.
The google toolbar is a logarithmic scale out of 10, not the actual internal
data. For example:
Domain Calculated PageRank PageRank displayed in Toolbar
small.com 47 2
medium1.com 54093 5
medium2.com 84063 5
big.com 1234567 7
big2.com 2364854 7
3.3.3 Interesting Notes on the Original Implementation of PageR-
ank
From PageRank Uncovered
38 , essential reading for those looking to understand
PageRank from an SEO perspective:
ˆ PageRank is a multiplier, applied after relevant results are found
Remember, PageRank alone cannot get you high rankings. We've mentioned
before that PageRank is a multiplier; so if your score for all other factors is 0
andyour PageRank is twenty billion, then you still score 0 (last in the results).
This isnot to say PageRank is worthless, but there is some confusion over when
PageRank is useful and when it is not. This leads to many misinterpretations
of its worth. The only way to clear up these misinterpretations is to point out
when PageRank is not worth while.If you perform any broad search on Google, it
will appear as if you've found several thousand results. However, you can only
view the rst 1000 of them. Understanding why this is so, explains why you
should always concentrate on on the page factors and anchor text rst, and
PageRank last.
ˆ Each page is born with a small amount of PageRank
A page that is in the Google index has a vote, however small. Thus, the more
pages you have in the index the more overall vote you are likely to have.
Or,simply put, bigger sites tend to hold a greater total amount of PageRank
within their site (as they have more pages to work with).
Note that Google's original algorithm has most likely been amended since
to detect and reduce page rank hoarding, and generating PageRank by massive
interlinking on auto generated pages. Also for quicker calculations an approx-
38 See http://www.bbs-consultant.net/IMG/pdf_PageRank.pdf
20
21. imation of PageRank which only gives certain seed pages PageRank may be
used
39 .
Interestingly, however, there are examples of this working, see How to get
billions of pages indexed in Google at http://www.threadwatch.org/node/
6999. In a related issue, at one point 10% of MSN Search's (now known as
Bing) German index was computer generated content on a single domain
40 .
3.3.4 Optimal Linking Strategies
Deciding how to interlink pages that you own or have inuence over is tricky;
interlinking can be a good signal that that pages are related and on a certain
topic, build PageRank and control PageRank ow. However, heavily interlinking
can be a signal of manipulation and spam, and dierent linking structures can
make dierent sites in your possession rank higher. The mathematics gets tricky
fast, here is a quick overview of the literature today:
ˆ Note from Web Spam Taxonomy
Though written about Spam farms, the math holds true for good commercial
sites too. Essentially this states that maximum page rank for a target page
is achieved by linking only to the target page from forums, blogs etc. then
interlinking the network of sites owned (as if there are no outlinks on a page the
random surfer will jump to a random page on the Internet).
1. Inaccessible pages are those that a spammer cannot modify. These are
the pages out of reach; the spammer cannot inuence their outgoing links. (Note
that a spammer can still point to inaccessible pages.)
2. Accessible pages are maintained by others (presumably not aliated with
the spammer), but can still be modied in a limited way by a spammer. For
example, a spammer may be able to post a comment to a blog entry, and that
comment may contain a link to a spam site.
3. Own pages are maintained by the spammer, who thus has full control over
their contents.
We can observe how the presented structure maximizes the total PageRank
score of the spam farm, and of page t in particular:
1. All available n own pages are part of the spam farm, maximizing the static
score total PageRank.
2. All m accessible pages point to the spam farm, maximizing the incoming
score incoming PageRank.
39 For more on why this shouldn't work see http://www.pagerank.dk/Pagerank/
Generate-pagerank.htm
40 See http://research.microsoft.com/pubs/65144/sigir2005.pdf
21
22. 3. Links pointing outside the spam farm are suppressed, making PRout out-
going PageRank zero.
4. All pages within the farm have some outgoing links, rendering a zero
PRsink score component.
Within the spam farm, the the score of page t is maximal because:
1. All accessible and own pages point directly to the target, maximizing its
incoming score PRin (t).
2. The target points to all other own pages. Without such links, t would had
lost a signicant part of its score (PRsink (t) 0), and the own pages would had
been unreachable from outside the spam farm. Note that it would not be wise to
add links from the target to pages outside the farm, as those would decrease the
total PageRank of the spam farm.
ˆ From Link Spam Alliances
The analysis that we have presented show how the PageRank of target pages can
be maximized in spam farms. Most importantly, we nd that there is an entire
class of farm structures that yield the largest achievable target PageRank score.
All such optimal farm structures share the following properties:
1. All boosting pages point to and only to the target.
2. All hijacked point to the target.
3. There are some links from the target to one or more boosting pages.
ˆ From Maximizing PageRank via Outlinks
In this paper we provide the general shape of an optimal link structure for a
website in order to maximize its PageRank. This structure with a forward chain
and every possible backward link may be not intuitive. At our knowledge, it
has never been mentioned, while topologies like a clique, a ring or a star are
considered in the literature on collusion and alliance between pages. Moreover,
this optimal structure gives new insight into the armation of Bianchini et al.
that, in order to maximize the PageRank of a website, hyperlinks to the rest
of the webgraph should be in pages with a small PageRank and that have many
internal hyperlinks. More precisely, we have seen that the leaking pages must be
chosen with respect to the mean number of visits before zapping they give to the
website, rather than their PageRank.
ˆ From The eect of New Links on PageRank by Xie
Theorem: The optimal linking strategy for a Web page is to have only one out-
going link pointing to a Web page with a shortest mean rst passage time back
to the original page.
Conclusions: .... We conclude that having no outgoing link is a bad policy
and that the best policy is to link to pages from the same Web community.
Surprisingly, a new incoming link might not be good news if a page that points
to us gives many other irrelevant links at the same time.
Reading this paper fully it is only in very particular circumstances that a
new incoming link is not good news.
22
23. 3.3.5 Implementation to make computing PageRank faster
There have been a number of proposed improvements to the original PageRank
algorithm to improve the speed of calculation
41 , and to adapt it to be better at
determining quality results. No search engine calculates PageRank as shown in
the naive algorithm in the original paper
42 .
3.3.6 HITS
HITS is another ranking algorithm that takes into account the pattern of links
found throughout the web, and it was released just before PageRank in 1999.
HITS treats some pages on the web as authorities, which are good documents
on a topic, and hubs, which mostly link to authorities.
A page is given a high authority score by being linked to by pages that are
recognized as Hubs for information. A page is given a high hub score by linking
to nodes that are considered to be authorities on the subject.
Unlike PageRank, which is query independent and so computed at index-
ing time, HITS hub and author scores are query depend ant and so computed
(though likely cached) at query time.
3.3.7 Is linking out a good thing?
Whilst TEOMA is the only search engine that uses HITS at its core, its think-
ing has heavily inuenced search engine designers - so it is likely that linking
out to high quality authorities can positively inuence either a pages ranking
(though potentially negatively, if designers want authorities rather than hubs to
appear in their results
43 ), or the importance of the other links it contains. Many
webmasters fear linking out to sites as they would rather keep links internal to
prevent PageRank owing out (many webmasters also nofollow links to similar
reasons, not that this form of PageRank sculpting no longer works according to
Matt Cutts, Google's head of [anti]web spam).
Matt Cutts also said a number of years ago:
Of course, folks never know when we're going to adjust our scoring. It's
pretty easy to spot domains that are hoarding PageRank; that can be just another
factor in scoring.
Some search engines are even concerned about people linking out too much,
whilst crawlers can now index a large number of links on a page, a very large
number of outbound links often indicates that a site has been hacked with spam
links or is machine generated.
A spammer might manually add a number of outgoing links to well-known
pages, hoping to increase the page's hub score. At the same time,the most
41 For example, see Computing PageRank using Power Extrapolation and Ecient PageR-
ank Approximation via Graph Aggregation
42 Matt Cutts discusses a couple of the implementation details at http://www.mattcutts.
com/blog/more-info-on-pagerank/
43 See http://www.wolf-howl.com/seo/seo-case-study-outbound-links/ and Deeper In-
side PageRank, discussed earlier
23
24. wide-spread method for creating a massive number of outgoing links is direc-
tory cloning
44 .
3.3.8 TrustRank / Bad Page Rank
Its likely that after results are generated based on relevance, PageRank is then
applied to help order, then Trust Rank to help order the results. A site may lose
trust every time it fails some kind of spam test (for example if a large number
of reciprocal links are found,cloaking, duplicate content, fake whois data) and
gain Trust for certain properties (domain age, trac, being one a number of
important seed sites that are manually tagged as trusted sites). These initial
Trust Ranks could then be propagated in a similar way to PageRank, so linking
to and from bad neighborhoods would negatively aect the sites Trust Rank
through association
45 .
From SEO By The Sea:
In 2004, a Yahoo whitepaper was published which described how the search
engine might attempt to identify web spam by looking at how dierent pages
linked to each other. That paper was mistakenly attributed to Google by a large
number of people, most likely because Google was in the process of trademarking
the term TrustRank around the same time, but for dierent reasons. Surpris-
ingly, Google was granted a patent on something it referred to as Trust Rank
in 2009, though the concept behind it was dierent than Yahoo's description of
TrustRank. Instead of looking at the ways that dierent sites linked to each
other, Google's Trust Rank works to have pages ranked according to a measure
of the trust associated with entities that have provided labels for the documents.
44 See Web Spam Taxonomy
45 See http://bakara.eng.tau.ac.il/semcomm/GKRT.pdfand http://www.
freepatentsonline.com/7603350.html and http://www.cs.toronto.edu/vldb04/protected/
eProceedings/contents/pdf/RS15P3.PDF
24
25. ...
If you've ever heard or seen the phrase TrustRank before, it's possible that
whoever was writing about it, or referring to it was discussing a paper titled Com-
bating Web Spam with TrustRank (pdf ). While the paper was the joint work of
researchers from Stanford University and Yahoo!, many writers have attributed
it to Google since its publication date in 2004 The confusion over who came
up with the idea of TrustRank wasn't helped by Google trademarking the term
TrustRank in 2005. That trademark was abandoned by Google on February
29, 2008, according to the records at the US PTO Tess database. However, a
patent called Search result ranking based on trust deals with something called
trust rank, led on May 9, 2006.
Google mentions distrust and trust changes as indicators. More than trust
analysis, trust variation analysis is on the road. Fake reviews, sponsored blogs
and e-commerce trust network inuence are pointed out.
The paper A Cautious Surfer for PageRank comments on why TrustRank
shouldn't be overused:
However, the goal of a search engine is to nd good quality results; spam-free
is a necessary but not sucient condition for high quality. If we use a trust-based
algorithm alone to simply replace PageRank for ranking purposes, some good
quality pages will be unfairly demoted and replaced, for example, by pages within
the trusted seed sets, even though they may be much less authoritative.Considered
from another angle, such trust-based algorithms propagate trust through paths
originating from the seed set; as a result,some good quality pages may get low
value if they are not well connected to those seeds.
3.3.9 Improvements to Google's ranking algorithms
There have been a number of notable algorithm changes which made consider-
able changes appear to results pages, though often the eects were later scaled
back slightly.
ˆ NoFollow
Matt Cutts and Jason Shellen created the nofollow specication to help limit
the eect and incentive for blog spam. If a search engine comes across a link
tagged as nofollow, it will not treat the link as a vote, ie as a positive signal in
rankings. Areas where untrusted users can post content are often tagged nofol-
low, roughly 80% of content management systems (the software that websites
run on) implement nofollow.
The HTML code of a NoFollow link:
a href=signin.php rel=nofollowsign in/a
ˆ Increasing use of anchor text
Even the original PageRank algorithm took into account the anchor text of links,
so links were used to give both a number that indicated the sites popularity
and information about the content of a document and so its relevance for user
queries.
25
26. ˆ Google Bombing Prevention, 2nd February 2007
Google Bombing is the process of massively linking to a page with a specic
anchor text, to give PageRank but more importantly indications that the doc-
ument is related to the anchor text. For example, in 1999 a number of bloggers
grouped together to link to Microsoft.com with the anchor text more evil than
Satan himself. This resulted in Microsoft being placed number one in searches
for more evil than Satan himself despite not having the phrase anywhere on its
page. Detecting a sudden inux of links with identical anchor text is very easy,
and in 2007 Google changed their indexing structure so that Google bombs such
as miserable failure would typically return commentary, discussions, and ar-
ticles about the tactic itself. Matt Cutts said the Google bombs had not been
a very high priority for us. Over time, we've seen more people assume that
they are Google's opinion, or that Google has hand-coded the results for these
Google-bombed queries. That's not true, and it seemed like it was worth trying
to correct that perception.
46 Some Google bombs still work, particularly those
tar getting unusual phrases, with varied anchor text, over a period of time,
within paragraphs of text.
ˆ Florida, November 2003
Results for highly commercial queries, likely informed from the cost of Adwords,
became heavily ltered so more trusted academic websites and less commercial
optimised websites ranked. Some of these changes resulted in less relevance, for
example if a user was searching for buy bricks they probably didn't want to
mainly see websites about the process of creating bricks, and were rolled back.
For more see
47 and 48 .
ˆ Bourbon, June 2005
A penalty was applied to sites with unusually fast or bursty patterns of link
growth.
ˆ Jagger, October 2005
A penalty applied to sites with unusually large amounts of reciprocal links, new
methods for detecting hidden text.
ˆ Big Daddy, December 2005
According to Matt Cutts, punished were sites where our algorithms had very
low trust in the inlinks or the outlinks of that site. Examples that might cause
that include excessive reciprocal links, linking to spammy neighborhoods on the
web, or link buying/selling.
49
46 See http://answers.google.com/answers/main?cmd=threadviewid=179922
47 http://www.searchengineguide.com/barry-lloyd/been-gazumped-by-google-trying-to-make-sense-of-the-florida-upd
php
48 http://www.seoresearchlabs.com/seo-research-labs-google-report.pdf
49 See http://www.webworkshop.net/googles-big-daddy-update.html
26
27. ˆ Caeine, October 2010
A faster indexing system that changed results little, but allowed for fresher
results and some of the later Panda updates
50 .
ˆ Panda, April 2011
Penalty applied to content deemed low quality, detected primarily from user
data. Websites which contained masses of articles, focusing on quantity over
quality, were often hit
51 .
4 Detecting Spam and Manipulation
You will often hear that your site has to look natural to the search engines.
Just what natural means is hard to dene, but essentially it means the prole
of a site whose popularity was never engineered or promoted, and was instead
based on people luckily coming across it and deciding to recommend it to their
friends with links. Whats more, you also need to make your site look popular,
creating no links to your site yourself will look natural but you will have
no chance of competing with people who do unless you have the cash to buy
large amounts of advertising. This section briey covers what search engines
consider to be acceptable, when and how they can detect violations, and what
the potential penalties are.
4.1 Google Webmaster Guidelines
Google have created a page called Webmaster Guidelines to inform users of
what they consider to be acceptable methods of promoting your website. Whilst
the lines for crossing general principles such as Would I do this if search engines
didn't exist? are somewhat vague, they do oer some specic notes of what
not to do:
ˆ Avoid hidden text or hidden links.
ˆ Don't use cloaking or sneaky redirects.
ˆ Don't send automated queries to Google.
ˆ Don't load pages with irrelevant keywords.
ˆ Don't create multiple pages, sub domains, or domains with substantially
duplicate content.
50 See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html
51 See http://blog.searchmetrics.com/us/2011/04/12/googles-panda-update-rolls-out-to-uk/
and http://www.seobook.com/questioning-questions and http://
googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-building-high-quality.
html
27
28. ˆ Don't create pages with malicious behavior, such as phishing or installing
viruses, Trojans, or other bad ware.
ˆ Avoid doorway pages created just for search engines, or other cookie
cutter approaches such as aliate programs with little or no original con-
tent.
ˆ If your site participates in an aliate program, make sure that your site
adds value. Provide unique and relevant content that gives users a reason
to visit your site rst.
Most of the methods listed above are naive and easy to detect, Google have
been fairly successful in making successful manipulation aligned with creating
genuine content, though without any promotion it is unlikely even the best
content will be noticed.
4.2 Penalties
Penalties
52 that Google to detected manipulation vary in length of time and
eect, from small ranking penalties for certain keywords for a page to site
wide bans, depending upon the sophistication of the manipulating methods
and the quality of the oending site. If you believe you had had one applied,
you can submit a Google Reconsideration Requesthttp://www.google.com/
support/webmasters/bin/answer.py?answer=35843 from Google Webmaster
Tools, once you have xed the oending issues.
4.3 Detecting Manipulation in Content
There is a fascinating paper by Microsoft which details a number of methods for
detecting spam pages in search engine index's based on their content. A simple
way is to use Bayesian lters (one is included with Ignite SEO to test your
content as the search engine's would), so for example seeing the phrase buy
pills would be a strong indicator of spam. Most of the research is on detecting
blatantly computer generated lists of keywords, which is fairly easy to detect.
Detecting the quality of human written content is very dicult, so unless you
are endlessly repeating your keywords if you are writing your own content you
can be reasonably happy with its quality in search engine's eyes.
The following graphs are cut from Detecting Spam Web Pages through
Content Analysis
53 by Microsoft Research employees.
4.4 Detecting Manipulation in Links
Much research has focused on detecting spam pages through their backlinks or
outlinks. Yahoo obtained a patent that uses the rate of link growth to detect
52 http://www.forbes.com/2007/04/29/sanar-google-skyfacet-tech-cx_ag_
0430googhell.html
53 http://cs.wellesley.edu/~cs315/Papers/Ntoulas-DetectingSpamThroughContentAnalysis.
pdf
28
29. manipulation. Essentially a constant rate of new backlinks, perhaps with a
small growth over time, is expected for a typical site. A saw-tooth pattern of
inlinks is a strong indicator of backlink campaigns that start and stop (though
could also be an indicator of say a site that releases new software monthly).
In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) and
outdegree (links on the page) distributions of web pages:
Most web pages have in and outdegrees that follow a powerlaw distribution.
Occasionally, however, search engines encounter substantially more pages with
the exact same in or outdegrees than what is predicted by the distribution for-
mula. The authors nd that the vast majority of such outliers are spam pages.
As discussed in the Trust Rank section earlier, large amount of links from
sites that have already been detected as linking to spam (so called untrustwor-
thy hubs) is a negative indicator. Links from unrelated websites, reciprocal
links, links out of content, from sites that are known to host paid links and
many other signals are likely taken into consideration.
Zhang et al have identied a method for identifying unusually highly inter-
connected groups of web pages. More methods of identifying manipulative sites
are listed in Link Spam Alliances by Geyongyi and Garcia-Molina.
4.5 Other Methods
If you think a competitor has been using methods that violate the webmaster
guidelines, you can report them to Google
54 . Its good practice to ensure that
any site you wish to keep for a long time, and expect to get reasonable amounts
of trac,
Google will sometimes manually review websites without prompting, Google
Quality Raters inspect sites for relevance to results but can also take web pages
as spam. Particular markets are inspected more often than others.
54 https://www.google.com/webmasters/tools/spamreport?hl=enpli=1
29
30. Part II
Practice
5 An Example Campaign
Now we've covered the theory, its time for a real world example of putting it
into practice.
5.1 Company Prole
John runs a driving school in Springeld, Ohio. He has a website he has owned
for a couple of years, that ranks around the second page for most searches related
to driving schools in Ohio and receives about 20 visitors day, a third from search
engines and two thirds from links from local websites.
A quick search for what he imagines would be his main keyword, driving
school Springeld Ohio, has a company directory site at the top followed by
other directories, companies and people asking on forums for recommendations.
This mix of relevant small companies web site's and small pages on big websites
indicates the keyword to be of medium diculty to rank for.
5.2 Goals
John thinks if he can get his site to rank 3rd instead of around the middle of the
second page for his core keywords, he will increase his search trac by around
1000%, his overall trac by about 300%, and roughly double his sales. He aims
to do this over a period of roughly one month.
5.3 Competitor Research
John nds his main competitors by searching, and gets estimates of their trac
sources using sites such as compete.com and serversiders.com. A tool such
as Ignite SEO can automatically build SEO reports of competitors, listing their
paid and organic keywords, demographics and backlinks. Looking at the HTML
source code of some his competitors displays their targeted keywords in the
meta name=keywords content=keyword1, keyword.
5.4 Keyword Research
John takes his initial guesses of what potential customers might search for,
and those from his competitors and his existing trac, and using the Google
Keyword Tool
55 and Google Insights56 expands this list.
55 https://adwords.google.co.uk/select/KeywordToolExternal
56 http://www.google.com/insights
30
31. 5.5 Content Creation
John takes his keywords and create a small amount of content on his website
containing them. He then creates a large amount of content quickly and creates
57 that, each one targeting a dierent keyword.
sites hosted on free hosting sites
The content generator section of Ignite SEO
58 is perfect for this.
5.6 Website Check
Before investing in o site promotion (ie link building), it is worth performing
a quick check that the site is search engine friendly. Creating an account in
Google Webmaster Tools will let you know if Google has any issues indexing
your website, and it is worth ensuring navigation isn't over reliant on JavaScript
or Flash.
5.7 Link Building
This is the core process that will actually improve John's rankings. By looking at
his competitors backlinks using Yahoo's linkdomain: command, John replicates
their links to his website by visiting each site one by one. Using a tool such
as Ignite SEO, he can automatically build links to the hosted sites he quickly
created in 5.5, without the risk of a link campaign negatively aecting the
rankings of his core website. Other signals of quality such as facebook and
twitter recommendations are built here.
5.8 Analysis
The success of the campaign is measured with a good tracking system such
as Google Analytics, as well as tracking the new incoming links with Google
Webmaster Tools and Yahoo's link: command. The results are compared with
the goals, and the whole process is rened and repeated.
57 http://igniteresearch.net/which-web-2-0-ranks-best-hubpages-vs-squidoo-vs-tumblr-vs-blogspot-etc/
58 http://igniteresearch.net
31
32. About the Author
Christopher Doman is a partner of Ignite Research, a rm specialising in soft-
ware and consultancies for search engine marketing. He holds a BA in Computer
Science from the University of Cambridge.
32