Using 'page importance' in ongoing conversation with Googlebot to get just a bit more crawl budget as part of technical SEO strategy for ecommerce and enterprise SEO website projects
1. USING
’PAGE
IMPORTANCE’
IN
ONGOING
CONVERSATION
WITH
GOOGLEBOT
TO
GET
JUST
A
BIT
MORE
THAN
YOUR
ALLOCATED
CRAWL
BUDGET
NEGOTIATING
CRAWL
BUDGET
WITH
GOOGLEBOTS
Dawn
Anderson
@
dawnieando
4. 1994 - 1998
“THE
GOOGLE
INDEX
IN
1998
HAD
60
MILLION
PAGES”
(GOOGLE)
(Source:Wikipedia.org)
5. 2000
“INDEXED
PAGES
REACHES
THE
ONE
BILLION
MARK”
(GOOGLE)
“IN
OVER
17
MILLION
WEBSITES”
(INTERNETLIVESTATS.COM)
6. 2001 ONWARDS
ENTER WORDPRESS, DRUPAL CMS’, PHP DRIVEN CMS’, ECOMMERCE
PLATFORMS, DYNAMIC SITES, AJAX
WHICH
CAN
GENERATE
10,000S
OR
100,000S
OR
1,000,000S
OF
DYNAMIC
URLS
ON
THE
FLY
WITH
DATABASE
‘FIELD
BASED’
CONTENT
DYNAMIC
CONTENT
CREATION
GROWS
ENTER
FACETED
NAVIGATION
(WITH
MANY
#
PATHS
TO
SAME
CONTENT)
2003
– WE’RE
AT
40
MILLION
WEBSITES
7. 2003 ONWARDS – USERS BEGIN TO JUMP ON THE CONTENT
GENERATION BANDWAGGON
LOTS
OF
CONTENT
– IN
MANY
FORMS
8. WE KNEW THE WEB WAS BIG… (GOOGLE, 2008)
https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐was-‐big.html
“1
trillion
(as
in
1,000,000,000,000)
unique
URLs
on
the
web
at
once!”
(Jesse
Alpert
on
Google’s
Official
Blog,
2008)
2008 – EVEN
GOOGLE
ENGINEERS
STOPPED IN AWE
9. 2010 – USER GENERATED CONTENT GROWS
“Let
me
repeat
that:
we
create
as
much
information
in
two
days
now
as
we
did
from
the
dawn
of
man
through
2003”
“The
real
issue
is
user-‐
generated
content.”
(Eric
Schmidt,
2010
– Techonomy
Conference
Panel)
SOURCE:
http://techcrunch.com/2010/08/04/schmidt-‐data/
10. Indexed
Web
contains at
least
4.73
billion
pages (13/11/2015)
CONTENT KEEPS GROWING
Total
number
of
websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
THE
NUMBER
OF
WEBSITES
DOUBLED
IN
SIZE
BETWEEN
2011
AND
2012
AND
AGAIN
BY
1/3
IN
2014
11. EVEN
SIR
TIM
BERNERS-‐LEE
(Inventor
of
www)
TWEETED
2014 – WE PASS A BILLION INDIVIDUAL WEBSITES
ONLINE
12. 2014 – WE ARE ALL PUBLISHERS
SOURCE:
http://wordpress/activity/posting
13. YUP - WE ALL‘LOVE CONTENT’
IMAGINE
HOW
MANY
UNIQUE
URLs
COMBINED
THIS
AMOUNTS
TO?
– A
LOT
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
14. “As
of
the
end
of
2003,
the
WWW
is
believed
to
include
well
in
excess
of
10
billion
distinct
documents
or
web
pages,
while
a
search
engine
may
have
a
crawling
capacity
that
is
less
than
half
as
many
documents”
(MANY
GOOGLE
PATENTS)
CAPACITY LIMITATIONS – EVEN FOR SEARCH
ENGINES
Source:
Scheduler
for
search
engine
crawler Google
Patent
US
8042112
B1,
(Zhu
et
al)
15. “So
how
many
unique
pages
does
the
web
really
contain?
We
don't
know;
we
don't
have
time
to
look
at
them
all!
:-‐)”
(Jesse
Alpert,
Google,
2008)
Source:
https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐
was-‐big.html
NOT
ENOUGH
TIME
SOME
THINGS
MUST
BE
FILTERED
16. A
LOT
OF
THE
CONTENT
IS
‘KIND
OF
THE
SAME’
“There’s
a
needle
in
here
somewhere”
“It’s
an
important
needle
too”
17. Capacity
limits
on
Google’s
crawling
system
By
prioritising
URLs
for
crawling
By
assigning
crawl
period
intervals
to
URLs
How
have
search
engines
responded?
By
creating
work
‘schedules’
for
Googlebots
WHAT IS THE SOLUTION?
“To
keep
within
the
capacity
limits
of
the
crawler,
automated
selection
mechanisms
are
needed
to
determine
not
only
which
web
pages
to
crawl,
but
which
web
pages
to
avoid
crawling”.
-‐
Scheduler
for
search
engine
crawler,
(Zhu
et
al)
18. ‘Managing items in a
crawl schedule’
Include
GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that
utilizes sitemaps from websites’
‘
‘Document reuse in a
search engine crawler’
‘Minimizing visibility of stale content in
web searching including revising web
crawl intervals of documents’
‘Scheduler for search engine’
EFFICIENCY IS
NECESSARY
19. CRAWL BUDGET
1. Crawl Budget – “An allocation of crawl
frequency visits to a host (IP LEVEL)”
3. Pages with a lot of links get crawled more
4. The vast majority of URLs on the web don’t get a
lot of budget allocated to them (low to 0 PageRank URLs).
2. Roughly proportionate to PageRank and
host load / speed / host capacity
https://www.stonetemple.com/matt-‐cutts-‐
interviewed-‐by-‐eric-‐enge-‐2/
20. BUT… MAYBE THINGS HAVE CHANGED?
CRAWL BUDGET / CRAWL
FREQUENCY IS NOT JUST
ABOUT HOST-LOAD AND
PAGERANK ANY MORE
21. STOP THINKING IT’S JUST ABOUT ‘PAGERANK’
http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s
“You
keep
focusing
on
PageRank”…
“There’s
a
shit-‐ton
of
other
stuff
going
on”
(Illyes,
G,
Google
-‐
2016)
22. THERE’S A LOT OF OTHER THINGS AFFECTING
‘CRAWLING’
Transcript:
https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/
WEB
PROMOS
Q
&
A
WITH
GOOGLES
ANDREY
LIPATTSEV
23. WHY? BECAUSE…
THE WEB GOT
‘MAHOOOOOSIVE’
AND CONTINUES TO GET
‘MAHOOOOOOSIVER’
SITES GOT MORE
DYNAMIC, COMPLEX,
AUTO-GENERATED, MULTI-
FACETED, DUPLICATED,
INTERNATIONALISED,
BIGGER, BECAME
PAGINATED AND SORTED
24. WE NEED MORE
WAYS TO GET
MORE EFFICIENT
AND FILTER OUT
TIME-WASTING
CRAWLING SO
WE CAN FIND
IMPORTANT
CHANGES
QUICKLY
GOOGLEBOT’S TO-DO LIST GOT REALLY BIG
25. Hard
and
Soft
Crawl
Limits
Importance
Thresholds
Min
and
Max
Hints
&
‘Hint
ranges’
Importance
Crawl
Periods
Scheduling
FURTHER IMPROVED CRAWLING
EFFICIENCY SOLUTIONS NEEDED
Prioritization Tiered
Crawling
Buckets
(‘Real
Time,
Daily,
Base
Layer)
26. SEVERAL PATENTS UPDATED
‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE
DETERMINING SOFTAND HARD LIMITS ON CRAWLING)
‘Managing Items in a Crawl Schedule’ (Alpert, 2014)
‘
‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE
FREQUENCY IN ORDER TO SCHEDULE NEXTVISIT, EMPLOYING HINTS
(Min & Max)
(SEEM
TO
WORK
TOGETHER)
‘Minimizing visibility of stale content in web searching including
revising web crawl intervals of documents’ (INCLUDES
EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)
27. Crawled
multiple
times
daily
Crawled
daily
Or
bi-‐daily
Crawled
least
on
a
‘round
robin’
basis
– only
‘active’
segment
is
crawledSplit
into
segments
on
random
rotation
MANAGING ITEMS IN A CRAWL
SCHEDULE (GOOGLE PATENT)
Real
Time
Crawl
Daily Crawl
Base
Layer
Crawl
3
layers
/
tiers
/
buckets
for
scheduling
URLs
are
moved
in
and
out
of
layers
based
on
past
visits
data
Most
Unimportant
28. CAN
WE
ESCAPE
THE
‘BASE
LAYER’
CRAWL
BUCKET
RESERVED
FOR
‘UNIMPORTANT’
URLS?
29. 10
types
of
Googlebot
SOME OF THE MAJOR SEARCH ENGINE
CHARACTERS
History
Logs
/
History
Server
The
URL
Scheduler
/
Crawl
Manager
30. HISTORY LOGS / HISTORY SERVERS
HISTORY
LOGS
/
HISTORY
SERVER
-‐ Builds
a
picture
of
historical
data
and
past
behaviour
of
the
URL
and
‘importance’
score
to
predict
and
plan
for
future
crawl
scheduling
• Last
crawled
date
• Next
crawl
due
• Last
server
response
• Page
importance
score
• Collaborates
with
link
logs
• Collaborates
with
anchor
logs
• Contributes
info
to
scheduling
31. ‘BOSS’- URL SCHEDULER / URL MANAGER
Think
of
it
as
Google’s
line
manager
or
‘air
traffic
controller’
for
Googlebots in
the
web
crawling
system
• Schedules
Googlebot visits
to
URLs
• Decides
which
URLs
to
‘feed’
to
Googlebot
• Uses
data
from
the
history
logs
about
past
visits
(Change
rate
and
importance)
• Calculates
importance
crawl
threshold
• Assigns
visit
regularity
of
Googlebot to
URLs
• Drops
‘max
and
min
hints’
to
Googlebot to
guide
on
types
of
content
NOT
to
crawl
or
to
crawl
as
exceptions.
• Excludes
some
URLs
from
schedules
• Assigns
URLs
to
‘layers
/
tiers’
for
crawling
schedules
• Scheduler
checks
URLs
for
‘importance’,
‘boost
factor’
candidacy,
‘probability
of
modification’
• Budgets
are
allocated
to
IPs
and
shared
amongst
domains
there
JOBS
32. • ‘Ranks
nothing
at
all’
• Takes
a
list
of
URLs
to
crawl
from
URL
Scheduler
• Runs
errands
&
makes
deliveries
for
the
URL
server,
indexer
/
ranking
engine
and
logs
• Makes
notes
of
outbound
linked
pages
and
additional
links
for
future
crawling
• Follows
directives
(robots)
and
takes
‘hints’
when
crawling
• Tells
tales
of
URL
accessibility
status,
server
response
codes,
notes
relationships
between
links
and
collects
content
checksums
(binary
data
equivalent
of
web
content)
for
comparison
with
past
visits
by
history
and
link
logs
• Will
go
beyond
the
crawl
schedule
if
it
finds
something
more
important
than
URLs
scheduled
GOOGLEBOT - CRAWLER
JOBS
33. WHAT
MAKES
THE
DIFFERENCE
BETWEEN
BASE
LAYER
AND
‘REAL
TIME’
SCHEDULE
ALLOCATION?
34. CONTRIBUTING FACTORS
1. Page Importance (which may include PageRank)
3. Soft limits and hard crawl limits
4. Host load capability & past site
performance (speed and access)
(IP level and domain level within)
2. Hints (max and min)
5. Probability / predictability of ‘CRITICAL
MATERIAL’ change + importance crawl
period
35. 1 - PAGE IMPORTANCE - Page
importance
is
the
importance
of
a
page
independent
of
a
query
• Location
in
Site
(e.g.
home
page
more
important
than
parameter
3
level
output)
• PageRank
• Page
type
/
file
type
• Internal
PageRank
• Internal
Backlinks
• In-‐site
Anchor
Text
Consistency
• Relevance
(content,
anchors
and
elements)
to
a
topic
(Similarity
Importance)
• Directives
from
in-‐page
robot
and
robots.txt
management
• Parent
quality
brushes
off
on
child
page
quality
IMPORTANT
PARENTS
LIKELY
SEEN
TO
HAVE
IMPORTANT
CHILD
PAGES
36. 2 - HINTS - ’MIN’ HINTS
& ’MAX’ HINTS
MIN
HINT
/
MIN
HINT
RANGES
• e.g.
Programmatically
generated
content
which
changes
content
checksum
on
load
• Unimportant
duplicate
parameter
URLs
• Canonicals
• Rel=next,
rel=prev
• HReflang
• Duplicate
content
• Spammy URLs?
• Objectionable
content
MAX
HINT
/
MAX
HINT
RANGES
• CHANGE
CONSIDERED
‘CRITICAL
MATERIAL
CHANGE’
(useful
to
users
e.g.
availability,
price)
&
/
or
improved
site
sections
or
change
to
IMPORTANT
but
infrequently
changing
content?
• Important
pages
/
page
range
updates
E.G.
rel="prev" and rel="next" a
ct
as
hints
to
Google,
not
absolute
directives
https://support.google.com/webm
asters/answer/1663744?hl=en&re
f_topic=4617741
37. 3 - HARD AND SOFT LIMITS ON CRAWLING
If
URLs
are
discovered
during
crawling
that
are
more
important
than
those
scheduled
to
be
crawled
then
Googlebot can
go
beyond
its
schedule
to
include
these
up
to
a
hard
crawl
limit
‘Soft’
crawl
limit
is
set
(Original
schedule)
‘Hard’
crawl
limit
is
set
(E.G.
130%
of
schedule)
FOR
IMPORTANT
FINDINGS
38. 4 – HOST LOAD CAPACITY / PAST SITE
PERFORMANCE
Googlebot has
a
list
of
URLs
to
crawl
Naturally,
if
your
site
is
fast
that
list
can
be
crawled
quicker
If
Googlebot
experiences
500s
e.g.
she
will
retreat
&
‘past
performance
’
is
noted
If
Googlebot
doesn’t
get
‘round
the
list’
you
may
end
up
with
‘overdue’
URLs
to
crawl
39. • Not
all
change
is
considered
equal
• There
are
many
dynamic
sites
with
low
importance
pages
changing
frequently
– SO
WHAT
• Constantly
changing
your
page
just
to
get
Googlebot
back
won’t
work
if
the
page
is
low
importance
(crawl
importance
period
<
change
rate)
POINTLESS
• Hints
are
employed
to
determine
pages
which
simply
change
the
content
checksum
with
every
visit
• Features
are
weighted
for
change
importance
to
user
(price
>
colour
e.g.)
• Change
identified
as
useful
to
users
is
considered
‘CRITICAL
MATERIAL
CHANGE’
• Don’t
just
try
to
randomise
things
to
catch
Googlebot’s
eye
• That
counter
or
clock
you
added
probably
isn’t
going
to
help
you
get
more
attention,
nor
random
or
shuffle
• Change
on
some
types
of
pages
is
more
important than
other
pages
(e.g.
Home
page
CNN
>
SME
about
us
page)
5 - CHANGE
40. • Current
capacity
of
the
web
crawling
system
is
high
• Your
URL
has
a
high
‘importance
score’
• Your
URL
is
in
the
real
time
(HIGH
IMPORTANCE),
daily
crawl
(LESS
IMPORTANT)
or
‘active’
base
layer
segment
(UNIMPORTANT
BUT
SELECTED)
• Your
URL
changes
a
lot
with
CRITICAL
MATERIAL
CONTENT
change
(AND
IS
IMPORTANT)
• Probability
and
predictability
of
CRITICAL
MATERIAL
CONTENT
change
is
high
for
your
URL
(AND
URL
IS
IMPORTANT)
• Your
website
speed
is
fast
and
Googlebot gets
the
time
to
visit
your
URL
on
its
bucket
list
of
scheduled
URLs
that
visit
• Your
URL
has
been
‘upgraded’
to
a
daily
or
real
time
crawl
layer
as
it’s
importance
is
detected
as
raised
• History
logs
and
URL
Scheduler
’learn’
together
FACTORS AFFECTING GOOGLEBOT
HIGHER VISIT FREQUENCY
41. • Current
capacity
of
web
crawling
system
is
low
• Your
URL
has
been
detected
as
a
‘spam’
URL
• Your
URL
is
in
an
‘inactive’
base
layer
segment
(UNIMPORTANT)
• Your
URLs
are
‘tripping
hints’
built
into
the
system
to
detect
non-‐
critical
change
dynamic
content
• Probability
and
predictability
of
critical
material
content
change
is
low
for
your
URL
• Your
website
speed
is
slow
and
Googlebot doesn’t
get
the
time
to
visit
your
URL
• Your
URL
has
been
‘downgraded’
to
an
‘inactive’
base
layer
(UNIMPORTANT)
segment
• Your
URL
has
returned
an
‘unreachable’
server
response
code
recently
• In-‐page
robots
management
or
robots.txt send
wrong
signals
FACTORS AFFECTING LOWER
GOOGLEBOT VISIT FREQUENCY
42. GET
MORE
CRAWL
BY
‘TURNING
GOOGLEBOT’S
HEAD’
– MAKE
YOUR
URLs
MORE
IMPORTANT
AND
‘EMPHASISE’ IMPORTANCE
43. • Hard
limits
and
soft
limits
• Follows
‘min’
and
‘max’
Hints
• If
she
finds
something
important
she
will
go
beyond
a
scheduled
crawl
(SOFT
LIMIT)
to
seek
out
importance
(TO
HARD
LIMIT)
• You
need
to
IMPRESS
Googlebot
• If
you
‘bore’
Googlebot she
will
return
to
boring
URLs
less
(e.g.
with
pages
all
the
same
(duplicate
content)
or
dynamically
generated
low
usefulness
content)
• If
you
’delight’
Googlebot she
will
return
to
delightful
URLs
more
(they
became
more
important
or
they
changed
with
‘CRITICAL
MATERIAL
CHANGE’)
• If
she
doesn’t
get
her
crawl
completed
you
will
end
up
with
an
‘overdue’
list
of
URLs
to
crawl
GOOGLEBOT DOES AS SHE’S TOLD –
WITH A FEW EXCEPTIONS
44. • Your
URL
became
more
important
and
achieved
a
higher
‘importance
score’
via
increased
PageRank
• Your
URL
became
more
important
via
increased
IB(P)
(INTERNAL
BACKLINKS
IN
OWN
SITE)
relative
to
other
URLs
within
your
site
(You
emphasised
importance)
• You
made
the
URL
content
more
relevant
to
a
topic
and
improved
the
importance
score
• The
parent
of
your
URL
became
more
important
(E.G.
IMPROVED
TOPIC
RELEVANCE
(SIMILARITY),
PageRank
OR
local
(in-‐site)
importance
metric)
• YOUR
‘IMPORTANCE
SCORE’
OF
SOME
URLS
EXCEEDED
THE
‘IMPORTANCE
SOFT
LIMIT
THRESHOLD’
SO
THAT
IT
IS
INCLUDED
FOR
CRAWLING
WHILST
BEING
VISITED
UP
TO
A
POINT
OF
‘HARD
LIMIT’
CRAWLING
(E.G.
130%
OF
SCHEDULED
CRAWLING)
GETTING MORE CRAWL BY
IMPROVING PAGE IMPORTANCE
46. TO DO - FIND GOOGLEBOT
AUTOMATE
SERVER
LOG
RETRIEVAL
VIA
CRON
JOB
grep Googlebot
access_log
>googlebot_access.txt
ANALYSE
THE
LOGS
47. LOOK THROUGH SPIDER-EYES
PREPARE TO BE HORRIFIED
Incorrect
URL
header
response
codes
301
redirect
chains
Old
files
or
XML
sitemaps
left
on
server
from
years
ago
Infinite/
endless
loops
(circular
dependency)
On
parameter
driven
sites
URLs
crawled
which
produce
same
output
AJAX
content
fragments
pulled
in
alone
URLs
generated
by
spammers
Dead
image
files
being
visited
Old
CSS
files
still
being
crawled
and
loading
EVERYTHING
You
may
even
see
’mini’
abandoned
projects
within
the
site
Legacy
URLs
generated
by
long
forgotten
.htaccess regex
pattern
matching
Googlebot hanging
around
in
your
‘ever-‐changing’
blog
but
nowhere
else
48. URL CRAWL FREQUENCY ’CLOCKING’
Spreadsheet
provided
by
@johnmu during
Webmaster
Hangout
-‐ https://goo.gl/1pToL8
Identify
your
‘real
time’,
‘daily’
and
‘base
layer’
URLs
-‐ ARE
THEY
THE
ONES
YOU
WANT
THERE?
WHAT
IS
BEING
SEEN
AS
UNIMPORTANT?
NOTE GOOGLEBOT
Do
you
recognise
all
the
URLs
and
URL
ranges
that
Are
appearing?
If
not…
Why
not?
49. IMPROVE & EMPHASISE PAGE IMPORTANCE
• Cross
modular
internal
linking
• Canonicalization
• Important
URLs
in
XML
sitemaps
• Anchor
text
target
consistency
(but
not
spammyrepetition
of
anchors
everywhere
(it’s
still
output))
• Internal
links
in
right
descending
order
– emphasise
IMPORTANCE
• Reduce
boiler
plate
content
and
improve
relevance
of
content
and
elements
to
specific
topic
(if
category)
/
product
(if
product
page)
/
subcategory
(if
subcategory)
• Reduce
duplicate
content
parts
of
page
to
allow
primary
targets
to
take
’IMPORTANCE’
• Improve
parent
pages
to
raise
IMPORTANCE
reputation
of
the
children
rather
than
over-‐optimising the
child
pages
and
cannibalising the
parent.
• Improve
content
as
more
‘relevant’
to
a
topic
to
increase
‘IMPORTANCE’
and
get
reassigned
to
a
different
crawl
layer
• Flatten
‘architectures’
• Avoid
content
cannibalisation
• Link
relevant
content
to
relevant
content
• Build
strong
highly
relevant
‘hub’
pages
to
tie
together
strength
&
IMPORTANCE
50. EMPHASISE IMPORTANCE WISELY
USE
CUSTOM
XML
SITEMAPS
E.G.
XML
UNLIMITED
SITEMAP
GENERATOR
PUT IMPORTANT URLS IN
HERE
IF EVERYTHING IS
IMPORTANT THEN
IMPORTANCE IS NOT
DIFFERENTIATED
51. KEEP CUSTOM SITEMAPS ‘CURRENT’
AUTOMATICALLY
AUTOMATE
UPDATES
WITH
CRON
JOBS
OR
WEB
CRON
JOBS
IT’S NOT AS TECHNICALAS
YOU MAY THINK – USE
WEB CRON JOBS
52. BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN
XML SITEMAPS
EXCLUDE
AND
INCLUDE
CRAWL
PATHS
IN
XML
SITEMAPS
TO EMPHASISE
IMPORTANCE
53. IF YOU CAN’T IMPROVE - EXCLUDE (VIA
NOINDEX) FOR NOW • YOU’RE
OUT
FOR
NOW
• When
you
improve
you
can
come
back
in
• Tell
Googlebot quickly
that
you’re
out
(via
temporary
XML
sitemap
inclusion)
• But
‘follow’
because
there
will
be
some
relevance
within
these
URLs
• Include
again
when
you’ve
improved
• Don’t
try
to
canonicalize
me
to
something
in
the
index
54. OR REMOVE – 410 GONE
(IF IT’S NEVER COMING
BACK)
http://faxfromthefuture.bandcamp.com/track/410-‐
gone-‐acoustic-‐demo
EMBRACE
THE ‘410
GONE’
There’s
Even
A
Song
About
It
55. #BIGSITEPROBLEMS – LOSE THE INDEX BLOAT
LOSE THE
BLOAT TO
INCREASE
THE
CRAWL
No.
of
unimportant
URLs
indexed
extend
far
beyond
the
available
importance
crawl
threshold
allocation
56. Tags:
I,
must,
tag,
this,
blog,
post,
with,
every,
possible,
word,
that,
pops,
into,
my,
head,
when,
I,
look,
at,
it,
and,
dilute,
all,
relevance,
from,
it,
to,
a,
pile,
of,
mush,
cow,
shoes,
sheep,
the,
and,
me,
of,
it
Image
Credit:
Buzzfeed
Creating
‘thin’
content
and
Even
more
URLs
to
crawl
#BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN
57. Most Important Page 1
Most
Important
Page
2
Most
Important
Page
3
IS THIS
YOUR BLOG??
HOPE NOT
#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED
IMPORTANCE DISTORTED
BY DISPROPORTIONATE
INTERNAL LINKING -
LOCAL IB (P) – INTERNAL
BACKLINKS
58. Optimize
Everything:
I
must
optimize
ALL
the
pages
across
a
category
descendants
for
the
same
terms
as
my
primary
target
category
page
so
that
each
of
them
is
of
almost
equal
relevance
to
the
target
page
and
confuse
crawlers
as
to
which
is
the
important
one.
I’ll
put
them
all
in
a
sitemap
as
standard
too
just
for
good
measure.
Image
Credit:
Buzzfeed
HOW
CAN
SEARCH
ENGINES
KNOW
WHICH
IS
MOST
IMPORTANT
TO
A
TOPIC
IF
‘EVERYTHING’
IS
IMPORTANT??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE
‘MISTER OVER-OPTIMIZER’
‘OPTIMIZE
ALL
THE
THINGS’
59. Duplicate
Everything:
I
must
have
a
massive
boiler
plate
area
in
the
footer,
identical
sidebars
and
a
massive
mega
menu
with
all
the
same
output
in
sitewide.
I’ll
put
very
little
unique
content
into
the
page
body
and
it
will
also
look
very
much
like
it’s
parents
and
grandparents
too.
From
time
to
time
I’ll
outrank
my
parents
and
grandparent
pages
but
‘Meh’…
Image
Credit:
Buzzfeed
HOW
CAN
SEARCH
ENGINES
KNOW
WHICH
IS
MOST
IMPORTANT
PAGE
IF
ALL
IT’S
CHILDREN
AND
GRANDCHILDREN
ARE
NEARLY
THE
SAME??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE
‘MISTER DUPLICATER’
‘DUPLICATE
ALL
THE
THINGS’
60. IMPROVE SITE PERFORMANCE - HELP GOOGLEBOTGET
THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE
Avoid
wasting
time
on
‘overdue-‐URL’
crawling
(E.G.
Send
correct
response
codes,
speed
up
your
site,
etc)
8,666,964
B1
½
time
>
2
x
page
crawl
p/day
Added
to
Cloudflare CDN
61. GOOGLEBOT
GOES
WHERE
THE
ACTION
IS
USE
‘ACTION’
WISELY
DON’T
TRY
TO
TRICK
GOOGLEBOT
BY
FAKING
‘FRESHNESS’
ON
LOW
IMPORTANCE
PAGES
– GOOGLEBOT
WILL
REALISE
UPDATE
IMPORTANT
PAGES
OFTEN
NURTURE
SEASONAL
URLs
TO
GROW
IMPORTANCE
WITH
FRESHNESS
(regular
updates)
&
MATURITY
(HISTORY)
DON’T
TURN
GOOGLEBOT’S
HEAD
INTO
THE
WRONG
PLACES
Image
Credit:
Buzzfeed
’GET FRESH’AND STAY ‘FRESH’
‘BUT
DON’T
TRY
TO
FAKE
FRESH
&
USE
FRESH
WISELY’
62. IMPROVE TO GET THE HARD LIMITS ON
CRAWLING
By
improving
your
URL
importance on
an
ongoing
basis
via
Increased
pagerank,
content
improvements
(e.g.
quality
hub
pages),
internal
link
strategies,
IB
(P),
restructuring,
You
can
get
the
‘hard
limit’
or
get
visited
more
generally
CAN
IMPROVING
YOUR SITE
HELP TO
‘OVERRIDE’
SOFT LIMIT
CRAWL
PERIODS SET?
63. YOU THINK IT DOESN’T MATTER… RIGHT?
YOU
SAY…
”
GOOGLE
WILL
WORK
IT
OUT”
”LET’S
JUST
MAKE
MORE
CONTENT”
65. WRONG – CRAWL TANK CAN LOOK LIKE THIS
SITE
SEO
DEATH
BY
TOO
MANY
URLS
AND
INSUFFICIENT
CRAWL
BUDGET
TO
SUPPORT
(EITHER
DUMPING
A
NEW
‘THIN’
PARAMETER
INTO
A
SITE
OR
INFINITE
LOOP
(CODING
ERROR)
(SPIDER
TRAP))
WHAT’S WORSE THAN AN INFINITE
LOOP?
‘A
LOGICAL
INFINITE
LOOP’
IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING
‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS
67. VIA ‘EXPONENTIAL URL UNIMPORTANCE’
Your
URLs
exponentially
confirmed
unimportant
with
each
iterative
crawl
visit
to
other
similar
or
duplicate
content
checksum
URLs.
Fewer
and
fewer
internal
links
and
‘thinner
and
thinner’
relevant
content.
MULTPLE
RANDOM
URLs
competing
for
same
query
confirm
irrelevance
of
all
competing
in-‐site
URLs
with
no
dominant
single
relevant
IMPORTANT
URL
68. WRONG – ‘SENDING WRONG SIGNALS TO
GOOGLEBOT’ COSTS DEARLY
(Source:Sistrix)
“2015
was
the
year
where
website
owners
managed
to
be
mostly
at
fault,
all
by
themselves”
(Sistrix 2015
Organic
Search
Review
-‐
2016)
69. WRONG - NO-ONE IS EXEMPT
(Source:Sistrix)
“It
doesn’t
matter
how
big
your
brand
is
if
you
‘talk
to
the
spider’
(Googlebot)
wrong
”
– You
can
still
‘tank’
70. WRONG – GOOGLE THINKS SEOS SHOULD
UNDERSTAND CRAWL BUDGET
71. ”EMPHASISE
IMPORTANCE”
“Make
sure
the
right
URLs
get
on
Googlebot’s menu
and
increase
URL
importance
to
build
Googlebot’s appetite
for
your
site
more”
Dawn
Anderson
@
dawnieando
SORT OUT CRAWLING
73. • Going
‘where
the
action
is’
in
sites
• The
‘need
for
speed’
• Logical
structure
• Correct
‘response’
codes
• XML
sitemaps
with
important
URLs
• ‘Successful
crawl
visits
• ‘Seeing
everything’
on
a
page
• Taking
MAX
‘hints’
• Clear
unique
single
‘URL
fingerprints’
(no
duplicates)
• Predicting
likelihood
of
‘future
change’
• Finding
‘more’
important
content
worth
crawling
• Slow
sites
• Too
many
redirects
• Being
bored
(Meh)
(Min
‘Hints’
are
built
in
by
the
search
engine
systems
– Takes
‘hints’)
• Being
lied
to
(e.g.
On
XML
sitemap
priorities)
• Crawl
traps
and
dead
ends
• Going
round
in
circles
(Infinite
loops)
• Spam
URLs
• Crawl
wasting
minor
content
change
URLs
• ‘Hidden’
and
blocked
content
• Uncrawlable URLs
Not
just
any
change
Critical
material
change
Predicting
future
change
Dropping
‘hints’
to
Googlebot
Sending
Googlebot
Where
‘the
action
is’
Not
just
page
change
designed
To
catch
Googlebot’s eye
with
No
added
value
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
LIKES DISLIKES
CHANGE
IS
KEY
74. Going
‘where
the
action
is’
in
sites
The
‘need
for
speed’
Logical
structure
Correct
‘response’
codes
XML
sitemaps
‘Successful
crawl
visits
‘Seeing
everything’
on
a
page
Taking
‘hints’
Clear
unique
single
‘URL
fingerprints’
(no
duplicates)
Predicting
likelihood
of
‘future
change’
Slow
sites
Too
many
redirects
Being
bored
(Meh)
(‘Hints’
are
built
in
by
the
search
engine
systems
– Takes
‘hints’)
Being
lied
to
(e.g.
On
XML
sitemap
priorities)
Crawl
traps
and
dead
ends
Going
round
in
circles
(Infinite
loops)
Spam
URLs
Crawl
wasting
minor
content
change
URLs
‘Hidden’
and
blocked
content
Uncrawlable URLs
Not
just
any
change
Critical
material
change
Predicting
future
change
Dropping
‘hints’
to
Googlebot
Sending
Googlebot
Where
‘the
action
is’
CRAWL OPTIMISATION – STAGE 1 -
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
LIKES DISLIKES CHANGE
IS
KEY
75. FIX
GOOGLEBOT’S JOURNEY
SPEED UP YOUR SITE
TO ‘FEED’
GOOGLEBOT MORE
TECHNICAL
‘FIXES’
Speed
up
your
site
Implement
compression,
minification,
caching
‘
Fix
incorrect
header
response
codes
Fix
nonsensical
‘infinite
loops’
generated
by
database
driven
parameters
or
‘looping’
relative
URLs
Use
absolute
versus
relative
internal
links
Ensure
no
parts
of
content
is
blocked
from
crawlers
(e.g.
in
carousels,
concertinas
and
tabbed
content
Ensure
no
css or
javascript files
are
blocked
from
crawlers
Unpick
301
redirect
chains
Consider
using
a
CDN
such
as
Cloudflare
IMPLEMENTATION OF
CONTENT DELIVERY
NETWORK
76. Minimise
301
redirects
Minimise
canonicalisation
Use
‘if
modified’
headers
on
low
importance
‘hygiene’
pages
Use
‘expires
after’
headers
on
content
with
short
shelf
live
(e.g.
auctions,
job
sites,
event
sites)
Noindex low
search
volume
or
near
duplicate
URLs
(use
noindex directive
on
robots.txt)
Use
410
‘gone’
headers
on
dead
URLs
liberally
Revisit
.htaccess file
and
review
legacy
pattern
matched
301
redirects
Combine
CSS
and
javascript files
Use
minification,
compression
and
caching
FIX GOOGLEBOT’S JOURNEY
SAVE
BUDGET
/
EMPHASISE
IMPORTANCE
£
77. Revisit
‘Votes
for
self’
via
internal
links
in
GSC
Clear
‘unique’
URL
fingerprints
Improve
whole
site
sections
/
categories
Use
XML
sitemaps
for
your
important
URLs
(don’t
put
everything
on
it)
Use
‘mega
menus’
(very
selectively)
to
key
pages
Use
‘breadcrumbs’
Build
‘bridges’
and
‘shortcuts’
via
html
sitemaps
and
‘cross
modular’
‘related’
internal
linking
to
key
pages
Consolidate
(merge)
important
but
similar
content
(e.g.
merge
FAQs
or
‘low
search
volume’
content
into
other
relevant
pages)
Consider
flattening
your
site
structure
so
‘importance’
flows
further
Reduce
internal
linking
to
lower
priority
URLs
BE
CLEAR
TO
GOOGLEBOT
WHICH
ARE
YOUR
MOST
IMPORTANT
PAGES
Not
just
any
change
– Critical
material
change
Keep
the
‘action’
in
the
key
areas -‐ NOT
JUST
THE
BLOG
Use
‘relevant
‘supplementary
content
to
keep
key
pages
‘fresh’
Remember
min
crawl
‘hints’
Regularly
update
key
IMPORTANT
content
Consider
‘updating’
rather
than
replacing
seasonal
content
URLs
(e.g.
annual
events).
Append
and
update.
Build
‘dynamism’
and
‘interactivity’
into
your
web
development
(sites
that
‘move’
win)
Keep
working
to
improve
and
make
your
URLs
more
important
GOOGLEBOT
GOES
WHERE
THE
ACTION
IS
AND
IS
LIKELY
TO
BE
IN
THE
FUTURE
(AS
LONG
AS
THOSE
URLS
ARE
NOT
UNIMPORTANT)
TRAIN GOOGLEBOT – ‘TALK TO THE
SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)
EMPHASISE
PAGE
IMPORTANCE
TRAIN
ON
CHANGE
79. URL IMPORTANCE & CRAWL
FREQUENCY TOOLS
• GSC
Internal
links
Report
(URL
importance)
• Link
Research
Tools
(Strongest
sub
pages
reports)
• GSC
Internal
links
(add
site
categories
and
sections
as
additional
profiles)
• Powermapper
• XML
Sitemap
Generators
for
custom
sitemaps
• Crawl
Frequency
Clocking
(@Johnmu)
URL
IMPORTANCE
81. REFERENCES
Efficient
Crawling
Through
URL
Ordering
(Page
et
al)
-‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf
Crawl
Optimisation (Blind
Five
Year
Old
– A
J
Kohn
-‐ @ajkohn)
http://www.blindfiveyearold.com/crawl-‐
optimization
Scheduling
a
recrawl (Auerbach)
-‐ http://www.google.co.uk/patents/US8386459
Scheduler
for
search
engine
crawler
(Zhu
et
al)
-‐ http://www.google.co.uk/patents/US8042112
Efficient
crawling
through
URL
ordering
(Page
et
al)
-‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf
Google
Explains
Why
The
Search
Console
Reporting
Is
Not
Real
Time
(SERoundtable)
https://www.seroundtable.com/google-‐explains-‐why-‐the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.html
Crawl
Data
Aggregation
Propagation
(Mueller)
-‐ https://goo.gl/1pToL8
Matt
Cutts Interviewed
By
Eric
Enge -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐
2/
Web
Promo
Q
and
A
with
Google’s
Andrev Lippatsev -‐
https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/
Google
Number
1
SEO
Advice
– Be
Consistent
-‐ https://www.seroundtable.com/google-‐number-‐one-‐seo-‐
advice-‐be-‐consistent-‐21196.html
82. REFERENCES
Internet
Live
Stats
-‐ http://www.internetlivestats.com/total-‐number-‐of-‐websites/
Scheduler
for
search
engine
crawler Google
Patent
US
8042112
B1,
(Zhu
et
al)
-‐ https://www.google.com/patents/US8707313
Managing
items
in
crawl
schedule
– Google
Patent
(Alpert)
http://www.google.ch/patents/US8666964
Document
reuse
in
a
search
engine
crawler
-‐ Google
Patent
(Zhu
et
al)
https://www.google.com/patents/US8707312
Web
crawler
scheduler
that
utilizes
sitemaps
(Brawer
et
al)
-‐
http://www.google.com/patents/US8037054
Distributed
crawling
of
hyperlinked
documents
(Dean
et
al)
-‐
http://www.google.co.uk/patents/US7305610
Minimizing
visibility
of
stale
content
(Carver)
-‐
http://www.google.ch/patents/US20130226897