SlideShare uma empresa Scribd logo
1 de 83
USING	
  ’PAGE	
  IMPORTANCE’	
  IN	
  ONGOING	
  
CONVERSATION	
  WITH	
  GOOGLEBOT	
  TO	
  GET	
  
JUST	
  A	
  BIT	
  MORE	
  THAN	
  YOUR	
  ALLOCATED	
  
CRAWL	
  BUDGET
NEGOTIATING	
  
CRAWL	
  
BUDGET	
  WITH	
  
GOOGLEBOTS
Dawn	
  Anderson	
  @	
  dawnieando
Another	
  Rainy	
  
Day	
  In	
  
Manchester
@dawnieando
WTF???
1994  -­ 1998
“THE	
  GOOGLE	
  INDEX	
  IN	
  1998	
  HAD	
  
60	
  MILLION	
  PAGES”	
  (GOOGLE)	
  
(Source:Wikipedia.org)
2000
“INDEXED	
  PAGES	
  REACHES	
  THE	
  ONE	
  BILLION	
  
MARK”	
  (GOOGLE)
“IN	
  OVER	
  17	
  MILLION	
  
WEBSITES”	
  
(INTERNETLIVESTATS.COM)
2001  ONWARDS
ENTER  WORDPRESS,  DRUPAL  CMS’,  PHP  DRIVEN  CMS’,  ECOMMERCE  
PLATFORMS,  DYNAMIC  SITES,  AJAX
WHICH	
  CAN	
  GENERATE	
  10,000S	
  OR	
  100,000S	
  
OR	
  1,000,000S	
  OF	
  DYNAMIC
URLS	
  ON	
  THE	
  FLY	
  WITH	
  DATABASE	
  ‘FIELD	
  
BASED’	
  CONTENT
DYNAMIC	
  CONTENT	
  CREATION	
  GROWS
ENTER	
  FACETED	
  NAVIGATION	
  (WITH	
  MANY	
  #	
  
PATHS	
  TO	
  SAME	
  CONTENT)
2003	
  – WE’RE	
  AT	
  40	
  MILLION	
  WEBSITES
2003  ONWARDS  – USERS  BEGIN  TO  JUMP  ON  THE  CONTENT  
GENERATION  BANDWAGGON
LOTS	
  OF	
  
CONTENT	
  – IN	
  
MANY	
  FORMS
WE  KNEW  THE  WEB  WAS  BIG…  (GOOGLE,  2008)
https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html
“1	
  trillion	
  (as	
  in	
  1,000,000,000,000)	
   unique	
  URLs	
  on	
  the	
  web	
  at	
  once!”
(Jesse	
  Alpert	
  on	
  Google’s	
   Official	
  Blog,	
  2008)
2008  – EVEN  
GOOGLE  
ENGINEERS  
STOPPED  IN  AWE
2010  – USER  GENERATED  CONTENT  GROWS
“Let	
  me	
  repeat	
  that:	
  we	
  
create	
  as	
  much	
  information	
  
in	
  two	
  days	
  now	
  as	
  we	
  did	
  
from	
  the	
  dawn	
  of	
  man	
  
through	
  2003”
“The	
  real	
  issue	
  is	
  user-­‐
generated	
  content.”	
  (Eric	
  
Schmidt,	
  2010	
  – Techonomy
Conference	
  Panel)
SOURCE:	
  http://techcrunch.com/2010/08/04/schmidt-­‐data/
Indexed	
  Web	
  contains at	
  least	
  4.73	
  billion	
   pages (13/11/2015)
CONTENT KEEPS GROWING
Total	
  number	
  of	
  websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
THE	
  NUMBER	
  OF	
  WEBSITES	
  
DOUBLED	
  IN	
  SIZE	
  BETWEEN	
  
2011	
  AND	
  2012
AND	
  AGAIN	
  BY	
  1/3	
  IN	
  2014
EVEN	
  SIR	
  TIM	
  
BERNERS-­‐LEE
(Inventor	
  of	
  www)	
  
TWEETED
2014  – WE  PASS  A  BILLION  INDIVIDUAL  WEBSITES  
ONLINE
2014  – WE  ARE  ALL PUBLISHERS
SOURCE:	
  http://wordpress/activity/posting
YUP  -­ WE  ALL‘LOVE  CONTENT’
IMAGINE	
  HOW	
  MANY	
  
UNIQUE	
  URLs	
  	
  COMBINED	
  
THIS	
  AMOUNTS	
  TO?	
  
– A	
  LOT
http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
“As	
  of	
  the	
  end	
  of	
  2003,	
  the	
  
WWW	
  is	
  believed	
  to	
  include	
  
well	
  in	
  excess	
  of	
  10	
  billion	
  
distinct	
  documents	
  or	
  web	
  
pages,	
  while	
  a	
  search	
  
engine	
  may	
  have	
  a	
  crawling	
  
capacity	
  that	
  is	
  less	
  than	
  
half	
  as	
  many	
  documents”	
  
(MANY	
  GOOGLE	
  PATENTS)
CAPACITY  LIMITATIONS  – EVEN  FOR  SEARCH  
ENGINES
Source:	
  Scheduler	
  for	
  search	
  engine	
  crawler Google	
  Patent
US	
  8042112	
  B1,	
  (Zhu	
  et	
  al)
“So	
  how	
  many	
  unique	
  pages	
  
does	
  the	
  web	
  really	
  
contain?	
  We	
  don't	
  know;	
  we	
  
don't	
  have	
  time	
  to	
  look	
  at	
  
them	
  all!	
  :-­‐)”	
  
(Jesse	
  Alpert,	
  Google,	
  2008)
Source:	
  https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐
was-­‐big.html
NOT	
  
ENOUGH	
  
TIME
SOME	
  THINGS	
  
MUST	
  BE	
  
FILTERED
A	
  LOT	
  OF	
  THE	
  
CONTENT	
  IS	
  
‘KIND	
  OF	
  THE	
  
SAME’
“There’s	
  a	
  needle	
  in	
  here	
  
somewhere”
“It’s	
  an	
  important	
  needle	
  too”
Capacity	
  limits	
  
on	
  Google’s	
  
crawling	
  system
By	
  prioritising	
  
URLs	
  for	
  
crawling
By	
  assigning	
  
crawl	
  period	
  
intervals	
  to	
  URLs
How	
  have	
  
search	
  engines	
  
responded?
By	
  creating	
  work	
  
‘schedules’	
  for	
  
Googlebots
WHAT IS THE SOLUTION?
“To	
  keep	
  within	
  the	
  capacity	
  limits	
  of	
  the	
  crawler,	
  automated	
  selection	
  mechanisms	
  are	
  needed	
  
to	
  determine	
  not	
  only	
  which	
  web	
  pages	
  to	
  crawl,	
  but	
  which	
  web	
  pages	
  to	
  avoid	
  crawling”.	
  -­‐
Scheduler	
  for	
  search	
  engine	
  crawler,	
  (Zhu	
  et	
  al)
‘Managing items in a
crawl schedule’
Include
GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that
utilizes sitemaps from websites’
‘
‘Document reuse in a
search engine crawler’
‘Minimizing visibility of stale content in
web searching including revising web
crawl intervals of documents’
‘Scheduler for search engine’
EFFICIENCY  IS  
NECESSARY
CRAWL  BUDGET
1.  Crawl  Budget  – “An  allocation  of  crawl  
frequency  visits  to  a  host  (IP  LEVEL)”  
3.  Pages  with  a  lot  of  links  get  crawled  more
4.  The  vast  majority  of  URLs  on  the  web  don’t  get  a  
lot  of  budget  allocated  to  them  (low  to  0  PageRank  URLs).
2.  Roughly  proportionate  to  PageRank  and  
host  load  /  speed  /  host  capacity
https://www.stonetemple.com/matt-­‐cutts-­‐
interviewed-­‐by-­‐eric-­‐enge-­‐2/
BUT…  MAYBE  THINGS  HAVE  CHANGED?
CRAWL  BUDGET  /  CRAWL  
FREQUENCY  IS  NOT  JUST  
ABOUT  HOST-­LOAD  AND  
PAGERANK  ANY  MORE
STOP  THINKING  IT’S  JUST  ABOUT  ‘PAGERANK’
http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s
“You	
  keep	
  focusing	
  on	
  
PageRank”…
“There’s	
  a	
  shit-­‐ton	
  of	
  
other	
  stuff	
  going	
  on”	
  
(Illyes,	
  G,	
  Google	
  -­‐
2016)
THERE’S  A  LOT  OF  OTHER  THINGS  AFFECTING  
‘CRAWLING’
Transcript:	
  
https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/
WEB	
  PROMOS	
  Q	
  &	
  A	
  
WITH	
  GOOGLES	
  
ANDREY	
  LIPATTSEV
WHY? BECAUSE…  
THE  WEB  GOT  
‘MAHOOOOOSIVE’
AND  CONTINUES  TO  GET  
‘MAHOOOOOOSIVER’
SITES  GOT  MORE  
DYNAMIC,  COMPLEX,  
AUTO-­GENERATED,  MULTI-­
FACETED,  DUPLICATED,  
INTERNATIONALISED,  
BIGGER,  BECAME  
PAGINATED  AND  SORTED
WE  NEED  MORE
WAYS  TO  GET
MORE  EFFICIENT
AND  FILTER  OUT
TIME-­WASTING
CRAWLING  SO  
WE  CAN  FIND  
IMPORTANT  
CHANGES  
QUICKLY
GOOGLEBOT’S  TO-­DO  LIST  GOT  REALLY  BIG
Hard	
  and	
  Soft	
  
Crawl	
  Limits
Importance	
  
Thresholds
Min	
  and	
  Max	
  
Hints	
  &	
  ‘Hint	
  
ranges’
Importance
Crawl	
  
Periods
Scheduling
FURTHER IMPROVED CRAWLING
EFFICIENCY SOLUTIONS NEEDED
Prioritization Tiered
Crawling
Buckets
(‘Real	
  Time,	
  Daily,	
  
Base	
  Layer)	
  
SEVERAL PATENTS UPDATED
‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE
DETERMINING SOFTAND HARD LIMITS ON CRAWLING)
‘Managing Items in a Crawl Schedule’ (Alpert, 2014)
‘
‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE
FREQUENCY IN ORDER TO SCHEDULE NEXTVISIT, EMPLOYING HINTS
(Min & Max)
(SEEM	
  TO	
  WORK	
  TOGETHER)
‘Minimizing visibility of stale content in web searching including
revising web crawl intervals of documents’ (INCLUDES
EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)
Crawled	
  multiple	
  
times	
  daily
Crawled	
  daily	
  
Or	
  bi-­‐daily
Crawled	
  least	
  on	
  a	
  ‘round	
  
robin’	
  basis	
  – only	
  ‘active’	
  
segment	
  is	
  crawledSplit	
  into	
  segments	
  
on	
  random	
  rotation
MANAGING ITEMS IN A CRAWL
SCHEDULE (GOOGLE PATENT)
Real	
  Time
Crawl
Daily Crawl
Base	
  Layer	
  	
  Crawl
3	
  layers	
  /	
  tiers	
  /	
  
buckets	
  for	
  
scheduling
URLs	
  are	
  moved	
  
in	
  and	
  out	
  of	
  
layers	
  based	
  on	
  
past	
  visits	
  data
Most	
  Unimportant
CAN	
  WE	
  ESCAPE	
  THE	
  ‘BASE	
  LAYER’	
  
CRAWL	
  BUCKET	
  RESERVED	
  FOR	
  
‘UNIMPORTANT’	
  URLS?
10	
  types
of
Googlebot
SOME  OF  THE  MAJOR  SEARCH  ENGINE  
CHARACTERS
History	
  Logs	
  /	
  History	
  
Server
The	
  URL	
  
Scheduler	
  
/	
  Crawl	
  
Manager
HISTORY LOGS / HISTORY SERVERS
HISTORY	
  LOGS	
  /	
  HISTORY	
  SERVER	
  -­‐ Builds	
  a	
  picture	
  of	
  historical	
  data	
  and	
  
past	
  behaviour	
  of	
  the	
  URL	
  and	
  ‘importance’	
  score	
  to	
  predict	
  and	
  plan	
  for	
  
future	
  crawl	
  scheduling
• Last	
  crawled	
  date
• Next	
  crawl	
  due
• Last	
  server	
  response
• Page	
  importance	
  score
• Collaborates	
  with	
  link	
  
logs
• Collaborates	
  with	
  
anchor	
  logs
• Contributes	
  info	
  to	
  
scheduling
‘BOSS’- URL SCHEDULER / URL MANAGER
Think	
  of	
  it	
  as	
  Google’s	
  
line	
  manager	
  or	
  ‘air	
  
traffic	
  controller’	
  for	
  
Googlebots in	
  the	
  
web	
  crawling	
  system
• Schedules	
  Googlebot visits	
  to	
  URLs
• Decides	
  which	
  URLs	
  to	
  ‘feed’	
  to	
  Googlebot
• Uses	
  data	
  from	
  the	
  history	
  logs	
  about	
  past	
  visits	
  (Change	
  rate	
  and	
  
importance)
• Calculates	
  importance	
  crawl	
  threshold
• Assigns	
  visit	
  regularity	
  of	
  Googlebot to	
  URLs
• Drops	
  ‘max	
  and	
  min	
  hints’	
  to	
  Googlebot to	
  guide	
  on	
  types	
  of	
  
content	
  NOT	
  to	
  crawl	
  or	
  to	
  crawl	
  as	
  exceptions.
• Excludes	
  some	
  URLs	
  from	
  schedules
• Assigns	
  URLs	
  to	
  ‘layers	
  /	
  tiers’	
  for	
  crawling	
  schedules
• Scheduler	
  checks	
  URLs	
  for	
  ‘importance’,	
  ‘boost	
  factor’	
  candidacy,	
  
‘probability	
  of	
  modification’
• Budgets	
  are	
  allocated	
  to	
  IPs	
  and	
  shared	
  amongst	
  domains	
  there
JOBS
• ‘Ranks	
  nothing	
  at	
  all’
• Takes	
  a	
  list	
  of	
  URLs	
  to	
  crawl	
  from	
  URL	
  Scheduler
• Runs	
  errands	
  &	
  makes	
  deliveries	
  for	
  the	
  URL	
  server,	
  indexer	
  /	
  
ranking	
  engine	
  and	
  logs
• Makes	
  notes	
  of	
  outbound	
   linked	
  pages	
  and	
  additional	
  links	
  
for	
  future	
  crawling
• Follows	
  directives	
  (robots)	
  and	
  takes	
  ‘hints’	
  when	
  crawling
• Tells	
  tales	
  of	
  URL	
  accessibility	
  status,	
  server	
  response	
  codes,	
  
notes	
  relationships	
  between	
  links	
  and	
  collects	
  content	
  
checksums	
  (binary	
  data	
  equivalent	
  of	
  web	
  content)	
  for	
  
comparison	
  with	
  past	
  visits	
  by	
  history	
  and	
  link	
  logs
• Will	
  go	
  beyond	
  the	
  crawl	
  schedule	
  if	
  it	
  finds	
  something	
  more	
  
important	
  than	
  URLs	
  scheduled
GOOGLEBOT - CRAWLER
JOBS
WHAT	
  MAKES	
  THE	
  DIFFERENCE	
  
BETWEEN	
  BASE	
  LAYER	
  AND	
  ‘REAL	
  TIME’	
  
SCHEDULE	
  ALLOCATION?
CONTRIBUTING  FACTORS
1.  Page  Importance  (which  may  include  PageRank)
3.  Soft  limits  and  hard  crawl  limits
4.  Host  load  capability  &  past  site  
performance  (speed  and  access)  
(IP  level  and  domain  level  within)
2.  Hints  (max  and  min)
5.  Probability  /  predictability  of  ‘CRITICAL
MATERIAL’  change  +  importance  crawl  
period
1 - PAGE IMPORTANCE - Page	
  importance	
  is	
  the	
  
importance	
  of	
  a	
  page	
  independent	
  of	
  a	
  query
• Location	
  in	
  Site	
  (e.g.	
  home	
  page	
  more	
  important	
  
than	
  parameter	
  3	
  level	
  output)
• PageRank
• Page	
  type	
  /	
  file	
  type
• Internal	
  PageRank
• Internal	
  Backlinks
• In-­‐site	
  Anchor	
  Text	
  Consistency
• Relevance	
  (content,	
  anchors	
  and	
  elements)	
  to	
  a	
  
topic	
  (Similarity	
  Importance)
• Directives	
  from	
  in-­‐page	
  robot	
  and	
  robots.txt
management
• Parent	
  quality	
  brushes	
  off	
  on	
  child	
  page	
  quality
IMPORTANT	
  PARENTS	
  LIKELY	
  SEEN	
  TO	
  
HAVE	
  IMPORTANT	
  CHILD	
  PAGES
2 - HINTS - ’MIN’ HINTS
& ’MAX’ HINTS
MIN	
  HINT	
  /	
  MIN	
  HINT	
  RANGES
• e.g.	
  Programmatically	
  generated	
  
content	
  which	
  changes	
  content	
  
checksum	
  on	
  load
• Unimportant	
  duplicate	
  parameter	
  
URLs
• Canonicals
• Rel=next,	
  rel=prev
• HReflang
• Duplicate	
  content
• Spammy URLs?
• Objectionable	
  content
MAX	
  HINT	
  /	
  MAX	
  HINT	
  
RANGES
• CHANGE	
  CONSIDERED	
  ‘CRITICAL	
  
MATERIAL	
  CHANGE’	
  (useful	
  to	
  
users	
  e.g.	
  availability,	
  price)	
  &	
  /	
  or	
  
improved	
  site	
  sections	
  or	
  change	
  
to	
  IMPORTANT	
  but	
  infrequently	
  
changing	
  content?
• Important	
  pages	
  /	
  page	
  range	
  
updates
E.G.
rel="prev" and rel="next" a
ct	
  as	
  hints	
  to	
  Google,	
   not	
  
absolute	
  directives
https://support.google.com/webm
asters/answer/1663744?hl=en&re
f_topic=4617741
3 - HARD AND SOFT LIMITS ON CRAWLING
If	
  URLs	
  are	
  discovered	
  
during	
  crawling	
  that	
  
are	
  more	
  important	
  
than	
  those	
  scheduled	
  
to	
  be	
  crawled	
  then	
  
Googlebot can	
  go	
  
beyond	
  its	
  schedule	
  to	
  
include	
  these	
  up	
  to	
  a	
  
hard	
  crawl	
  limit
‘Soft’	
  crawl	
  
limit	
  is	
  set	
  
(Original	
  
schedule)
‘Hard’	
  crawl	
  limit	
  
is	
  set	
  (E.G.	
  130%	
  
of	
  schedule)
FOR	
  IMPORTANT	
  
FINDINGS
4 – HOST LOAD CAPACITY / PAST SITE
PERFORMANCE
Googlebot has	
  a	
  list	
  
of	
  URLs	
  to	
  crawl
Naturally,	
  if	
  your	
  
site	
  is	
  fast	
  that	
  list	
  
can	
  be	
  crawled	
  
quicker
If	
  Googlebot
experiences	
  
500s	
  e.g.	
  she	
  
will	
  retreat	
  &	
  
‘past	
  
performance
’	
  is	
  noted
If	
  Googlebot
doesn’t	
  get	
  
‘round	
  the	
  list’	
  
you	
  may	
  end	
  
up	
  with	
  
‘overdue’	
  
URLs	
  to	
  crawl
• Not	
  all	
  change	
  is	
  considered	
  equal
• There	
  are	
  many	
  dynamic	
  sites	
  with	
  low	
  importance	
  pages	
  
changing	
  frequently	
  – SO	
  WHAT
• Constantly	
  changing	
  your	
  page	
  just	
  to	
  get	
  Googlebot
back	
  won’t	
  work	
  if	
  the	
  page	
  is	
  low	
  importance	
  (crawl	
  
importance	
  period	
  <	
  change	
  rate)	
  POINTLESS
• Hints	
  are	
  employed	
  to	
  determine	
  pages	
  which	
  simply	
  
change	
  the	
  content	
  checksum	
  with	
  every	
  visit
• Features	
  are	
  weighted	
  for	
  change	
  importance	
  to	
  user	
  
(price	
  >	
  colour	
  e.g.)
• Change	
  identified	
  as	
  useful	
  to	
  users	
  is	
  considered	
  
‘CRITICAL	
  MATERIAL	
  CHANGE’
• Don’t	
  just	
  try	
  to	
  randomise	
  things	
  to	
  catch	
  Googlebot’s
eye
• That	
  counter	
  or	
  clock	
  you	
  added	
  probably	
  isn’t	
  going	
  to	
  
help	
  you	
  get	
  more	
  attention,	
  nor	
  random	
  or	
  shuffle
• Change	
  on	
  some	
  types	
  of	
  pages	
  is	
  more	
  important than	
  
other	
  pages	
  (e.g.	
  Home	
  page	
  CNN	
  >	
  SME	
  about	
  us	
  page)
5 - CHANGE
• Current	
  capacity	
  of	
  the	
  web	
  crawling	
  system	
  is	
  high
• Your	
  URL	
  has	
  a	
  high	
  ‘importance	
  score’
• Your	
  URL	
  is	
  in	
  the	
  real	
  time	
  (HIGH	
  IMPORTANCE),	
  daily	
  crawl	
  
(LESS	
  IMPORTANT)	
  or	
  ‘active’	
  base	
  layer	
  segment	
  
(UNIMPORTANT	
  BUT	
  SELECTED)
• Your	
  URL	
  changes	
  a	
  lot	
  with	
  CRITICAL	
  MATERIAL	
  CONTENT	
  
change	
  (AND	
  IS	
  IMPORTANT)
• Probability	
  and	
  predictability	
  of	
  CRITICAL	
  MATERIAL	
  CONTENT	
  
change	
  is	
  high	
  for	
  your	
  URL	
  (AND	
  URL	
  IS	
  IMPORTANT)
• Your	
  website	
  speed	
  is	
  fast	
  and	
  Googlebot gets	
  the	
  time	
  to	
  visit	
  
your	
  URL	
  on	
  its	
  bucket	
  list	
  of	
  scheduled	
  URLs	
  that	
  visit
• Your	
  URL	
  has	
  been	
  ‘upgraded’	
  to	
  a	
  daily	
  or	
  real	
  time	
  crawl	
  layer	
  
as	
  it’s	
  importance	
  is	
  detected	
  as	
  raised
• History	
  logs	
  and	
  URL	
  Scheduler	
  ’learn’	
  together
FACTORS AFFECTING GOOGLEBOT
HIGHER VISIT FREQUENCY
• Current	
  capacity	
  of	
  web	
  crawling	
  system	
  is	
  low
• Your	
  URL	
  has	
  been	
  detected	
  as	
  a	
  ‘spam’	
  URL
• Your	
  URL	
  is	
  in	
  an	
  ‘inactive’	
  base	
  layer	
  segment	
  (UNIMPORTANT)
• Your	
  URLs	
  are	
  ‘tripping	
  hints’	
  built	
  into	
  the	
  system	
  to	
  detect	
  non-­‐
critical	
  change	
  dynamic	
  content
• Probability	
  and	
  predictability	
  of	
  critical	
  material	
  content	
  change	
  is	
  
low	
  for	
  your	
  URL
• Your	
  website	
  speed	
  is	
  slow	
  and	
  Googlebot doesn’t	
  get	
  the	
  time	
  to	
  
visit	
  your	
  URL
• Your	
  URL	
  has	
  been	
  ‘downgraded’	
  to	
  an	
  ‘inactive’	
  base	
  layer	
  
(UNIMPORTANT)	
  segment
• Your	
  URL	
  has	
  returned	
  an	
  ‘unreachable’	
  server	
  response	
  code	
  
recently
• In-­‐page	
  robots	
  management	
  or	
  robots.txt send	
  wrong	
  signals
FACTORS AFFECTING LOWER
GOOGLEBOT VISIT FREQUENCY
GET	
  MORE	
  CRAWL	
  BY	
  ‘TURNING	
  
GOOGLEBOT’S	
  HEAD’	
  – MAKE	
  YOUR	
  
URLs	
  MORE	
  IMPORTANT	
  AND	
  
‘EMPHASISE’ IMPORTANCE
• Hard	
  limits	
  and	
  soft	
  limits
• Follows	
  ‘min’	
  and	
  ‘max’	
  Hints
• If	
  she	
  finds	
  something	
  important	
  she	
  will	
  go	
  beyond	
  a	
  
scheduled	
  crawl	
  (SOFT	
  LIMIT)	
  to	
  seek	
  out	
  importance	
  (TO	
  
HARD	
  LIMIT)
• You	
  need	
  to	
  IMPRESS	
  Googlebot
• If	
  you	
  ‘bore’	
  Googlebot she	
  will	
  return	
  to	
  boring	
  URLs	
  less	
  
(e.g.	
  with	
  pages	
  all	
  the	
  same	
  (duplicate	
  content)	
  or	
  
dynamically	
  generated	
  low	
  usefulness	
   content)
• If	
  you	
  ’delight’	
  Googlebot she	
  will	
  return	
  to	
  delightful	
  URLs	
  
more	
  (they	
  became	
  more	
  important	
  or	
  they	
  changed	
  with	
  
‘CRITICAL	
  MATERIAL	
  CHANGE’)
• If	
  she	
  doesn’t	
  get	
  her	
  crawl	
  completed	
  you	
  will	
  end	
  up	
  with	
  
an	
  ‘overdue’	
  list	
  of	
  URLs	
  to	
  crawl
GOOGLEBOT DOES AS SHE’S TOLD –
WITH A FEW EXCEPTIONS
• Your	
  URL	
  became	
  more	
  important	
  and	
  achieved	
  a	
  higher	
  ‘importance	
  score’	
  
via	
  increased	
  PageRank
• Your	
  URL	
  became	
  more	
  important	
  via	
  increased	
  IB(P)	
  (INTERNAL	
  BACKLINKS	
  IN	
  
OWN	
  SITE)	
  relative	
  to	
  other	
  URLs	
  within	
  your	
  site	
  (You	
  emphasised	
  
importance)
• You	
  made	
  the	
  URL	
  content	
  more	
  relevant	
  to	
  a	
  topic	
  and	
  improved	
  the	
  
importance	
  score
• The	
  parent	
  of	
  your	
  URL	
  became	
  more	
  important	
  (E.G.	
  IMPROVED	
  TOPIC	
  
RELEVANCE	
  (SIMILARITY),	
  PageRank	
  OR	
  local	
  (in-­‐site)	
  importance	
  metric)
• YOUR	
  ‘IMPORTANCE	
  SCORE’	
  OF	
  SOME	
  URLS	
  EXCEEDED	
  THE	
  ‘IMPORTANCE	
  
SOFT	
  LIMIT	
  THRESHOLD’	
  SO	
  THAT	
  IT	
  IS	
  INCLUDED	
  FOR	
  CRAWLING	
  WHILST	
  
BEING	
  VISITED	
  UP	
  TO	
  A	
  POINT	
  OF	
  ‘HARD	
  LIMIT’	
  CRAWLING	
  (E.G.	
  130%	
  OF	
  
SCHEDULED	
  CRAWLING)
GETTING MORE CRAWL BY
IMPROVING PAGE IMPORTANCE
HOW	
  DO	
  WE	
  DO	
  THIS?
TO DO - FIND GOOGLEBOT
AUTOMATE	
  SERVER	
  LOG	
  
RETRIEVAL	
  VIA	
  CRON	
  JOB
grep Googlebot
access_log
>googlebot_access.txt
ANALYSE	
  THE	
  LOGS
LOOK THROUGH SPIDER-EYES
PREPARE TO BE HORRIFIED
Incorrect	
  URL	
  header	
  response	
  codes	
  
301	
  redirect	
  chains
Old	
  files	
  or	
  XML	
  sitemaps	
  left	
  on	
  server	
  from	
  years	
  ago
Infinite/	
  endless	
  loops	
  (circular	
  dependency)
On	
  parameter	
  driven	
  sites	
  URLs	
  crawled	
  which	
  produce	
  same	
  output
AJAX	
  content	
  fragments	
  pulled	
  in	
  alone
URLs	
  generated	
  by	
  spammers
Dead	
  image	
  files	
  being	
  visited
Old	
  CSS	
  files	
  still	
  being	
  crawled	
  and	
  loading	
  EVERYTHING
You	
  may	
  even	
  see	
  ’mini’	
  abandoned	
  projects	
  within	
  the	
  site
Legacy	
  URLs	
  generated	
  by	
  long	
  forgotten	
  .htaccess regex	
  pattern	
  matching
Googlebot hanging	
  around	
  in	
  your	
  ‘ever-­‐changing’	
  blog	
  but	
  nowhere	
  else
URL  CRAWL  FREQUENCY  ’CLOCKING’
Spreadsheet	
  provided	
  by	
  @johnmu during	
  Webmaster	
  Hangout	
  -­‐ https://goo.gl/1pToL8
Identify	
  your	
  ‘real	
  time’,	
  ‘daily’	
  and	
  
‘base	
  layer’	
  URLs
-­‐ ARE	
  THEY	
  THE	
  ONES	
  YOU	
  WANT	
  
THERE?	
  	
  WHAT	
  IS	
  BEING	
  SEEN	
  AS	
  
UNIMPORTANT?
NOTE GOOGLEBOT
Do	
  you	
  recognise	
  all	
  the
URLs	
  and	
  URL	
  ranges	
  that
Are	
  appearing?
If	
  not…	
  Why	
  not?
IMPROVE & EMPHASISE PAGE IMPORTANCE
• Cross	
  modular	
  internal	
  linking
• Canonicalization
• Important	
  URLs	
  in	
  XML	
  sitemaps
• Anchor	
  text	
  target	
  consistency	
  (but	
  not	
  spammyrepetition	
  of	
  
anchors	
  everywhere	
  (it’s	
  still	
  output))
• Internal	
  links	
  in	
  right	
  descending	
  order	
  – emphasise
IMPORTANCE
• Reduce	
  boiler	
  plate	
  content	
  and	
  improve	
  relevance	
  of	
  content	
  
and	
  elements	
  to	
  specific	
  topic	
  (if	
  category)	
  /	
  product	
  (if	
  product	
  
page)	
  /	
  subcategory	
  (if	
  subcategory)
• Reduce	
  duplicate	
  content	
  parts	
  of	
  page	
  to	
  allow	
  primary	
  targets	
  
to	
  take	
  ’IMPORTANCE’
• Improve	
  parent	
  pages	
  to	
  raise	
  IMPORTANCE	
  reputation	
  of	
  the	
  
children	
  rather	
  than	
  over-­‐optimising the	
  child	
  pages	
  and	
  
cannibalising the	
  parent.
• Improve	
  content	
  as	
  more	
  ‘relevant’	
  to	
  a	
  topic	
  to	
  increase	
  
‘IMPORTANCE’	
  and	
  get	
  reassigned	
  to	
  a	
  different	
  crawl	
  layer
• Flatten	
  ‘architectures’
• Avoid	
  content	
  cannibalisation
• Link	
  relevant	
  content	
  to	
  relevant	
  content
• Build	
  strong	
  highly	
  relevant	
  ‘hub’	
  pages	
  to	
  tie	
  together	
  strength	
  
&	
  IMPORTANCE
EMPHASISE IMPORTANCE WISELY
USE	
  CUSTOM
XML
SITEMAPS
E.G.	
  XML	
  UNLIMITED
SITEMAP	
  GENERATOR
PUT IMPORTANT URLS IN
HERE
IF EVERYTHING IS
IMPORTANT THEN
IMPORTANCE IS NOT
DIFFERENTIATED
KEEP CUSTOM SITEMAPS ‘CURRENT’
AUTOMATICALLY
AUTOMATE
UPDATES
WITH	
  CRON	
  JOBS	
  OR	
  
WEB	
  CRON	
  JOBS
IT’S NOT AS TECHNICALAS
YOU MAY THINK – USE
WEB CRON JOBS
BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN
XML SITEMAPS
EXCLUDE	
  AND
INCLUDE	
  CRAWL
PATHS	
  IN	
  XML	
  SITEMAPS	
  
TO EMPHASISE
IMPORTANCE
IF YOU CAN’T IMPROVE - EXCLUDE (VIA
NOINDEX) FOR NOW • YOU’RE	
  OUT	
  FOR	
  NOW
• When	
  you	
  improve	
  you	
  can	
  
come	
  back	
  in
• Tell	
  Googlebot quickly	
  that	
  
you’re	
  out	
  (via	
  temporary	
  
XML	
  sitemap	
  inclusion)
• But	
  ‘follow’	
  because	
  there	
  
will	
  be	
  some	
  relevance	
  
within	
  these	
  URLs
• Include	
  again	
  when	
  you’ve	
  
improved
• Don’t	
  try	
  to	
  canonicalize
me	
  to	
  something	
   in	
  the
index
OR REMOVE – 410 GONE
(IF IT’S NEVER COMING
BACK)
http://faxfromthefuture.bandcamp.com/track/410-­‐
gone-­‐acoustic-­‐demo
EMBRACE
THE ‘410
GONE’
There’s	
  Even	
  A	
  Song
About	
  It
#BIGSITEPROBLEMS – LOSE THE INDEX BLOAT
LOSE THE
BLOAT TO
INCREASE
THE
CRAWL
No.	
  of	
  unimportant	
  
URLs	
  indexed	
  extend	
  
far	
  beyond	
  the	
  
available	
  importance	
  
crawl	
  threshold	
  
allocation
Tags:	
  I,	
  must,	
  tag,	
  	
  this,	
  blog,	
  post,	
  with,	
  
every,	
  possible,	
   word,	
  that,	
  pops,	
   into,	
  my,	
  
head,	
  when,	
  I,	
  look,	
  at,	
  it,	
  and,	
  dilute,	
  all,	
  
relevance,	
  from,	
  it,	
  to,	
  a,	
  pile,	
  of,	
  mush,	
  
cow,	
  shoes,	
  sheep,	
  the,	
  and,	
  me,	
  of,	
  it
Image	
  Credit:	
  Buzzfeed
Creating	
  ‘thin’	
  content	
  and	
  
Even	
  more	
  URLs	
  to	
  crawl
#BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN
Most Important Page 1
Most	
  Important	
  Page	
  2
Most	
  Important	
  Page	
  3
IS THIS
YOUR BLOG??
HOPE NOT
#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED
IMPORTANCE DISTORTED
BY DISPROPORTIONATE
INTERNAL LINKING -
LOCAL IB (P) – INTERNAL
BACKLINKS
Optimize	
  Everything:	
  I	
  must	
  optimize	
  ALL	
  
the	
  pages	
  across	
  a	
  category	
  descendants	
  
for	
  the	
  same	
  terms	
  as	
  my	
  primary	
  target	
  
category	
  page	
  so	
  that	
  each	
  of	
  them	
  is	
  of	
  
almost	
  equal	
  relevance	
  to	
  the	
  target	
  page	
  
and	
  confuse	
  crawlers	
  as	
  to	
  which	
  is
the	
  important	
  one.	
  	
  I’ll	
  put	
  them	
  all	
  in	
  a	
  
sitemap	
  as	
  standard	
  too	
  just	
  for	
  good	
  
measure.
Image	
  Credit:	
  Buzzfeed
HOW	
  CAN	
  SEARCH	
  ENGINES
KNOW	
  WHICH	
  IS	
  MOST	
  IMPORTANT
TO	
  A	
  TOPIC	
  IF	
  ‘EVERYTHING’	
  IS
IMPORTANT??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE
‘MISTER OVER-OPTIMIZER’
‘OPTIMIZE	
  ALL	
  THE	
  THINGS’
Duplicate	
  Everything:	
  I	
  must	
  have	
  a	
  
massive	
  boiler	
  plate	
  area	
  in	
  the	
  footer,	
  
identical	
  sidebars	
  and	
  a	
  massive	
  mega	
  
menu	
  with	
  all	
  the	
  same	
  output	
  in	
  sitewide.	
  	
  
I’ll	
  put	
  very	
  little	
  unique	
  content	
  into	
  the	
  
page	
  body	
  and	
  it	
  will	
  also	
  look	
  very	
  much	
  
like	
  it’s	
  parents	
  and	
  grandparents	
  too.	
  	
  
From	
  time	
  to	
  time	
  I’ll	
  outrank	
  my	
  parents	
  
and	
  grandparent	
  pages	
  but	
  ‘Meh’…
Image	
  Credit:	
  Buzzfeed
HOW	
  CAN	
  SEARCH	
  ENGINES
KNOW	
  WHICH	
  IS	
  MOST	
  IMPORTANT
PAGE	
  IF	
  ALL	
  IT’S	
  CHILDREN	
  AND	
  
GRANDCHILDREN	
  ARE	
  NEARLY	
  THE	
  
SAME??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE
‘MISTER DUPLICATER’
‘DUPLICATE	
  ALL	
  THE	
  THINGS’
IMPROVE SITE PERFORMANCE - HELP GOOGLEBOTGET
THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE
Avoid	
  wasting	
  time	
  
on	
  ‘overdue-­‐URL’	
  
crawling	
  
(E.G.	
  
Send	
  correct	
  
response	
  codes,	
  
speed	
  up	
  your	
  site,	
  
etc)
8,666,964	
  B1
½	
  time
>	
  2	
  x	
  page	
  
crawl	
  p/day
Added	
  to	
  Cloudflare CDN
GOOGLEBOT	
  GOES	
  WHERE	
  THE	
  ACTION	
  IS
USE	
  ‘ACTION’	
  WISELY
DON’T	
  TRY	
  TO	
  TRICK	
  GOOGLEBOT	
  BY	
  
FAKING	
  ‘FRESHNESS’	
  ON	
  LOW	
  IMPORTANCE	
  
PAGES	
  – GOOGLEBOT	
  WILL	
  REALISE
UPDATE	
  IMPORTANT	
  PAGES	
  OFTEN
NURTURE	
  SEASONAL	
  URLs	
  TO	
  GROW	
  
IMPORTANCE	
  WITH	
  FRESHNESS	
  (regular	
  
updates)	
  &	
  MATURITY	
  (HISTORY)
DON’T	
  TURN	
  GOOGLEBOT’S	
  HEAD	
  INTO	
  
THE	
  WRONG	
  PLACES
Image	
  Credit:	
  Buzzfeed
’GET FRESH’AND STAY ‘FRESH’
‘BUT	
  DON’T	
  TRY	
  TO	
  FAKE	
  
FRESH	
  &	
  USE	
  FRESH	
  WISELY’
IMPROVE TO GET THE HARD LIMITS ON
CRAWLING
By	
  improving	
  your
URL	
  importance on	
  an	
  
ongoing	
  basis	
  via
Increased	
  pagerank,	
  
content	
  improvements	
  
(e.g.	
  quality	
  hub	
  pages),	
  
internal	
  link	
  strategies,	
  
IB	
  (P),	
  restructuring,
You	
  can	
  get	
  the	
  ‘hard	
  
limit’	
  or	
  get	
  visited	
  
more	
  generally
CAN
IMPROVING
YOUR SITE
HELP TO
‘OVERRIDE’
SOFT LIMIT
CRAWL
PERIODS SET?
YOU THINK IT DOESN’T MATTER… RIGHT?
YOU	
  SAY…
”	
  GOOGLE	
  WILL	
  
WORK	
  IT	
  OUT”
”LET’S	
  JUST	
  MAKE	
  
MORE	
  CONTENT”
WRONG  – ‘CRAWL  TANK’  IS  UGLY
WRONG  – CRAWL  TANK  CAN  LOOK  LIKE  THIS
SITE	
  SEO	
  DEATH	
  BY	
  TOO	
  MANY	
  URLS	
  AND	
  
INSUFFICIENT	
  CRAWL	
  BUDGET	
  TO	
  SUPPORT	
  
(EITHER	
  DUMPING	
  A	
  NEW	
  ‘THIN’	
  
PARAMETER	
  INTO	
  A	
  SITE	
  OR	
  INFINITE	
  LOOP	
  
(CODING	
  ERROR)	
  (SPIDER	
  TRAP))
WHAT’S  WORSE  THAN  AN  INFINITE  
LOOP?
‘A	
  LOGICAL	
  INFINITE	
  LOOP’
IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING
‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS
WRONG  –
SITE  DROWNED
-­ IN  IT’S
OWN  SEA  OF  
UNIMPORTANT  
URLS
VIA  ‘EXPONENTIAL  URL  UNIMPORTANCE’
Your	
  URLs	
  exponentially	
  confirmed	
  
unimportant	
   with	
  each	
  iterative	
  crawl	
  
visit	
  to	
  other	
  similar	
  or	
  duplicate	
  
content	
  checksum	
  URLs.	
  	
  Fewer	
  and	
  
fewer	
  internal	
  links	
  and	
  ‘thinner	
  and	
  
thinner’	
  relevant	
  content.
MULTPLE	
  RANDOM	
  URLs	
  competing	
  for	
  
same	
  query	
  confirm	
  irrelevance	
  of	
  all	
  
competing	
  in-­‐site	
  URLs	
  with	
  no	
  
dominant	
  single	
  relevant	
  IMPORTANT	
  
URL
WRONG  – ‘SENDING  WRONG  SIGNALS  TO  
GOOGLEBOT’  COSTS  DEARLY
(Source:Sistrix)
“2015	
  was	
  the	
  year	
  where	
  
website	
  owners	
  managed	
  
to	
  be	
  mostly	
  at	
  fault,	
  all	
  by	
  
themselves”	
  (Sistrix 2015	
  
Organic	
  Search	
  Review	
  -­‐
2016)
WRONG  -­ NO-­ONE  IS  EXEMPT
(Source:Sistrix)
“It	
  doesn’t	
  matter	
  how	
  big	
  
your	
  brand	
  is	
  if	
  you	
  ‘talk	
  to	
  
the	
  spider’	
  (Googlebot)	
  
wrong	
  ”	
  – You	
  can	
  still	
  
‘tank’
WRONG  – GOOGLE  THINKS  SEOS  SHOULD  
UNDERSTAND  CRAWL  BUDGET
”EMPHASISE	
  IMPORTANCE”
“Make	
  sure	
  the	
  right	
  URLs	
  get	
  on	
  Googlebot’s menu	
  and	
  increase	
  URL	
  
importance	
  to	
  build	
  Googlebot’s appetite	
  for	
  your	
  site	
  more”
Dawn	
  Anderson	
  @	
  dawnieando
SORT OUT CRAWLING
TWITTER	
  -­‐ @dawnieando
GOOGLE+	
  -­‐ +DawnAnderson888
LINKEDIN	
  -­‐ msdawnanderson
THANK	
  YOU
Dawn	
  Anderson	
  @	
  dawnieando
• Going	
  ‘where	
  the	
  action	
  is’	
  in	
  
sites
• The	
  ‘need	
  for	
  speed’
• Logical	
  structure
• Correct	
  ‘response’	
  codes
• XML	
  sitemaps	
  with	
  important	
  
URLs
• ‘Successful	
  crawl	
  visits
• ‘Seeing	
  everything’	
  on	
  a	
  page
• Taking	
  MAX	
  ‘hints’
• Clear	
  unique	
  single	
  ‘URL	
  
fingerprints’	
  (no	
  duplicates)
• Predicting	
  likelihood	
  of	
  ‘future	
  
change’
• Finding	
  ‘more’	
  important	
  content	
  
worth	
  crawling
• Slow	
  sites
• Too	
  many	
  redirects
• Being	
  bored	
  (Meh)	
  (Min	
  ‘Hints’	
  are	
  built	
  
in	
  by	
  the	
  search	
  engine	
  systems	
  – Takes	
  
‘hints’)
• Being	
  lied	
  to	
  (e.g.	
  On	
  XML	
  sitemap	
  
priorities)
• Crawl	
  traps	
  and	
  dead	
  ends
• Going	
  round	
  in	
  circles	
  (Infinite	
  loops)
• Spam	
  URLs
• Crawl	
  wasting	
  minor	
  content	
  change	
  
URLs
• ‘Hidden’	
  and	
  blocked	
  content
• Uncrawlable URLs
Not	
  just	
  any	
  change
Critical	
  material	
  change
Predicting	
  future	
  change
Dropping	
  ‘hints’	
  to	
  Googlebot
Sending	
  Googlebot
Where	
  ‘the	
  action	
  is’
Not	
  just	
  page	
  change	
  designed
To	
  catch	
  Googlebot’s eye	
  with
No	
  added	
  value
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
LIKES DISLIKES
CHANGE	
  IS	
  KEY
Going	
  ‘where	
  the	
  action	
  is’	
  in	
  sites
The	
  ‘need	
  for	
  speed’
Logical	
  structure
Correct	
  ‘response’	
  codes
XML	
  sitemaps
‘Successful	
  crawl	
  visits
‘Seeing	
  everything’	
  on	
  a	
  page
Taking	
  ‘hints’
Clear	
  unique	
  single	
  ‘URL	
  
fingerprints’	
  (no	
  duplicates)
Predicting	
  likelihood	
  of	
  ‘future	
  
change’
Slow	
  sites
Too	
  many	
  redirects
Being	
  bored	
  (Meh)	
  (‘Hints’	
  are	
  built	
  in	
  by	
  the	
  
search	
  engine	
  systems	
  – Takes	
  ‘hints’)
Being	
  lied	
  to	
  (e.g.	
  On	
  XML	
  sitemap	
  priorities)
Crawl	
  traps	
  and	
  dead	
  ends
Going	
  round	
  in	
  circles	
  (Infinite	
  loops)
Spam	
  URLs
Crawl	
  wasting	
  minor	
  content	
  change	
  URLs
‘Hidden’	
  and	
  blocked	
  content
Uncrawlable URLs
Not	
  just	
  any	
  change
Critical	
  material	
  change
Predicting	
  future	
  change
Dropping	
  ‘hints’	
  to	
  Googlebot
Sending	
  Googlebot
Where	
  ‘the	
  action	
  is’
CRAWL OPTIMISATION – STAGE 1 -
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
LIKES DISLIKES CHANGE	
  IS	
  KEY
FIX
GOOGLEBOT’S JOURNEY
SPEED UP YOUR SITE
TO ‘FEED’
GOOGLEBOT MORE
TECHNICAL	
  ‘FIXES’	
  	
  	
  
Speed	
  up	
  your	
  site
Implement	
  compression,	
  minification,	
  caching
‘
Fix	
  incorrect	
  header	
  response	
  codes
Fix	
  nonsensical	
  ‘infinite	
  loops’	
  generated	
  by	
  
database	
  driven	
  parameters	
  or	
  ‘looping’	
  relative	
  
URLs
Use	
  absolute	
  versus	
  relative	
  internal	
  links
Ensure	
  no	
  parts	
  of	
  content	
  is	
  blocked	
  from	
  
crawlers	
  (e.g.	
  in	
  carousels,	
  concertinas	
  and	
  
tabbed	
  content
Ensure	
  no	
  css or	
  javascript files	
  are	
  blocked	
  from	
  
crawlers
Unpick	
  301	
  redirect	
  chains
Consider	
  using	
  a	
  CDN	
  such	
  as
Cloudflare
IMPLEMENTATION OF
CONTENT DELIVERY
NETWORK
Minimise	
  301	
  redirects
Minimise	
  canonicalisation
Use	
  ‘if	
  modified’	
  headers	
  on	
  low	
  importance	
  
‘hygiene’	
  pages
Use	
  ‘expires	
  after’	
  headers	
  on	
  content	
  with	
  short	
  
shelf	
  live	
  (e.g.	
  auctions,	
  job	
  sites,	
  event	
  sites)
Noindex low	
  search	
  volume	
  or	
  near	
  duplicate	
  URLs	
  
(use	
  noindex directive	
  on	
  robots.txt)
Use	
  410	
  ‘gone’	
  headers	
  on	
  dead	
  URLs	
  liberally
Revisit	
  .htaccess file	
  and	
  review	
  legacy	
  pattern	
  
matched	
  301	
  redirects
Combine	
  CSS	
  and	
  javascript files
Use	
  minification,	
  compression	
  and	
  caching
FIX GOOGLEBOT’S JOURNEY
SAVE	
  BUDGET	
  /	
  EMPHASISE	
  IMPORTANCE
£
Revisit	
  ‘Votes	
  for	
  self’	
  via	
  internal	
  links	
  in	
  GSC
Clear	
  ‘unique’	
  URL	
  fingerprints
Improve	
  whole	
  site	
  sections	
  /	
  categories
Use	
  XML	
  sitemaps	
  for	
  your	
  important	
  URLs	
  (don’t	
  put	
  
everything	
  on	
  it)
Use	
  ‘mega	
  menus’	
  (very	
  selectively)	
  to	
  key	
  pages
Use	
  ‘breadcrumbs’
Build	
  ‘bridges’	
  and	
  ‘shortcuts’	
  via	
  html	
  sitemaps	
  and	
  
‘cross	
  modular’	
  ‘related’	
  internal	
  linking	
  to	
  key	
  pages
Consolidate	
  (merge)	
  important	
  but	
  similar	
  content	
  (e.g.	
  
merge	
  FAQs	
  or	
  ‘low	
  search	
  volume’	
  content	
  into	
  other	
  
relevant	
  pages)
Consider	
  flattening	
  your	
  site	
  structure	
  so	
  ‘importance’	
  
flows	
  further
Reduce	
  internal	
  linking	
  to	
  lower	
  priority	
  URLs
BE	
  CLEAR	
  TO	
  GOOGLEBOT	
  WHICH	
  ARE	
  
YOUR	
  MOST	
  IMPORTANT	
  PAGES
Not	
  just	
  any	
  change	
  – Critical	
  material	
  change
Keep	
  the	
  ‘action’	
  in	
  the	
  key	
  areas -­‐ NOT	
  JUST	
  THE	
  BLOG
Use	
  ‘relevant	
  ‘supplementary	
  content	
  to	
  keep	
  key	
  pages	
  ‘fresh’
Remember	
  min	
  crawl	
  ‘hints’
Regularly	
  update	
  key	
  IMPORTANT	
  content
Consider	
  ‘updating’	
  rather	
  than	
  replacing	
  seasonal	
  content	
  
URLs	
  (e.g.	
  annual	
  events).	
  	
  Append	
  and	
  update.
Build	
  ‘dynamism’	
  and	
  ‘interactivity’	
  into	
  your	
  web	
  development	
  
(sites	
  that	
  ‘move’	
  win)
Keep	
  working	
  to	
  improve	
  and	
  make	
  your	
  URLs	
  more	
  important
GOOGLEBOT	
  GOES	
  WHERE	
  THE	
  ACTION	
  IS	
  AND	
  
IS	
  LIKELY	
  TO	
  BE	
  IN	
  THE	
  FUTURE	
  (AS	
  LONG	
  AS	
  
THOSE	
  URLS	
  ARE	
  NOT	
  UNIMPORTANT)
TRAIN GOOGLEBOT – ‘TALK TO THE
SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)
EMPHASISE	
  PAGE	
  IMPORTANCE	
  	
  	
   TRAIN	
  ON	
  CHANGE
SAVINGS, CHANGE & SPEED
TOOLS
• GSC	
  Index	
  levels	
  (over	
  indexation	
  checks)
• GSC	
  Crawl	
  stats
• Last	
  Accessed	
  Tools	
  (versus	
  competitors)
• Server	
  logs
• Keyword	
  Tools
SAVINGS	
  &	
  CHANGE
SPEED
• Yslow
• Pingdom
• Google	
  Page	
  Speed	
  Tests
• Minificiation – JS	
  Compress	
  and	
  CSS	
  
Minifier
• Image	
  Compression	
   –
Compressjpeg.com,	
   tinypng.com
• Content	
  Delivery	
  Networks	
  (e.g.	
  
Cloudflare)
URL IMPORTANCE & CRAWL
FREQUENCY TOOLS
• GSC	
  Internal	
  links	
  Report	
  (URL	
  
importance)
• Link	
  Research	
  Tools	
  (Strongest	
  
sub	
  pages	
  reports)
• GSC	
  Internal	
  links	
  (add	
  site	
  
categories	
  and	
  sections	
  as	
  
additional	
  profiles)
• Powermapper
• XML	
  Sitemap	
  Generators	
  for	
  
custom	
  sitemaps
• Crawl	
  Frequency	
  Clocking	
  
(@Johnmu)
URL	
  IMPORTANCE
SPIDER EYES TOOLS
• GSC	
  Crawl	
  Stats
• URL	
  Profiler
• Deepcrawl
• Screaming	
  Frog
• Server	
  Logs
• SEMRush (auditing	
  tools)
• Webconfs (header	
  responses	
   /	
  similarity	
  
checker)
• Powermapper (birds	
  eye	
  view	
  of	
  site)
• Lynx	
  Browser
• Crawl	
  Frequency	
  Clocking	
  (@Johnmu)
SPIDER	
  EYES
REFERENCES
Efficient	
  Crawling	
  Through	
  URL	
  Ordering	
  (Page	
  et	
  al)	
  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf
Crawl	
  Optimisation (Blind	
  Five	
  Year	
  Old	
  – A	
  J	
  Kohn	
  -­‐ @ajkohn)	
  http://www.blindfiveyearold.com/crawl-­‐
optimization
Scheduling	
  a	
  recrawl (Auerbach)	
  	
  -­‐ http://www.google.co.uk/patents/US8386459
Scheduler	
  for	
  search	
  engine	
  crawler	
  (Zhu	
  et	
  al)	
  -­‐ http://www.google.co.uk/patents/US8042112
Efficient	
  crawling	
  through	
  URL	
  ordering	
  	
  (Page	
  et	
  al)	
  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf
Google	
  Explains	
  Why	
  The	
  Search	
  Console	
  Reporting	
  Is	
  Not	
  Real	
  Time	
  (SERoundtable)	
  
https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html
Crawl	
  Data	
  Aggregation	
  Propagation	
  (Mueller)	
  -­‐ https://goo.gl/1pToL8
Matt	
  Cutts Interviewed	
  By	
  Eric	
  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐
2/
Web	
  Promo	
  Q	
  and	
  A	
  with	
  Google’s	
  Andrev Lippatsev -­‐
https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/
Google	
  Number	
  1	
  SEO	
  Advice	
  – Be	
  Consistent	
  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐
advice-­‐be-­‐consistent-­‐21196.html
REFERENCES
Internet	
  Live	
  Stats	
  -­‐ http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
Scheduler	
  for	
  search	
  engine	
  crawler Google	
  Patent
US	
  8042112	
  B1,	
  (Zhu	
  et	
  al)	
  -­‐ https://www.google.com/patents/US8707313
Managing	
  items	
  in	
  crawl	
  schedule	
  – Google	
  Patent	
  (Alpert)	
  
http://www.google.ch/patents/US8666964
Document	
  reuse	
  in	
  a	
  search	
  engine	
  crawler	
  -­‐ Google	
  Patent	
  (Zhu	
  et	
  al)
https://www.google.com/patents/US8707312
Web	
  crawler	
  scheduler	
  that	
  utilizes	
  sitemaps	
  (Brawer	
  et	
  al)	
  -­‐
http://www.google.com/patents/US8037054
Distributed	
  crawling	
  of	
  hyperlinked	
  documents	
  (Dean	
  et	
  al)	
  -­‐
http://www.google.co.uk/patents/US7305610
Minimizing	
  visibility	
  of	
  stale	
  content	
  (Carver)	
  -­‐
http://www.google.ch/patents/US20130226897
REFERENCES
https://www.sistrix.com/blog/how-­‐nordstrom-­‐bested-­‐zappos-­‐on-­‐google/
https://www.xml-­‐sitemaps.com/generator-­‐demo/

Mais conteúdo relacionado

Mais procurados

Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
Steve Morgan
 

Mais procurados (20)

How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
 
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
 
PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...
PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...
PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...
 
SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...
SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...
SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...
 
How to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommerceHow to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommerce
 
Automating Google Lighthouse
Automating Google LighthouseAutomating Google Lighthouse
Automating Google Lighthouse
 
Influencing Discovery, Indexing Strategies For Complex Websites
Influencing Discovery, Indexing Strategies For Complex WebsitesInfluencing Discovery, Indexing Strategies For Complex Websites
Influencing Discovery, Indexing Strategies For Complex Websites
 
A beginner's guide to machine learning for SEOs - WTSFest 2022
A beginner's guide to machine learning for SEOs  - WTSFest 2022A beginner's guide to machine learning for SEOs  - WTSFest 2022
A beginner's guide to machine learning for SEOs - WTSFest 2022
 
SMX Session - State of Search 2023
SMX Session - State of Search 2023SMX Session - State of Search 2023
SMX Session - State of Search 2023
 
Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
Going Solo - The Survival Guide for Freelance SEOs (Present & Future) | brigh...
 
Beth Barnham Schema Auditing BrightonSEO Slides.pptx
Beth Barnham Schema Auditing BrightonSEO Slides.pptxBeth Barnham Schema Auditing BrightonSEO Slides.pptx
Beth Barnham Schema Auditing BrightonSEO Slides.pptx
 
Why Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody HardWhy Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody Hard
 
Crawl Budget: Everything you Need to Know
Crawl Budget: Everything you Need to KnowCrawl Budget: Everything you Need to Know
Crawl Budget: Everything you Need to Know
 
[BrightonSEO 2019] Restructuring Websites to Improve Indexability
[BrightonSEO 2019] Restructuring Websites to Improve Indexability[BrightonSEO 2019] Restructuring Websites to Improve Indexability
[BrightonSEO 2019] Restructuring Websites to Improve Indexability
 
How Search Works
How Search WorksHow Search Works
How Search Works
 
Paige Hobart - How to do GOOD Keyword Research - Search Advertising Show 2021
Paige Hobart - How to do GOOD Keyword Research - Search Advertising Show 2021Paige Hobart - How to do GOOD Keyword Research - Search Advertising Show 2021
Paige Hobart - How to do GOOD Keyword Research - Search Advertising Show 2021
 
Core Web Vitals Audit - Sophie Gibson - PDF - BrightonSEO.pdf
Core Web Vitals Audit - Sophie Gibson - PDF - BrightonSEO.pdfCore Web Vitals Audit - Sophie Gibson - PDF - BrightonSEO.pdf
Core Web Vitals Audit - Sophie Gibson - PDF - BrightonSEO.pdf
 
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise WebsitesBrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
 
SEO low hanging Fruit - Identifying High Impact Opportunities Fast #SEOforUkr...
SEO low hanging Fruit - Identifying High Impact Opportunities Fast #SEOforUkr...SEO low hanging Fruit - Identifying High Impact Opportunities Fast #SEOforUkr...
SEO low hanging Fruit - Identifying High Impact Opportunities Fast #SEOforUkr...
 

Destaque

Contributing to WordPress: Why it's Important to Your Business
Contributing to WordPress: Why it's Important to Your Business Contributing to WordPress: Why it's Important to Your Business
Contributing to WordPress: Why it's Important to Your Business
Kel
 

Destaque (20)

Head Slapping WordPress Security
Head Slapping WordPress SecurityHead Slapping WordPress Security
Head Slapping WordPress Security
 
Paid Traffic with WordPress PPC Hacks - by Peter Mead for BigDigital 2016
Paid Traffic with WordPress PPC Hacks - by Peter Mead for BigDigital 2016Paid Traffic with WordPress PPC Hacks - by Peter Mead for BigDigital 2016
Paid Traffic with WordPress PPC Hacks - by Peter Mead for BigDigital 2016
 
Mobile Visibility to the Max - 2016 Edition #BigDigitalADL
Mobile Visibility to the Max - 2016 Edition #BigDigitalADLMobile Visibility to the Max - 2016 Edition #BigDigitalADL
Mobile Visibility to the Max - 2016 Edition #BigDigitalADL
 
Harnessing The Power Of Archetypes For Your Digital Marketing
Harnessing The Power Of Archetypes For Your Digital MarketingHarnessing The Power Of Archetypes For Your Digital Marketing
Harnessing The Power Of Archetypes For Your Digital Marketing
 
How to achieve mind-blowing Content Marketing ROI
How to achieve mind-blowing Content Marketing ROIHow to achieve mind-blowing Content Marketing ROI
How to achieve mind-blowing Content Marketing ROI
 
Writing the Right Content at #SMS2016
Writing the Right Content at #SMS2016 Writing the Right Content at #SMS2016
Writing the Right Content at #SMS2016
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
Tori Cushing - Actionable SEO Insights - SMX 2015
Tori Cushing - Actionable SEO Insights - SMX 2015Tori Cushing - Actionable SEO Insights - SMX 2015
Tori Cushing - Actionable SEO Insights - SMX 2015
 
Identifying a Compromised WordPress Site
Identifying a Compromised WordPress SiteIdentifying a Compromised WordPress Site
Identifying a Compromised WordPress Site
 
Accelerated Mobile Pages (AMP)
Accelerated Mobile Pages (AMP)Accelerated Mobile Pages (AMP)
Accelerated Mobile Pages (AMP)
 
WordPress Security Basics - Melbourne WordPress User Meetup
WordPress Security Basics - Melbourne WordPress User MeetupWordPress Security Basics - Melbourne WordPress User Meetup
WordPress Security Basics - Melbourne WordPress User Meetup
 
SEO Training at Envatotalks
SEO Training at EnvatotalksSEO Training at Envatotalks
SEO Training at Envatotalks
 
WordPress SEO Tips
WordPress SEO TipsWordPress SEO Tips
WordPress SEO Tips
 
WordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress MeetupWordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress Meetup
 
Web Performance Optimisation
Web Performance OptimisationWeb Performance Optimisation
Web Performance Optimisation
 
Final cbd slides
Final cbd slidesFinal cbd slides
Final cbd slides
 
Installing WordPress The Right Way
Installing WordPress The Right WayInstalling WordPress The Right Way
Installing WordPress The Right Way
 
WordPress Menus - Melbourne User Meetup
WordPress Menus - Melbourne User MeetupWordPress Menus - Melbourne User Meetup
WordPress Menus - Melbourne User Meetup
 
Contributing to WordPress: Why it's Important to Your Business
Contributing to WordPress: Why it's Important to Your Business Contributing to WordPress: Why it's Important to Your Business
Contributing to WordPress: Why it's Important to Your Business
 
Build on Chassis: Introduction to a Solid Development Workflow
Build on Chassis: Introduction to a Solid Development WorkflowBuild on Chassis: Introduction to a Solid Development Workflow
Build on Chassis: Introduction to a Solid Development Workflow
 

Semelhante a Negotiating crawl budget with googlebots

Semelhante a Negotiating crawl budget with googlebots (20)

Bringing in the family to emphasise importance and win during crawling
Bringing in the family to emphasise importance and win during crawlingBringing in the family to emphasise importance and win during crawling
Bringing in the family to emphasise importance and win during crawling
 
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
 
How to Optimize Your Website for Crawl Efficiency
How to Optimize Your Website for Crawl EfficiencyHow to Optimize Your Website for Crawl Efficiency
How to Optimize Your Website for Crawl Efficiency
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Crawl Budget - Some Insights & Ideas @ seokomm 2015
Crawl Budget - Some Insights & Ideas @ seokomm 2015Crawl Budget - Some Insights & Ideas @ seokomm 2015
Crawl Budget - Some Insights & Ideas @ seokomm 2015
 
SEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO SuccessSEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO Success
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
Sasconbeta 2015 Dawn Anderson - Talk To The SpiderSasconbeta 2015 Dawn Anderson - Talk To The Spider
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
 
From Web Site to Web App: Fantastic Optimisations and Where To Find Them
From Web Site to Web App: Fantastic Optimisations and Where To Find ThemFrom Web Site to Web App: Fantastic Optimisations and Where To Find Them
From Web Site to Web App: Fantastic Optimisations and Where To Find Them
 
Web Design Trends: 2018 Edition
Web Design Trends: 2018 EditionWeb Design Trends: 2018 Edition
Web Design Trends: 2018 Edition
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Web of things introduction
Web of things introductionWeb of things introduction
Web of things introduction
 
Web 2.0 Mashups
Web 2.0 MashupsWeb 2.0 Mashups
Web 2.0 Mashups
 
Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Modern SEO Players Guide
Modern SEO Players GuideModern SEO Players Guide
Modern SEO Players Guide
 

Mais de Dawn Anderson MSc DigM

Mais de Dawn Anderson MSc DigM (20)

Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdf
 
Life of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic UpdatesLife of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you think
 
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
 
Google BERT - SMX London 2020 Virtual Conference
Google BERT - SMX London 2020 Virtual ConferenceGoogle BERT - SMX London 2020 Virtual Conference
Google BERT - SMX London 2020 Virtual Conference
 
Google BERT - What SEOs and Marketers Need to Know
Google BERT - What SEOs and Marketers Need to KnowGoogle BERT - What SEOs and Marketers Need to Know
Google BERT - What SEOs and Marketers Need to Know
 
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
 
2019 Tech SEO Boost Dawn Anderson Contextual Recommender Search
2019 Tech SEO Boost Dawn Anderson Contextual Recommender Search2019 Tech SEO Boost Dawn Anderson Contextual Recommender Search
2019 Tech SEO Boost Dawn Anderson Contextual Recommender Search
 
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
 
Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019
 
Google BERT and Family and the Natural Language Understanding Leaderboard Race
Google BERT and Family and the Natural Language Understanding Leaderboard RaceGoogle BERT and Family and the Natural Language Understanding Leaderboard Race
Google BERT and Family and the Natural Language Understanding Leaderboard Race
 
The User is the Query - The Rise of Predictive Proactive Search
The User is the Query - The Rise of Predictive Proactive SearchThe User is the Query - The Rise of Predictive Proactive Search
The User is the Query - The Rise of Predictive Proactive Search
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
 
SEO in a Mobile First World
SEO in a Mobile First WorldSEO in a Mobile First World
SEO in a Mobile First World
 
Modern Ecommerce SEO
Modern Ecommerce SEOModern Ecommerce SEO
Modern Ecommerce SEO
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
 
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
 
SEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm ShiftSEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm Shift
 
Pubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn andersonPubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn anderson
 

Último

Brand experience Dream Center Peoria Presentation.pdf
Brand experience Dream Center Peoria Presentation.pdfBrand experience Dream Center Peoria Presentation.pdf
Brand experience Dream Center Peoria Presentation.pdf
tbatkhuu1
 
The 100x Factor Growth with AI - Susan Diaz
The 100x Factor  Growth with AI - Susan DiazThe 100x Factor  Growth with AI - Susan Diaz

Último (20)

Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
 
Uncover Insightful User Journey Secrets Using GA4 Reports
Uncover Insightful User Journey Secrets Using GA4 ReportsUncover Insightful User Journey Secrets Using GA4 Reports
Uncover Insightful User Journey Secrets Using GA4 Reports
 
Major SEO Trends in 2024 - Banyanbrain Digital
Major SEO Trends in 2024 - Banyanbrain DigitalMajor SEO Trends in 2024 - Banyanbrain Digital
Major SEO Trends in 2024 - Banyanbrain Digital
 
Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptx
 
Campfire Stories - Matching Content to Audience Context - Ryan Brock
Campfire Stories - Matching Content to Audience Context - Ryan BrockCampfire Stories - Matching Content to Audience Context - Ryan Brock
Campfire Stories - Matching Content to Audience Context - Ryan Brock
 
Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
 
Foundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David PisarekFoundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David Pisarek
 
Kraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentationKraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentation
 
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxDigital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptx
 
Brand experience Dream Center Peoria Presentation.pdf
Brand experience Dream Center Peoria Presentation.pdfBrand experience Dream Center Peoria Presentation.pdf
Brand experience Dream Center Peoria Presentation.pdf
 
What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
 
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel LeminTurn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
 
The 100x Factor Growth with AI - Susan Diaz
The 100x Factor  Growth with AI - Susan DiazThe 100x Factor  Growth with AI - Susan Diaz
The 100x Factor Growth with AI - Susan Diaz
 
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting GroupSEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
 
Martal Group - B2B Lead Gen Agency - Onboarding Overview
Martal Group - B2B Lead Gen Agency - Onboarding OverviewMartal Group - B2B Lead Gen Agency - Onboarding Overview
Martal Group - B2B Lead Gen Agency - Onboarding Overview
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew Rupert
 
Unlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich ManuscriptUnlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich Manuscript
 
A.I. and The Social Media Shift - Mohit Rajhans
A.I. and The Social Media Shift - Mohit RajhansA.I. and The Social Media Shift - Mohit Rajhans
A.I. and The Social Media Shift - Mohit Rajhans
 

Negotiating crawl budget with googlebots

  • 1. USING  ’PAGE  IMPORTANCE’  IN  ONGOING   CONVERSATION  WITH  GOOGLEBOT  TO  GET   JUST  A  BIT  MORE  THAN  YOUR  ALLOCATED   CRAWL  BUDGET NEGOTIATING   CRAWL   BUDGET  WITH   GOOGLEBOTS Dawn  Anderson  @  dawnieando
  • 2. Another  Rainy   Day  In   Manchester @dawnieando
  • 4. 1994  -­ 1998 “THE  GOOGLE  INDEX  IN  1998  HAD   60  MILLION  PAGES”  (GOOGLE)   (Source:Wikipedia.org)
  • 5. 2000 “INDEXED  PAGES  REACHES  THE  ONE  BILLION   MARK”  (GOOGLE) “IN  OVER  17  MILLION   WEBSITES”   (INTERNETLIVESTATS.COM)
  • 6. 2001  ONWARDS ENTER  WORDPRESS,  DRUPAL  CMS’,  PHP  DRIVEN  CMS’,  ECOMMERCE   PLATFORMS,  DYNAMIC  SITES,  AJAX WHICH  CAN  GENERATE  10,000S  OR  100,000S   OR  1,000,000S  OF  DYNAMIC URLS  ON  THE  FLY  WITH  DATABASE  ‘FIELD   BASED’  CONTENT DYNAMIC  CONTENT  CREATION  GROWS ENTER  FACETED  NAVIGATION  (WITH  MANY  #   PATHS  TO  SAME  CONTENT) 2003  – WE’RE  AT  40  MILLION  WEBSITES
  • 7. 2003  ONWARDS  – USERS  BEGIN  TO  JUMP  ON  THE  CONTENT   GENERATION  BANDWAGGON LOTS  OF   CONTENT  – IN   MANY  FORMS
  • 8. WE  KNEW  THE  WEB  WAS  BIG…  (GOOGLE,  2008) https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html “1  trillion  (as  in  1,000,000,000,000)   unique  URLs  on  the  web  at  once!” (Jesse  Alpert  on  Google’s   Official  Blog,  2008) 2008  – EVEN   GOOGLE   ENGINEERS   STOPPED  IN  AWE
  • 9. 2010  – USER  GENERATED  CONTENT  GROWS “Let  me  repeat  that:  we   create  as  much  information   in  two  days  now  as  we  did   from  the  dawn  of  man   through  2003” “The  real  issue  is  user-­‐ generated  content.”  (Eric   Schmidt,  2010  – Techonomy Conference  Panel) SOURCE:  http://techcrunch.com/2010/08/04/schmidt-­‐data/
  • 10. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015) CONTENT KEEPS GROWING Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 THE  NUMBER  OF  WEBSITES   DOUBLED  IN  SIZE  BETWEEN   2011  AND  2012 AND  AGAIN  BY  1/3  IN  2014
  • 11. EVEN  SIR  TIM   BERNERS-­‐LEE (Inventor  of  www)   TWEETED 2014  – WE  PASS  A  BILLION  INDIVIDUAL  WEBSITES   ONLINE
  • 12. 2014  – WE  ARE  ALL PUBLISHERS SOURCE:  http://wordpress/activity/posting
  • 13. YUP  -­ WE  ALL‘LOVE  CONTENT’ IMAGINE  HOW  MANY   UNIQUE  URLs    COMBINED   THIS  AMOUNTS  TO?   – A  LOT http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
  • 14. “As  of  the  end  of  2003,  the   WWW  is  believed  to  include   well  in  excess  of  10  billion   distinct  documents  or  web   pages,  while  a  search   engine  may  have  a  crawling   capacity  that  is  less  than   half  as  many  documents”   (MANY  GOOGLE  PATENTS) CAPACITY  LIMITATIONS  – EVEN  FOR  SEARCH   ENGINES Source:  Scheduler  for  search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)
  • 15. “So  how  many  unique  pages   does  the  web  really   contain?  We  don't  know;  we   don't  have  time  to  look  at   them  all!  :-­‐)”   (Jesse  Alpert,  Google,  2008) Source:  https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐ was-­‐big.html NOT   ENOUGH   TIME SOME  THINGS   MUST  BE   FILTERED
  • 16. A  LOT  OF  THE   CONTENT  IS   ‘KIND  OF  THE   SAME’ “There’s  a  needle  in  here   somewhere” “It’s  an  important  needle  too”
  • 17. Capacity  limits   on  Google’s   crawling  system By  prioritising   URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots WHAT IS THE SOLUTION? “To  keep  within  the  capacity  limits  of  the  crawler,  automated  selection  mechanisms  are  needed   to  determine  not  only  which  web  pages  to  crawl,  but  which  web  pages  to  avoid  crawling”.  -­‐ Scheduler  for  search  engine  crawler,  (Zhu  et  al)
  • 18. ‘Managing items in a crawl schedule’ Include GOOGLE CRAWL SCHEDULER PATENTS ‘Scheduling a recrawl’ ‘Web crawler scheduler that utilizes sitemaps from websites’ ‘ ‘Document reuse in a search engine crawler’ ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ ‘Scheduler for search engine’ EFFICIENCY  IS   NECESSARY
  • 19. CRAWL  BUDGET 1.  Crawl  Budget  – “An  allocation  of  crawl   frequency  visits  to  a  host  (IP  LEVEL)”   3.  Pages  with  a  lot  of  links  get  crawled  more 4.  The  vast  majority  of  URLs  on  the  web  don’t  get  a   lot  of  budget  allocated  to  them  (low  to  0  PageRank  URLs). 2.  Roughly  proportionate  to  PageRank  and   host  load  /  speed  /  host  capacity https://www.stonetemple.com/matt-­‐cutts-­‐ interviewed-­‐by-­‐eric-­‐enge-­‐2/
  • 20. BUT…  MAYBE  THINGS  HAVE  CHANGED? CRAWL  BUDGET  /  CRAWL   FREQUENCY  IS  NOT  JUST   ABOUT  HOST-­LOAD  AND   PAGERANK  ANY  MORE
  • 21. STOP  THINKING  IT’S  JUST  ABOUT  ‘PAGERANK’ http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s “You  keep  focusing  on   PageRank”… “There’s  a  shit-­‐ton  of   other  stuff  going  on”   (Illyes,  G,  Google  -­‐ 2016)
  • 22. THERE’S  A  LOT  OF  OTHER  THINGS  AFFECTING   ‘CRAWLING’ Transcript:   https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ WEB  PROMOS  Q  &  A   WITH  GOOGLES   ANDREY  LIPATTSEV
  • 23. WHY? BECAUSE…   THE  WEB  GOT   ‘MAHOOOOOSIVE’ AND  CONTINUES  TO  GET   ‘MAHOOOOOOSIVER’ SITES  GOT  MORE   DYNAMIC,  COMPLEX,   AUTO-­GENERATED,  MULTI-­ FACETED,  DUPLICATED,   INTERNATIONALISED,   BIGGER,  BECAME   PAGINATED  AND  SORTED
  • 24. WE  NEED  MORE WAYS  TO  GET MORE  EFFICIENT AND  FILTER  OUT TIME-­WASTING CRAWLING  SO   WE  CAN  FIND   IMPORTANT   CHANGES   QUICKLY GOOGLEBOT’S  TO-­DO  LIST  GOT  REALLY  BIG
  • 25. Hard  and  Soft   Crawl  Limits Importance   Thresholds Min  and  Max   Hints  &  ‘Hint   ranges’ Importance Crawl   Periods Scheduling FURTHER IMPROVED CRAWLING EFFICIENCY SOLUTIONS NEEDED Prioritization Tiered Crawling Buckets (‘Real  Time,  Daily,   Base  Layer)  
  • 26. SEVERAL PATENTS UPDATED ‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE DETERMINING SOFTAND HARD LIMITS ON CRAWLING) ‘Managing Items in a Crawl Schedule’ (Alpert, 2014) ‘ ‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE FREQUENCY IN ORDER TO SCHEDULE NEXTVISIT, EMPLOYING HINTS (Min & Max) (SEEM  TO  WORK  TOGETHER) ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ (INCLUDES EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)
  • 27. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawledSplit  into  segments   on  random  rotation MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers  /   buckets  for   scheduling URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data Most  Unimportant
  • 28. CAN  WE  ESCAPE  THE  ‘BASE  LAYER’   CRAWL  BUCKET  RESERVED  FOR   ‘UNIMPORTANT’  URLS?
  • 29. 10  types of Googlebot SOME  OF  THE  MAJOR  SEARCH  ENGINE   CHARACTERS History  Logs  /  History   Server The  URL   Scheduler   /  Crawl   Manager
  • 30. HISTORY LOGS / HISTORY SERVERS HISTORY  LOGS  /  HISTORY  SERVER  -­‐ Builds  a  picture  of  historical  data  and   past  behaviour  of  the  URL  and  ‘importance’  score  to  predict  and  plan  for   future  crawl  scheduling • Last  crawled  date • Next  crawl  due • Last  server  response • Page  importance  score • Collaborates  with  link   logs • Collaborates  with   anchor  logs • Contributes  info  to   scheduling
  • 31. ‘BOSS’- URL SCHEDULER / URL MANAGER Think  of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system • Schedules  Googlebot visits  to  URLs • Decides  which  URLs  to  ‘feed’  to  Googlebot • Uses  data  from  the  history  logs  about  past  visits  (Change  rate  and   importance) • Calculates  importance  crawl  threshold • Assigns  visit  regularity  of  Googlebot to  URLs • Drops  ‘max  and  min  hints’  to  Googlebot to  guide  on  types  of   content  NOT  to  crawl  or  to  crawl  as  exceptions. • Excludes  some  URLs  from  schedules • Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules • Scheduler  checks  URLs  for  ‘importance’,  ‘boost  factor’  candidacy,   ‘probability  of  modification’ • Budgets  are  allocated  to  IPs  and  shared  amongst  domains  there JOBS
  • 32. • ‘Ranks  nothing  at  all’ • Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler • Runs  errands  &  makes  deliveries  for  the  URL  server,  indexer  /   ranking  engine  and  logs • Makes  notes  of  outbound   linked  pages  and  additional  links   for  future  crawling • Follows  directives  (robots)  and  takes  ‘hints’  when  crawling • Tells  tales  of  URL  accessibility  status,  server  response  codes,   notes  relationships  between  links  and  collects  content   checksums  (binary  data  equivalent  of  web  content)  for   comparison  with  past  visits  by  history  and  link  logs • Will  go  beyond  the  crawl  schedule  if  it  finds  something  more   important  than  URLs  scheduled GOOGLEBOT - CRAWLER JOBS
  • 33. WHAT  MAKES  THE  DIFFERENCE   BETWEEN  BASE  LAYER  AND  ‘REAL  TIME’   SCHEDULE  ALLOCATION?
  • 34. CONTRIBUTING  FACTORS 1.  Page  Importance  (which  may  include  PageRank) 3.  Soft  limits  and  hard  crawl  limits 4.  Host  load  capability  &  past  site   performance  (speed  and  access)   (IP  level  and  domain  level  within) 2.  Hints  (max  and  min) 5.  Probability  /  predictability  of  ‘CRITICAL MATERIAL’  change  +  importance  crawl   period
  • 35. 1 - PAGE IMPORTANCE - Page  importance  is  the   importance  of  a  page  independent  of  a  query • Location  in  Site  (e.g.  home  page  more  important   than  parameter  3  level  output) • PageRank • Page  type  /  file  type • Internal  PageRank • Internal  Backlinks • In-­‐site  Anchor  Text  Consistency • Relevance  (content,  anchors  and  elements)  to  a   topic  (Similarity  Importance) • Directives  from  in-­‐page  robot  and  robots.txt management • Parent  quality  brushes  off  on  child  page  quality IMPORTANT  PARENTS  LIKELY  SEEN  TO   HAVE  IMPORTANT  CHILD  PAGES
  • 36. 2 - HINTS - ’MIN’ HINTS & ’MAX’ HINTS MIN  HINT  /  MIN  HINT  RANGES • e.g.  Programmatically  generated   content  which  changes  content   checksum  on  load • Unimportant  duplicate  parameter   URLs • Canonicals • Rel=next,  rel=prev • HReflang • Duplicate  content • Spammy URLs? • Objectionable  content MAX  HINT  /  MAX  HINT   RANGES • CHANGE  CONSIDERED  ‘CRITICAL   MATERIAL  CHANGE’  (useful  to   users  e.g.  availability,  price)  &  /  or   improved  site  sections  or  change   to  IMPORTANT  but  infrequently   changing  content? • Important  pages  /  page  range   updates E.G. rel="prev" and rel="next" a ct  as  hints  to  Google,   not   absolute  directives https://support.google.com/webm asters/answer/1663744?hl=en&re f_topic=4617741
  • 37. 3 - HARD AND SOFT LIMITS ON CRAWLING If  URLs  are  discovered   during  crawling  that   are  more  important   than  those  scheduled   to  be  crawled  then   Googlebot can  go   beyond  its  schedule  to   include  these  up  to  a   hard  crawl  limit ‘Soft’  crawl   limit  is  set   (Original   schedule) ‘Hard’  crawl  limit   is  set  (E.G.  130%   of  schedule) FOR  IMPORTANT   FINDINGS
  • 38. 4 – HOST LOAD CAPACITY / PAST SITE PERFORMANCE Googlebot has  a  list   of  URLs  to  crawl Naturally,  if  your   site  is  fast  that  list   can  be  crawled   quicker If  Googlebot experiences   500s  e.g.  she   will  retreat  &   ‘past   performance ’  is  noted If  Googlebot doesn’t  get   ‘round  the  list’   you  may  end   up  with   ‘overdue’   URLs  to  crawl
  • 39. • Not  all  change  is  considered  equal • There  are  many  dynamic  sites  with  low  importance  pages   changing  frequently  – SO  WHAT • Constantly  changing  your  page  just  to  get  Googlebot back  won’t  work  if  the  page  is  low  importance  (crawl   importance  period  <  change  rate)  POINTLESS • Hints  are  employed  to  determine  pages  which  simply   change  the  content  checksum  with  every  visit • Features  are  weighted  for  change  importance  to  user   (price  >  colour  e.g.) • Change  identified  as  useful  to  users  is  considered   ‘CRITICAL  MATERIAL  CHANGE’ • Don’t  just  try  to  randomise  things  to  catch  Googlebot’s eye • That  counter  or  clock  you  added  probably  isn’t  going  to   help  you  get  more  attention,  nor  random  or  shuffle • Change  on  some  types  of  pages  is  more  important than   other  pages  (e.g.  Home  page  CNN  >  SME  about  us  page) 5 - CHANGE
  • 40. • Current  capacity  of  the  web  crawling  system  is  high • Your  URL  has  a  high  ‘importance  score’ • Your  URL  is  in  the  real  time  (HIGH  IMPORTANCE),  daily  crawl   (LESS  IMPORTANT)  or  ‘active’  base  layer  segment   (UNIMPORTANT  BUT  SELECTED) • Your  URL  changes  a  lot  with  CRITICAL  MATERIAL  CONTENT   change  (AND  IS  IMPORTANT) • Probability  and  predictability  of  CRITICAL  MATERIAL  CONTENT   change  is  high  for  your  URL  (AND  URL  IS  IMPORTANT) • Your  website  speed  is  fast  and  Googlebot gets  the  time  to  visit   your  URL  on  its  bucket  list  of  scheduled  URLs  that  visit • Your  URL  has  been  ‘upgraded’  to  a  daily  or  real  time  crawl  layer   as  it’s  importance  is  detected  as  raised • History  logs  and  URL  Scheduler  ’learn’  together FACTORS AFFECTING GOOGLEBOT HIGHER VISIT FREQUENCY
  • 41. • Current  capacity  of  web  crawling  system  is  low • Your  URL  has  been  detected  as  a  ‘spam’  URL • Your  URL  is  in  an  ‘inactive’  base  layer  segment  (UNIMPORTANT) • Your  URLs  are  ‘tripping  hints’  built  into  the  system  to  detect  non-­‐ critical  change  dynamic  content • Probability  and  predictability  of  critical  material  content  change  is   low  for  your  URL • Your  website  speed  is  slow  and  Googlebot doesn’t  get  the  time  to   visit  your  URL • Your  URL  has  been  ‘downgraded’  to  an  ‘inactive’  base  layer   (UNIMPORTANT)  segment • Your  URL  has  returned  an  ‘unreachable’  server  response  code   recently • In-­‐page  robots  management  or  robots.txt send  wrong  signals FACTORS AFFECTING LOWER GOOGLEBOT VISIT FREQUENCY
  • 42. GET  MORE  CRAWL  BY  ‘TURNING   GOOGLEBOT’S  HEAD’  – MAKE  YOUR   URLs  MORE  IMPORTANT  AND   ‘EMPHASISE’ IMPORTANCE
  • 43. • Hard  limits  and  soft  limits • Follows  ‘min’  and  ‘max’  Hints • If  she  finds  something  important  she  will  go  beyond  a   scheduled  crawl  (SOFT  LIMIT)  to  seek  out  importance  (TO   HARD  LIMIT) • You  need  to  IMPRESS  Googlebot • If  you  ‘bore’  Googlebot she  will  return  to  boring  URLs  less   (e.g.  with  pages  all  the  same  (duplicate  content)  or   dynamically  generated  low  usefulness   content) • If  you  ’delight’  Googlebot she  will  return  to  delightful  URLs   more  (they  became  more  important  or  they  changed  with   ‘CRITICAL  MATERIAL  CHANGE’) • If  she  doesn’t  get  her  crawl  completed  you  will  end  up  with   an  ‘overdue’  list  of  URLs  to  crawl GOOGLEBOT DOES AS SHE’S TOLD – WITH A FEW EXCEPTIONS
  • 44. • Your  URL  became  more  important  and  achieved  a  higher  ‘importance  score’   via  increased  PageRank • Your  URL  became  more  important  via  increased  IB(P)  (INTERNAL  BACKLINKS  IN   OWN  SITE)  relative  to  other  URLs  within  your  site  (You  emphasised   importance) • You  made  the  URL  content  more  relevant  to  a  topic  and  improved  the   importance  score • The  parent  of  your  URL  became  more  important  (E.G.  IMPROVED  TOPIC   RELEVANCE  (SIMILARITY),  PageRank  OR  local  (in-­‐site)  importance  metric) • YOUR  ‘IMPORTANCE  SCORE’  OF  SOME  URLS  EXCEEDED  THE  ‘IMPORTANCE   SOFT  LIMIT  THRESHOLD’  SO  THAT  IT  IS  INCLUDED  FOR  CRAWLING  WHILST   BEING  VISITED  UP  TO  A  POINT  OF  ‘HARD  LIMIT’  CRAWLING  (E.G.  130%  OF   SCHEDULED  CRAWLING) GETTING MORE CRAWL BY IMPROVING PAGE IMPORTANCE
  • 45. HOW  DO  WE  DO  THIS?
  • 46. TO DO - FIND GOOGLEBOT AUTOMATE  SERVER  LOG   RETRIEVAL  VIA  CRON  JOB grep Googlebot access_log >googlebot_access.txt ANALYSE  THE  LOGS
  • 47. LOOK THROUGH SPIDER-EYES PREPARE TO BE HORRIFIED Incorrect  URL  header  response  codes   301  redirect  chains Old  files  or  XML  sitemaps  left  on  server  from  years  ago Infinite/  endless  loops  (circular  dependency) On  parameter  driven  sites  URLs  crawled  which  produce  same  output AJAX  content  fragments  pulled  in  alone URLs  generated  by  spammers Dead  image  files  being  visited Old  CSS  files  still  being  crawled  and  loading  EVERYTHING You  may  even  see  ’mini’  abandoned  projects  within  the  site Legacy  URLs  generated  by  long  forgotten  .htaccess regex  pattern  matching Googlebot hanging  around  in  your  ‘ever-­‐changing’  blog  but  nowhere  else
  • 48. URL  CRAWL  FREQUENCY  ’CLOCKING’ Spreadsheet  provided  by  @johnmu during  Webmaster  Hangout  -­‐ https://goo.gl/1pToL8 Identify  your  ‘real  time’,  ‘daily’  and   ‘base  layer’  URLs -­‐ ARE  THEY  THE  ONES  YOU  WANT   THERE?    WHAT  IS  BEING  SEEN  AS   UNIMPORTANT? NOTE GOOGLEBOT Do  you  recognise  all  the URLs  and  URL  ranges  that Are  appearing? If  not…  Why  not?
  • 49. IMPROVE & EMPHASISE PAGE IMPORTANCE • Cross  modular  internal  linking • Canonicalization • Important  URLs  in  XML  sitemaps • Anchor  text  target  consistency  (but  not  spammyrepetition  of   anchors  everywhere  (it’s  still  output)) • Internal  links  in  right  descending  order  – emphasise IMPORTANCE • Reduce  boiler  plate  content  and  improve  relevance  of  content   and  elements  to  specific  topic  (if  category)  /  product  (if  product   page)  /  subcategory  (if  subcategory) • Reduce  duplicate  content  parts  of  page  to  allow  primary  targets   to  take  ’IMPORTANCE’ • Improve  parent  pages  to  raise  IMPORTANCE  reputation  of  the   children  rather  than  over-­‐optimising the  child  pages  and   cannibalising the  parent. • Improve  content  as  more  ‘relevant’  to  a  topic  to  increase   ‘IMPORTANCE’  and  get  reassigned  to  a  different  crawl  layer • Flatten  ‘architectures’ • Avoid  content  cannibalisation • Link  relevant  content  to  relevant  content • Build  strong  highly  relevant  ‘hub’  pages  to  tie  together  strength   &  IMPORTANCE
  • 50. EMPHASISE IMPORTANCE WISELY USE  CUSTOM XML SITEMAPS E.G.  XML  UNLIMITED SITEMAP  GENERATOR PUT IMPORTANT URLS IN HERE IF EVERYTHING IS IMPORTANT THEN IMPORTANCE IS NOT DIFFERENTIATED
  • 51. KEEP CUSTOM SITEMAPS ‘CURRENT’ AUTOMATICALLY AUTOMATE UPDATES WITH  CRON  JOBS  OR   WEB  CRON  JOBS IT’S NOT AS TECHNICALAS YOU MAY THINK – USE WEB CRON JOBS
  • 52. BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN XML SITEMAPS EXCLUDE  AND INCLUDE  CRAWL PATHS  IN  XML  SITEMAPS   TO EMPHASISE IMPORTANCE
  • 53. IF YOU CAN’T IMPROVE - EXCLUDE (VIA NOINDEX) FOR NOW • YOU’RE  OUT  FOR  NOW • When  you  improve  you  can   come  back  in • Tell  Googlebot quickly  that   you’re  out  (via  temporary   XML  sitemap  inclusion) • But  ‘follow’  because  there   will  be  some  relevance   within  these  URLs • Include  again  when  you’ve   improved • Don’t  try  to  canonicalize me  to  something   in  the index
  • 54. OR REMOVE – 410 GONE (IF IT’S NEVER COMING BACK) http://faxfromthefuture.bandcamp.com/track/410-­‐ gone-­‐acoustic-­‐demo EMBRACE THE ‘410 GONE’ There’s  Even  A  Song About  It
  • 55. #BIGSITEPROBLEMS – LOSE THE INDEX BLOAT LOSE THE BLOAT TO INCREASE THE CRAWL No.  of  unimportant   URLs  indexed  extend   far  beyond  the   available  importance   crawl  threshold   allocation
  • 56. Tags:  I,  must,  tag,    this,  blog,  post,  with,   every,  possible,   word,  that,  pops,   into,  my,   head,  when,  I,  look,  at,  it,  and,  dilute,  all,   relevance,  from,  it,  to,  a,  pile,  of,  mush,   cow,  shoes,  sheep,  the,  and,  me,  of,  it Image  Credit:  Buzzfeed Creating  ‘thin’  content  and   Even  more  URLs  to  crawl #BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN
  • 57. Most Important Page 1 Most  Important  Page  2 Most  Important  Page  3 IS THIS YOUR BLOG?? HOPE NOT #BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING - LOCAL IB (P) – INTERNAL BACKLINKS
  • 58. Optimize  Everything:  I  must  optimize  ALL   the  pages  across  a  category  descendants   for  the  same  terms  as  my  primary  target   category  page  so  that  each  of  them  is  of   almost  equal  relevance  to  the  target  page   and  confuse  crawlers  as  to  which  is the  important  one.    I’ll  put  them  all  in  a   sitemap  as  standard  too  just  for  good   measure. Image  Credit:  Buzzfeed HOW  CAN  SEARCH  ENGINES KNOW  WHICH  IS  MOST  IMPORTANT TO  A  TOPIC  IF  ‘EVERYTHING’  IS IMPORTANT?? #BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER OVER-OPTIMIZER’ ‘OPTIMIZE  ALL  THE  THINGS’
  • 59. Duplicate  Everything:  I  must  have  a   massive  boiler  plate  area  in  the  footer,   identical  sidebars  and  a  massive  mega   menu  with  all  the  same  output  in  sitewide.     I’ll  put  very  little  unique  content  into  the   page  body  and  it  will  also  look  very  much   like  it’s  parents  and  grandparents  too.     From  time  to  time  I’ll  outrank  my  parents   and  grandparent  pages  but  ‘Meh’… Image  Credit:  Buzzfeed HOW  CAN  SEARCH  ENGINES KNOW  WHICH  IS  MOST  IMPORTANT PAGE  IF  ALL  IT’S  CHILDREN  AND   GRANDCHILDREN  ARE  NEARLY  THE   SAME?? #BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER DUPLICATER’ ‘DUPLICATE  ALL  THE  THINGS’
  • 60. IMPROVE SITE PERFORMANCE - HELP GOOGLEBOTGET THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE Avoid  wasting  time   on  ‘overdue-­‐URL’   crawling   (E.G.   Send  correct   response  codes,   speed  up  your  site,   etc) 8,666,964  B1 ½  time >  2  x  page   crawl  p/day Added  to  Cloudflare CDN
  • 61. GOOGLEBOT  GOES  WHERE  THE  ACTION  IS USE  ‘ACTION’  WISELY DON’T  TRY  TO  TRICK  GOOGLEBOT  BY   FAKING  ‘FRESHNESS’  ON  LOW  IMPORTANCE   PAGES  – GOOGLEBOT  WILL  REALISE UPDATE  IMPORTANT  PAGES  OFTEN NURTURE  SEASONAL  URLs  TO  GROW   IMPORTANCE  WITH  FRESHNESS  (regular   updates)  &  MATURITY  (HISTORY) DON’T  TURN  GOOGLEBOT’S  HEAD  INTO   THE  WRONG  PLACES Image  Credit:  Buzzfeed ’GET FRESH’AND STAY ‘FRESH’ ‘BUT  DON’T  TRY  TO  FAKE   FRESH  &  USE  FRESH  WISELY’
  • 62. IMPROVE TO GET THE HARD LIMITS ON CRAWLING By  improving  your URL  importance on  an   ongoing  basis  via Increased  pagerank,   content  improvements   (e.g.  quality  hub  pages),   internal  link  strategies,   IB  (P),  restructuring, You  can  get  the  ‘hard   limit’  or  get  visited   more  generally CAN IMPROVING YOUR SITE HELP TO ‘OVERRIDE’ SOFT LIMIT CRAWL PERIODS SET?
  • 63. YOU THINK IT DOESN’T MATTER… RIGHT? YOU  SAY… ”  GOOGLE  WILL   WORK  IT  OUT” ”LET’S  JUST  MAKE   MORE  CONTENT”
  • 64. WRONG  – ‘CRAWL  TANK’  IS  UGLY
  • 65. WRONG  – CRAWL  TANK  CAN  LOOK  LIKE  THIS SITE  SEO  DEATH  BY  TOO  MANY  URLS  AND   INSUFFICIENT  CRAWL  BUDGET  TO  SUPPORT   (EITHER  DUMPING  A  NEW  ‘THIN’   PARAMETER  INTO  A  SITE  OR  INFINITE  LOOP   (CODING  ERROR)  (SPIDER  TRAP)) WHAT’S  WORSE  THAN  AN  INFINITE   LOOP? ‘A  LOGICAL  INFINITE  LOOP’ IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING ‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS
  • 66. WRONG  – SITE  DROWNED -­ IN  IT’S OWN  SEA  OF   UNIMPORTANT   URLS
  • 67. VIA  ‘EXPONENTIAL  URL  UNIMPORTANCE’ Your  URLs  exponentially  confirmed   unimportant   with  each  iterative  crawl   visit  to  other  similar  or  duplicate   content  checksum  URLs.    Fewer  and   fewer  internal  links  and  ‘thinner  and   thinner’  relevant  content. MULTPLE  RANDOM  URLs  competing  for   same  query  confirm  irrelevance  of  all   competing  in-­‐site  URLs  with  no   dominant  single  relevant  IMPORTANT   URL
  • 68. WRONG  – ‘SENDING  WRONG  SIGNALS  TO   GOOGLEBOT’  COSTS  DEARLY (Source:Sistrix) “2015  was  the  year  where   website  owners  managed   to  be  mostly  at  fault,  all  by   themselves”  (Sistrix 2015   Organic  Search  Review  -­‐ 2016)
  • 69. WRONG  -­ NO-­ONE  IS  EXEMPT (Source:Sistrix) “It  doesn’t  matter  how  big   your  brand  is  if  you  ‘talk  to   the  spider’  (Googlebot)   wrong  ”  – You  can  still   ‘tank’
  • 70. WRONG  – GOOGLE  THINKS  SEOS  SHOULD   UNDERSTAND  CRAWL  BUDGET
  • 71. ”EMPHASISE  IMPORTANCE” “Make  sure  the  right  URLs  get  on  Googlebot’s menu  and  increase  URL   importance  to  build  Googlebot’s appetite  for  your  site  more” Dawn  Anderson  @  dawnieando SORT OUT CRAWLING
  • 72. TWITTER  -­‐ @dawnieando GOOGLE+  -­‐ +DawnAnderson888 LINKEDIN  -­‐ msdawnanderson THANK  YOU Dawn  Anderson  @  dawnieando
  • 73. • Going  ‘where  the  action  is’  in   sites • The  ‘need  for  speed’ • Logical  structure • Correct  ‘response’  codes • XML  sitemaps  with  important   URLs • ‘Successful  crawl  visits • ‘Seeing  everything’  on  a  page • Taking  MAX  ‘hints’ • Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) • Predicting  likelihood  of  ‘future   change’ • Finding  ‘more’  important  content   worth  crawling • Slow  sites • Too  many  redirects • Being  bored  (Meh)  (Min  ‘Hints’  are  built   in  by  the  search  engine  systems  – Takes   ‘hints’) • Being  lied  to  (e.g.  On  XML  sitemap   priorities) • Crawl  traps  and  dead  ends • Going  round  in  circles  (Infinite  loops) • Spam  URLs • Crawl  wasting  minor  content  change   URLs • ‘Hidden’  and  blocked  content • Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ Not  just  page  change  designed To  catch  Googlebot’s eye  with No  added  value UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES LIKES DISLIKES CHANGE  IS  KEY
  • 74. Going  ‘where  the  action  is’  in  sites The  ‘need  for  speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ CRAWL OPTIMISATION – STAGE 1 - UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES LIKES DISLIKES CHANGE  IS  KEY
  • 75. FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’ GOOGLEBOT MORE TECHNICAL  ‘FIXES’       Speed  up  your  site Implement  compression,  minification,  caching ‘ Fix  incorrect  header  response  codes Fix  nonsensical  ‘infinite  loops’  generated  by   database  driven  parameters  or  ‘looping’  relative   URLs Use  absolute  versus  relative  internal  links Ensure  no  parts  of  content  is  blocked  from   crawlers  (e.g.  in  carousels,  concertinas  and   tabbed  content Ensure  no  css or  javascript files  are  blocked  from   crawlers Unpick  301  redirect  chains Consider  using  a  CDN  such  as Cloudflare IMPLEMENTATION OF CONTENT DELIVERY NETWORK
  • 76. Minimise  301  redirects Minimise  canonicalisation Use  ‘if  modified’  headers  on  low  importance   ‘hygiene’  pages Use  ‘expires  after’  headers  on  content  with  short   shelf  live  (e.g.  auctions,  job  sites,  event  sites) Noindex low  search  volume  or  near  duplicate  URLs   (use  noindex directive  on  robots.txt) Use  410  ‘gone’  headers  on  dead  URLs  liberally Revisit  .htaccess file  and  review  legacy  pattern   matched  301  redirects Combine  CSS  and  javascript files Use  minification,  compression  and  caching FIX GOOGLEBOT’S JOURNEY SAVE  BUDGET  /  EMPHASISE  IMPORTANCE £
  • 77. Revisit  ‘Votes  for  self’  via  internal  links  in  GSC Clear  ‘unique’  URL  fingerprints Improve  whole  site  sections  /  categories Use  XML  sitemaps  for  your  important  URLs  (don’t  put   everything  on  it) Use  ‘mega  menus’  (very  selectively)  to  key  pages Use  ‘breadcrumbs’ Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and   ‘cross  modular’  ‘related’  internal  linking  to  key  pages Consolidate  (merge)  important  but  similar  content  (e.g.   merge  FAQs  or  ‘low  search  volume’  content  into  other   relevant  pages) Consider  flattening  your  site  structure  so  ‘importance’   flows  further Reduce  internal  linking  to  lower  priority  URLs BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE   YOUR  MOST  IMPORTANT  PAGES Not  just  any  change  – Critical  material  change Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG Use  ‘relevant  ‘supplementary  content  to  keep  key  pages  ‘fresh’ Remember  min  crawl  ‘hints’ Regularly  update  key  IMPORTANT  content Consider  ‘updating’  rather  than  replacing  seasonal  content   URLs  (e.g.  annual  events).    Append  and  update. Build  ‘dynamism’  and  ‘interactivity’  into  your  web  development   (sites  that  ‘move’  win) Keep  working  to  improve  and  make  your  URLs  more  important GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND   IS  LIKELY  TO  BE  IN  THE  FUTURE  (AS  LONG  AS   THOSE  URLS  ARE  NOT  UNIMPORTANT) TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS) EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE
  • 78. SAVINGS, CHANGE & SPEED TOOLS • GSC  Index  levels  (over  indexation  checks) • GSC  Crawl  stats • Last  Accessed  Tools  (versus  competitors) • Server  logs • Keyword  Tools SAVINGS  &  CHANGE SPEED • Yslow • Pingdom • Google  Page  Speed  Tests • Minificiation – JS  Compress  and  CSS   Minifier • Image  Compression   – Compressjpeg.com,   tinypng.com • Content  Delivery  Networks  (e.g.   Cloudflare)
  • 79. URL IMPORTANCE & CRAWL FREQUENCY TOOLS • GSC  Internal  links  Report  (URL   importance) • Link  Research  Tools  (Strongest   sub  pages  reports) • GSC  Internal  links  (add  site   categories  and  sections  as   additional  profiles) • Powermapper • XML  Sitemap  Generators  for   custom  sitemaps • Crawl  Frequency  Clocking   (@Johnmu) URL  IMPORTANCE
  • 80. SPIDER EYES TOOLS • GSC  Crawl  Stats • URL  Profiler • Deepcrawl • Screaming  Frog • Server  Logs • SEMRush (auditing  tools) • Webconfs (header  responses   /  similarity   checker) • Powermapper (birds  eye  view  of  site) • Lynx  Browser • Crawl  Frequency  Clocking  (@Johnmu) SPIDER  EYES
  • 81. REFERENCES Efficient  Crawling  Through  URL  Ordering  (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Crawl  Optimisation (Blind  Five  Year  Old  – A  J  Kohn  -­‐ @ajkohn)  http://www.blindfiveyearold.com/crawl-­‐ optimization Scheduling  a  recrawl (Auerbach)    -­‐ http://www.google.co.uk/patents/US8386459 Scheduler  for  search  engine  crawler  (Zhu  et  al)  -­‐ http://www.google.co.uk/patents/US8042112 Efficient  crawling  through  URL  ordering    (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Google  Explains  Why  The  Search  Console  Reporting  Is  Not  Real  Time  (SERoundtable)   https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html Crawl  Data  Aggregation  Propagation  (Mueller)  -­‐ https://goo.gl/1pToL8 Matt  Cutts Interviewed  By  Eric  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐ 2/ Web  Promo  Q  and  A  with  Google’s  Andrev Lippatsev -­‐ https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ Google  Number  1  SEO  Advice  – Be  Consistent  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐ advice-­‐be-­‐consistent-­‐21196.html
  • 82. REFERENCES Internet  Live  Stats  -­‐ http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/ Scheduler  for  search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)  -­‐ https://www.google.com/patents/US8707313 Managing  items  in  crawl  schedule  – Google  Patent  (Alpert)   http://www.google.ch/patents/US8666964 Document  reuse  in  a  search  engine  crawler  -­‐ Google  Patent  (Zhu  et  al) https://www.google.com/patents/US8707312 Web  crawler  scheduler  that  utilizes  sitemaps  (Brawer  et  al)  -­‐ http://www.google.com/patents/US8037054 Distributed  crawling  of  hyperlinked  documents  (Dean  et  al)  -­‐ http://www.google.co.uk/patents/US7305610 Minimizing  visibility  of  stale  content  (Carver)  -­‐ http://www.google.ch/patents/US20130226897