O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Why	
  Googlebot &	
  The	
  URL	
  Scheduler	
  
Should	
   Be	
  Amongst	
   Your	
  Key	
  Personas	
  
And	
  How	
  T...
9	
  types	
  of	
  
Googlebot
THE KEY PERSONAS
02
SUPPORTING	
  ROLES
Indexer	
  /	
  
Ranking	
  Engine
The	
  URL	
  
S...
‘Ranks	
  nothing	
  at	
  all’
Takes	
  a	
  list	
  of	
  URLs	
  to	
  crawl	
  from	
  URL	
  Scheduler
Job	
  varies	...
04
ROLES – MAJOR PLAYERS – A ‘BOSS’- URL
SCHEDULER
Think	
  of	
  it	
  as	
  Google’s	
  
line	
  manager	
  or	
  ‘air	
...
Indexed	
  Web	
  contains at	
  least	
  4.73	
  billion	
   pages (13/11/2015)
05
TOO MUCH CONTENT
Total	
  number	
  of...
Capacity	
  limits	
  
on	
  Google’s	
  
crawling	
  system
By	
  prioritising	
  
URLs	
  for	
  
crawling
By	
  assigni...
‘Managing items in a
crawl schedule’
Include
07
GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler schedul...
Crawled	
  multiple	
  
times	
  daily
Crawled	
  daily	
  
Or	
  bi-­‐daily
Crawled	
  least	
  on	
  a	
  ‘round	
  
rob...
Scheduler	
  checks	
  URLs	
  
for	
  ‘importance’,	
  ‘boost	
  
factor’	
  candidacy,	
  
‘probability	
  of	
  
modifi...
CRAWL BUDGET
10
Roughly	
  proportionate	
  to	
  Page	
  Importance	
  (LinkEquity)	
   &	
  speed
Pages	
  with	
  a	
  ...
CRITICAL MATERIAL CONTENT
CHANGE
11
HINTS	
  &
C	
  =	
  ∑	
  i =	
  0	
  n	
  -­‐ 1	
   	
  weight	
  i *	
  feature
Current	
  capacity	
  of	
  the	
  web	
  crawling	
  system	
  is	
  high
Your	
  URL	
  is	
  ‘important’
Your	
  URL	
...
Current	
  capacity	
  of	
  web	
  crawling	
  system	
  is	
  low
Your	
  URL	
  has	
  been	
  detected	
  as	
  a	
  ‘...
IT’S NOT JUST ABOUT ‘FRESHNESS’
14
It’s	
  about	
  the	
  
probability	
  &	
  
predictability	
  of	
  future	
  
‘fresh...
Going	
  ‘where	
  the	
  action	
  is’	
  in	
  sites
The	
  ‘need	
  for	
  speed’
Logical	
  structure
Correct	
  ‘resp...
FIND GOOGLEBOT
16
AUTOMATE	
  SERVER	
  LOG	
  
RETRIEVAL	
  VIA	
  CRON	
  JOB
grep Googlebot access_log
>googlebot_acces...
LOOK THROUGH ‘SPIDER EYES’ VIA
LOG ANALYSIS – ANALYSE GOOGLEBOT
17
PREPARE TO BE HORRIFIED
Incorrect	
  URL	
  header	
  r...
18
FIX GOOGLEBOT’S JOURNEY
SPEED UP YOUR
SITE TO ‘FEED’
GOOGLEGOT
MORE
TECHNICAL	
  ‘FIXES’	
  	
  	
  
Speed	
  up	
  you...
Minimise	
  301	
  redirects
Minimise	
  canonicalisation
Use	
  ‘if	
  modified’	
  headers	
  on	
  low	
  importance	
 ...
Revisit	
  ‘Votes	
  for	
  self’	
  via	
  internal	
  links	
  in	
  GSC
Clear	
  ‘unique’	
  URL	
  fingerprints
Use	
 ...
YSlow
Pingdom
Google	
  Page	
  Speed	
  Tests
Minificiation – JS	
  Compress	
  and	
  CSS	
  
Minifier
Image	
  Compress...
IS THIS
YOUR BLOG??
HOPE NOT
22
WARNING SIGNS – TOO MANY
VOTES BY SELF FOR WRONG PAGES
Most Important Page 1
Most	
  Impor...
23
WARNING SIGNS – OVER INDEXATION
FIX IT FOR
A BETTER
CRAWL
Tags:	
  I,	
  must,	
  tag,	
  	
  this,	
  blog,	
  post,	
  with,	
  
every,	
  possible,	
   word,	
  that,	
  pops,	
...
25
GOOGLE THINKS SO
”Googlebot’s On	
  A	
  Strict	
  Diet”
“Make	
  sure	
  the	
  right	
  URLs	
  get	
  on	
  the	
  menu”
Dawn	
  Anderso...
Próximos SlideShares
Carregando em…5
×

Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Googlebot has been put on a diet of URLs by the Scheduler in the web crawling system. If your URL’s are not on the list, Googlebot is not coming in. The increasing influx of content flooding the internet means that there is a need for prioritisation on web pages and files visited. Are you telling Googlebot and the URL Scheduler that your content is less important than it is via technical SEO and architectural blunders? There’s a real need to understand Googlebot’s persona and that of the Scheduler, along with the jobs they do, in order to ‘talk to the spider’ and gain more from your time with it.

Sasconbeta 2015 Dawn Anderson - Talk To The Spider

  1. 1. Why  Googlebot &  The  URL  Scheduler   Should   Be  Amongst   Your  Key  Personas   And  How  To  Train  Them TALK  TO   THE  SPIDER Dawn  Anderson  @  dawnieando
  2. 2. 9  types  of   Googlebot THE KEY PERSONAS 02 SUPPORTING  ROLES Indexer  /   Ranking  Engine The  URL   Scheduler History  Logs Link  Logs Anchor  Logs
  3. 3. ‘Ranks  nothing  at  all’ Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler Job  varies  based  on  ‘bot’  type Runs  errands  &  makes  deliveries  for  the  URL  server,   indexer  /  ranking  engine  and  logs Makes  notes  of  outbound   linked  pages  and  additional   links  for  future  crawling Takes  notes  of  ‘hints’  from  URL  scheduler  when  crawling Tells  tales  of  URL  accessibility  status,  server  response   codes,  notes  relationships  between  links  and  collects   content  checksums  (binary  data  equivalent  of  web   content)  for  comparison  with  past  visits  by  history  and   link  logs 03 GOOGLEBOT’S JOBS
  4. 4. 04 ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER Think  of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system Schedules  Googlebot visits  to  URLs Decides  which  URLs  to  ‘feed’  to  Googlebot Uses  data  from  the  history  logs  about  past  visits Assigns  visit  regularity  of  Googlebot to  URLs Drops  ‘hints’  to  Googlebot to  guide  on  types  of  content  NOT  to   crawl  and  excludes  some  URLs  from  schedules Analyses  past  ‘change’  periods  and  predicts  future  ‘change’   periods  for  URLs  for  the  purposes  of  scheduling  Googlebot visits Checks  ‘page  importance’  in  scheduling  visits Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules
  5. 5. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015) 05 TOO MUCH CONTENT Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE  2013  THE  WEB  IS   THOUGHT  TO  HAVE   INCREASED  IN  SIZE  BY  1/3
  6. 6. Capacity  limits   on  Google’s   crawling  system By  prioritising   URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots 06 TOO MUCH CONTENT
  7. 7. ‘Managing items in a crawl schedule’ Include 07 GOOGLE CRAWL SCHEDULER PATENTS ‘Scheduling a recrawl’ ‘Web crawler scheduler that utilizes sitemaps from websites’ ‘ ‘Document reuse in a search engine crawler’ ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ ‘Scheduler for search engine’
  8. 8. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawledSplit  into  segments   on  random  rotation 08 MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data
  9. 9. Scheduler  checks  URLs   for  ‘importance’,  ‘boost   factor’  candidacy,   ‘probability  of   modification’ GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET 09 The  URL  Scheduler   controls  the  meal   planner Carefully  controls   the  list  of  URLs   Googlebot vits ‘Budgets’  are  allocated £
  10. 10. CRAWL BUDGET 10 Roughly  proportionate  to  Page  Importance  (LinkEquity)   &  speed Pages  with  a  lot  of  healthy  links  get  crawled  more  (Can  include  internal  links??) Apportioned  by  the  URL  scheduler  to  Googlebots WHAT  IS  A  CRAWL  BUDGET?  -­‐ An  allocation  of  ‘crawl  visit  frequency’  apportioned  to  URLs  on  a  site But  there  are  other  factors  affecting  frequency  of  Googlebot visits  aside  from  importance  /  speed The  vast  majority  of  URLs  on  the  web  don’t  get  a  lot  of  budget  allocated  to  them
  11. 11. CRITICAL MATERIAL CONTENT CHANGE 11 HINTS  & C  =  ∑  i =  0  n  -­‐ 1    weight  i *  feature
  12. 12. Current  capacity  of  the  web  crawling  system  is  high Your  URL  is  ‘important’ Your  URL  is  in  the  real  time,  daily  crawl  or  ‘active’  base   layer  segment Your  URL  changes  a  lot  with  critical  material  content   change Probability  and  predictability  of  critical  material  content   change  is  high  for  your  URL Your  website  speed  is  fast  and  Googlebot gets  the  time  to   visit  your  URL Your  URL  has  been  ‘upgraded’  to  a  daily  or  real  time  crawl   layer 12 POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
  13. 13. Current  capacity  of  web  crawling  system  is  low Your  URL  has  been  detected  as  a  ‘spam’  URL Your  URL  is  in  an  ‘inactive’  base  layer  segment Your  URLs  are  ‘tripping  hints’  built  into  the  system  to   detect  non-­‐critical  change  dynamic  content Probability  and  predictability  of  critical  material  content   change  is  low  for  your  URL Your  website  speed  is  slow  and  Googlebot doesn’t  get  the   time  to  visit  your  URL Your  URL  has  been  ‘downgraded’  to  an  ‘inactive’  base   layer  segment Your  URL  has  returned  an  ‘unreachable’  server  response   code  recently 13 NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
  14. 14. IT’S NOT JUST ABOUT ‘FRESHNESS’ 14 It’s  about  the   probability  &   predictability  of  future   ‘freshness’ BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?
  15. 15. Going  ‘where  the  action  is’  in  sites The  ‘need  for  speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ CRAWL OPTIMISATION – STAGE 1 - UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES 15 LIKES DISLIKES CHANGE  IS  KEY
  16. 16. FIND GOOGLEBOT 16 AUTOMATE  SERVER  LOG   RETRIEVAL  VIA  CRON  JOB grep Googlebot access_log >googlebot_access.txt
  17. 17. LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT 17 PREPARE TO BE HORRIFIED Incorrect  URL  header  response  codes  (e.g.  302s) 301  redirect  chains Old  files  or  XML  sitemaps  left  on  server  from  years  ago Infinite/  endless  loops  (circular  dependency) On  parameter  driven  sites  URLs  crawled  which  produce  same  output URLs  generated  by  spammers Dead  image  files  being  visited Old  css files  still  being  crawled Identify  your  ‘real  time’,  ‘daily’  and  ‘base  layer’  URLs ARE  THEY  THE  ONES  YOU  WANT  THERE?
  18. 18. 18 FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE TECHNICAL  ‘FIXES’       Speed  up  your  site Implement  compression,  minification,  caching ‘ Fix  incorrect  header  response  codes Fix  nonsensical  ‘infinite  loops’  generated  by   database  driven  parameters  or  ‘looping’  relative   URLs Use  absolute  versus  relative  internal  links Ensure  no  parts  of  content  is  blocked  from   crawlers  (e.g.  in  carousels,  concertinas  and   tabbed  content Ensure  no  css or  javascript files  are  blocked  from   crawlers Unpick  301  redirect  chains
  19. 19. Minimise  301  redirects Minimise  canonicalisation Use  ‘if  modified’  headers  on  low  importance   ‘hygiene’  pages Use  ‘expires  after’  headers  on  content  with  short   shelf  live  (e.g.  auctions,  job  sites,  event  sites) Noindex low  search  volume  or  near  duplicate  URLs   (use  noindex directive  on  robots.txt) Use  410  ‘gone’  headers  on  dead  URLs  liberally Revisit  .htaccess file  and  review  legacy  pattern   matched  301  redirects Combine  CSS  and  javascript files FIX GOOGLEBOT’S JOURNEY 19 SAVE  BUDGET £
  20. 20. Revisit  ‘Votes  for  self’  via  internal  links  in  GSC Clear  ‘unique’  URL  fingerprints Use  XML  sitemaps  for  your  important  URLs  (don’t  put   everything  on  it) Use  ‘mega  menus’  (very  selectively)  to  key  pages Use  ‘breadcrumbs’  (for  hierarchical  structure) Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and   supplementary  content  for  ‘cross  modular’  ‘related’   internal  linking  to  key  pages Consolidate  (merge)  important  but  similar  content  (e.g.   merge  FAQs) Consider  flattening  your  site  structure  so  ‘importance’   flows  further Reduce  internal  linking  to  low  priority  URLs BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE   YOUR  MOST  IMPORTANT  PAGES Not  just  any  change  – Critical  material  change Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG Use  ‘relevant  ‘supplementary  content  to  keep  key  pages  ‘fresh’ Remember  the  negative  impact  of    ‘crawl  hints’ Regularly  update  key  content Consider  ‘updating’  rather  than  replacing  seasonal  content   URLs Build  ‘dynamism’  into  your  web  development  (sites  that  ‘move’   win) GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND   IS  LIKELY  TO  BE  IN  THE  FUTURE TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS) 20 EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE
  21. 21. YSlow Pingdom Google  Page  Speed  Tests Minificiation – JS  Compress  and  CSS   Minifier Image  Compression   – Compressjpeg.com,   tinypng.com 21 TOOLS YOU CAN USE GSC  Crawl  Stats Deepcrawl Screaming  Frog Server  Logs SEMRush (auditing  tools) Webconfs (header  responses   /  similarity   checker) Powermapper (birds  eye  view  of  site) GSC  Internal  links  Report  (URL  importance) Link  Research  Tools  (Strongest  sub  pages   reports) GSC  Internal  links  (add  site  categories  and   sections  as  additional  profiles) Powermapper GSC  Index  levels  (over  indexation  checks) GSC  Crawl  stats Last  Accessed  Tools  (versus  competitors) Server  logs SPEED SPIDER  EYES URL  IMPORTANCE SAVINGS  &  CHANGE Webmaster Hangout Office Hours
  22. 22. IS THIS YOUR BLOG?? HOPE NOT 22 WARNING SIGNS – TOO MANY VOTES BY SELF FOR WRONG PAGES Most Important Page 1 Most  Important  Page  2 Most  Important  Page  3
  23. 23. 23 WARNING SIGNS – OVER INDEXATION FIX IT FOR A BETTER CRAWL
  24. 24. Tags:  I,  must,  tag,    this,  blog,  post,  with,   every,  possible,   word,  that,  pops,   into,  my,   head,  when,  I,  look,  at,  it,  and,  dilute,  all,   relevance,  from,  it,  to,  a,  pile,  of,  mush,   cow,  shoes,  sheep,  the,  and,  me,  of,  it Image  Credit:  Buzzfeed Creating  ‘thin’  content  and   Even  more  URLs  to  crawl 24 WARNING SIGNS – TAG MAN
  25. 25. 25 GOOGLE THINKS SO
  26. 26. ”Googlebot’s On  A  Strict  Diet” “Make  sure  the  right  URLs  get  on  the  menu” Dawn  Anderson  @  dawnieando REMEMBER

×