These are the slides to Tyler McConville, our CEO and head of SEO's presentation at Neil Patel's Advanced SEO Summit 2017.
Video can be found here: https://www.crowdcast.io/e/advanced-seo-summit/9
2. Why Should you care?
What you get out of this talk
D Ability to understand how Crawl bots function and View your Website
D Ability decrypt and understand Bot behaviour and Triggers
D Prioritize Pages and content as well as identify roadblocks
3. [~]$ whoami
NAV43Tyler McConville
D CEO & CO-Founder of NAV43
D Technical Search Engine Optimizer
D 7 YEAR SERP JOURNEY
D Survivor of 2013
ThoseWho HaveFallen…
In the SERPs
4. Agenda
D A very short understanding of how Google works.
D A background on Crawl Bot duties
D Tracking those f#@kers!
D The Secret Data
D Key Take-Aways
D Questions anyone?
5. Disclaimer
This is NOT an exploit resource
D It’s just an understanding from tests ヽ༼ຈل͜ ຈ༽ノ
D …and some implementation specific oddities
Google has done nothing [especially] wrong
D To the contrary, their bots are quite organized
Modifying your server and misdirecting Google bots can be Damaging
D If not done right. You have been forewarned..
Duplicating this is NOT guaranteed to Rank you
D I’m looking at you.. Understand the concepts before implementing them.
8. Google Bots and Duties
Google Bots a brief explanation
D C ra w l e r -> A discovery program for Google!
D A bot that mines “meta-data” and organizes
“relationship” mapping
D Technically a robust scraper running within a cluster
D Spoiler: Google Crawl bots aren’t intelligent!
List of Google Crawl Bots and functions
D Desktop
D Standard Website Scraper
D Smartphone
D Standard Scraper with mobile rules
D Image
D Image check & meta data gatherer
D Video
D Video Render & Processing
D News
D App
9. Removing the Distortion
What we want to do
D Isolate Google crawlers from Users!
What to actually do
D Mine Server Logs and Compile Repository
D recommended `An upper limit of 2 months is suggested for crawl logs`
D When shipping logs, send over encrypted state to ELK stack on the network
D basically to keeping your info.. yours, a logical first step...
What implementation will also do
D Store symmetric crawl schedules (so you can find the otherwise random actions...)
D Give real time feedback on crawl errors and application issues...
“ Find bots, you must”
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248
"http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200
8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF"
"Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET
/pics/5star2000.gif HTTP/1.0" 200 4005
"http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics
10. Shipping Logs.. WOOHOO
MUST FIND: GBOT
SERVER: Here’s Everything,
BABY.
*Command to Slice the Server Log file for shipping: split -l 200000 logfile.log
11. E L K
S T A C K
by the docs
Diagram based on:
https://logz.io/learn/complete-guide-elk-stack/
Google
Bot
Webserver
Visualization Dashboard
12. So what just happened?
D ClientConnects: “Everytime a client (or bot) connects to the server through apache, a server entry is
formed in the apache-access log.”
D “ W h e n t h i s i s c o m p l e t e , the log shipper will send the sliced log files into logstash which will be
processed and then placed into Elasticsearch.”*
D E l a s t i c s e a rc h : “…provides a distributed, multitenant-capable full-text search engine with an HTTP
web interface and schema-free JSON documents…”
D Kibana: “This is the final step to visualizing the data within Elasticsearch so that we can sort and compile
the data we need. […] This is where it get’s interesting, we will live here.” *
*Elasticsearch quoted from here: https://qbox.io/blog/what-is-elasticsearch
14. Mission
We want to be able to see data trends and crawl patterns from Google bots navigating
the webserver.
We want to gather any contextual information that we can use for forensic purposes,
regardless of whether or not we can accomplish the above
We (as an adversary) want to be able to map Google bots direct actions and compile data
trends to be able to predict and guide the data a Google bot digests in a crawl.
D We want to do this without manipulating Google bots.. Too much. ;)
18. The Data itself.
D JSON Format / Organized and saved
D Re q u i red for future processingD Oh..
And it can be real time ;)
Approach Premise:
D Organize Server Logs into Line-items
D Requests are relatively small
D S e r v e r logs are larger and cluttered
D … they are a “log” after all.
D Create relationships based off JSON requests
D
FindAES from: http://jessekornblum.com/tools/
19. Grouping Data
_New_Search
4 Agent
4 Time
4/8 Response [“code”]
4/8 Request [“URL”]
CSslUserContext
Look Complicated? Let’s go more into this on a later date.
_LOG_ITEM
4 IP Address
4 Request
4 Server Response
4/8 Referral
4 Timestamp
4/8 Agent
_Relationship_MAP
4 Request URL
4 SERVER Response
4 Bot Type
... ...
4 Bytes downloaded
? Referral link
? Exit
20. The Results
This functions do three things:
D Isolate drop-off or dead spots
D Return deep error reporting
D Check natural flow of Crawl bot types
A Complete Website Google Crawl Map
*There are many ways to visualize this.
22. Pre-Planned crawl routes
D B o t s a r e n o t S m a r t
D As Bots will always follow a given path that’s
laid out to them, we are always in control.
D Silo’s are not dead. Instead they are topically
focused.
D Provide a crawl path that makes sense for a user
navigating your funnel. Outline relationships and
target dead areas.
Bottom line:
Create absolute relationship links between
pages for bots.
23. Reactive Server Actions
D P r e p r o g r a m m e d r e s p o n s e a c t i o n s
D You can program your server to act reactive to
requests and craft responses accordingly. Ie:
Using 304 response codes.
D Addressing incorrect internal referral
sources.
D Provide crawl bot with unique meta-data within
headers and avoid silly server errors. Bottom line:
You control your server and how it behaves.
Google bots are just here for the ride.