SEO for Large/Enterprise Websites - Data & Tech Side

Doing SEO for large
websites.
Working on large websites, or large number of websites. Let’s
talk about SEO at scale, with the enterprise.

Working in a large organisation
Working with
data
Technical Foundation
Minimising Risk
Scaling Content
Reporting

Scaling Content
Reporting
Working with
data
Minimising Risk

Scaling Content
Working with
data
Reporting
Minimising Risk

I would like a 1000 problems please.

“Please ﬁx all 18,304 pages”

Category
Home page
Product
Contact Us
Obviously different

Small product number Main category page
Out of stock product
Extremes

Facet category page Reviews Page 2
Same page different URL

Country
County
City
Area/District
Street

Impressions week by week for new content

Pre change Post change
Clicks pre and post change for site sections

Competing pages for a set of terms

SLOWER DIFFICULT TO WORK WITH
SAMPLING

SAMPLING
LIMITS

SAMPLING
LIMITS LAG

SAMPLING
LIMITS LAG
SEGMENTATION

Search console properties for a
large brand.

5 sub-folders provided
260% more keywords

Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
Get
Get, Analyse
Get, Store, Analyse, Report

Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing

Data studio for extracting
data
● Add a data source
● Create a table for it.
● Download the table.
With both GA & GSC, you’ll get
everything in the table, no
paginating.

Day by day data
To get even more data we have
to get it day by day.
● bit.ly/search-console-dat
a-downloader
This bit is Search Console only.

Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API

data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data

data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
As a workﬂow I’d highly
recommend Jupyter notebooks
for getting started.
● Why use jupyter
notebooks?
● SearchLove Video (paid)

SEO Pythonistas
A memorial and soon to be
collection of Hamlet’s excellent
work.
SEO Pythonistas - In loving
memory of Hamlet Batista
@DataChaz

Analyse
Store data
Get data
Report

Analyse
Store data
Get data
Report
Takes time & space.

Rolling your own
JC Chouinard has built a series
of excellent granular tutorials
which walk you through setting
up one on your own machine.
Link.

Off the shelf
Get in touch with me!
I run Piped Out which is
software for building SEO data
warehouses.

Part 1: Templates
Part 2: Logs
Part 3: Crawling Big

Not the same ﬁelds as a crawl.
No page title for example.

● Crawling & indexing problems

● Measuring freshness

● Prioritisation

● Prioritisation
● Monitoring website changes (e.g. migrations)

Jun ‘19
Apr ‘19 Aug‘19 Oct ‘19
200 301 302
Status codes in product pages

Jun ‘19
Apr ‘19 Aug‘19 Oct ‘19
200 301 302
Status codes in product pages
ELK

● Prioritisation
● Monitoring website changes (e.g. migrations)
● Debugging

Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).
What time period do we want?
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is to look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re
finding etc.
We can absolutely do analysis with a month or so (we've even done it with just a week or two), but it means we lose historical context and obviously we're more likely to lose things on a larger side.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal information in?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.
Can we get logs from as close to the edge as possible?
It's pretty likely you've got a couple different layers of your network that might log. Ideally we want those from as close to the edge as possible. This prevents a couple issues:
● If you've got caching going on, like a CDN or Varnish then if we get logs from after them, we won't see any of the requests they answer.
● If you've got a load balancer distributing to several servers sometimes the external IP gets lost (perhaps X-Forwarded-For isn't working), which we need to verify Googlebot or we accidentally only get logs from a couple
servers.
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well. (Although of course if you're sending us CDN logs this won't matter.)
How do you log hostname and protocol?
It's very helpful for us to be able to see hostname & protocol. How do you distinguish those in the log files?
Do you log HTTP & HTTPS to separate files? Do you log hostname at all?
This is one of the problems that's often solved by getting logs closer to the edge, as while many servers won't give you those by default, load balancers and CDN's often will.
Where would we like the logs?
In an ideal world, they would be files in an S3 bucket and we can draw them down from there. If possible, we'd also ask that multiple files aren't zipped together for upload, because that makes processing harder. (No problem with
compressed logs just, just zipping multiple log files into a single archive).
Is there anything else we should know?
Best,
{x}

Sampling your crawl
● Limit your crawl
percentage per template.
i.e.
● 20% to product pages
● 30% to category pages

Low memory crawler
Runs locally on your machine
and allows you to crawl with a
very low memory footprint.
Doesn’t render JS or process
data however.

Run SF in the cloud
You can purchase a super high
memory computer in the cloud,
install SF on it and run it at
maximum speed.

Part 1: Manually crawling
Part 2: Automating assertions
Part 3: Unit testing

Part 1: Manually crawling change detection
Part 2: Automating assertions
Part 3: Unit testing

Is it the value I want?
Is it different?

Element Equals
Title Big Brown Shoe - £12.99 - Example.com
Status Code 200
H1 Big Brown Shoe
Canonical <link rel="canonical" href="https:/
/example.com/product/big-brown-shoe" />
CSS Selector: #review-counter Any number
CSS Selector: #product-data {
"@context": "https:/
/schema.org/",
"@type": "Product",
"name": "Big Brown Shoe",
"description": "The biggest brownest show you can ﬁnd.",
"sku": "0446310786",
"mpn": "925872",
}

Create code Test code Deployment

Create code Test code Deployment
All our hard work.

@dom_woodman
bit.ly/seo-for-large-websites
www.pipedout.com
@dom_woodman

SEO for Large/Enterprise Websites - Data & Tech Side

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a SEO for Large/Enterprise Websites - Data & Tech Side

Semelhante a SEO for Large/Enterprise Websites - Data & Tech Side (20)

Mais de Dominic Woodman

Mais de Dominic Woodman (7)

Último

Último (20)

SEO for Large/Enterprise Websites - Data & Tech Side