How easy is it to crawl the Australian web graph - or, in other words, crawl all Australian sites? Frank has set himself this challenge and in his talk he will cover web crawling in depth, as well as a number of interesting findings and trends about the Australian web market that he came across along the way.
3. @frankseo
frank@orchidbox.com
Overview
The basic techniques of web crawling
Backlink tools - Moz, Hrefs and Majestic SEO
Outreachr.com – the tool
An Australian challenge
Insights into the Outreachr database
Owning the data – what you can do with it
Take-aways
4. @frankseo
frank@orchidbox.com
Web crawling – an introduction
- A web crawler is a computer program that
browses the web in a methodical and
automated manner.
- They are called crawlers because they
crawl through a site one page at a
time, following the links to other pages on
the site until all pages have been read.
- All major search engines and SEO tools
deploy crawlers - also known as "spiders" or
"bots”.
5. @frankseo
frank@orchidbox.com
Breadth First Search
Web crawling – an Introduction
• BFS begins at a root node and inspects
all neighbouring nodes.
• For each neighbour node, in turn it
inspects the neighbour nodes which
were unvisited, and continues.
• Assumption: If we start with "good"
pages, this keeps us close to other
good pages.
• Variation of this algorithms are more
memory efficient and popular in
computing.
6. @frankseo
frank@orchidbox.com
Web Crawling – An Introduction
Depth First Search
• Invented in 19th century by French
mathematician Charles Pierre
Trémaux (strategy for solving
mazes).
• Algorithm for traversing or
searching tree or graph data
structures.
• Starts at the root and explores as far
as possible along each branch
before backtracking.
12. @frankseo
frank@orchidbox.com
- Be more efficient in finding the right sites for our clients
- Speed up the contact process
- Outsource some of the most repetitive work (e.g. sending
emails/filling contact forms)
- Work for various clients in various languages
- Codebase ownership = freedom to run custom campaign
- We don’t want to piss people off! We have an historical index of
who we have contacted in the past.
Why?
13. @frankseo
frank@orchidbox.com
Outreachr.com - how we do it
Discovery
(engine scraping,
Twitter,
own index)
Get SEO stats
(Moz &PR)
Social
Contact
extraction
(crawling
sites, Whois
data)
Sorting
algorithm
New campaign queries
17. @frankseo
frank@orchidbox.com
Step 1 - We started with a small tight seeding
(abc.net.au, news.com.au, theaustralian.com.au and other popular
Australian news sites)
After obtaining over 1M urls and analysing over 8M links, we only
found 90,000 unique domains over 2.4M registered .au Domains
The Australian web graph is hard to crawl
41. @frankseo
frank@orchidbox.com
Take-aways
- If you want to outreach in Australia, you probably need to be on Twitter.
- The top Aussie sites are aggregators (products, reviews or local business) - get
listed to increase visibility.
- You are already lucky! You don’t need to work to get as many root domains as
you would in other countries like the UK.
- Use a range of tools, including Open Site Explorer, hrefs.com and MajesticSEO
to check backlink profile as no single tool seems to do a great job at indexing the
Australian subnet.
- You need a com.au to rank in Australia. 19% are .com but usually with an
Australian subdomain (e.g. au.domain.com)