Pricesearcher is a vertical search engine and our mission is to give consumers the complete view. Our technology processes 500m+ prices per day across 10 countries.
2017 saw the launch of Pricesearcher’s web crawler – PriceBot, to complete the indexing of all UK prices.
In this talk we will analyse what PriceBot discovered and how this information can help you improve the crawlability of your own site.
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
What a search engine can teach you about product sitemaps - BrightonSEO April 2018
1. Vlassios Rizopoulos
Chief Technology Officer @ pricesearcher.com
What a search engine can teach you about product
sitemaps
@Pricesearcher #BrightonSEO
5. @Pricesearcher #BrightonSEO
PROGRESS TO DATE
Gathered data on 1.1 Billion products
Online in 11 Countries
Gathered 91 Billion price points for our products On average we check the price of a product 3 times a
day
We have gathered:
17,000,000 ISBNs
144,000,000 MPNs
73,000,000 SKUs
157,000,000 GTINs
GB / US / DE / FR / IT / IE / NO / SE / FI / DK / NG
6. @Pricesearcher #BrightonSEO
WHAT IS PRICEBOT?
Pricebot is our proprietary crawler, built to discover products and turn unstructured data
from web pages into structured data for our product database
Pricesearcher is the only product search engine that crawls to complement our product
coverage
PriceBot is fully robots.txt compliant, leaves behind a footprint in its user agent and has a
built-in feedback mechanism
http://www.pricesearcher.com/pricebot
7. @Pricesearcher #BrightonSEO
WHAT INFORMATION IS PRICEBOT COLLECTING?
We are looking to extract the following fields:
• Product Title
• Product Image
• Product Price
and optionally:
• Product Description
• Product Identifier (GTIN/UPC/EAN/ISBN)
• Product Brand
• Product Category
• Product Stock Availability
8. Vastly simplified discovering all the products from retailers
@Pricesearcher #BrightonSEO
INITIAL CRAWLING TECH DEPENDED ON SITEMAPS
11. @Pricesearcher #BrightonSEO
1. SITEMAP DATA
have an XML sitemap
with product links
that’s regularly updated
91%
61%
54%
of retailer websites
of retailer websites
of retailer websites
12. @Pricesearcher #BrightonSEO
2. BLOCKING OF CRAWLERS
have blocked us unintentionally
(generic robots.txt entry
or 403 automatic block)
have blocked us intentionally
(robots.txt entry)
2%
of retailer websites
0.05%
of retailer websites
13. @Pricesearcher #BrightonSEO
3. EXTRACTION USING METADATA STANDARDS
have product title + price + image
defined using meta / opengraph tags
have product title + price + image
defined using meta / itemprop tags
(schema)
have product title + price + image defined
using both
41%
36%
12%
of retailer websites
of retailer websites
of retailer websites
14. @Pricesearcher #BrightonSEO
4. EXTRACTION USING JAVASCRIPT
no info extracted due to heavy rendering
being uneconomical
price cannot be extracted as it is
converted / calculated on the fly
2%
of retailer websites
1%
of retailer websites
15. @Pricesearcher #BrightonSEO
5. SITEMAP LINKS
have multiple links to the same
product pages
have multiple links to pages that
return 404 codes
2%
of retailer websites
3%
of retailer websites
16. @Pricesearcher #BrightonSEO
6. PRODUCT IDENTIFIERS
provide a GTIN-14, EAN-13, UPC-12/8
for their products
provide an SKU for their products
provide an ISBN for their products
24%
of retailer websites
7%
of retailer websites
3%
of retailer websites
17. @Pricesearcher #BrightonSEO
7. PRODUCT CATALOGUE SIZE
have less than 5000 product links in
their sitemap
have between 5000 and 30000 links
have more than 30000 links
14%
of retailer websites
79%
of retailer websites
7%
of retailer websites
18. @Pricesearcher #BrightonSEO
8. DATA RICHNESS #1
provide a brand for their products
provide a category for their products
provide a stock indicator for their products
17%
of retailer websites
44%
of retailer websites
62%
of retailer websites
22. @Pricesearcher #BrightonSEO
ACTION POINT #1 - SITEMAP
• Have an XML sitemap
• Have the path of your sitemap listed in robots.txt
• Have your product pages in your sitemap
• Regularly update your sitemap
• Don’t point to 404 pages from your sitemap
23. @Pricesearcher #BrightonSEO
ACTION POINT #2 - META / OPENGRAPH / ITEMPROP
• Provide structured information on your products using meta
itemprop (schema) or opengraph tags
• Provide as much structured data as possible
• Implement them as close as possible to the standards
24. @Pricesearcher #BrightonSEO
ACTION POINT #3 – JAVASCRIPT & PRICE
• Be wary of the side effects of a javascript heavy site on crawling
• If you do implement a javascript heavy site, meta tags with
structured information are even more important!
• Be wary when converting the price based on geo location
• Don’t perform the price conversion in Javascript
25. @Pricesearcher #BrightonSEO
ACTION POINT #4 - ANTI-CRAWL & ROBOTS.TXT
• Ask yourselves what’s the benefit of an anti-crawl mechanism
• Ask yourselves what’s the benefit of blocking all crawlers in
robots.txt
• Control the speed of crawlers using crawl-delay
26. @Pricesearcher #BrightonSEO
ACTION POINT #5 - HAVE A SITEMAP MEETING
• Have a sitemap strategy, it’s just as important as your SEO strategy
• Sitemaps contribute massively to discoverability, yet are often overlooked
• Make sure you are doing everything you can to provide structured information
• Review your robots.txt contents
• Address missed opportunities from your sitemap sooner rather than later
27. @Pricesearcher #BrightonSEO
THANKS FOR LISTENING!
Pricebot
http://www.pricesearcher.com/pricebot
Keen to hear from you with feedback about PriceBot or Pricesearcher in general.
Feel free to drop me a line at vlassios@pricesearcher.com or catch up with me at
our stand B11 in the expo hall
Notas do Editor
Unintentional blocks:
Crawl-delay is very high that would take weeks to crawl a single site
All user-agents are blocked in robots.txt
Automated anti-crawl system kicks in and starts serving 403s