4. What Is Web Scraping?
Web Scraping
Also known as screen scraping, web scraping is the act of copying
large amounts of data from a website – either manually or with an
automated program.
Legitimate Scraping
Scraping can sometimes be benevolent and totally acceptable. For
example, the search engine bots that index your website
Malicious Scraping
A systematic theft of intellectual property accessible on a website,
including pricing, content, images, and proprietary data
5. Web scraping of listing data results in competitive disadvantage, damage to brand
reputation and loss of revenue and customers
Presenting inaccurate listing data - once listing data has been scraped, you lose
control over the presentation of the property
Damaged SEO pagerank due to stolen data duplicated on other websites
Skewed analytic data leads to mis-informed business decisions
Slowdowns and downtime due to excessive scraping
How Web Scraping Impacts Real Estate Portals
6. Cost and of scraping/acquiring data has gone down
The ability to scrape has become easier and more accessible
Virtual servers and bandwidth are cheap
Sophistication of Botnet as a Service
The Growth of the Scraping Problem
7. Survey Respondents
100 real estate executives representing over
600,000 realtors
14 real estate portal operators running 400,000
real estate websites
2015 Real Estate Web Scraping Survey
8. We asked real estate portals “how much scraping is acceptable for your business model and
operational budget?”
How much Scraping Traffic is Too Much for your Business?
43% of responded “Less than 1%”
28% responded “Less than 15%”
9. Up to 25% of website traffic on real estate portals is from scrapers
How Much Bot Traffic do Real Estate Portals Actually See?
Real Estate Sites
Bad Bots
25%
Good Bots
35%
Source: “Distil Networks, The 2015 Bad Bot Landscape Report”
Humans
40%
of those surveyed said their business model
could NOT handle this level of scraping
71%
11. ~80% are relying on the wrong tools to detect the problem
Advanced bots and distributed scraping may not be apparent in log analysis
Only 21% are relying on
commercial tools
Why Aren’t Portals Aware of The Scope of the problem?
12. Legacy Tools are Ineffective on the Modern Bots
Real estate portals are largely
implementing the wrong tools
Top implemented anti-scraping
solutions are
● IP Blocking
● Rate limiting (based on IPs)
● WAFs
Why can’t these tools keep up?
Source: “Distil Networks, 2015 Study of Scraping Real Estate Websites and MLS Data Security”
13. IP Blocking is always one step behind attackers
Attackers rotate IP addresses from huge pools of IP
IP addresses can easily be spoofed
Anonymous proxies help mask user origins
IP Based Solutions are too Reactive
14. Attacks are often distributed among many IP Addresses
Scraping happens at a very slow pace but from many sources
Low and Slow Attacks Evade Rate Limiting
1 IP scraping 1,000 pages = 500 IPs scraping 2 pages each
15. Bad guys have more tools to leverage when building bots
Web Browsers are Becoming More Complex
The Evolution of the Web
Browser versions and their Technologies
Source: http://www.evolutionoftheweb.com
16. Advanced bots use browser capabilities to evade detection and mimic human
behavior
Bots are Increasingly Able to Mimic Humans
Bad Bot Sophistication levels, 2014
18. Tools Must Leverage Many Techniques to Detect Advanced Bots
Identifying advanced bots and browser automation
requires specialized techniques
Commercial, purpose-built solutions tend to have
more automation checks
Approaches to Detecting Bots, by Tier
19. IP blocking is not effective when dealing with modern threats
Device fingerprinting provides distinct advantages like
○ Tracking attackers across IP addresses
○ Detecting bots through anonymous proxy networks
○ Reducing false positives associated with
humans anonymizing themselves
Use Device Fingerprinting Instead of IP Blocking
20. Community sourced attack data aggregation provides more accurate data source for
enforcement
Machine learning and self configuration greatly
reduced security maintenance overhead
Community Sourced Intelligence Improves Accuracy
21. Mobile users now outnumber desktop
users
Mobile clients are now being used to
launch attacks
Mobile sites tend to be easier to scrape
○ Less superfluous content
○ Highly structured and easy to
navigate layouts
Mobile Growth Brings With it Mobile Threats
Source: Comscore,The US Mobile App report
22. Precautions should be implemented to extend security strategies to cover mobile
websites
Mobile clients need to be subjected to the same scrutiny as other users
Mobile Should not be Overlooked
23. The World’s Most Accurate Bot Detection System
Inline Fingerprinting
Fingerprints stick to the bot even if it attempts to reconnect
from random IP addresses or hide behind an anonymous
proxy.
Known Violators Database
Real-time updates from the world’s largest Known Violators
Database, which is based on the collective intelligence of all
Distil-protected sites.
Browser Validation
The first solution to disallow browser spoofing by validating
each incoming request as self-reported and detects all known
browser automation tools.
Behavioral Modeling and Machine Learning
Machine-learning algorithms pinpoint behavioral anomalies
specific to your site’s unique traffic patterns.
24. Challenges Distil Results
Homegrown ‘IP blocking’solution costly to maintain Automated bot defense eliminated the need formanual tuning and
maintenance
Had to overprovision infrastructure to account forrandom
spikesin bot traffic
Eliminated attacksfrom90+countriesrepresenting over99.9% of
bad bots
Webs scraping bots broke through theirdefenses Stopped thousandsofthreatsfrom imposterGooglebots – making
single page requestsfrom 1000+IP addresses/month
Onthehouse Saves Infrastructure Costs by Blocking Bad Bots
Australia’s only free property research portal, covering 98% of
Australian properties
Distil was quick to setup and ensures that we block the bots that are
dangerous to our organization.”
-Arun Thenabadu,CTO of Onthehouse
“
27. Abstract
Session Time: 20 minutes
Industry: Real Estate (Global - show is in Amsterdam)
Title: Ensuring Property Portal Listing Data Security
Subtitle: Don’t Bother with Litigation, Just Protect Your Listing Data Before the Theft Occurs
Abstract:
Securing your property portal listingdatais harder thanever. Why?Web scraping is cheap and easy. Bots simply steal whatever contentthey’ve been
programmed to fetch– listing text, photos,andother datathat shouldonly be available to paid subscribers andlegitimate consumers.
Attend this session to learn how toavoidexpensivelitigationby protectingyour contentbefore the theft occurs. Review the latest researchon how non-human
traffic has evolved over thepast few years andbest practices to protect both copyrighted and non-copyrightable content.
Hear the results from research conductedwith property portal executives onthecurrentstateof anti-scraping efforts.
Key takeaways include:
Insights into thelatestresearch about “scraping” propertyportal websites
How web scraping works and what youcan doto shoreup your defenses
How to create a secure listing “supply chain” with your upstream anddownstream partners
How to protect your brandimage, reputationandSEOrankings
29. ○Realtor.org offers free tools to track data - Reactive = expensive
Checklist for Syndication has many references to data scraping – legal guidance
NoScrape – aborted project - no update since 2010?
Problem is not going away
Industry Help? ...is Way behind on Bad Bots
Ads for Scraping Programs
on Realtor.com!
○Realtor.com blog to “deter scraping” relies on
obsolete IP address blocking and expensive IP
litigation
“REALTOR.com® logging, tracking and monitoring patterns
that indicate data is being stolen for these illegitimate
purposes. Once an offender is identified, their IP address is
blocked from accessing the site.”
(Oct 10, 2014)