3. Summary
1. Web Scraping
– Definitions
– Value added
– Analysis a Sample Case
2. Scrapy Framework
– Overview
– Architecture
– A simple Scrapy program.
3. Build a auto scraping system for location-based apps
– Extract LatLng from address
– Extract phone number
– Realtime update & continuous 24/7
– Prevent duplication data
– Deploy without a dedicated server or VPS
4. Web crawler
Internet bot that systematically browses
the World Wide Web,
typically for web indexing.
Sources: wikipedia.org
16. Analysis a sample case
(1) collect [home for sales] records
from Web
(2) from many websites in Vietnam
(3) as soon as they posted
(4) continuous 24 / 7
Need
19. Step 3: Ctrl+C, Ctrl+V
• For every sites:
– Find listing latest records webpage link.
– For every record :
• Check if new record
– Copy & paste fields into a new record in my DB.
30. Simple Scrapy Program (cont.)
(4) Run the spider to extract the data
(5) Review scraped data
31. Build a auto scraping system for
location-based apps
• Extract LatLng from address
• Extract phone number
• Realtime update & continuous 24/7
• Prevent duplication data
• Deploy without a dedicated server or
VPS
32. Extract LatLng from address
• Use Google Geocode
• https://maps.googleapis.com/maps/api/geocode/json?addr
ess=xxx&sensor=true_or_false&key=API_KEY
38. Without a dedicated server or
VPS
• Problems: my server-side is on a cpanel
web hosting => can’t deploy scrapy
• Solutions:
– Make a web services for sync new record
data.
• /get_head_revision
• /sync
– Scrapy run on my PC, then sync with server.