SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Web Scraping 1-2-3 with
   Python + Scrapy
        Sammy Fung
    sammy.hk, gownjob.com
Today Agenda
● Some Cases
● Python and Scrapy
Web Scraping
● a computer software technique of extracting
  information from websites. (Wikipedia)
● for business, hobbies, research.......
● NOT talk about business cases today.
CableTV & NOWTV Programme
(Past)
● 2004.
● slow, slow, slow, or worst - can't connect.
● use Flash.
HK Observatory and Joint Typhoon
Warning Center
● no easy data exchange format, eg.
  RSS/Atom.
● We won't check websites everyday.
Transportation - KMB, PTES
● no map view on KMB website for a bus route
  in the past.
● Exteremly Poor, Ugly (or much worse) map
  UI on PTES.
My experiences on web scraping
● 2004: php
● year after: python
● recent year: python with scrapy
Document Types
● HTML, XML,......
● Text
● Others, eg. pictures, videos,......
Web Scraping
●   Look for right URLs to scrap.
●   Look for right content from webpages.
●   Saving data into data store.
●   When to run the web scraping program ?
What is Scrapy ?
● An open source web scraping framework for
  Python.
● Scrapy is a fast high-level screen scraping
  and web crawling framework, used to crawl
  websites and extract structured data from
  their pages. It can be used for a wide range
  of purposes, from data mining to monitoring
  and automated testing.
Features of Scrapy
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from
  HTML and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others
Installation of Scrapy
●   pip
●   APT repo
●   RPM
●   tarball (binary/source)
Create new scrapy project
$ scrapy startproject mybot
mybot/
mybot/scrapy.cfg
mybot/mybot/items.py
mybot/mybot/pipeline.py
mybot/mybot/settings.py
mybot/mybot/spiders/myspider.py
etc.......
items.py
from scrapy.item import Item, Field

class HKOCurrentItem(Item):
 time = Field()
 station = Field()
 temperature = Field()
 humidity = Field()
 #......
spiders/hko_spider.py (1/5)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from weatherhk.items import HKOCurrentItem

import datetime, re
spiders/hko_spider.py (2/5)
class HKOCurrentSpider(BaseSpider):
 name = "HKOCurrentSpider"
 #allowed_domains = ["www.weather.gov.hk"]
 start_urls = [
   "http://www.weather.gov.
hk/textonly/forecast/chinesewx.htm"
 ]
spiders/hko_spider.py (3/5)
def parse(self, response):
 hxs = HtmlXPathSelector(response)

 stations = []
 # Getting weather data from each stations.
 tx = hxs.select("//pre[1]/text()").re('[^n]*n')
spiders/hko_spider.py (4/5)
   for i in tx:
     if re.search(u'd 度',i):
       data = HKOCurrentItem()
       data['time'] = int(dt)
       data['station'] = self.station.code(i)
       data['temperature'] = int(re.findall(u'd+',i)
[0])
       stations.append(data)
spiders/hko_spider.py (5/5)
 return stations
pipelines.py (1/2)
class HKOCurrentPipeline(object):
 def process_item(self, item, spider):
   station = self.db[item['station']]
   storeditem = dict(item.__dict__)['_values']
pipelines.py (2/2)
   try:
     if 'temperature' in storeditem:
       lasttime = station.find({'temperature': {'$gt':
0}}).sort('time', -1).limit(1)
     if lasttime[0]['time'] != storeditem['time']:
       id = self.insert(storeditem)

  return item

Mais conteúdo relacionado

Mais procurados

Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documentsCésar Rodas
 
Dive into .git
Dive into .gitDive into .git
Dive into .gitnishio
 
世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するためにKuniaki Igarashi
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With GroovyJames Williams
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)MongoSF
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)Steven Francia
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Дмитрий Бумов
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide webSergio Burdisso
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid themSteven Francia
 
Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)EunGi Hong
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Steven Francia
 
Writing dumb tests
Writing dumb testsWriting dumb tests
Writing dumb testsLuke Lee
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Steven Pousty
 

Mais procurados (20)

Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documents
 
Dive into .git
Dive into .gitDive into .git
Dive into .git
 
世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With Groovy
 
Django Mongodb Engine
Django Mongodb EngineDjango Mongodb Engine
Django Mongodb Engine
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them
 
Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
Writing dumb tests
Writing dumb testsWriting dumb tests
Writing dumb tests
 
Kiosk / PHP
Kiosk / PHP Kiosk / PHP
Kiosk / PHP
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
 

Destaque

Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha
 
Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Dan Whitehouse
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design FundamentalsHüseyin BABAL
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareNosheen Qamar
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 

Destaque (20)

Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Scrapy
ScrapyScrapy
Scrapy
 
Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15
 
CPA-Script
CPA-ScriptCPA-Script
CPA-Script
 
摘星
摘星摘星
摘星
 
Relevance Assessment Tool
Relevance Assessment ToolRelevance Assessment Tool
Relevance Assessment Tool
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design Fundamentals
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Website vs web app
Website vs web appWebsite vs web app
Website vs web app
 
Mobile Website vs Mobile App
Mobile Website vs Mobile AppMobile Website vs Mobile App
Mobile Website vs Mobile App
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional Software
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 

Semelhante a Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Open Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageOpen Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageSammy Fung
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopPyCon Italia
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Sammy Fung
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Younggun Kim
 
Python. Why to learn?
Python. Why to learn?Python. Why to learn?
Python. Why to learn?Oleh Korkh
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Younggun Kim
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep DiveAkihiro Suda
 
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...Megumi Takeshita
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server InternalsPraveen Gollakota
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 

Semelhante a Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) (20)

Open Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageOpen Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object Storage
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntop
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Deploy your own P2P network
Deploy your own P2P networkDeploy your own P2P network
Deploy your own P2P network
 
Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015
 
Python. Why to learn?
Python. Why to learn?Python. Why to learn?
Python. Why to learn?
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive
 
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
 
Crawler
CrawlerCrawler
Crawler
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 

Mais de Sammy Fung

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Sammy Fung
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineSammy Fung
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Sammy Fung
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunitySammy Fung
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web frameworkSammy Fung
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯Sammy Fung
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastSammy Fung
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the WebSammy Fung
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and CommunitySammy Fung
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsSammy Fung
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionSammy Fung
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source CommunitySammy Fung
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with DrupalSammy Fung
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at workSammy Fung
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and CommunitySammy Fung
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job BoardSammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSSammy Fung
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 

Mais de Sammy Fung (20)

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to online
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and Community
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web framework
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 Forecast
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and Community
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server Edition
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source Community
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with Drupal
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at work
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and Community
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job Board
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

  • 1. Web Scraping 1-2-3 with Python + Scrapy Sammy Fung sammy.hk, gownjob.com
  • 2. Today Agenda ● Some Cases ● Python and Scrapy
  • 3. Web Scraping ● a computer software technique of extracting information from websites. (Wikipedia) ● for business, hobbies, research....... ● NOT talk about business cases today.
  • 4. CableTV & NOWTV Programme (Past) ● 2004. ● slow, slow, slow, or worst - can't connect. ● use Flash.
  • 5. HK Observatory and Joint Typhoon Warning Center ● no easy data exchange format, eg. RSS/Atom. ● We won't check websites everyday.
  • 6. Transportation - KMB, PTES ● no map view on KMB website for a bus route in the past. ● Exteremly Poor, Ugly (or much worse) map UI on PTES.
  • 7. My experiences on web scraping ● 2004: php ● year after: python ● recent year: python with scrapy
  • 8. Document Types ● HTML, XML,...... ● Text ● Others, eg. pictures, videos,......
  • 9. Web Scraping ● Look for right URLs to scrap. ● Look for right content from webpages. ● Saving data into data store. ● When to run the web scraping program ?
  • 10. What is Scrapy ? ● An open source web scraping framework for Python. ● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  • 11. Features of Scrapy ● define data you want to scrapy ● write spider to extract data ● Built-in: selecting and extracting data from HTML and XML ● Built-in: JSON, CSV, XML output ● Interactive shell console ● Built-in: web service, telnet console, logging ● Others
  • 12. Installation of Scrapy ● pip ● APT repo ● RPM ● tarball (binary/source)
  • 13. Create new scrapy project $ scrapy startproject mybot mybot/ mybot/scrapy.cfg mybot/mybot/items.py mybot/mybot/pipeline.py mybot/mybot/settings.py mybot/mybot/spiders/myspider.py etc.......
  • 14. items.py from scrapy.item import Item, Field class HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
  • 15. spiders/hko_spider.py (1/5) from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from weatherhk.items import HKOCurrentItem import datetime, re
  • 16. spiders/hko_spider.py (2/5) class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov. hk/textonly/forecast/chinesewx.htm" ]
  • 17. spiders/hko_spider.py (3/5) def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re('[^n]*n')
  • 18. spiders/hko_spider.py (4/5) for i in tx: if re.search(u'd 度',i): data = HKOCurrentItem() data['time'] = int(dt) data['station'] = self.station.code(i) data['temperature'] = int(re.findall(u'd+',i) [0]) stations.append(data)
  • 20. pipelines.py (1/2) class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item['station']] storeditem = dict(item.__dict__)['_values']
  • 21. pipelines.py (2/2) try: if 'temperature' in storeditem: lasttime = station.find({'temperature': {'$gt': 0}}).sort('time', -1).limit(1) if lasttime[0]['time'] != storeditem['time']: id = self.insert(storeditem) return item