快快樂樂學 Scrapy

h4353062@hotmail.com
鍾怡傑

 甚麼是 Scrapy ?
 如何安裝 ?
 Go! Go! 開始吧~
 實作第一個 Spider
 Demo

 Scrapy 是一個 Python 的框架，可以簡單用
Python 的程式碼寫網路蜘蛛，很方便的抓網頁上
的資料。
 Scrapy at a glance
 Frequently Asked Questions

 需要安裝
◦ Python 2.7
◦ Lxml
◦ OpenSSL
◦ pip or easy_install
 安裝 Scrapy
◦ pip install Scrapy
◦ or
◦ easy_install Scrapy
http://doc.scrapy.org/en/latest/intro/install.html

 Step
1. 編輯 /etc/apt/sources.list 加入這行
deb http://archive.scrapy.org/ubuntu precise main
2. 執行 curl –s
http://archive.scrapy.org/ubuntu/archive.key | sudo apt-key add –
3. 升級 sudo apt-get update
4. 安裝 sudo apt-get install scrapy-<Version>
5. 測試執行 scrapy 指令
http://doc.scrapy.org/en/latest/topics/ubuntu.html

http://doc.scrapy.org/en/latest/topics/architecture.html

 甚麼是 Scrapy Shell ?
◦ 一個 python 的直譯器，可以動態顯示執行的結果。
 執行 scrapy shell “<URL>”
 Ex:
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

 讓我們試一下吧~
 觀察輸出的結果
In [1]: sel.xpath('//title')
In [2]: sel.xpath('//title').extract()
In [3]: sel.xpath('//title/text()')
In [4]: sel.xpath('//title/text()').extract()
In [5]: sel.xpath('//title/text()').re('(w+):')

 由輸出結果可知，Scrapy 使用 Xpath 語法抓網頁的資
訊，預設輸出使用 JSON 格式
 對於 Xpath 語法不熟的話，可參考以下網址:
1. http://www.w3schools.com/XPath/
2. http://msdn.microsoft.com/zh-
tw/library/ms256086(v=vs.110).aspx
 練習資源
1. Xpath Tester
2. Chrome 瀏覽器的 Xpath Helper 套件

 基本流程:
1. 建立一個新 Scrapy 專案
2. 定義你需要的 Item
3. 寫一個 Spider 抓資料
4. 寫一個 Item Pipeline 存抓到的資料

 scrapy startproject <Project_Name>
scrapy startproject tutorial
tutorial/
├── tutorial
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg

 編輯 tutorial/tutorial/items.py
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()

 編輯 tutorial/tutorial/spiders/dmoz_spider.py
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
name = “dmoz”
allowed_domains = ["dmoz.org"]
start_urls = [
“http://www.dmoz.org/Computers/Programming/Languages/Python/B
"http://www.dmoz.org/Computers/Programming/Languages/Python/R
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem() #之前 Step2 定義的 Item
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items

 執行路徑必須在 tutorial/tutorial/spiders/
 抓到的資料儲存在 spiders/items.json
scrapy crawl dmoz -o items.json -t json

參考資料: Scrapy 0.22 documentation

快快樂樂學 Scrapy

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a 快快樂樂學 Scrapy

Semelhante a 快快樂樂學 Scrapy (20)

快快樂樂學 Scrapy