SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
PYTHON
CRAWLER
From Beginner To Intermediate
1
Self Introduction
Cheng-Yi, Yu
erinus.startup@gmail.com
• LifePlus Inc.
Technical Manager
• Paganini Plus Inc.
Senior Software Developer
• Freelancer
~ 10 Years
2
DAY 1
3
Python
• Entry
if __name__ == '__main__':
# do something
• Method
def main ():
# do something
• Package
import [package]
import [package] as [alias]
• Format
'%s' % ([parameters ...])
4
demo01.py
demo02.py
demo03.py
Python
• If … Else …
if [condition]:
# do something
else:
# do something
• For Loop
for item in list:
# do something
• Array Slice
array[start:end]
5
Python
• Array Creation From For Loop
– Object
[item.attr for item in array]
– Dictionary
[item[key] for item in array]
6
demo04.py
Python
• In
– String
if <str> in <str>:
– Array
if item in list:
– Dictionary
if <str:key> in <dict>:
7
Installation
• Ubuntu Fonts
https://design.ubuntu.com/font/
• Source Han Sans Fonts
https://github.com/adobe-fonts/source-han-sans
• Visual Studio Code And Extensions
https://code.visualstudio.com/
8
Installation
• Cmder
http://cmder.net/
• Python 3.6
https://www.python.org/
• Python Packages
pip install requests
pip install pyquery
pip install beautifulsoup4
pip install js2py
pip install selenium
9
Visual Studio Code
10
Visual Studio Code
• Install Python Extensions
11
Visual Studio Code
• Open Integrated Terminal
12
Visual Studio Code
• Open Integrated Terminal
13
Built-in
• Json
import json
– json.loads
json.loads(<str>)
– json.dumps
json.dumps(<dict>)
14
demo05.py
demo06.py
Built-in
• Xml
import xml.etree.ElementTree as ET
– Load From File
tree = ET.ElementTree(file=<str:filepath>)
tree = ET.parse(<str:filepath>)
root = tree.getroot()
– Load From String
root = ET.fromstring(<str>)
15
demo07.py
demo08.py
Built-in
• Xml
– Child Nodes
Only One Level
for node in root:
# do something
– XPath
Multiple Levels
nodes = root.findall(<str:expression>)
for node in nodes:
# do something
16
demo07.py
demo08.py
Built-in
• Url
import urllib.parse as UP
– urlparse
result = UP.urlparse(<str:url>)
– urlunparse
url = UP.urlunparse(<ParseResult>)
– quote
str = UP.quote(<str:unquoted>)
– unquote
str = UP.unquote(<str:quoted>)
17
demo09.py
demo10.py
Built-in
• Regular Expression
import re
– re.search
Find First Match
match = re.search(<str:pattern>, <str:text>)
match.group(<int:index>)
– re.findall
Find All Matches
finds = re.findall(<str:pattern>, <str:text>)
for find in finds:
# do something
18
demo11.py
Built-in
• Regular Expression
– re.split
Split By Pattern
re.split(<str:pattern>, <str:text>)
– re.sub
Replace By Pattern
re.sub(<str:pattern>, <str:replace>, <str:text>)
19
demo11.py
Built-in
• Regular Expression
– Expressions
1. Range
[Start-End]
[0-9], [a-z], [A-Z], [a-zA-Z], [0-9a-zA-Z], ...
20
demo12.py
Built-in
• Regular Expression
– Expressions
1. Zero Or More Times
*
2. One Or More Times
+
3. Zero Or One Time
?
21
demo12.py
Built-in
• Regular Expression
– Expressions
1. Numbers
d = [0-9]
2. Words
w = [a-zA-Z0-9] (ANSI)
w = [a-zA-Z0-9] + Non-ANSI Characters (UTF-8)
3. Spaces, Tabs, …
s
22
demo12.py
Built-in
• Regular Expression
– Expressions
1. Start With
^
2. End With
$
23
demo12.py
Built-in
• Regular Expression
– Expressions
1. Named Group
(?P<name>expr)
(?P<country>+d+)-(?P<phone>d+)
24
demo13.py
AnalySIS
• Chrome Developer Tools
– Elements
See Elements In DOM
Id, Class, Attribute, ...
– Network
See Requests, Responses
Urls, Methods, Headers, Cookies, Bodies, ...
25
Elements
• Find Element by Mouse Pointer
26
Elements
• Find Element by HTML Tag
27
Networks
• See All Requests And Responses
28
Networks
• See Details Of Request And Response
29
Networks
• See Response Content
30
Networks
• See Cookies Sent And Set
31
Documents
• Requests
http://docs.python-requests.org/
• PyQuery
https://pythonhosted.org/pyquery/
• Beautiful Soup 4
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
• Js2Py
https://github.com/PiotrDabkowski/Js2Py
32
Packages
• Requests
import requests
– Request
1. Method
GET, POST, ...
response = requests.get(<str:url>)
response = requests.post(<str:url>, data=<str:body>)
response = requests.post(<str:url>, data=<dict:body>)
2. Session
session = requests.Session()
response = session.get(<str:url>)
33
demo14.py
Packages
• Requests
import requests
– Request
• Headers
response = requests.get(<str:url>, headers=<dict>)
• Cookies
response = requests.get(<str:url>, cookies=<dict>)
34
Packages
• Requests
– Response
1. Status Code
response.status_code
2. Headers
response.headers[<str:name>]
3. Cookies
response.cookies[<str:name>]
35
demo14.py
Packages
• Requests
– Response
1. Binary Content
response.content
2. Text Content
response.text
3. Json Content
response.json()
36
demo14.py
Packages
• PyQuery
import pyquery
– Load From String
d = pyquery.PyQuery(<str:html>)
– Load From Url
d = pyquery.PyQuery(url=<str:url>)
– Load From File
d = pyquery.PyQuery(filename=<str:filepath>)
37
demo15.py
Packages
• PyQuery
– Find
p = d(<str:expression>)
– Element To HTML
p.html()
– Extract Text From Element
p.text()
– Get Value From Element Attribute
val = p.attr[<str:name>]
38
demo15.py
Packages
• Beautiful Soup 4
import bs4
– Load From String
d = bs4.BeautifulSoup(<str:html>, 'html.parser')
39
demo16.py
Packages
• Beautiful Soup 4
– Find
p = d.find_all(<str:tag>, <attr-key>=<attr-val>, ...)
p = d.find_all(<regex>, <attr-key>=<attr-val>, ...)
p = d.find_all(<array>, <attr-key>=<attr-val>, ...)
p = d.find(<str:tag>, <attr-key>=<attr-val>, ...)
p = d.find(<regex>, <attr-key>=<attr-val>, ...)
p = d.find(<array>, <attr-key>=<attr-val>, ...)
p = d.select(<str:expression>)
p = d.select_one(<str:expression>)
40
demo16.py
demo17.py
Packages
• Beautiful Soup 4
– Extract Text From Element
p.get_text()
– Get Value From Element Attribute
p.get(<str:name>)
41
demo16.py
demo17.py
Packages
• Js2Py
import js2py
– Eval
js2py.eval_js(<str:code>)
res = js2py.eval_js('var o = <str:js>; o')
42
demo18.py
DAY 2
43
WORKSHOP
• Apple Daily
https://tw.appledaily.com/
– Realtime News
https://tw.appledaily.com/new/realtime
44
WORKSHOP
• Facebook Page
– Cookies
– Feed
45
DAY 3
46
SELENIUM
• Download ChromeDriver
https://sites.google.com/a/chromium.org/chromedriver/
47
SELENIUM
• Import
import selenium.webdriver
• Initialize
option = elenium.webdriver.ChromeOptions()
• Start
driver = selenium.webdriver.Chrome(chrome_options=option)
• Browse
driver.get(<str:url>)
• Close
driver.quit()
48
SELENIUM
• Source
driver.page_source
49
SELENIUM
• Find One
– find_element_by_id
– find_element_by_name
– find_element_by_tag_name
– find_element_by_class_name
– find_element_by_css_selector
50
SELENIUM
• Find Multiple
– find_elements_by_name
– find_elements_by_tag_name
– find_elements_by_class_name
– find_elements_by_css_selector
51
SELENIUM
• Actions
– send_keys
– click
52
WORKSHOP
• Facebook Page
– Login
– Feed
53
54

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Programming in python Unit-1 Part-1
Programming in python Unit-1 Part-1Programming in python Unit-1 Part-1
Programming in python Unit-1 Part-1
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
Functional concepts in C#
Functional concepts in C#Functional concepts in C#
Functional concepts in C#
 
Odessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonOdessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and Python
 
R. herves. clean code (theme)2
R. herves. clean code (theme)2R. herves. clean code (theme)2
R. herves. clean code (theme)2
 
Lecture 5 python function (ewurc)
Lecture 5 python function (ewurc)Lecture 5 python function (ewurc)
Lecture 5 python function (ewurc)
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 

Semelhante a Python Crawler

Semelhante a Python Crawler (20)

Tuples, Dicts and Exception Handling
Tuples, Dicts and Exception HandlingTuples, Dicts and Exception Handling
Tuples, Dicts and Exception Handling
 
Python update in 2018 #ll2018jp
Python update in 2018 #ll2018jpPython update in 2018 #ll2018jp
Python update in 2018 #ll2018jp
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
Intro
IntroIntro
Intro
 
Modules and packages in python
Modules and packages in pythonModules and packages in python
Modules and packages in python
 
programming with python ppt
programming with python pptprogramming with python ppt
programming with python ppt
 
Dov'è il tasto ok?
Dov'è il tasto ok?Dov'è il tasto ok?
Dov'è il tasto ok?
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
package module in the python environement.pptx
package module in the python environement.pptxpackage module in the python environement.pptx
package module in the python environement.pptx
 
TDD with phpspec2
TDD with phpspec2TDD with phpspec2
TDD with phpspec2
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Mufix Network Programming Lecture
Mufix Network Programming LectureMufix Network Programming Lecture
Mufix Network Programming Lecture
 
Python in 30 minutes!
Python in 30 minutes!Python in 30 minutes!
Python in 30 minutes!
 
go.ppt
go.pptgo.ppt
go.ppt
 
Day 1 - Python Overview and Basic Programming - Python Programming Camp - One...
Day 1 - Python Overview and Basic Programming - Python Programming Camp - One...Day 1 - Python Overview and Basic Programming - Python Programming Camp - One...
Day 1 - Python Overview and Basic Programming - Python Programming Camp - One...
 
Good ideas that we forgot
Good ideas that we forgot   Good ideas that we forgot
Good ideas that we forgot
 
Programming in Python
Programming in Python Programming in Python
Programming in Python
 
Crawler 2
Crawler 2Crawler 2
Crawler 2
 
PyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialPyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 Tutorial
 
Lecture 0 python basic (ewurc)
Lecture 0 python basic (ewurc)Lecture 0 python basic (ewurc)
Lecture 0 python basic (ewurc)
 

Mais de Cheng-Yi Yu

Mais de Cheng-Yi Yu (9)

CEF.net
CEF.netCEF.net
CEF.net
 
Go Web Development
Go Web DevelopmentGo Web Development
Go Web Development
 
Facebook Dynamic Ads API
Facebook Dynamic Ads APIFacebook Dynamic Ads API
Facebook Dynamic Ads API
 
Network Device Development - Part 5: Firewall 104 ~ Packet Splitter
Network Device Development - Part 5: Firewall 104 ~ Packet SplitterNetwork Device Development - Part 5: Firewall 104 ~ Packet Splitter
Network Device Development - Part 5: Firewall 104 ~ Packet Splitter
 
Network Device Development - Part 4: Firewall 103 ~ Protocol Filter & Payload...
Network Device Development - Part 4: Firewall 103 ~ Protocol Filter & Payload...Network Device Development - Part 4: Firewall 103 ~ Protocol Filter & Payload...
Network Device Development - Part 4: Firewall 103 ~ Protocol Filter & Payload...
 
2015.10.05 Updated > Network Device Development - Part 2: Firewall 101
2015.10.05 Updated > Network Device Development - Part 2: Firewall 1012015.10.05 Updated > Network Device Development - Part 2: Firewall 101
2015.10.05 Updated > Network Device Development - Part 2: Firewall 101
 
2015.10.05 Updated > Network Device Development - Part 1: Switch
2015.10.05 Updated > Network Device Development - Part 1: Switch2015.10.05 Updated > Network Device Development - Part 1: Switch
2015.10.05 Updated > Network Device Development - Part 1: Switch
 
Android Security Development - Part 2: Malicious Android App Dynamic Analyzi...
Android Security Development - Part 2: Malicious Android App Dynamic Analyzi...Android Security Development - Part 2: Malicious Android App Dynamic Analyzi...
Android Security Development - Part 2: Malicious Android App Dynamic Analyzi...
 
2015.04.24 Updated > Android Security Development - Part 1: App Development
2015.04.24 Updated > Android Security Development - Part 1: App Development 2015.04.24 Updated > Android Security Development - Part 1: App Development
2015.04.24 Updated > Android Security Development - Part 1: App Development
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Último (20)

WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

Python Crawler