SlideShare a Scribd company logo
1 of 22
Scraping the Web with Scrapinghub
For Startups
“Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
Who Uses Web Scraping
It is used by everyone from individuals to
multinational companies:
● Monitor your competitors’ prices by scraping
product information
● Detect fraudulent reviews and sentiment
changes by scraping product reviews
● Track online reputation by scraping social
media profiles
● Create apps that use public data
● Track SEO by scraping search engine results
Web Scraping Traffic
Scrapinghub
Our products empower our users to scrape data quickly and
effectively using open source technologies. We offer:
● A cloud-based platform to help you scale your crawlers
● A smart proxy rotator to crawl the web even faster
● Professional Services to handle web scraping and data
mining for you
● Off-the-shelf datasets so you can get data hassle-free
Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
Portia
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn, among others
● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
Web Scraping Pitfalls
Bot Countermeasures
Websites are using increasingly sophisticated techniques to
protect against bad bots.
Unfortunately, these same technologies often prevent
harmless bots from scraping content.
Common countermeasures include:
● IP address-based bans
● JavaScript and session based counter-measures
Blocked Crawlers
Servers identify and block crawlers that continuously fire many requests
to a website.
Solution: Meet Crawlera , our smart proxy rotator for web crawlers.
● Routes requests through a pool of 50k+ IPs
● Detects, logs and handles bans
● Polite scraping: Automatically throttles requests to websites
JavaScript in Web Pages
Dynamic content generated by JavaScript is often used by websites to
render the page (SPA) or to avoid being scraped by naive crawlers.
For simple instances, you can emulate the AJAX requests in Scrapy.
For complex cases, you can use Splash
● Works through an HTTP API
● Lua Scripts simulate user interaction
● No lock-in, it’s an open source project!
Duplicate Content
The web is full of duplicate content.
Duplicate Content negatively impacts:
● Storage
● Re-crawl performance
● Quality of data
Efficient algorithms for Near Duplicate Detection, like SimHash, are
applied to estimate similarity between web pages to avoid scraping
duplicated content.
Near Duplicate Detection Uses
Compare prices of products scraped from different retailers by finding
near duplicates in a dataset:
Merge similar items to avoid duplicate entries:
Title Store Price
ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89
Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95
Name Summary Location
Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the
Victorian architect William Burges…
51.8944, -8.48064
St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
Examples of Web Scraping Usage
Competitor Monitoring
E-commerce companies use web scraping to
monitor the price fluctuations and the ratings of
competitors:
● Scrape online retailers
● Structure the data in a search engine or DB
● Create an interface to search for products
● Sentiment analysis for product rankings
We help electronics companies monitor the activities of their resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints
● Product line quality checks
Monitor Resellers
Lead Generation
Mine scraped data to identify who to target in a company for your
outbound sales campaigns:
● Locate possible leads in your target market
● Identify the right contacts within each one
● Augment the information you already have on them
● Use data science to guess their email address
Reduce the time spent on HR tasks by creating a
select pool of applicants:
● Mine scraped data to locate candidates
● Match requisite skills and background
● Spot and rescue employees that are shopping
for a new job
Human Resources
Thank you!Thank you!
scrapinghub.com

More Related Content

What's hot

Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
MongoDB
 

What's hot (20)

Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB Cluster
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
 
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow ZurichHow to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Log File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkitLog File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkit
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Fluentd - Unified logging layer
Fluentd -  Unified logging layerFluentd -  Unified logging layer
Fluentd - Unified logging layer
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
Mongodb
MongodbMongodb
Mongodb
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingWebinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 

Viewers also liked

Paper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive MultimediaPaper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive Multimedia
Jeanette Howe
 

Viewers also liked (6)

Paper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive MultimediaPaper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive Multimedia
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
AIMS Website Revamp
AIMS Website RevampAIMS Website Revamp
AIMS Website Revamp
 
WEB Analytics - Data Mining - MIS - eBusiness website
WEB Analytics  - Data Mining - MIS - eBusiness website WEB Analytics  - Data Mining - MIS - eBusiness website
WEB Analytics - Data Mining - MIS - eBusiness website
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
HTML CSS & Javascript
HTML CSS & JavascriptHTML CSS & Javascript
HTML CSS & Javascript
 

Similar to Scrapinghub Deck for Startups

Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
 

Similar to Scrapinghub Deck for Startups (20)

Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
iWeb Scraping Services, India
iWeb Scraping Services, IndiaiWeb Scraping Services, India
iWeb Scraping Services, India
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
Introduction to Big Data using AWS Services
Introduction to Big Data using AWS ServicesIntroduction to Big Data using AWS Services
Introduction to Big Data using AWS Services
 
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
How Startups can leverage big data?
How Startups can leverage big data?How Startups can leverage big data?
How Startups can leverage big data?
 
Web Scraper For Online Stores .
Web Scraper For Online Stores .Web Scraper For Online Stores .
Web Scraper For Online Stores .
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
Null 1
Null 1Null 1
Null 1
 
crawl technology saves money and time
crawl technology saves money and timecrawl technology saves money and time
crawl technology saves money and time
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
 

Recently uploaded

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 

Scrapinghub Deck for Startups

  • 1. Scraping the Web with Scrapinghub For Startups
  • 2. “Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor
  • 3. Who Uses Web Scraping It is used by everyone from individuals to multinational companies: ● Monitor your competitors’ prices by scraping product information ● Detect fraudulent reviews and sentiment changes by scraping product reviews ● Track online reputation by scraping social media profiles ● Create apps that use public data ● Track SEO by scraping search engine results
  • 5. Scrapinghub Our products empower our users to scrape data quickly and effectively using open source technologies. We offer: ● A cloud-based platform to help you scale your crawlers ● A smart proxy rotator to crawl the web even faster ● Professional Services to handle web scraping and data mining for you ● Off-the-shelf datasets so you can get data hassle-free
  • 6. Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation
  • 7. Portia Portia is a Visual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page
  • 9. Large Scale Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on EC2 instances or dedicated servers ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn, among others ● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
  • 10. Broad Crawls Frontera allows us to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
  • 12. Bot Countermeasures Websites are using increasingly sophisticated techniques to protect against bad bots. Unfortunately, these same technologies often prevent harmless bots from scraping content. Common countermeasures include: ● IP address-based bans ● JavaScript and session based counter-measures
  • 13. Blocked Crawlers Servers identify and block crawlers that continuously fire many requests to a website. Solution: Meet Crawlera , our smart proxy rotator for web crawlers. ● Routes requests through a pool of 50k+ IPs ● Detects, logs and handles bans ● Polite scraping: Automatically throttles requests to websites
  • 14. JavaScript in Web Pages Dynamic content generated by JavaScript is often used by websites to render the page (SPA) or to avoid being scraped by naive crawlers. For simple instances, you can emulate the AJAX requests in Scrapy. For complex cases, you can use Splash ● Works through an HTTP API ● Lua Scripts simulate user interaction ● No lock-in, it’s an open source project!
  • 15. Duplicate Content The web is full of duplicate content. Duplicate Content negatively impacts: ● Storage ● Re-crawl performance ● Quality of data Efficient algorithms for Near Duplicate Detection, like SimHash, are applied to estimate similarity between web pages to avoid scraping duplicated content.
  • 16. Near Duplicate Detection Uses Compare prices of products scraped from different retailers by finding near duplicates in a dataset: Merge similar items to avoid duplicate entries: Title Store Price ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89 Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95 Name Summary Location Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges… 51.8944, -8.48064 St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
  • 17. Examples of Web Scraping Usage
  • 18. Competitor Monitoring E-commerce companies use web scraping to monitor the price fluctuations and the ratings of competitors: ● Scrape online retailers ● Structure the data in a search engine or DB ● Create an interface to search for products ● Sentiment analysis for product rankings
  • 19. We help electronics companies monitor the activities of their resellers: ● Tracking and watching out for stolen goods ● Pricing agreement violations ● Customer support responses on complaints ● Product line quality checks Monitor Resellers
  • 20. Lead Generation Mine scraped data to identify who to target in a company for your outbound sales campaigns: ● Locate possible leads in your target market ● Identify the right contacts within each one ● Augment the information you already have on them ● Use data science to guess their email address
  • 21. Reduce the time spent on HR tasks by creating a select pool of applicants: ● Mine scraped data to locate candidates ● Match requisite skills and background ● Spot and rescue employees that are shopping for a new job Human Resources