SlideShare uma empresa Scribd logo
1 de 20
REGULAR EXPRESSIONS, EXTRAORDINARY POWER
UNSL
2013
Burdisso Sergio - sergio.burdisso@gmail.com
 I have 20 min to cover all about using REs on
theW3
 HTTP
 Internet bots
 Web Crawler
 Web Scraping
HyperText Transfer Protocol
WWW (The Web)
Web Browser
Request
Response
HTTP
HTTP
 Application layer protocol
 HTTP is the protocol to exchange or transfer hypertext
Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html
sequences of characters
 HTTP Response example
Header
Body
EXTRAORDINARY POWER
 FirstThings First…
Regular ExpressionsAre
Awesome!
 Gather text
 Replace /Transform text
 Search /Validate text
 POSIX regular expressions (standard)
▪ ^. [ ] [^ ] (0) * {m,n} ? +|$
 regex.h
 pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"
 regcomp(regex_t *regex, pattern, cflags);
 regex.re_nsub = 4 //Number of parenthesized subexpressions
 regexec(regex, text, pmatch[])
 pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
 Making use of RE to parse HTTP responses headers
Great! Now we’re able to parse the http response headers… so what?
-We can properly process the response body!
Ah I see! … and what would I do that for?
-Let me show you!
Just like spiders on the web!
Regular Expressions cartoon from xkcd
Web Scraping
(we will see!)
 Internet bots (web robots,WWW robots or
bots) are software applications that run
automated tasks over the Internet
 A Web crawler is an Internet bot that
systematically browses theWorld Wide Web,
typically for the purpose ofWeb indexing
 Web scraping is a computer software technique
of extracting information from websites
 A Web Crawler Starts with a list of URLs to visit. As the
crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of
URLs to visit
hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");
hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
 Web Scraping: A simple yet powerful approach to
extract information from web pages can be based on
regular expression matching facilities of programming
languages (for instance C++, Perl or Python)
Regular Expressions cartoon from xkcd
WebScraping wScraping (8, "http://emails.com/victim");
wScraping.findAll(
"^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box
d{1,5}))s{1,2}(?i:(?<address2>(((APT|B
LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{
1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?<
city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}),
x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL
N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]
|T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$"
);
We’ve saved the day!
Everybody stand back!
We know regular expressions
The end
Thank you for your patience!

Mais conteúdo relacionado

Mais procurados

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Sammy Fung
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
Audrey Lim
 

Mais procurados (20)

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Combinators - Lightning Talk
Combinators - Lightning TalkCombinators - Lightning Talk
Combinators - Lightning Talk
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
hySON - D2Fest
hySON - D2FesthySON - D2Fest
hySON - D2Fest
 
hySON
hySONhySON
hySON
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started
 
Reactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことReactをproductionに導入して変わったこと
Reactをproductionに導入して変わったこと
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 

Semelhante a regular expressions and the world wide web

Wss Object Model
Wss Object ModelWss Object Model
Wss Object Model
maddinapudi
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 Final
Rich Helton
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web Apps
Frank Kim
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
Chamnap Chhorn
 

Semelhante a regular expressions and the world wide web (20)

Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Software Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsSoftware Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and Prospects
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)
 
Wss Object Model
Wss Object ModelWss Object Model
Wss Object Model
 
Writing Secure Code for WordPress
Writing Secure Code for WordPressWriting Secure Code for WordPress
Writing Secure Code for WordPress
 
Web services intro.
Web services intro.Web services intro.
Web services intro.
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 Final
 
Api
ApiApi
Api
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Mashup
MashupMashup
Mashup
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web Apps
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1
 
Taking AJAX to the Next Level
Taking AJAX to the Next LevelTaking AJAX to the Next Level
Taking AJAX to the Next Level
 
Microsoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next LevelMicrosoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next Level
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Switch to Backend 2023
Switch to Backend 2023Switch to Backend 2023
Switch to Backend 2023
 

Último

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

regular expressions and the world wide web

  • 1. REGULAR EXPRESSIONS, EXTRAORDINARY POWER UNSL 2013 Burdisso Sergio - sergio.burdisso@gmail.com
  • 2.  I have 20 min to cover all about using REs on theW3
  • 3.  HTTP  Internet bots  Web Crawler  Web Scraping
  • 5.
  • 6. WWW (The Web) Web Browser Request Response HTTP HTTP
  • 7.  Application layer protocol  HTTP is the protocol to exchange or transfer hypertext Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html sequences of characters
  • 8.  HTTP Response example Header Body
  • 10.  FirstThings First… Regular ExpressionsAre Awesome!  Gather text  Replace /Transform text  Search /Validate text
  • 11.  POSIX regular expressions (standard) ▪ ^. [ ] [^ ] (0) * {m,n} ? +|$  regex.h  pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"  regcomp(regex_t *regex, pattern, cflags);  regex.re_nsub = 4 //Number of parenthesized subexpressions  regexec(regex, text, pmatch[])  pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
  • 12.  Making use of RE to parse HTTP responses headers
  • 13. Great! Now we’re able to parse the http response headers… so what? -We can properly process the response body! Ah I see! … and what would I do that for? -Let me show you!
  • 14. Just like spiders on the web!
  • 15. Regular Expressions cartoon from xkcd Web Scraping (we will see!)
  • 16.  Internet bots (web robots,WWW robots or bots) are software applications that run automated tasks over the Internet  A Web crawler is an Internet bot that systematically browses theWorld Wide Web, typically for the purpose ofWeb indexing  Web scraping is a computer software technique of extracting information from websites
  • 17.  A Web Crawler Starts with a list of URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)""); hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
  • 18.  Web Scraping: A simple yet powerful approach to extract information from web pages can be based on regular expression matching facilities of programming languages (for instance C++, Perl or Python)
  • 19. Regular Expressions cartoon from xkcd WebScraping wScraping (8, "http://emails.com/victim"); wScraping.findAll( "^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box d{1,5}))s{1,2}(?i:(?<address2>(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{ 1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?< city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}), x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$" ); We’ve saved the day!
  • 20. Everybody stand back! We know regular expressions The end Thank you for your patience!