SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Christopher M. Frenz
 Information is being generated at a faster
rate than ever before
 The speed at which information can be
generated is continually increasing
 Continuous improvements in computers,
storage, and networking make much of this
information readily available to indviduals
54,000 hits
 Most search engines use a keyword based
approach
 If a document contains all of the keywords
specified it is returned as a match
 Ranking algorithms (e.g. PageRank) are used
to put the most relevant results at the top of
the list and the least relevant at the bottom
 Not everything can be easily expressed as a
keyword
 Suppose you want to search for unknown
phone numbers? How can you do this with
keywords?
 How do we recognize a phone number when
we see one?
 We recognize a phone number by recognizing
the pattern of digits
◦ (XXX) XXX-XXXX
 While it is hard to express such a pattern in
the form of a keyword, it is really easy to
express it in the form of a regular expression
 (s?(?d{3})?[-s.]?d{3}[-.]d{4})
#!usr/bin/perl
use strict;
use warnings;
(my $string=<<'LIST');
John (555) 555-5555 fits pattern
Bob 234 567-8901
Mary 734-234-9873
Tom 999 999-9999
Harry 111 111 1111 does not fit pattern
LIST
while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){
print "$1n";
}
 Conduct a broad key word search using an
existing search engine
 Use your custom coded application to take
the returned search results and perform
regular expression based pattern matching
 The results that match your regular
expression are your refined search results
General Search APIs Specialized Search APIs
 Bing
 Yahoo BOSS
 Blekko
 Yandex
 Twitter
 Medicine – Pubmed
 Physics –Arxiv
 Government –
GovTrack
 Finance – Yahoo
Finance
 etc
Seeking to Extract: DFN [A-Z]d+
Script described in:
http://www.biomedcentral.com/1472-6947/7/32
 #!usr/bin/perl
 use LWP;
 use strict;
 use warnings;
 #sets query and congress session
 my $query='fracking';
 my $congress=112;
 my $ua = LWP::UserAgent->new;
 my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";
 my $response=$ua->get($url);
 my $result=$response->content;
 print $result;
Returns JSON formatted
output
 #!usr/bin/perl

 use LWP;
 use XML::LibXML;
 use strict;
 use warnings;

 my $ua=LWP::UserAgent->new();
 my $query='perl programming';
 my $url="http://blekko.com/ws/?q=$query+/rss";
 my $response=$ua->get($url);
 my $results=$response->content; die unless $response->is_success;

 my $parser=XML::LibXML->new;
 my $domtree=$parser->parse_string($results);
 my @Records=$domtree->getElementsByTagName("item");
 my $i=0;
 foreach(@Records){
 my $link=$Records[$i]->getChildrenByTagName("link");
 print "$i $linkn";
 my $description=$Records[$i]->getChildrenByTagName("description");
 print "$descriptionnn";
 $i++;
 }
 Allows programmers to extract code samples
pertaining to a set of keywords
 Recognizes the patterns associated with
CC++ functions and CC++ Control
structures
int myfunc ( ){
//code here
}
while ( ) {
//code here
}
 use Text::Balanced qw(extract_codeblock);

 #delimiter used to distinguish code blocks for use with Text::Balanced
 $delim='{}';

 #regex used to match keywords/patterns that precede code blocks
 my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';

 foreach $link(@links){
 $response=$request->get("$link"); # gets Web page
 $results=$response->content;
 while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript
 pos($results)=0;
 while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){
 $code=$1 . extract_codeblock($results,$delim);
 print OFile "<h3><a href="$link">$link</a></h3> n";
 print OFile "$code" . "n" . "n";
 }
 }
 A common challenge to performing
information extraction and text mining on
many Web pages or parts of Web pages is
that the content is served up by JavaScript
 This can be dealt with by putting the
JavaScript that serves up the content through
a JavaScript Engine like V8
 <title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
 #!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$resultnn";
#extracts JavaScript
my $js;
if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/'/g;
#print "$jsnn";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mailnn";
 cfrenz@gmail.com
 http://www.linkedin.com/in/christopherfrenz/

Mais conteúdo relacionado

Mais procurados

Google Dorks
Google DorksGoogle Dorks
Google Dorks
Andrea D'Ubaldo
 

Mais procurados (11)

Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
 
All About HTML Tags
All About HTML TagsAll About HTML Tags
All About HTML Tags
 
How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
 
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
 
Website Security
Website SecurityWebsite Security
Website Security
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Documento
DocumentoDocumento
Documento
 
47300 php-web-backdoor-decode
47300 php-web-backdoor-decode47300 php-web-backdoor-decode
47300 php-web-backdoor-decode
 

Destaque

Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
Svitlana volkova
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
GUANBO
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
Ahmedali Durga
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
ask2372
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
Chen Xi
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
butest
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
hit_alex
 

Destaque (20)

What the fuzz
What the fuzzWhat the fuzz
What the fuzz
 
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksXSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
 
Searchable Encryption Systems
Searchable Encryption SystemsSearchable Encryption Systems
Searchable Encryption Systems
 
Hot fuzz - textual analysis
Hot fuzz - textual analysis Hot fuzz - textual analysis
Hot fuzz - textual analysis
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
 
2 13
2 132 13
2 13
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 

Semelhante a Information Retrieval and Extraction

Website Security
Website SecurityWebsite Security
Website Security
Carlos Z
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecurities
amiable_indian
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 

Semelhante a Information Retrieval and Extraction (20)

My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Web application security
Web application securityWeb application security
Web application security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Ch1(introduction to php)
Ch1(introduction to php)Ch1(introduction to php)
Ch1(introduction to php)
 
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
 
User authentication module using php
User authentication module using phpUser authentication module using php
User authentication module using php
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecurities
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
Salzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonySalzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP Symfony
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
PHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPPHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHP
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
 
20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Getting More Traffic From Search Advanced Seo For Developers Presentation
Getting More Traffic From Search  Advanced Seo For Developers PresentationGetting More Traffic From Search  Advanced Seo For Developers Presentation
Getting More Traffic From Search Advanced Seo For Developers Presentation
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Information Retrieval and Extraction

  • 2.  Information is being generated at a faster rate than ever before  The speed at which information can be generated is continually increasing  Continuous improvements in computers, storage, and networking make much of this information readily available to indviduals
  • 4.
  • 5.  Most search engines use a keyword based approach  If a document contains all of the keywords specified it is returned as a match  Ranking algorithms (e.g. PageRank) are used to put the most relevant results at the top of the list and the least relevant at the bottom
  • 6.  Not everything can be easily expressed as a keyword  Suppose you want to search for unknown phone numbers? How can you do this with keywords?  How do we recognize a phone number when we see one?
  • 7.
  • 8.  We recognize a phone number by recognizing the pattern of digits ◦ (XXX) XXX-XXXX  While it is hard to express such a pattern in the form of a keyword, it is really easy to express it in the form of a regular expression  (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  • 9. #!usr/bin/perl use strict; use warnings; (my $string=<<'LIST'); John (555) 555-5555 fits pattern Bob 234 567-8901 Mary 734-234-9873 Tom 999 999-9999 Harry 111 111 1111 does not fit pattern LIST while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){ print "$1n"; }
  • 10.  Conduct a broad key word search using an existing search engine  Use your custom coded application to take the returned search results and perform regular expression based pattern matching  The results that match your regular expression are your refined search results
  • 11. General Search APIs Specialized Search APIs  Bing  Yahoo BOSS  Blekko  Yandex  Twitter  Medicine – Pubmed  Physics –Arxiv  Government – GovTrack  Finance – Yahoo Finance  etc
  • 12. Seeking to Extract: DFN [A-Z]d+ Script described in: http://www.biomedcentral.com/1472-6947/7/32
  • 13.  #!usr/bin/perl  use LWP;  use strict;  use warnings;  #sets query and congress session  my $query='fracking';  my $congress=112;  my $ua = LWP::UserAgent->new;  my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";  my $response=$ua->get($url);  my $result=$response->content;  print $result; Returns JSON formatted output
  • 14.  #!usr/bin/perl   use LWP;  use XML::LibXML;  use strict;  use warnings;   my $ua=LWP::UserAgent->new();  my $query='perl programming';  my $url="http://blekko.com/ws/?q=$query+/rss";  my $response=$ua->get($url);  my $results=$response->content; die unless $response->is_success;   my $parser=XML::LibXML->new;  my $domtree=$parser->parse_string($results);  my @Records=$domtree->getElementsByTagName("item");  my $i=0;  foreach(@Records){  my $link=$Records[$i]->getChildrenByTagName("link");  print "$i $linkn";  my $description=$Records[$i]->getChildrenByTagName("description");  print "$descriptionnn";  $i++;  }
  • 15.
  • 16.  Allows programmers to extract code samples pertaining to a set of keywords  Recognizes the patterns associated with CC++ functions and CC++ Control structures int myfunc ( ){ //code here } while ( ) { //code here }
  • 17.  use Text::Balanced qw(extract_codeblock);   #delimiter used to distinguish code blocks for use with Text::Balanced  $delim='{}';   #regex used to match keywords/patterns that precede code blocks  my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';   foreach $link(@links){  $response=$request->get("$link"); # gets Web page  $results=$response->content;  while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript  pos($results)=0;  while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){  $code=$1 . extract_codeblock($results,$delim);  print OFile "<h3><a href="$link">$link</a></h3> n";  print OFile "$code" . "n" . "n";  }  }
  • 18.
  • 19.  A common challenge to performing information extraction and text mining on many Web pages or parts of Web pages is that the content is served up by JavaScript  This can be dealt with by putting the JavaScript that serves up the content through a JavaScript Engine like V8
  • 20.  <title>Contact XYZ inc</title> <H1>Contact XYZ inc</H1><br> <p>For more information about XYZ inc, please contact us at the following Email address</p> <script type="text/javascript" language="javascript"> <!-- // Email obfuscator script 2.1 by Tim Williams, University of Arizona // Random encryption key feature by Andrew Moulden, Site Engineering Ltd // This code is freeware provided these four comment lines remain intact // A wizard to generate this code is at http://www.jottings.com/obfuscator/ { coded = "OKUxkq@KwtoO2K.0ko" key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn" shift=coded.length link="" for (i=0; i<coded.length; i++) { if (key.indexOf(coded.charAt(i))==-1) { ltr = coded.charAt(i) link += (ltr) } else { ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length link += (key.charAt(ltr)) } } document.write("<a href='mailto:"+link+"'>"+link+"</a>") } //--> </script><noscript>Sorry, you need Javascript on to email me.</noscript>
  • 21.
  • 22.  #!usr/bin/perl use JavaScript::V8; use LWP; use Text::Balanced qw(extract_codeblock); use strict; use warnings; #delimiter used to distinguish code blocks for use with Text::Balanced my $delim='{}'; #downloads Web page my $ua=LWP::UserAgent->new; my $response=$ua->get('http://localhost/email.html'); my $result=$response->content; #print "$resultnn"; #extracts JavaScript my $js; if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){ $js=extract_codeblock($result,$delim); } #modified JS to make it processable by V8 module $js=~s/document.write/write/; $js=~s/'/'/g; #print "$jsnn"; #processes JS my $context = JavaScript::V8::Context->new(); $context->bind_function(write => sub { print @_ }); my $mail=$context->eval("$js"); print "$mailnn";
  • 23.