SlideShare a Scribd company logo
1 of 28
Web data acquisition with R


        Scott Chamberlain
        October 28, 2011
Why would you even need to do this?

  Why not just get data through a
            browser?
Some use cases
ā€¢ Reason 1: It just takes too dam* long to
  manually search/get data on a web interface

ā€¢ Reason 2: Workflow integration

ā€¢ Reason 3: Your work is reproducible and
  transparent if done from R instead of clicking
  buttons on the web
A few general methods of getting web
           data through R
ā€¢   Read file ā€“ ideal if available
ā€¢   HTML
ā€¢   XML
ā€¢   JSON
ā€¢   APIs that serve up XML/JSON
Practiceā€¦read.csv (or xls, txt, etc.)



Get URL for fileā€¦see screenshot
url <- ā€œhttp://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1ā€

mycsv <- read.csv(url)

mycsv
ā€˜Scrapingā€™ web data

ā€¢ Why? When there is no API
  ā€“ Can either scrape XML or HTML or JSON
  ā€“ XML and JSON are easier formats to deal with
    from R
Scraping E.g. 1: XML
http://www.fishbase.org/summary/speciessummary.php?id=2
Scraping E.g. 1: XML
The summary XML page behind the rendered pageā€¦
Scraping E.g. 1: XML
We can process the XML ourselves using a bunch of lines of codeā€¦
Scraping E.g. 1: XML
ā€¦OR just use a package someone already created - rfishbase



                                         And you get this nice plot
Practiceā€¦XML and JSON formats
     data from the USA National Phenology Network
install.packages(c(ā€œRCurlā€,ā€XMLā€,ā€RJSONIOā€)) # if not installed already
require(RCurl); require(XML); require(RJSONIO)

XML Format
xmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
    getObservationsForSpeciesIndividualAtLocation.xml?
    year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
xmlout <- getURLContent(xmlurl, curl = getCurlHandle())
xmlTreeParse(xmlout)[[1]][[1]]

JSON Format
jsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
    getObservationsForSpeciesIndividualAtLocation.json?
    year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
jsonout <- getURLContent(jsonurl, curl = getCurlHandle())
fromJSON(jsonout)
Scraping E.g. 2: HTML
 All this code can produce something likeā€¦
Scraping E.g. 2: HTML
          ā€¦this
Practiceā€¦scraping HTML
install.packages(c("XML","RCurl")) # if not already installed
require(XML); require(RCurl)

# Lets look at the raw html first
rawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawhtml

# Scrape data from the website
rawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawPMI
PMI <- data.frame(rawPMI[[1]])
names(PMI)[1] <- 'Year'
APIs (application programmatic interface)

ā€¢ Many data sources have APIā€™s ā€“ largely for
  talking to other web interfaces
  ā€“ we can use their API from R
ā€¢ Consists of a set of methods to search,
  retrieve, or submit data to, a data
  source/repository
ā€¢ One can write R code to interface with an API
  ā€“ Keep in mind some APIā€™s require authentication
    keys
API Documentation
ā€¢ API docs for the Integrated Taxonomic
  Information Service (ITIS):
http://www.itis.gov/ws_description.html




                  http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
Example: Simple call to API
rOpenSci suite of R packages
ā€¢ There are many packages on CRAN for specific
  data sources on the web ā€“ search on CRAN to
  find these
ā€¢ rOpenSci is developing a lot of packages for as
  many open source data sources as possible
  ā€“ Please use and give feedbackā€¦
Data                                    Literature/metadata




       http://ropensci.org/ , code at GitHub
Three examples of packages that
      interact with an API
API E.g. 1: Search literature: rplos
You can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
API E.g. 2: Get taxonomic information
    for your study species: taxize
      A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
API E.g. 3: Get some data: dryad
     A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
Calling external programs from
               R
Why even think about doing this?
ā€¢ Again, workflow integration

ā€¢ Itā€™s just easier to call X program from R if you
  have are going to run many analyses with said
  program
Eg. 1: Phylometa
ā€¦using the files in the dropbox
Also, get Phylometa here:
http://lajeunesse.myweb.usf.edu/publications.html
ā€¢ On a Mac: doesnā€™t work on mac because itā€™s
  .exe
   ā€“ But system() often can work to run external programs
ā€¢ On Windows:
   system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T)
   NOTE: intern = T, returns the output to the R console


   Should give you something like this   ļƒ 
Resources
ā€¢ rOpenSci (development of R packages for all
  open source data and literature)
ā€¢ CRAN packages (search for a data source)
ā€¢ Tutorials/websites:
  ā€“ http://www.programmingr.com/content/webscraping-using-readlines-
    and-rcurl

ā€¢ Non-R based, but cool:
  http://ecologicaldata.org/

More Related Content

What's hot

Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
Ā 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availabilityRuben Verborgh
Ā 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is FederatedRuben Verborgh
Ā 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
Ā 
Querying data on the Web ā€“ client or server?
Querying data on the Web ā€“ client or server?Querying data on the Web ā€“ client or server?
Querying data on the Web ā€“ client or server?Ruben Verborgh
Ā 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
Ā 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsRuben Verborgh
Ā 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
Ā 
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rodrigo Senra
Ā 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxRuben Verborgh
Ā 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
Ā 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using LuceneAbdelrahman Othman Helal
Ā 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantageAlexandra Johnson
Ā 
CrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossref
Ā 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorMaxim Shelest
Ā 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
Ā 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)andyseaborne
Ā 

What's hot (20)

Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
Ā 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
Ā 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Ā 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
Ā 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
Ā 
Querying data on the Web ā€“ client or server?
Querying data on the Web ā€“ client or server?Querying data on the Web ā€“ client or server?
Querying data on the Web ā€“ client or server?
Ā 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
Ā 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Ā 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
Ā 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Ā 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
Ā 
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Ā 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
Ā 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
Ā 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
Ā 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
Ā 
CrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossRef Technical Information for Libraries
CrossRef Technical Information for Libraries
Ā 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
Ā 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Ā 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
Ā 

Viewers also liked

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
Ā 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web APIBrad Genereaux
Ā 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
Ā 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
Ā 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
Ā 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
Ā 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
Ā 
Google Analytics Data Mining with R
Google Analytics Data Mining with RGoogle Analytics Data Mining with R
Google Analytics Data Mining with RTatvic Analytics
Ā 
Data mining with Google analytics
Data mining with Google analyticsData mining with Google analytics
Data mining with Google analyticsGreg Bray
Ā 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
Ā 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
Ā 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsZhipeng Liang
Ā 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services EnglishPascal Spelier
Ā 
Webinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceWebinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceQuanticMind
Ā 
An ad words ad performance analysis by r
An ad words ad performance analysis by rAn ad words ad performance analysis by r
An ad words ad performance analysis by rSimonChen888
Ā 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
Ā 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and TrendingMike Brittain
Ā 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
Ā 

Viewers also liked (20)

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
Ā 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web API
Ā 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
Ā 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Ā 
TextMining with R
TextMining with RTextMining with R
TextMining with R
Ā 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
Ā 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Ā 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Ā 
Google Analytics Data Mining with R
Google Analytics Data Mining with RGoogle Analytics Data Mining with R
Google Analytics Data Mining with R
Ā 
Data mining with Google analytics
Data mining with Google analyticsData mining with Google analytics
Data mining with Google analytics
Ā 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
Ā 
Building powerful dashboards with r shiny
Building powerful dashboards with r shinyBuilding powerful dashboards with r shiny
Building powerful dashboards with r shiny
Ā 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Ā 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and Methods
Ā 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services English
Ā 
Webinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceWebinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data Science
Ā 
An ad words ad performance analysis by r
An ad words ad performance analysis by rAn ad words ad performance analysis by r
An ad words ad performance analysis by r
Ā 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
Ā 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Ā 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
Ā 

Similar to Web data from R

Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceElad Elrom
Ā 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworksJonathan Holloway
Ā 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
Ā 
20100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v120100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v1Gilles Guirand
Ā 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
Ā 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkFlorent Georges
Ā 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
Ā 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
Ā 
PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools10n Software, LLC
Ā 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
Ā 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
Ā 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
Ā 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref
Ā 
The Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsThe Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsOlaf Hartig
Ā 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic WebLuigi De Russis
Ā 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
Ā 
Semantic Web
Semantic WebSemantic Web
Semantic Webhardchiu
Ā 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
Ā 

Similar to Web data from R (20)

Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 Unconference
Ā 
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Ā 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworks
Ā 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
Ā 
20100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v120100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v1
Ā 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Ā 
Red5 - PHUG Workshops
Red5 - PHUG WorkshopsRed5 - PHUG Workshops
Red5 - PHUG Workshops
Ā 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
Ā 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
Ā 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ā 
PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
Ā 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Ā 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
Ā 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Ā 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Ā 
The Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsThe Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your Applications
Ā 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
Ā 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
Ā 
Semantic Web
Semantic WebSemantic Web
Semantic Web
Ā 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
Ā 

More from schamber

Poster
PosterPoster
Posterschamber
Ā 
Poster
PosterPoster
Posterschamber
Ā 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesisschamber
Ā 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
Ā 
regex-presentation_ed_goodwin
regex-presentation_ed_goodwinregex-presentation_ed_goodwin
regex-presentation_ed_goodwinschamber
Ā 
R Introduction
R IntroductionR Introduction
R Introductionschamber
Ā 

More from schamber (6)

Poster
PosterPoster
Poster
Ā 
Poster
PosterPoster
Poster
Ā 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
Ā 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
Ā 
regex-presentation_ed_goodwin
regex-presentation_ed_goodwinregex-presentation_ed_goodwin
regex-presentation_ed_goodwin
Ā 
R Introduction
R IntroductionR Introduction
R Introduction
Ā 

Recently uploaded

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
Ā 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
Ā 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vƔzquez
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
Ā 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
Ā 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
Ā 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
Ā 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
Ā 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
Ā 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
Ā 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Ā 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Ā 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Ā 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 

Web data from R

  • 1. Web data acquisition with R Scott Chamberlain October 28, 2011
  • 2. Why would you even need to do this? Why not just get data through a browser?
  • 3. Some use cases ā€¢ Reason 1: It just takes too dam* long to manually search/get data on a web interface ā€¢ Reason 2: Workflow integration ā€¢ Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  • 4. A few general methods of getting web data through R
  • 5. ā€¢ Read file ā€“ ideal if available ā€¢ HTML ā€¢ XML ā€¢ JSON ā€¢ APIs that serve up XML/JSON
  • 6. Practiceā€¦read.csv (or xls, txt, etc.) Get URL for fileā€¦see screenshot url <- ā€œhttp://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1ā€ mycsv <- read.csv(url) mycsv
  • 7. ā€˜Scrapingā€™ web data ā€¢ Why? When there is no API ā€“ Can either scrape XML or HTML or JSON ā€“ XML and JSON are easier formats to deal with from R
  • 8. Scraping E.g. 1: XML http://www.fishbase.org/summary/speciessummary.php?id=2
  • 9. Scraping E.g. 1: XML The summary XML page behind the rendered pageā€¦
  • 10. Scraping E.g. 1: XML We can process the XML ourselves using a bunch of lines of codeā€¦
  • 11. Scraping E.g. 1: XML ā€¦OR just use a package someone already created - rfishbase And you get this nice plot
  • 12. Practiceā€¦XML and JSON formats data from the USA National Phenology Network install.packages(c(ā€œRCurlā€,ā€XMLā€,ā€RJSONIOā€)) # if not installed already require(RCurl); require(XML); require(RJSONIO) XML Format xmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' xmlout <- getURLContent(xmlurl, curl = getCurlHandle()) xmlTreeParse(xmlout)[[1]][[1]] JSON Format jsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' jsonout <- getURLContent(jsonurl, curl = getCurlHandle()) fromJSON(jsonout)
  • 13. Scraping E.g. 2: HTML All this code can produce something likeā€¦
  • 14. Scraping E.g. 2: HTML ā€¦this
  • 15. Practiceā€¦scraping HTML install.packages(c("XML","RCurl")) # if not already installed require(XML); require(RCurl) # Lets look at the raw html first rawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752') rawhtml # Scrape data from the website rawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752') rawPMI PMI <- data.frame(rawPMI[[1]]) names(PMI)[1] <- 'Year'
  • 16. APIs (application programmatic interface) ā€¢ Many data sources have APIā€™s ā€“ largely for talking to other web interfaces ā€“ we can use their API from R ā€¢ Consists of a set of methods to search, retrieve, or submit data to, a data source/repository ā€¢ One can write R code to interface with an API ā€“ Keep in mind some APIā€™s require authentication keys
  • 17. API Documentation ā€¢ API docs for the Integrated Taxonomic Information Service (ITIS): http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
  • 19. rOpenSci suite of R packages ā€¢ There are many packages on CRAN for specific data sources on the web ā€“ search on CRAN to find these ā€¢ rOpenSci is developing a lot of packages for as many open source data sources as possible ā€“ Please use and give feedbackā€¦
  • 20. Data Literature/metadata http://ropensci.org/ , code at GitHub
  • 21. Three examples of packages that interact with an API
  • 22. API E.g. 1: Search literature: rplos You can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
  • 23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
  • 24. API E.g. 3: Get some data: dryad A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
  • 26. Why even think about doing this? ā€¢ Again, workflow integration ā€¢ Itā€™s just easier to call X program from R if you have are going to run many analyses with said program
  • 27. Eg. 1: Phylometa ā€¦using the files in the dropbox Also, get Phylometa here: http://lajeunesse.myweb.usf.edu/publications.html ā€¢ On a Mac: doesnā€™t work on mac because itā€™s .exe ā€“ But system() often can work to run external programs ā€¢ On Windows: system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this ļƒ 
  • 28. Resources ā€¢ rOpenSci (development of R packages for all open source data and literature) ā€¢ CRAN packages (search for a data source) ā€¢ Tutorials/websites: ā€“ http://www.programmingr.com/content/webscraping-using-readlines- and-rcurl ā€¢ Non-R based, but cool: http://ecologicaldata.org/