2. Why would you even need to do this?
Why not just get data through a
browser?
3. Some use cases
ā¢ Reason 1: It just takes too dam* long to
manually search/get data on a web interface
ā¢ Reason 2: Workflow integration
ā¢ Reason 3: Your work is reproducible and
transparent if done from R instead of clicking
buttons on the web
5. ā¢ Read file ā ideal if available
ā¢ HTML
ā¢ XML
ā¢ JSON
ā¢ APIs that serve up XML/JSON
6. Practiceā¦read.csv (or xls, txt, etc.)
Get URL for fileā¦see screenshot
url <- āhttp://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1ā
mycsv <- read.csv(url)
mycsv
7. āScrapingā web data
ā¢ Why? When there is no API
ā Can either scrape XML or HTML or JSON
ā XML and JSON are easier formats to deal with
from R
8. Scraping E.g. 1: XML
http://www.fishbase.org/summary/speciessummary.php?id=2
9. Scraping E.g. 1: XML
The summary XML page behind the rendered pageā¦
10. Scraping E.g. 1: XML
We can process the XML ourselves using a bunch of lines of codeā¦
11. Scraping E.g. 1: XML
ā¦OR just use a package someone already created - rfishbase
And you get this nice plot
12. Practiceā¦XML and JSON formats
data from the USA National Phenology Network
install.packages(c(āRCurlā,āXMLā,āRJSONIOā)) # if not installed already
require(RCurl); require(XML); require(RJSONIO)
XML Format
xmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
getObservationsForSpeciesIndividualAtLocation.xml?
year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
xmlout <- getURLContent(xmlurl, curl = getCurlHandle())
xmlTreeParse(xmlout)[[1]][[1]]
JSON Format
jsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
getObservationsForSpeciesIndividualAtLocation.json?
year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
jsonout <- getURLContent(jsonurl, curl = getCurlHandle())
fromJSON(jsonout)
13. Scraping E.g. 2: HTML
All this code can produce something likeā¦
15. Practiceā¦scraping HTML
install.packages(c("XML","RCurl")) # if not already installed
require(XML); require(RCurl)
# Lets look at the raw html first
rawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawhtml
# Scrape data from the website
rawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawPMI
PMI <- data.frame(rawPMI[[1]])
names(PMI)[1] <- 'Year'
16. APIs (application programmatic interface)
ā¢ Many data sources have APIās ā largely for
talking to other web interfaces
ā we can use their API from R
ā¢ Consists of a set of methods to search,
retrieve, or submit data to, a data
source/repository
ā¢ One can write R code to interface with an API
ā Keep in mind some APIās require authentication
keys
17. API Documentation
ā¢ API docs for the Integrated Taxonomic
Information Service (ITIS):
http://www.itis.gov/ws_description.html
http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
19. rOpenSci suite of R packages
ā¢ There are many packages on CRAN for specific
data sources on the web ā search on CRAN to
find these
ā¢ rOpenSci is developing a lot of packages for as
many open source data sources as possible
ā Please use and give feedbackā¦
20. Data Literature/metadata
http://ropensci.org/ , code at GitHub
26. Why even think about doing this?
ā¢ Again, workflow integration
ā¢ Itās just easier to call X program from R if you
have are going to run many analyses with said
program
27. Eg. 1: Phylometa
ā¦using the files in the dropbox
Also, get Phylometa here:
http://lajeunesse.myweb.usf.edu/publications.html
ā¢ On a Mac: doesnāt work on mac because itās
.exe
ā But system() often can work to run external programs
ā¢ On Windows:
system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T)
NOTE: intern = T, returns the output to the R console
Should give you something like this ļ
28. Resources
ā¢ rOpenSci (development of R packages for all
open source data and literature)
ā¢ CRAN packages (search for a data source)
ā¢ Tutorials/websites:
ā http://www.programmingr.com/content/webscraping-using-readlines-
and-rcurl
ā¢ Non-R based, but cool:
http://ecologicaldata.org/