SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
useR Vignette:



    Accessing Databases
          from R


      Greater Boston useR Group
              May 4, 2011


                          by

                 Jeffrey Breen
           jbreen@cambridge.aero




Photo from http://en.wikipedia.org/wiki/File:Oracle_Headquarters_Redwood_Shores.jpg
Outline
●    Why relational databases?
●    Introducing DBI
●    Simple SQL queries
●    dbApply() marries strengths of
     SQL and *apply()
●    The parallel universe of RODBC
●    sqldf: No database? No problem!
●    Further Reading
●    Loading mtcars sample
     data.frame into MySQL




                                            AP Photo/Ben Margot

useR Vignette: Accessing Databases from R                Greater Boston useR Meeting, May 2011   Slide 2
Why relational databases?
●    Databases excel at handling large amounts of, um, data
●    They're everywhere
      ●    Virtually all enterprise applications are built on relational databases (CRM, ERP,
           HRIS, etc.)
      ●    Thanks to high quality open source databases (esp. MySQL and PostgreSQL),
           they're central to dynamic web development since beginning.
             –   “LAMP” = Linux + Apache + MySQL + PHP
      ●    Amazon's “Relational Data Service” is just a tuned deployment of MySQL
●    SQL provides almost-standard language to filter, aggregate, group, sort
      ●    SQL-like query languages showing up in new places (Hadoop Hive)
      ●    ODBC provides SQL interface to non-database data (Excel, CSV, text files)




useR Vignette: Accessing Databases from R       Greater Boston useR Meeting, May 2011    Slide 3
Introducing DBI

 ●      DBI provides a common interface for (most of) R's
        database packages
 ●      Database-specific code implemented in sub-
        packages
          ●     RMySQL, RPostgreSQL, ROracle, RSQLite, RJDBC
 ●      Use dbConnect(), dbDisconnect() to open, close
        connections:
> library(RMySQL)
> con = dbConnect("MySQL", "testdb", username="testuser", password="testpass")
[...]
> dbDisconnect(con)




useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011   Slide 4
Using DBI
 ●    dbReadTable() and dbWriteTable() read and write entire tables
> df = dbReadTable(con, 'motortrend')
> head(df, 4)
                   mpg cyl disp hp drat                     wt     qsec vs am gear carb                     mfg     model
Mazda RX4         21.0   6 160 110 3.90                  2.620    16.46 0 1      4    4                   Mazda       RX4
Mazda RX4 Wag     21.0   6 160 110 3.90                  2.875    17.02 0 1      4    4                   Mazda   RX4 Wag
Datsun 710        22.8   4 108 93 3.85                   2.320    18.61 1 1      4    1                  Datsun       710
Hornet 4 Drive    21.4   6 258 110 3.08                  3.215    19.44 1 0      3    1                  Hornet   4 Drive


 ●    dbGetQuery() runs SQL query and returns entire result set
> df = dbGetQuery(con, "SELECT              * FROM motortrend")
> head(df,4)
       row_names mpg cyl disp                hp   drat      wt    qsec vs am gear carb    mfg   model
1      Mazda RX4 21.0   6 160               110   3.90   2.620   16.46 0 1      4    4 Mazda      RX4
2 Mazda RX4 Wag 21.0    6 160               110   3.90   2.875   17.02 0 1      4    4 Mazda RX4 Wag
3     Datsun 710 22.8   4 108                93   3.85   2.320   18.61 1 1      4    1 Datsun     710
4 Hornet 4 Drive 21.4   6 258               110   3.08   3.215   19.44 1 0      3    1 Hornet 4 Drive

       ●     Note how dbReadTable() uses “row_names” column
 ●    Use dbSendQuery() & fetch() to stream larger result sets
 ●    Advanced functions available to read schema definitions, handle
      transactions, call stored procedures, etc.
useR Vignette: Accessing Databases from R                        Greater Boston useR Meeting, May 2011                      Slide 5
Simple SQL queries

Fetch a column with no filtering but de-dupe:
> df = dbGetQuery(con, "SELECT DISTINCT mfg FROM motortrend")
> head(df, 3)
       mfg
1    Mazda
2   Datsun
3   Hornet


Aggregate and sort result:
> df = dbGetQuery(con, "SELECT mfg, avg(hp) AS meanHP FROM motortrend GROUP BY mfg ORDER BY
meanHP DESC")
> head(df, 4)
       mfg meanHP
1 Maserati    335
2     Ford    264
3   Duster    245
4   Camaro    245

> df = dbGetQuery(con, "SELECT cyl as cylinders, avg(hp) as meanHP FROM motortrend GROUP by
cyl ORDER BY cyl")
> df
  cylinders    meanHP
1         4 82.63636
2         6 122.28571
3         8 209.21429

useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011       Slide 6
●dbApply() marries strengths of SQL and
*apply()
 ●    Operates on result set from dbSendQuery()
        ●     Uses fetch() to bring in smaller chunks at a time to handle Big Data
        ●     You must order result set by your “chunking” variable
 ●    Example: calculate quantiles for horsepower vs. cylinders
> sql = "SELECT cyl, hp FROM motortrend ORDER BY cyl"
> rs = dbSendQuery(con, sql)
> dbApply(rs, INDEX='cyl', FUN=function(x, grp) quantile(x$hp))
$`4.000000`
   0%   25%   50%   75% 100%
 52.0 65.5 91.0 96.0 113.0

$`6.000000`
  0% 25% 50%                    75% 100%
 105 110 110                    123 175

$`8.000000`
    0%    25%    50%    75%   100%
150.00 176.25 192.50 241.25 335.00

 ●    Implemented and available in RMySQL, RPostgreSQL
useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011   Slide 7
The parallel universe of RODBC
 ●    ODBC = “open database connectivity”
       ●     Released by Microsoft in 1992
       ●     Cross-platform, but strongest support on Windows
       ●     ODBC drivers are available for every database you can think of PLUS
             Excel spreadsheets, CSV text files, etc.
 ●    For historical reasons, RODBC not part of DBI family
 ●    Same idea, different details:
       ●     odbcConnect() instead of dbConnection()
       ●     sqlFetch() = dbReadTable()
       ●     sqlSave() = dbWriteTable()
       ●     sqlQuery() = dbGetQuery()
 ●    Closest match in DBI family is RJDBC using Java JDBC drivers

useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011   Slide 8
sqldf: No database? No problem!
 ●    Provides SQL access to data.frames as if they were tables
 ●    Creates & updates SQLite databases automagically
        ●    But can also be used with existing SQLite, MySQL databases
> library(sqldf)
> data(mtcars)
> sqldf("SELECT cyl, avg(hp) FROM mtcars GROUP BY cyl ORDER BY cyl")
  cyl   avg(hp)
1   4 82.63636
2   6 122.28571
3   8 209.21429

> library(stringr)
> mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1]
> sqldf("SELECT mfg, avg(hp) AS meanHP FROM mtcars GROUP BY mfg ORDER BY meanHP DESC
LIMIT 4")
       mfg meanHP
1 Maserati    335
2     Ford    264
3   Camaro    245
4   Duster    245


useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011   Slide 9
Further Reading
●   Bell Labs: R/S-Database Interface
      ●   http://stat.bell-labs.com/RS-DBI/
●   R Data Import/Export manual
      ●   http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases
●   CRAN: DBI and “Reverse depends” friends
      ●   http://cran.r-project.org/web/packages/DBI/
      ●   http://cran.r-project.org/web/packages/RMySQL/
      ●   http://cran.r-project.org/web/packages/RPostgreSQL/
      ●   http://cran.r-project.org/web/packages/RJDBC/
●   CRAN: RODBC
      ●   http://cran.r-project.org/web/packages/RODBC/
●   CRAN: sqldf
      ●   http://cran.r-project.org/web/packages/sqldf/
●   Phil Spector's SQL tutorial
      ●   http://www.stat.berkeley.edu/~spector/sql.pdf
useR Vignette: Accessing Databases from R       Greater Boston useR Meeting, May 2011   Slide 10
Loading 'mtcars' sample data.frame into
MySQL
In MySQL, create new database & user:
mysql> create database testdb;
mysql> grant all privileges on testdb.* to 'testuser'@'localhost' identified by 'testpass';
mysql> flush privileges;


In R, load "mtcars" data.frame, clean up, and write to new "motortrend" data base table:
library(stringr)
library(RMySQL)

data(mtcars)

mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1]
mtcars$mfg[mtcars$mfg=='Merc'] = 'Mercedes'
mtcars$model = str_split_fixed(rownames(mtcars), ' ', 2)[,2]

con = dbConnect("MySQL", "testdb", username="testuser", password="testpass")

dbWriteTable(con, 'motortrend', mtcars)

dbDisconnect(con)



useR Vignette: Accessing Databases from R   Greater Boston useR Meeting, May 2011        Slide 11

Mais conteúdo relacionado

Mais procurados

patroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deploymentpatroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deployment
hyeongchae lee
 
MySQL 5.7: Core Server Changes
MySQL 5.7: Core Server ChangesMySQL 5.7: Core Server Changes
MySQL 5.7: Core Server Changes
Morgan Tocker
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
mubarakss
 

Mais procurados (20)

PGDay.Seoul 2016 lightingtalk
PGDay.Seoul 2016 lightingtalkPGDay.Seoul 2016 lightingtalk
PGDay.Seoul 2016 lightingtalk
 
patroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deploymentpatroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deployment
 
What is new in MariaDB 10.6?
What is new in MariaDB 10.6?What is new in MariaDB 10.6?
What is new in MariaDB 10.6?
 
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
 
Set Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopSet Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into Hadoop
 
MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
 
What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6What's New in PostgreSQL 9.6
What's New in PostgreSQL 9.6
 
Plmce 14 be a_hero_16x9_final
Plmce 14 be a_hero_16x9_finalPlmce 14 be a_hero_16x9_final
Plmce 14 be a_hero_16x9_final
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Oracle Database 12.1.0.2 New Features
Oracle Database 12.1.0.2 New FeaturesOracle Database 12.1.0.2 New Features
Oracle Database 12.1.0.2 New Features
 
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
 
Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk
 
Transparent sharding with Spider: what's new and getting started
Transparent sharding with Spider: what's new and getting startedTransparent sharding with Spider: what's new and getting started
Transparent sharding with Spider: what's new and getting started
 
MySQL 5.7: Core Server Changes
MySQL 5.7: Core Server ChangesMySQL 5.7: Core Server Changes
MySQL 5.7: Core Server Changes
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
 
Oracle 12c - Multitenant Feature
Oracle 12c - Multitenant FeatureOracle 12c - Multitenant Feature
Oracle 12c - Multitenant Feature
 
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
Percona xtra db cluster(pxc) non blocking operations, what you need to know t...
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
 
Ordina Oracle Open World
Ordina Oracle Open WorldOrdina Oracle Open World
Ordina Oracle Open World
 

Destaque

R Journal 2009 1
R Journal 2009 1R Journal 2009 1
R Journal 2009 1
Ajay Ohri
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
rusersla
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
rusersla
 

Destaque (9)

R Journal 2009 1
R Journal 2009 1R Journal 2009 1
R Journal 2009 1
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For Beginners
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 

Semelhante a Accessing Databases from R

[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
Steve Min
 
Data stage scenario design9 - job1
Data stage scenario   design9 - job1Data stage scenario   design9 - job1
Data stage scenario design9 - job1
Naresh Bala
 
OOW09 EBS Tech Essentials
OOW09 EBS Tech EssentialsOOW09 EBS Tech Essentials
OOW09 EBS Tech Essentials
jucaab
 

Semelhante a Accessing Databases from R (20)

TSQL in SQL Server 2012
TSQL in SQL Server 2012TSQL in SQL Server 2012
TSQL in SQL Server 2012
 
My sql susecon_crashcourse_2012
My sql susecon_crashcourse_2012My sql susecon_crashcourse_2012
My sql susecon_crashcourse_2012
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
 
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsReplicate from Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to AnalyticsReplicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
 
Die Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerDie Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise Server
 
Tungsten Use Case: How Gittigidiyor (a subsidiary of eBay) Replicates Data In...
Tungsten Use Case: How Gittigidiyor (a subsidiary of eBay) Replicates Data In...Tungsten Use Case: How Gittigidiyor (a subsidiary of eBay) Replicates Data In...
Tungsten Use Case: How Gittigidiyor (a subsidiary of eBay) Replicates Data In...
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
 
Data stage scenario design9 - job1
Data stage scenario   design9 - job1Data stage scenario   design9 - job1
Data stage scenario design9 - job1
 
HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)
 
GoldenGate CDR from UKOUG 2017
GoldenGate CDR from UKOUG 2017GoldenGate CDR from UKOUG 2017
GoldenGate CDR from UKOUG 2017
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
NoSQL
NoSQLNoSQL
NoSQL
 
OOW09 EBS Tech Essentials
OOW09 EBS Tech EssentialsOOW09 EBS Tech Essentials
OOW09 EBS Tech Essentials
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Accessing Databases from R

  • 1. useR Vignette: Accessing Databases from R Greater Boston useR Group May 4, 2011 by Jeffrey Breen jbreen@cambridge.aero Photo from http://en.wikipedia.org/wiki/File:Oracle_Headquarters_Redwood_Shores.jpg
  • 2. Outline ● Why relational databases? ● Introducing DBI ● Simple SQL queries ● dbApply() marries strengths of SQL and *apply() ● The parallel universe of RODBC ● sqldf: No database? No problem! ● Further Reading ● Loading mtcars sample data.frame into MySQL AP Photo/Ben Margot useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 2
  • 3. Why relational databases? ● Databases excel at handling large amounts of, um, data ● They're everywhere ● Virtually all enterprise applications are built on relational databases (CRM, ERP, HRIS, etc.) ● Thanks to high quality open source databases (esp. MySQL and PostgreSQL), they're central to dynamic web development since beginning. – “LAMP” = Linux + Apache + MySQL + PHP ● Amazon's “Relational Data Service” is just a tuned deployment of MySQL ● SQL provides almost-standard language to filter, aggregate, group, sort ● SQL-like query languages showing up in new places (Hadoop Hive) ● ODBC provides SQL interface to non-database data (Excel, CSV, text files) useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 3
  • 4. Introducing DBI ● DBI provides a common interface for (most of) R's database packages ● Database-specific code implemented in sub- packages ● RMySQL, RPostgreSQL, ROracle, RSQLite, RJDBC ● Use dbConnect(), dbDisconnect() to open, close connections: > library(RMySQL) > con = dbConnect("MySQL", "testdb", username="testuser", password="testpass") [...] > dbDisconnect(con) useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 4
  • 5. Using DBI ● dbReadTable() and dbWriteTable() read and write entire tables > df = dbReadTable(con, 'motortrend') > head(df, 4) mpg cyl disp hp drat wt qsec vs am gear carb mfg model Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive ● dbGetQuery() runs SQL query and returns entire result set > df = dbGetQuery(con, "SELECT * FROM motortrend") > head(df,4) row_names mpg cyl disp hp drat wt qsec vs am gear carb mfg model 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive ● Note how dbReadTable() uses “row_names” column ● Use dbSendQuery() & fetch() to stream larger result sets ● Advanced functions available to read schema definitions, handle transactions, call stored procedures, etc. useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 5
  • 6. Simple SQL queries Fetch a column with no filtering but de-dupe: > df = dbGetQuery(con, "SELECT DISTINCT mfg FROM motortrend") > head(df, 3) mfg 1 Mazda 2 Datsun 3 Hornet Aggregate and sort result: > df = dbGetQuery(con, "SELECT mfg, avg(hp) AS meanHP FROM motortrend GROUP BY mfg ORDER BY meanHP DESC") > head(df, 4) mfg meanHP 1 Maserati 335 2 Ford 264 3 Duster 245 4 Camaro 245 > df = dbGetQuery(con, "SELECT cyl as cylinders, avg(hp) as meanHP FROM motortrend GROUP by cyl ORDER BY cyl") > df cylinders meanHP 1 4 82.63636 2 6 122.28571 3 8 209.21429 useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 6
  • 7. ●dbApply() marries strengths of SQL and *apply() ● Operates on result set from dbSendQuery() ● Uses fetch() to bring in smaller chunks at a time to handle Big Data ● You must order result set by your “chunking” variable ● Example: calculate quantiles for horsepower vs. cylinders > sql = "SELECT cyl, hp FROM motortrend ORDER BY cyl" > rs = dbSendQuery(con, sql) > dbApply(rs, INDEX='cyl', FUN=function(x, grp) quantile(x$hp)) $`4.000000` 0% 25% 50% 75% 100% 52.0 65.5 91.0 96.0 113.0 $`6.000000` 0% 25% 50% 75% 100% 105 110 110 123 175 $`8.000000` 0% 25% 50% 75% 100% 150.00 176.25 192.50 241.25 335.00 ● Implemented and available in RMySQL, RPostgreSQL useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 7
  • 8. The parallel universe of RODBC ● ODBC = “open database connectivity” ● Released by Microsoft in 1992 ● Cross-platform, but strongest support on Windows ● ODBC drivers are available for every database you can think of PLUS Excel spreadsheets, CSV text files, etc. ● For historical reasons, RODBC not part of DBI family ● Same idea, different details: ● odbcConnect() instead of dbConnection() ● sqlFetch() = dbReadTable() ● sqlSave() = dbWriteTable() ● sqlQuery() = dbGetQuery() ● Closest match in DBI family is RJDBC using Java JDBC drivers useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 8
  • 9. sqldf: No database? No problem! ● Provides SQL access to data.frames as if they were tables ● Creates & updates SQLite databases automagically ● But can also be used with existing SQLite, MySQL databases > library(sqldf) > data(mtcars) > sqldf("SELECT cyl, avg(hp) FROM mtcars GROUP BY cyl ORDER BY cyl") cyl avg(hp) 1 4 82.63636 2 6 122.28571 3 8 209.21429 > library(stringr) > mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1] > sqldf("SELECT mfg, avg(hp) AS meanHP FROM mtcars GROUP BY mfg ORDER BY meanHP DESC LIMIT 4") mfg meanHP 1 Maserati 335 2 Ford 264 3 Camaro 245 4 Duster 245 useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 9
  • 10. Further Reading ● Bell Labs: R/S-Database Interface ● http://stat.bell-labs.com/RS-DBI/ ● R Data Import/Export manual ● http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases ● CRAN: DBI and “Reverse depends” friends ● http://cran.r-project.org/web/packages/DBI/ ● http://cran.r-project.org/web/packages/RMySQL/ ● http://cran.r-project.org/web/packages/RPostgreSQL/ ● http://cran.r-project.org/web/packages/RJDBC/ ● CRAN: RODBC ● http://cran.r-project.org/web/packages/RODBC/ ● CRAN: sqldf ● http://cran.r-project.org/web/packages/sqldf/ ● Phil Spector's SQL tutorial ● http://www.stat.berkeley.edu/~spector/sql.pdf useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 10
  • 11. Loading 'mtcars' sample data.frame into MySQL In MySQL, create new database & user: mysql> create database testdb; mysql> grant all privileges on testdb.* to 'testuser'@'localhost' identified by 'testpass'; mysql> flush privileges; In R, load "mtcars" data.frame, clean up, and write to new "motortrend" data base table: library(stringr) library(RMySQL) data(mtcars) mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1] mtcars$mfg[mtcars$mfg=='Merc'] = 'Mercedes' mtcars$model = str_split_fixed(rownames(mtcars), ' ', 2)[,2] con = dbConnect("MySQL", "testdb", username="testuser", password="testpass") dbWriteTable(con, 'motortrend', mtcars) dbDisconnect(con) useR Vignette: Accessing Databases from R Greater Boston useR Meeting, May 2011 Slide 11