SlideShare uma empresa Scribd logo
1 de 10
Basic Data Ingestion in R
Denver RUG 11/16/10
@jrideout
Software Engineer & Data Monkey
@ReturnPath
Where is the data?
• Flat-file (text/binary)
• Relational Database
• Where is … (from google suggestions)
– chuck norris
– the love
– my mind
– the love lyrics (apparently a song by Black Eyed Peas)
read.*
• read.table
• read.csv(2)
– csv2 for , decimal points, : delim
• read.delim(2)
– Tab defaults
read.*
• library(foreign) provides read.
– systat, xport, ssd, octave, spss, mtp, epiinfo, dta,
dbf
• Many Others:
Search http://crantastic.org/
Scan
• Better for numeric matrices
M1 <- matrix(scan("test.data"),nrow=x,ncol=y,byrow=T)
Read 10000000 items
user system elapsed
28.565 18.513 50.882
M2 <- as.matrix(read.table("test.data"))
> 40 minutes on my laptop
Actually (read.* just uses scan anyway)
Others
• readLines
• Sqldf
• MapReduce
• bigmemory
Some tricks
• comment.char="“
• Use colClasses or as.is for read.table
– stringsAsFactors
• Colnames(data) <- c(‘newName’,’other’)
• na.strings = “.”
Working with the DF
• Attach(df); fieldname
• df[[index]]
• df$fieldname
• Plyr/Reshape
• name abbreviation
• as.*, matrix, data.matrix
Type coercion
• Check types with str(), typeof()
• attributes()
• logical < integer < double < complex
• It’s better to get the read.* methods right
than coerce later.
?

Mais conteúdo relacionado

Semelhante a Basic data ingestion in r

Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Windows Memory Forensic Analysis using EnCase
Windows Memory Forensic Analysis using EnCaseWindows Memory Forensic Analysis using EnCase
Windows Memory Forensic Analysis using EnCaseTakahiro Haruyama
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan GunterMaterials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan GunterDan Gunter
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Ontico
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
 
Full Table Scan: friend or foe
Full Table Scan: friend or foeFull Table Scan: friend or foe
Full Table Scan: friend or foeMauro Pagano
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and HadoopGirish L
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationJake Mannix
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Lucidworks
 

Semelhante a Basic data ingestion in r (20)

Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Windows Memory Forensic Analysis using EnCase
Windows Memory Forensic Analysis using EnCaseWindows Memory Forensic Analysis using EnCase
Windows Memory Forensic Analysis using EnCase
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan GunterMaterials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
Full Table Scan: friend or foe
Full Table Scan: friend or foeFull Table Scan: friend or foe
Full Table Scan: friend or foe
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
DA_02_algorithms.pptx
DA_02_algorithms.pptxDA_02_algorithms.pptx
DA_02_algorithms.pptx
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep Tokenization
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Basic data ingestion in r

  • 1. Basic Data Ingestion in R Denver RUG 11/16/10 @jrideout Software Engineer & Data Monkey @ReturnPath
  • 2. Where is the data? • Flat-file (text/binary) • Relational Database • Where is … (from google suggestions) – chuck norris – the love – my mind – the love lyrics (apparently a song by Black Eyed Peas)
  • 3. read.* • read.table • read.csv(2) – csv2 for , decimal points, : delim • read.delim(2) – Tab defaults
  • 4. read.* • library(foreign) provides read. – systat, xport, ssd, octave, spss, mtp, epiinfo, dta, dbf • Many Others: Search http://crantastic.org/
  • 5. Scan • Better for numeric matrices M1 <- matrix(scan("test.data"),nrow=x,ncol=y,byrow=T) Read 10000000 items user system elapsed 28.565 18.513 50.882 M2 <- as.matrix(read.table("test.data")) > 40 minutes on my laptop Actually (read.* just uses scan anyway)
  • 6. Others • readLines • Sqldf • MapReduce • bigmemory
  • 7. Some tricks • comment.char="“ • Use colClasses or as.is for read.table – stringsAsFactors • Colnames(data) <- c(‘newName’,’other’) • na.strings = “.”
  • 8. Working with the DF • Attach(df); fieldname • df[[index]] • df$fieldname • Plyr/Reshape • name abbreviation • as.*, matrix, data.matrix
  • 9. Type coercion • Check types with str(), typeof() • attributes() • logical < integer < double < complex • It’s better to get the read.* methods right than coerce later.
  • 10. ?