SlideShare uma empresa Scribd logo
1 de 56
Baixar para ler offline
Garrett Grolemund
Phd Student / Rice University
Department of Statistics
Data cleaning
1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? What would
prompt you to change your belief?
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
10 - 20%
of an analysis
Data Cleaning
Data
Residuals
Model
Compare
Visualize
Transform
Data
cleaning
“Happy families are all alike;
every unhappy family is
unhappy in its own way.”
—Leo Tolstoy
“Clean datasets are all alike;
every messy dataset is
messy in its own way.”
—Hadley Wickham
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
What you
can’t fix:
Complete
Correct
Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)
When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
ensure concision.
Detecting inconsistency is not always
easy.
Inconsistency = incorrect
General strategy
To find incorrect values you need to be
creative, combining graphics and data
processing.
Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men or women tip more?
Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?
tips <- read.csv("tipping.csv",
stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) &
!is.na(non_smoker))
qplot(tip, data = tips, binwidth = .5)
qplot(total_bill, data = tips, binwidth = 2)
qplot(total_bill, tip, data = tips)
nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)
What you
can fix:
Concise
(each fact represented once)
Repeating facts:
1. wastes memory
2. creates opportunities for inconsistency
Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your analysis?
This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr
address this need
Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysis and visualization tools? (reshape)
Rectangular
Observations
in rows
Variables
in columns
(1 column per variable)
Your turn
What are the variables in tipping.csv?
How are they arranged in rows and
columns? Can you form the variables into
two groups?
Reshape
install.packages("reshape")
library(reshape)
library(stringr)
head(tips)
Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
Then subset.
1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables
object
ID Variables
Bruce Wayne
Batman
SSN:
555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)
ID Variables
Gotham City +
male +
Top 1% tax
bracket
Identifier variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)
Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
reshape::melt(data, id)
tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable name
names(tips1)[6] <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]
Use melt to fix the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn
Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, tip, data = tip,
colour = ?)
qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
Resource
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
http://www.jstatsoft.org/v21/i12
Summary
Clean data is:
Rectangular
(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct
Data
Residuals
Model
Compare
Visualize
Transform
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
reshape
Data
Residuals
Model
Compare
Visualize
Transform
most statistics
classes
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.

Mais conteúdo relacionado

Mais procurados

Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Harry Potter
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Universidad Particular de Loja
 

Mais procurados (8)

Mean conceptual
Mean   conceptualMean   conceptual
Mean conceptual
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
cross tabulation
 cross tabulation cross tabulation
cross tabulation
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 

Semelhante a 18 cleaning

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
Dr. Trilok Kumar Jain
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
plisasm
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
edgar6wallace88877
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
Tao Hong
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

Semelhante a 18 cleaning (20)

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Correlation and linear regression
Correlation and linear regression Correlation and linear regression
Correlation and linear regression
 
Rclass
RclassRclass
Rclass
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
Dymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdfDymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdf
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 

Mais de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

18 cleaning