SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Data
                                cleaning
                                Stat405

                         Hadley Wickham
Monday, 31 August 2009
1. Intro to data cleaning
               2. Missing values
               3. Subsetting
               4. Modifying
               5. Short cuts



Monday, 31 August 2009
Clean data is:
                   Columnar
                   (rectangular, observations in rows, variables in columns)

                   Consistent
                   Concise
                   Complete
                   Correct


Monday, 31 August 2009
Correct
                   Can’t restore correct values without
                   original data but can remove clearly
                   incorrect values
                   Options:
                         Remove entire row
                         Mark incorrect value as missing


Monday, 31 August 2009
What is a missing
                             value?
                   In R, written as NA. Has special
                   behaviour:
                         NA + 3 = ?
                         NA > 2 = ?
                         mean(c(2, 7, 10, NA)) = ?
                         NA == NA ?
                   Use is.na() to see if a value is NA
                   Many functions have na.rm argument
Monday, 31 August 2009
Your turn

                   Look at histograms and scatterplots of x,
                   y, z from the diamonds dataset
                   Which values are clearly incorrect? Which
                   values might we be able to correct?
                   (Remember measurements are in millimetres,
                   1 inch = 25 mm)




Monday, 31 August 2009
Plots

               qplot(x,   data = diamonds, binwidth = 0.1)
               qplot(y,   data = diamonds, binwidth = 0.1)
               qplot(z,   data = diamonds, binwidth = 0.1)
               qplot(x,   y, data = diamonds)
               qplot(x,   z, data = diamonds)
               qplot(y,   z, data = diamonds)




Monday, 31 August 2009
Modifying data
                   To modify, must first know how to extract,
                   or subset. Many different methods
                   available in R. We’ll start with most
                   explicit then learn some shortcuts.
                   Basic structure:
                   df$varname
                   df[row index, column index]


Monday, 31 August 2009
$

                   Remember str(diamonds) ?
                   That hints at how to extract individual
                   variables:
                   diamonds$carat
                   diamonds$price



Monday, 31 August 2009
[
            positive integers   select specified
            negative integers   omit specified
            characters          extract named items
            nothing             include everything
            logicals            select T, omit F


Monday, 31 August 2009
Challenge
                   There is an equivalency between logical
                   (boolean) and numerical (set) indexing.
                   How do you change a logical index to a
                   numeric index? And vice versa?
                   What are the equivalents of the boolean
                   operations for numerical indices?



Monday, 31 August 2009
# Nothing
     str(diamonds[, ])

     # Positive integers & nothing
     diamonds[1:6, ] # same as head(diamonds)
     diamonds[, 1:4] # watch out!

     # Positive integers * 2
     diamonds[1:10, 1:4]
     diamonds$carat[1:100]

     # Negative integers
     diamonds[-(1:53900), -1]

     # Character vector
     diamonds[, c("depth", "table")]
     diamonds[1:100, "carat"]

Monday, 31 August 2009
[ + logical vectors
               # The most complicated to understand, but
               # the most powerful. Lets you extract a
               # subset defined by some characteristic of
               # the data
               x_big <- diamonds$x > 10

               head(x_big)
               tail(x_big)
               sum(x_big)

               diamonds$x[x_big]
               diamonds[x_big, ]



Monday, 31 August 2009
Useful                 table(zeros)
        functions for                 sum(zeros)
      logical vectors                 mean(zeros)
                TRUE = 1; FALSE = 0



Monday, 31 August 2009
x_big <- diamonds$x > 10
     diamonds[x_big, ]
     diamonds[x_big, "x"]
     diamonds[x_big, c("x", "y", "z")]

     small <- diamonds[diamonds$carat < 1, ]
     lowqual <- diamonds[diamonds$clarity
       %in% c("I1", "SI2", "SI1"), ]

     # Comparison functions:
     # < > <= >= != == %in%

     # Boolean operators
     small <- diamonds$carat < 1 &
       diamonds$price > 500
     lowqual <- diamonds$colour == "D" |
       diamonds$cut == "Fair"

Monday, 31 August 2009
And     a & b

                         Or      a | b

                         Not      !b

                         Xor   xor(a, b)


Monday, 31 August 2009
Saving results
               # Prints to screen
               diamonds[diamonds$x > 10, ]

               # Saves to new data frame
               big <- diamonds[diamonds$x > 10, ]

               # Overwrites existing data frame. Dangerous!
               diamonds <- diamonds[diamonds$x < 10,]



Monday, 31 August 2009
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Monday, 31 August 2009
Your turn

                   Extract diamonds with equal x & y.
                   Extract diamonds with incorrect/unusual
                   x, y, or z values.




Monday, 31 August 2009
equal <- diamonds[diamonds$x == diamonds$y, ]

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0
     zeros <- x_zero | y_zero | z_zero

     bad <- y_big | z_big | zeros
     dbad <- diamonds[bad, ]


Monday, 31 August 2009
Aside: strategy

                   The biggest problem I see new
                   programmers make is trying to do too
                   much at once.
                   Break the problem into pieces and solve
                   the smallest piece first. Then check each
                   piece before solving the next problem.



Monday, 31 August 2009
Making new variables
               diamonds$pricepc <- diamonds$price /
                 diamonds$carat

               diamonds$volume <- diamonds$x *
                 diamonds$y * diamonds$z

               qplot(pricepc, carat, data = diamonds)
               qplot(carat, volume, data = diamonds)


Monday, 31 August 2009
Modifying values
                   Combination of subsetting and making
                   new variables:
                   diamonds$x[x_zero] <- NA
                   diamonds$z[z_big] <- diamonds$z[z_big] / 10

                   These modify the data in place.
                   Be careful!



Monday, 31 August 2009
diamonds$volume <- diamonds$x *
       diamonds$y * diamonds$z
     qplot(carat, volume, data = diamonds)

     # Fix problems & replot
     diamonds$x[x_zero] <- NA
     diamonds$y[y_zero] <- NA
     diamonds$z[z_zero] <- NA
     diamonds$y[y_big] <- diamonds$y[y_big] / 10
     diamonds$z[z_big] <- diamonds$z[z_big] / 10

     diamonds$volume <- diamonds$x *
       diamonds$y * diamonds$z
     qplot(carat, volume, data = diamonds)

Monday, 31 August 2009
Your turn
                   Fix the incorrect values and replot
                   scatterplots of x, y, and z. Are all the
                   unusual values gone?
                   Correct any other strange values.
                   Hint: If qplot(a, b) is a straight line,
                   qplot(a, a / b) will be a flat line. Makes
                   selecting strange values much easier!


Monday, 31 August 2009
qplot(carat, volume, data = diamonds)
     qplot(carat, volume / carat, data = diamonds)

     weird_density <-
       (diamonds$volume / diamonds$carat) < 140 |
       (diamonds$volume / diamonds$carat) > 180
     weird_density <- weird_density & !is.na(weird_density)

     diamonds[weird_density, c("x", "y", "z", "volume")] <- NA




Monday, 31 August 2009
Short cuts
                   You’ve been typing diamonds many many
                   times. There are three shortcuts: with,
                   subset and transform.
                   These save typing, but may be a little
                   harder to understand, and will not work in
                   some situations.
                   Useful tools, but don’t forget the basics.


Monday, 31 August 2009
weird_density <-
       (diamonds$volume / diamonds$carat) < 140 |
       (diamonds$volume / diamonds$carat) > 180
     weird_density <- with(diamonds,
       (volume / carat) < 140 | (volume / carat) > 180)

     diamonds[diamonds$carat < 1)
     subset(diamonds, carat < 1)

     equal <- diamonds[diamonds$x == diamonds$y, ]
     equal <- subset(diamonds, x == y)



Monday, 31 August 2009
diamonds$volume <- diamonds$x * diamonds$y *
       diamonds$z
     diamonds$pricepc <- diamonds$price /
       diamonds$carat

     diamonds <- transform(diamonds,
       volume = x * y * z,
       pricepc = price / carat)




Monday, 31 August 2009
Your turn

                   Try to convert your previous statements
                   to use with, subset and transform. Which
                   ones convert easily? Which are hard?
                   When is the shortcut actually a longcut?




Monday, 31 August 2009
Next time

                   Learning how to use latex: a scientific
                   publishing program.
                   If you’re using a laptop, please install
                   latex from the links on the course
                   webpage.




Monday, 31 August 2009
a & b   intersect(c, d)

                         a | b     union(c, d)

                          !b      setdiff(U, c)
                                 union(setdiff(c, d),
                   xor(a, b)        setdiff(d, c))

                                  U = seq_along(a)
                 c = which(a)
                                  a = U %in% c
                 d = which(b)
                                  b = U %in% d

Monday, 31 August 2009

Mais conteúdo relacionado

Mais procurados

Lesson 29: Integration by Substition (worksheet solutions)
Lesson 29: Integration by Substition (worksheet solutions)Lesson 29: Integration by Substition (worksheet solutions)
Lesson 29: Integration by Substition (worksheet solutions)
Matthew Leingang
 
Lesson 29: Integration by Substition
Lesson 29: Integration by SubstitionLesson 29: Integration by Substition
Lesson 29: Integration by Substition
Matthew Leingang
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
frdos
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
zukun
 

Mais procurados (10)

The Essence of the Iterator Pattern (pdf)
The Essence of the Iterator Pattern (pdf)The Essence of the Iterator Pattern (pdf)
The Essence of the Iterator Pattern (pdf)
 
Deblurring in ct
Deblurring in ctDeblurring in ct
Deblurring in ct
 
Stata cheat sheet: data visualization
Stata cheat sheet: data visualizationStata cheat sheet: data visualization
Stata cheat sheet: data visualization
 
Stata cheat sheet: Data visualization
Stata cheat sheet: Data visualizationStata cheat sheet: Data visualization
Stata cheat sheet: Data visualization
 
Top School in Delhi NCR
Top School in Delhi NCRTop School in Delhi NCR
Top School in Delhi NCR
 
Lesson 29: Integration by Substition (worksheet solutions)
Lesson 29: Integration by Substition (worksheet solutions)Lesson 29: Integration by Substition (worksheet solutions)
Lesson 29: Integration by Substition (worksheet solutions)
 
Lesson 29: Integration by Substition
Lesson 29: Integration by SubstitionLesson 29: Integration by Substition
Lesson 29: Integration by Substition
 
Common derivatives integrals_reduced
Common derivatives integrals_reducedCommon derivatives integrals_reduced
Common derivatives integrals_reduced
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 

Semelhante a 03 Cleaning (10)

11 Data Structures
11 Data Structures11 Data Structures
11 Data Structures
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
08 Functions
08 Functions08 Functions
08 Functions
 
08 functions
08 functions08 functions
08 functions
 
04 reports
04 reports04 reports
04 reports
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
04 Reports
04 Reports04 Reports
04 Reports
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
24 modelling
24 modelling24 modelling
24 modelling
 
24 Spam
24 Spam24 Spam
24 Spam
 

Mais de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Último

Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
amitlee9823
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
dlhescort
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂EscortCall Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
dlhescort
 
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
lizamodels9
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
amitlee9823
 

Último (20)

Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLWhitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceEluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
 
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂EscortCall Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 

03 Cleaning

  • 1. Data cleaning Stat405 Hadley Wickham Monday, 31 August 2009
  • 2. 1. Intro to data cleaning 2. Missing values 3. Subsetting 4. Modifying 5. Short cuts Monday, 31 August 2009
  • 3. Clean data is: Columnar (rectangular, observations in rows, variables in columns) Consistent Concise Complete Correct Monday, 31 August 2009
  • 4. Correct Can’t restore correct values without original data but can remove clearly incorrect values Options: Remove entire row Mark incorrect value as missing Monday, 31 August 2009
  • 5. What is a missing value? In R, written as NA. Has special behaviour: NA + 3 = ? NA > 2 = ? mean(c(2, 7, 10, NA)) = ? NA == NA ? Use is.na() to see if a value is NA Many functions have na.rm argument Monday, 31 August 2009
  • 6. Your turn Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Monday, 31 August 2009
  • 7. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Monday, 31 August 2009
  • 8. Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts. Basic structure: df$varname df[row index, column index] Monday, 31 August 2009
  • 9. $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Monday, 31 August 2009
  • 10. [ positive integers select specified negative integers omit specified characters extract named items nothing include everything logicals select T, omit F Monday, 31 August 2009
  • 11. Challenge There is an equivalency between logical (boolean) and numerical (set) indexing. How do you change a logical index to a numeric index? And vice versa? What are the equivalents of the boolean operations for numerical indices? Monday, 31 August 2009
  • 12. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Positive integers * 2 diamonds[1:10, 1:4] diamonds$carat[1:100] # Negative integers diamonds[-(1:53900), -1] # Character vector diamonds[, c("depth", "table")] diamonds[1:100, "carat"] Monday, 31 August 2009
  • 13. [ + logical vectors # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) tail(x_big) sum(x_big) diamonds$x[x_big] diamonds[x_big, ] Monday, 31 August 2009
  • 14. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Monday, 31 August 2009
  • 15. x_big <- diamonds$x > 10 diamonds[x_big, ] diamonds[x_big, "x"] diamonds[x_big, c("x", "y", "z")] small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% # Boolean operators small <- diamonds$carat < 1 & diamonds$price > 500 lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair" Monday, 31 August 2009
  • 16. And a & b Or a | b Not !b Xor xor(a, b) Monday, 31 August 2009
  • 17. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Monday, 31 August 2009
  • 18. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Monday, 31 August 2009
  • 19. Your turn Extract diamonds with equal x & y. Extract diamonds with incorrect/unusual x, y, or z values. Monday, 31 August 2009
  • 20. equal <- diamonds[diamonds$x == diamonds$y, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros dbad <- diamonds[bad, ] Monday, 31 August 2009
  • 21. Aside: strategy The biggest problem I see new programmers make is trying to do too much at once. Break the problem into pieces and solve the smallest piece first. Then check each piece before solving the next problem. Monday, 31 August 2009
  • 22. Making new variables diamonds$pricepc <- diamonds$price / diamonds$carat diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(pricepc, carat, data = diamonds) qplot(carat, volume, data = diamonds) Monday, 31 August 2009
  • 23. Modifying values Combination of subsetting and making new variables: diamonds$x[x_zero] <- NA diamonds$z[z_big] <- diamonds$z[z_big] / 10 These modify the data in place. Be careful! Monday, 31 August 2009
  • 24. diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(carat, volume, data = diamonds) # Fix problems & replot diamonds$x[x_zero] <- NA diamonds$y[y_zero] <- NA diamonds$z[z_zero] <- NA diamonds$y[y_big] <- diamonds$y[y_big] / 10 diamonds$z[z_big] <- diamonds$z[z_big] / 10 diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(carat, volume, data = diamonds) Monday, 31 August 2009
  • 25. Your turn Fix the incorrect values and replot scatterplots of x, y, and z. Are all the unusual values gone? Correct any other strange values. Hint: If qplot(a, b) is a straight line, qplot(a, a / b) will be a flat line. Makes selecting strange values much easier! Monday, 31 August 2009
  • 26. qplot(carat, volume, data = diamonds) qplot(carat, volume / carat, data = diamonds) weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180 weird_density <- weird_density & !is.na(weird_density) diamonds[weird_density, c("x", "y", "z", "volume")] <- NA Monday, 31 August 2009
  • 27. Short cuts You’ve been typing diamonds many many times. There are three shortcuts: with, subset and transform. These save typing, but may be a little harder to understand, and will not work in some situations. Useful tools, but don’t forget the basics. Monday, 31 August 2009
  • 28. weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180 weird_density <- with(diamonds, (volume / carat) < 140 | (volume / carat) > 180) diamonds[diamonds$carat < 1) subset(diamonds, carat < 1) equal <- diamonds[diamonds$x == diamonds$y, ] equal <- subset(diamonds, x == y) Monday, 31 August 2009
  • 29. diamonds$volume <- diamonds$x * diamonds$y * diamonds$z diamonds$pricepc <- diamonds$price / diamonds$carat diamonds <- transform(diamonds, volume = x * y * z, pricepc = price / carat) Monday, 31 August 2009
  • 30. Your turn Try to convert your previous statements to use with, subset and transform. Which ones convert easily? Which are hard? When is the shortcut actually a longcut? Monday, 31 August 2009
  • 31. Next time Learning how to use latex: a scientific publishing program. If you’re using a laptop, please install latex from the links on the course webpage. Monday, 31 August 2009
  • 32. a & b intersect(c, d) a | b union(c, d) !b setdiff(U, c) union(setdiff(c, d), xor(a, b) setdiff(d, c)) U = seq_along(a) c = which(a) a = U %in% c d = which(b) b = U %in% d Monday, 31 August 2009