SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
plyr
         One data-analytic strategy

                      Hadley Wickham
                         Rice University
Friday, 29 May 2009
1. Motivation: Deseasonlising ozone
                   measurements
                2. Outline of strategy: split-apply-
                   combine
                3. Specifics: input vs. output
                4. Fiddly details
                5. Thoughts on data analysis


Friday, 29 May 2009
24 x 24 x 72 = 41,472

                       30




                       20




                       10




                        0




                      −10




                      −20




                            −110   −85        −60


Friday, 29 May 2009
24 x 24 x 72 = 41,472

                       30




                       20




                       10




                        0




                      −10




                      −20




                            −110   −85        −60


Friday, 29 May 2009
1.0



     0.5                     ●



     0.0                     ●




   −0.5



   −1.0


              −1.0    −0.5   0.0   0.5   1.0




Friday, 29 May 2009
1.0
     1.0

                                               0.8
     0.5                     ●

                                               0.6

     0.0                     ●
                                                     ●
                                               0.4

   −0.5
                                               0.2

   −1.0
                                               0.0
              −1.0    −0.5   0.0   0.5   1.0
                                                     0.0   0.2   0.4   0.6   0.8   1.0




Friday, 29 May 2009
1.0
         0.9
         0.8
         0.7
 value




         0.6
         0.5
         0.4
         0.3

                 0.0   0.2   0.4          0.6   0.8   1.0
                                   time




Friday, 29 May 2009
resid(deseas1) + mean(one$value)




                                    1.0
                                    0.9
                                    0.8
                                    0.7
                                    0.6
                                    0.5
                                    0.4
                                    0.3

                                          0.0   0.2   0.4          0.6   0.8   1.0
                                                            time




Friday, 29 May 2009
resid(deseas1) + mean(one$value)




                                    1.0
                                    0.9
                                    0.8
                                    0.7
                                    0.6
                                    0.5
                                    0.4
                                    0.3

                                          0.0   0.2   0.4          0.6   0.8   1.0
                                                            time




Friday, 29 May 2009
How can we do this for
                 all 24 x 24 locations?
                      (assume ozone levels stored
                         in a 24 x 24 x 72 array)

Friday, 29 May 2009
W
                                                         ith
                models <- as.list(rep(NA, 24 * 24))




                                                             a
                                                               fo
                dim(models) <- c(24, 24)




                                                                  r
                                                                 lo
                                                                  op
                deseas <- array(NA, c(24, 24, 72))
                dimnames(deseas) <- dimnames(ozone)

                for (i in seq_len(24)) {
                  for(j in seq_len(24)) {
                    mod <- deseasf(ozone[i, j, ])


                          models[[i, j]] <- mod
                          deseas[i, j, ] <- resid(mod)
                      }
                }



Friday, 29 May 2009
W
                                                         ith
                models <- as.list(rep(NA, 24 * 24))




                                                             a
                                                               fo
                dim(models) <- c(24, 24)




                                                                  r
                                                                 lo
                                                                  op
                deseas <- array(NA, c(24, 24, 72))
                dimnames(deseas) <- dimnames(ozone)

                for (i in seq_len(24)) {
                  for(j in seq_len(24)) {
                    mod <- deseasf(ozone[i, j, ])


                          models[[i, j]] <- mod
                          deseas[i, j, ] <- resid(mod)
                      }
                }



Friday, 29 May 2009
W
                                                          ith
                                                              ap
                                                                pl
                                                                y
                models <- apply(ozone, 1:2, deseasf)


                resids <- unlist(lapply(models, resid))
                dim(resids) <- c(72, 24, 24)
                deseas <- aperm(resids, c(2, 3, 1))
                dimnames(deseas) <- dimnames(ozone)




Friday, 29 May 2009
W
                                                          ith
                                                              ap
                                                                pl
                                                                y
                models <- apply(ozone, 1:2, deseasf)


                resids <- unlist(lapply(models, resid))
                dim(resids) <- c(72, 24, 24)
                deseas <- aperm(resids, c(2, 3, 1))
                dimnames(deseas) <- dimnames(ozone)




Friday, 29 May 2009
W
                                                       ith
                                                           pl
                                                             yr
                models <- aaply(ozone, 1:2, deseasf)
                deseas <- aaply(models, 1:2, resid)




                Succinct, but you need to
                know what aaply does



                cf. onomatopoeia, schadenfreude, soliloquy
Friday, 29 May 2009
30




                       20




                                               avg
                                                     250
                       10
                                                     260
                                                     270
                                                     280
                                                     290
                                                     300
                                                     310

                        0




                      −10




                      −20




                            −110   −85   −60




Friday, 29 May 2009
30




                       20




                       10




                        0




                      −10




                      −20




                            −110   −85   −60



Friday, 29 May 2009
Many problems involve splitting up a large
                      data structure, operating on each piece
                      and joining the results back together:

                          split-apply-combine



Friday, 29 May 2009
How you split up depends on the type of
                      input: arrays, data frames, lists
                      How you combine depends on the type of
                      output: arrays, data frames, lists,
                      nothing




Friday, 29 May 2009
array   data frame    list   nothing


             array     aaply     adply      alply    a_ply


      data frame       daply     ddply      dlply   d_ply


                list   laply     ldply      llply    l_ply




Friday, 29 May 2009
array    data frame    list    nothing


             array     apply      adply      alply     a_ply


      data frame       daply    aggregate     by      d_ply


                list   sapply     ldply      lapply    l_ply




Friday, 29 May 2009
Split: array, data frame, list


                                    1




                      2
                              1



                          2       1,2




Friday, 29 May 2009
Split: array, data frame, list

                          1      2     3




 3




        2
                      1




                                             1,2,3
                          1,2    1,3   2,3




Friday, 29 May 2009
Take 3d array, split up by first two dimensions.

          models <- aaply(ozone, 1:2, deseasf)
          deseas <- aaply(models, 1:2, resid)


          Splitting up ozone gives 576 vectors of length 72.
          Splitting up models gives 576 rlm models


          How are they combined?


Friday, 29 May 2009
Combine: array, data frame, list




                                   4D!




Friday, 29 May 2009
Combine: array, data frame, list




Friday, 29 May 2009
Split: array, data frame, list


                                               .(sex)                      .(age)



     name             age    sex     name       age      sex     name       age      sex

      John            13     Male    John       13       Male    John       13       Male

      Mary            15    Female   Peter      13       Male    Peter      13       Male

      Alice           14    Female   Roger      14       Male    Phyllis    13      Female

      Peter           13     Male
                                     name       age      sex     name       age      sex
     Roger            14     Male    Mary       15      Female   Alice      14      Female

     Phyllis          13    Female   Alice      14      Female   Roger      14       Male

                                     Phyllis    13      Female
                                                                 name       age      sex

                                                                 Mary       15      Female




Friday, 29 May 2009
Combine: array, data frame, list




                            .(sex)                .(age)                    .(sex, age)

                      sex            value   age           value    sex         age       value

                  Male                3      13             3       Male         13        2

                Female                3      14             2       Male         14        1

                                             15             2      Female        13        1

                                                                   Female        14        1

            Applying nrow to each piece                            Female        15        1




Friday, 29 May 2009
Case study: Baseball



Friday, 29 May 2009
id     year   team   g        ab        r        h
                                                                         21 699 records
    ruthba01           1914 BOS          5        10        1        2
    ruthba01           1915 BOS         42        92       16       29
    ruthba01           1916 BOS         67    136          18       37
                                                                         1228 players
    ruthba01           1917 BOS         52    123          14       40
    ruthba01           1918 BOS         95    317          50       95   15-31 years for
    ruthba01           1919 BOS     130       432      103      139      each player
    ruthba01           1920 NYA     142       457      158      172
    ruthba01           1921 NYA     152       540      177      204
    ruthba01           1922 NYA     110       406          94   128
    ruthba01           1923 NYA     152       522      151      205
    ruthba01           1924 NYA     153       529      143      200
    ruthba01           1925 NYA         98    359          61   104
    ruthba01           1926 NYA     152       495      139      184
    ruthba01           1927 NYA     151       540      158      192
    ruthba01           1928 NYA     154       536      163      173
    ruthba01           1929 NYA     135       499      121      172

Friday, 29 May 2009
How does performance (rbi/ab)
    change over the course of a career?

    First need to add column that gives
    “career year”

    Easy for a single player.
    baberuth <- subset(baseball, id == quot;ruthba01quot;)
    baberuth <- transform(baberuth,
      cyear = year - min(year) + 1)

    For many players, use ddply + transform
    baseball <- ddply(baseball, quot;idquot;, transform,
       cyear = year - min(year) + 1)

Friday, 29 May 2009
Draw time series for all 1228 players

              baseball <- subset(baseball, ab >= 25)
              xlim <- range(baseball$cyear, na.rm=TRUE)
              ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
              plotpattern <- function(df) {
                qplot(cyear, rbi / ab, data = df, geom = quot;linequot;,
                  xlim = xlim, ylim = ylim)
              }


              pdf(quot;paths.pdfquot;, width = 8, height = 4)
              d_ply(baseball, .(reorder(id, rbi / ab)),
                failwith(NA, plotpattern), .print = TRUE)
              dev.off()




Friday, 29 May 2009
200



        150
count




        100



         50



          0

              0.0     0.2   0.4             0.6   0.8   1.0
                                  rsquare




Friday, 29 May 2009
0.25
              1.0
                                                                                0.20


                                                                                0.15
                                                        rsquare                                                      rsquare
              0.5
                                                            0.00                                                         0.00
                                                                                0.10
 intercept




                                                                   intercept
                                                            0.25                                                         0.25
                                                            0.50                0.05                                     0.50
              0.0                                           0.75                                                         0.75
                                                            1.00                0.00                                     1.00

                                                                               −0.05
             −0.5
                                                                               −0.10

                    −0.04
                        −0.020.00 0.02 0.04 0.06 0.08                              −0.010 −0.005 0.000   0.005   0.010
                             slope                                                           slope




Friday, 29 May 2009
Fiddly details
                      Labelling
                      Progress bars
                      Consistent argument names
                      Missing values / Nulls




Friday, 29 May 2009
Data analysis
                      What other patterns of data analysis are
                      waiting to be discovered?
                      How can we identify these strategies and
                      then develop software to support them?
                      Does teaching these patterns make it
                      easier for novices to become experts?



Friday, 29 May 2009
http://had.co.nz/plyr



Friday, 29 May 2009

Mais conteúdo relacionado

Destaque

Wearables and smartphones in digital marketing.
Wearables and smartphones in digital marketing.Wearables and smartphones in digital marketing.
Wearables and smartphones in digital marketing.Lukasz Piwek
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Online Consumer Behaviour
Online Consumer BehaviourOnline Consumer Behaviour
Online Consumer BehaviourLukasz Piwek
 
Data Analytics Strategy
Data Analytics StrategyData Analytics Strategy
Data Analytics StrategyeHealthCareers
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 

Destaque (9)

Wearables and smartphones in digital marketing.
Wearables and smartphones in digital marketing.Wearables and smartphones in digital marketing.
Wearables and smartphones in digital marketing.
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
03 Conditional
03 Conditional03 Conditional
03 Conditional
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Online Consumer Behaviour
Online Consumer BehaviourOnline Consumer Behaviour
Online Consumer Behaviour
 
Data Analytics Strategy
Data Analytics StrategyData Analytics Strategy
Data Analytics Strategy
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 

Mais de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Plyr, one data analytic strategy

  • 1. plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009
  • 2. 1. Motivation: Deseasonlising ozone measurements 2. Outline of strategy: split-apply- combine 3. Specifics: input vs. output 4. Fiddly details 5. Thoughts on data analysis Friday, 29 May 2009
  • 3. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 4. 24 x 24 x 72 = 41,472 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 5. 1.0 0.5 ● 0.0 ● −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 Friday, 29 May 2009
  • 6. 1.0 1.0 0.8 0.5 ● 0.6 0.0 ● ● 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Friday, 29 May 2009
  • 7. 1.0 0.9 0.8 0.7 value 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 8. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 9. resid(deseas1) + mean(one$value) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.2 0.4 0.6 0.8 1.0 time Friday, 29 May 2009
  • 10. How can we do this for all 24 x 24 locations? (assume ozone levels stored in a 24 x 24 x 72 array) Friday, 29 May 2009
  • 11. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • 12. W ith models <- as.list(rep(NA, 24 * 24)) a fo dim(models) <- c(24, 24) r lo op deseas <- array(NA, c(24, 24, 72)) dimnames(deseas) <- dimnames(ozone) for (i in seq_len(24)) { for(j in seq_len(24)) { mod <- deseasf(ozone[i, j, ]) models[[i, j]] <- mod deseas[i, j, ] <- resid(mod) } } Friday, 29 May 2009
  • 13. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • 14. W ith ap pl y models <- apply(ozone, 1:2, deseasf) resids <- unlist(lapply(models, resid)) dim(resids) <- c(72, 24, 24) deseas <- aperm(resids, c(2, 3, 1)) dimnames(deseas) <- dimnames(ozone) Friday, 29 May 2009
  • 15. W ith pl yr models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Succinct, but you need to know what aaply does cf. onomatopoeia, schadenfreude, soliloquy Friday, 29 May 2009
  • 16. 30 20 avg 250 10 260 270 280 290 300 310 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 17. 30 20 10 0 −10 −20 −110 −85 −60 Friday, 29 May 2009
  • 18. Many problems involve splitting up a large data structure, operating on each piece and joining the results back together: split-apply-combine Friday, 29 May 2009
  • 19. How you split up depends on the type of input: arrays, data frames, lists How you combine depends on the type of output: arrays, data frames, lists, nothing Friday, 29 May 2009
  • 20. array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply Friday, 29 May 2009
  • 21. array data frame list nothing array apply adply alply a_ply data frame daply aggregate by d_ply list sapply ldply lapply l_ply Friday, 29 May 2009
  • 22. Split: array, data frame, list 1 2 1 2 1,2 Friday, 29 May 2009
  • 23. Split: array, data frame, list 1 2 3 3 2 1 1,2,3 1,2 1,3 2,3 Friday, 29 May 2009
  • 24. Take 3d array, split up by first two dimensions. models <- aaply(ozone, 1:2, deseasf) deseas <- aaply(models, 1:2, resid) Splitting up ozone gives 576 vectors of length 72. Splitting up models gives 576 rlm models How are they combined? Friday, 29 May 2009
  • 25. Combine: array, data frame, list 4D! Friday, 29 May 2009
  • 26. Combine: array, data frame, list Friday, 29 May 2009
  • 27. Split: array, data frame, list .(sex) .(age) name age sex name age sex name age sex John 13 Male John 13 Male John 13 Male Mary 15 Female Peter 13 Male Peter 13 Male Alice 14 Female Roger 14 Male Phyllis 13 Female Peter 13 Male name age sex name age sex Roger 14 Male Mary 15 Female Alice 14 Female Phyllis 13 Female Alice 14 Female Roger 14 Male Phyllis 13 Female name age sex Mary 15 Female Friday, 29 May 2009
  • 28. Combine: array, data frame, list .(sex) .(age) .(sex, age) sex value age value sex age value Male 3 13 3 Male 13 2 Female 3 14 2 Male 14 1 15 2 Female 13 1 Female 14 1 Applying nrow to each piece Female 15 1 Friday, 29 May 2009
  • 30. id year team g ab r h 21 699 records ruthba01 1914 BOS 5 10 1 2 ruthba01 1915 BOS 42 92 16 29 ruthba01 1916 BOS 67 136 18 37 1228 players ruthba01 1917 BOS 52 123 14 40 ruthba01 1918 BOS 95 317 50 95 15-31 years for ruthba01 1919 BOS 130 432 103 139 each player ruthba01 1920 NYA 142 457 158 172 ruthba01 1921 NYA 152 540 177 204 ruthba01 1922 NYA 110 406 94 128 ruthba01 1923 NYA 152 522 151 205 ruthba01 1924 NYA 153 529 143 200 ruthba01 1925 NYA 98 359 61 104 ruthba01 1926 NYA 152 495 139 184 ruthba01 1927 NYA 151 540 158 192 ruthba01 1928 NYA 154 536 163 173 ruthba01 1929 NYA 135 499 121 172 Friday, 29 May 2009
  • 31. How does performance (rbi/ab) change over the course of a career? First need to add column that gives “career year” Easy for a single player. baberuth <- subset(baseball, id == quot;ruthba01quot;) baberuth <- transform(baberuth, cyear = year - min(year) + 1) For many players, use ddply + transform baseball <- ddply(baseball, quot;idquot;, transform, cyear = year - min(year) + 1) Friday, 29 May 2009
  • 32. Draw time series for all 1228 players baseball <- subset(baseball, ab >= 25) xlim <- range(baseball$cyear, na.rm=TRUE) ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE) plotpattern <- function(df) { qplot(cyear, rbi / ab, data = df, geom = quot;linequot;, xlim = xlim, ylim = ylim) } pdf(quot;paths.pdfquot;, width = 8, height = 4) d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE) dev.off() Friday, 29 May 2009
  • 33. 200 150 count 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 rsquare Friday, 29 May 2009
  • 34. 0.25 1.0 0.20 0.15 rsquare rsquare 0.5 0.00 0.00 0.10 intercept intercept 0.25 0.25 0.50 0.05 0.50 0.0 0.75 0.75 1.00 0.00 1.00 −0.05 −0.5 −0.10 −0.04 −0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010 slope slope Friday, 29 May 2009
  • 35. Fiddly details Labelling Progress bars Consistent argument names Missing values / Nulls Friday, 29 May 2009
  • 36. Data analysis What other patterns of data analysis are waiting to be discovered? How can we identify these strategies and then develop software to support them? Does teaching these patterns make it easier for novices to become experts? Friday, 29 May 2009