Plyr, one data analytic strategy

plyr
One data-analytic strategy

Hadley Wickham
Rice University
Friday, 29 May 2009

1. Motivation: Deseasonlising ozone
measurements
2. Outline of strategy: split-apply-
combine
3. Speciﬁcs: input vs. output
4. Fiddly details
5. Thoughts on data analysis

Friday, 29 May 2009

24 x 24 x 72 = 41,472

30

20

10

0

−10

−20

−110 −85 −60

Friday, 29 May 2009

1.0

0.5 ●

0.0 ●

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0

Friday, 29 May 2009

1.0
1.0

0.8
0.5 ●

0.6

0.0 ●
●
0.4

−0.5
0.2

−1.0
0.0
−1.0 −0.5 0.0 0.5 1.0
0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

1.0
0.9
0.8
0.7
value

0.6
0.5
0.4
0.3

0.0 0.2 0.4 0.6 0.8 1.0
time

Friday, 29 May 2009

resid(deseas1) + mean(one$value)

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3

0.0 0.2 0.4 0.6 0.8 1.0
time

Friday, 29 May 2009

How can we do this for
all 24 x 24 locations?
(assume ozone levels stored
in a 24 x 24 x 72 array)

Friday, 29 May 2009

W
ith
models <- as.list(rep(NA, 24 * 24))

a
fo
dim(models) <- c(24, 24)

r
lo
op
deseas <- array(NA, c(24, 24, 72))
dimnames(deseas) <- dimnames(ozone)

for (i in seq_len(24)) {
for(j in seq_len(24)) {
mod <- deseasf(ozone[i, j, ])

models[[i, j]] <- mod
deseas[i, j, ] <- resid(mod)
}
}

Friday, 29 May 2009

W
ith
ap
pl
y
models <- apply(ozone, 1:2, deseasf)

resids <- unlist(lapply(models, resid))
dim(resids) <- c(72, 24, 24)
deseas <- aperm(resids, c(2, 3, 1))
dimnames(deseas) <- dimnames(ozone)

Friday, 29 May 2009

W
ith
pl
yr
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)

Succinct, but you need to
know what aaply does

cf. onomatopoeia, schadenfreude, soliloquy
Friday, 29 May 2009

30

20

avg
250
10
260
270
280
290
300
310

0

−10

−20

−110 −85 −60

Friday, 29 May 2009

30

20

10

0

−10

−20

−110 −85 −60

Friday, 29 May 2009

Many problems involve splitting up a large
data structure, operating on each piece
and joining the results back together:

split-apply-combine

Friday, 29 May 2009

How you split up depends on the type of
input: arrays, data frames, lists
How you combine depends on the type of
output: arrays, data frames, lists,
nothing

Friday, 29 May 2009

array data frame list nothing

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

Friday, 29 May 2009

array data frame list nothing

array apply adply alply a_ply

data frame daply aggregate by d_ply

list sapply ldply lapply l_ply

Friday, 29 May 2009

Split: array, data frame, list

1

2
1

2 1,2

Friday, 29 May 2009


1 2 3

3

2
1

1,2,3
1,2 1,3 2,3

Friday, 29 May 2009

Take 3d array, split up by ﬁrst two dimensions.

models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)

Splitting up ozone gives 576 vectors of length 72.
Splitting up models gives 576 rlm models

How are they combined?

Friday, 29 May 2009

Combine: array, data frame, list

4D!

Friday, 29 May 2009


Friday, 29 May 2009


.(sex) .(age)

name age sex name age sex name age sex

John 13 Male John 13 Male John 13 Male

Mary 15 Female Peter 13 Male Peter 13 Male

Alice 14 Female Roger 14 Male Phyllis 13 Female

Peter 13 Male
name age sex name age sex
Roger 14 Male Mary 15 Female Alice 14 Female

Phyllis 13 Female Alice 14 Female Roger 14 Male

Phyllis 13 Female
name age sex

Mary 15 Female

Friday, 29 May 2009


.(sex) .(age) .(sex, age)

sex value age value sex age value

Male 3 13 3 Male 13 2

Female 3 14 2 Male 14 1

15 2 Female 13 1

Female 14 1

Applying nrow to each piece Female 15 1

Friday, 29 May 2009

Case study: Baseball

Friday, 29 May 2009

id year team g ab r h
21 699 records
ruthba01 1914 BOS 5 10 1 2
ruthba01 1915 BOS 42 92 16 29
ruthba01 1916 BOS 67 136 18 37
1228 players
ruthba01 1917 BOS 52 123 14 40
ruthba01 1918 BOS 95 317 50 95 15-31 years for
ruthba01 1919 BOS 130 432 103 139 each player
ruthba01 1920 NYA 142 457 158 172
ruthba01 1921 NYA 152 540 177 204
ruthba01 1922 NYA 110 406 94 128
ruthba01 1923 NYA 152 522 151 205
ruthba01 1924 NYA 153 529 143 200
ruthba01 1925 NYA 98 359 61 104
ruthba01 1926 NYA 152 495 139 184
ruthba01 1927 NYA 151 540 158 192
ruthba01 1928 NYA 154 536 163 173
ruthba01 1929 NYA 135 499 121 172

Friday, 29 May 2009

How does performance (rbi/ab)
change over the course of a career?

First need to add column that gives
“career year”

Easy for a single player.
baberuth <- subset(baseball, id == quot;ruthba01quot;)
baberuth <- transform(baberuth,
cyear = year - min(year) + 1)

For many players, use ddply + transform
baseball <- ddply(baseball, quot;idquot;, transform,
cyear = year - min(year) + 1)

Friday, 29 May 2009

Draw time series for all 1228 players

baseball <- subset(baseball, ab >= 25)
xlim <- range(baseball$cyear, na.rm=TRUE)
ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
plotpattern <- function(df) {
qplot(cyear, rbi / ab, data = df, geom = quot;linequot;,
xlim = xlim, ylim = ylim)
}

pdf(quot;paths.pdfquot;, width = 8, height = 4)
d_ply(baseball, .(reorder(id, rbi / ab)),
failwith(NA, plotpattern), .print = TRUE)
dev.off()

Friday, 29 May 2009

200

150
count

100

50

0

0.0 0.2 0.4 0.6 0.8 1.0
rsquare

Friday, 29 May 2009

0.25
1.0
0.20

0.15
rsquare rsquare
0.5
0.00 0.00
0.10
intercept

intercept
0.25 0.25
0.50 0.05 0.50
0.0 0.75 0.75
1.00 0.00 1.00

−0.05
−0.5
−0.10

−0.04
−0.020.00 0.02 0.04 0.06 0.08 −0.010 −0.005 0.000 0.005 0.010
slope slope

Friday, 29 May 2009

Fiddly details
Labelling
Progress bars
Consistent argument names
Missing values / Nulls

Friday, 29 May 2009

Data analysis
What other patterns of data analysis are
waiting to be discovered?
How can we identify these strategies and
then develop software to support them?
Does teaching these patterns make it
easier for novices to become experts?

Friday, 29 May 2009

http://had.co.nz/plyr

Friday, 29 May 2009

Plyr, one data analytic strategy

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (9)

Mais de Hadley Wickham

Mais de Hadley Wickham (20)

Último

Último (20)

Plyr, one data analytic strategy