10. How can we do this for
all 24 x 24 locations?
(assume ozone levels stored
in a 24 x 24 x 72 array)
Friday, 29 May 2009
11. W
ith
models <- as.list(rep(NA, 24 * 24))
a
fo
dim(models) <- c(24, 24)
r
lo
op
deseas <- array(NA, c(24, 24, 72))
dimnames(deseas) <- dimnames(ozone)
for (i in seq_len(24)) {
for(j in seq_len(24)) {
mod <- deseasf(ozone[i, j, ])
models[[i, j]] <- mod
deseas[i, j, ] <- resid(mod)
}
}
Friday, 29 May 2009
12. W
ith
models <- as.list(rep(NA, 24 * 24))
a
fo
dim(models) <- c(24, 24)
r
lo
op
deseas <- array(NA, c(24, 24, 72))
dimnames(deseas) <- dimnames(ozone)
for (i in seq_len(24)) {
for(j in seq_len(24)) {
mod <- deseasf(ozone[i, j, ])
models[[i, j]] <- mod
deseas[i, j, ] <- resid(mod)
}
}
Friday, 29 May 2009
13. W
ith
ap
pl
y
models <- apply(ozone, 1:2, deseasf)
resids <- unlist(lapply(models, resid))
dim(resids) <- c(72, 24, 24)
deseas <- aperm(resids, c(2, 3, 1))
dimnames(deseas) <- dimnames(ozone)
Friday, 29 May 2009
14. W
ith
ap
pl
y
models <- apply(ozone, 1:2, deseasf)
resids <- unlist(lapply(models, resid))
dim(resids) <- c(72, 24, 24)
deseas <- aperm(resids, c(2, 3, 1))
dimnames(deseas) <- dimnames(ozone)
Friday, 29 May 2009
15. W
ith
pl
yr
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)
Succinct, but you need to
know what aaply does
cf. onomatopoeia, schadenfreude, soliloquy
Friday, 29 May 2009
18. Many problems involve splitting up a large
data structure, operating on each piece
and joining the results back together:
split-apply-combine
Friday, 29 May 2009
19. How you split up depends on the type of
input: arrays, data frames, lists
How you combine depends on the type of
output: arrays, data frames, lists,
nothing
Friday, 29 May 2009
20. array data frame list nothing
array aaply adply alply a_ply
data frame daply ddply dlply d_ply
list laply ldply llply l_ply
Friday, 29 May 2009
21. array data frame list nothing
array apply adply alply a_ply
data frame daply aggregate by d_ply
list sapply ldply lapply l_ply
Friday, 29 May 2009
23. Split: array, data frame, list
1 2 3
3
2
1
1,2,3
1,2 1,3 2,3
Friday, 29 May 2009
24. Take 3d array, split up by first two dimensions.
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)
Splitting up ozone gives 576 vectors of length 72.
Splitting up models gives 576 rlm models
How are they combined?
Friday, 29 May 2009
27. Split: array, data frame, list
.(sex) .(age)
name age sex name age sex name age sex
John 13 Male John 13 Male John 13 Male
Mary 15 Female Peter 13 Male Peter 13 Male
Alice 14 Female Roger 14 Male Phyllis 13 Female
Peter 13 Male
name age sex name age sex
Roger 14 Male Mary 15 Female Alice 14 Female
Phyllis 13 Female Alice 14 Female Roger 14 Male
Phyllis 13 Female
name age sex
Mary 15 Female
Friday, 29 May 2009
28. Combine: array, data frame, list
.(sex) .(age) .(sex, age)
sex value age value sex age value
Male 3 13 3 Male 13 2
Female 3 14 2 Male 14 1
15 2 Female 13 1
Female 14 1
Applying nrow to each piece Female 15 1
Friday, 29 May 2009
30. id year team g ab r h
21 699 records
ruthba01 1914 BOS 5 10 1 2
ruthba01 1915 BOS 42 92 16 29
ruthba01 1916 BOS 67 136 18 37
1228 players
ruthba01 1917 BOS 52 123 14 40
ruthba01 1918 BOS 95 317 50 95 15-31 years for
ruthba01 1919 BOS 130 432 103 139 each player
ruthba01 1920 NYA 142 457 158 172
ruthba01 1921 NYA 152 540 177 204
ruthba01 1922 NYA 110 406 94 128
ruthba01 1923 NYA 152 522 151 205
ruthba01 1924 NYA 153 529 143 200
ruthba01 1925 NYA 98 359 61 104
ruthba01 1926 NYA 152 495 139 184
ruthba01 1927 NYA 151 540 158 192
ruthba01 1928 NYA 154 536 163 173
ruthba01 1929 NYA 135 499 121 172
Friday, 29 May 2009
31. How does performance (rbi/ab)
change over the course of a career?
First need to add column that gives
“career year”
Easy for a single player.
baberuth <- subset(baseball, id == quot;ruthba01quot;)
baberuth <- transform(baberuth,
cyear = year - min(year) + 1)
For many players, use ddply + transform
baseball <- ddply(baseball, quot;idquot;, transform,
cyear = year - min(year) + 1)
Friday, 29 May 2009
32. Draw time series for all 1228 players
baseball <- subset(baseball, ab >= 25)
xlim <- range(baseball$cyear, na.rm=TRUE)
ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
plotpattern <- function(df) {
qplot(cyear, rbi / ab, data = df, geom = quot;linequot;,
xlim = xlim, ylim = ylim)
}
pdf(quot;paths.pdfquot;, width = 8, height = 4)
d_ply(baseball, .(reorder(id, rbi / ab)),
failwith(NA, plotpattern), .print = TRUE)
dev.off()
Friday, 29 May 2009
36. Data analysis
What other patterns of data analysis are
waiting to be discovered?
How can we identify these strategies and
then develop software to support them?
Does teaching these patterns make it
easier for novices to become experts?
Friday, 29 May 2009