The document discusses strategies for analyzing large datasets and applying quantitative models. It summarizes the key steps as: start with a single unit and identify patterns, summarize patterns with a model, apply model to all units, and look for units that don't fit the pattern. It then applies this strategy to housing data from Texas cities from 2000-2009, exploring patterns in Houston and applying models to deseasonalize and analyze sales data across all cities. Different modeling approaches are discussed, including using models as tools to remove seasonal trends and exploring model coefficients in more depth.
1. Stat405 Quantitative models
Hadley Wickham
Wednesday, 28 October 2009
2. 1. Strategy for analysing large data.
2. Introduction to the Texas housing
data.
3. What’s happening in Houston?
4. Using a models as a tool
5. Using models in their own right
Wednesday, 28 October 2009
3. Large data strategy
Start with a single unit, and identify
interesting patterns.
Summarise patterns with a model.
Apply model to all units.
Look for units that don’t fit the pattern.
Summarise with a single model.
Wednesday, 28 October 2009
4. Texas housing data
For each metropolitan area (45) in Texas,
for each month from 2000 to 2009 (112):
Number of houses listed and sold
Total value of houses, and average sale
price
Average time on market
CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/
Wednesday, 28 October 2009
5. Strategy
Start with a single city (Houston).
Explore patterns & fit models.
Apply models to all cities.
Wednesday, 28 October 2009
6. 220000
200000
avgprice
180000
160000
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
7. 8000
7000
6000
sales
5000
4000
3000
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
8. 6.5
6.0
onmarket
5.5
5.0
4.5
4.0
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
9. Seasonal trends
Make it much harder to see long term
trend. How can we remove the trend?
(Many sophisticated techniques from time
series, but what’s the simplest thing that
might work?)
Wednesday, 28 October 2009
12. Challenge
What does the following function do?
deseas <- function(var, month) {
resid(lm(var ~ factor(month))) +
mean(var, na.rm = TRUE)
}
How could you use it in conjunction with
transform to deasonalise the data? What if
you wanted to deasonalise every city?
Wednesday, 28 October 2009
15. Model as tools
Here we’re using the linear model as a
tool - we don’t care about the coefficients
or the standard errors, just using it to get
rid of a striking pattern.
Tukey described this pattern as residuals
and reiteration: by removing a striking
pattern we can see more subtle patterns.
Wednesday, 28 October 2009
16. 210000
200000
190000
avgprice_ds
180000
170000
160000
150000
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
17. 7000
6500
6000
sales_ds
5500
5000
4500
4000
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
18. 6.5
6.0
onmarket_ds
5.5
5.0
4.5
4.0
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
19. Summary
Most variables seem to be combination of
strong seasonal pattern plus weaker long-
term trend.
How do these patterns hold up for the
rest of Texas? We’ll focus on sales.
Wednesday, 28 October 2009
20. 8000
6000
sales
4000
2000
2000 2002 2004 2006 2008
date
Wednesday, 28 October 2009
21. 7000
6000
5000
sales_ds
4000
3000
2000
1000
2000 2002 2004 2006 2008
tx <- ddply(tx, "city", transform,
date
sales_ds = deseas(sales, month))
qplot(date, sales_ds, data = tx, geom = "line",
group = city)
Wednesday, 28 October 2009
22. Is this still such a good idea? What do
we lose? Is there anything else we
could do to improve the plot?
7000
6000
5000
sales_ds
4000
3000
2000
1000
2000 2002 2004 2006 2008
tx <- ddply(tx, "city", transform,
date
sales_ds = deseas(sales, month))
qplot(date, sales_ds, data = tx, geom = "line",
group = city)
Wednesday, 28 October 2009
23. 3.5
3.0
2.5
log10(sales)
2.0
1.5
1.0
0.5
qplot(date, log10(sales), data = tx, geom = "line",
2000 2002 2004 2006 2008
group = city) date
Wednesday, 28 October 2009
24. It works, but...
Instead of throwing the models away and
just using the residuals, let’s keep the
models and explore them in more depth.
Wednesday, 28 October 2009
25. Two new tools
dlply: takes a data frame, splits up in the
same way as ddply, applies function to
each piece and combines the results into a
list
ldply: takes a list, splits up into elements,
applies function to each piece and then
combines the results into a data frame
dlply + ldply = ddply
Wednesday, 28 October 2009
26. models <- dlply(tx, "city", function(df)
lm(sales ~ factor(month), data = df))
models[[1]]
coef(models[[1]])
ldply(models, coef)
Wednesday, 28 October 2009
27. Labelling
Notice we didn’t have to do anything to
have the coefficients labelled correctly.
Behind the scenes plyr records the labels
used for the split step, and ensures they
are preserved across multiple plyr calls.
Wednesday, 28 October 2009
28. Back to the model
What are some problems with this model?
How could you fix them?
Is the format of the coefficients optimal?
Turn to the person next to you and
discuss for 2 minutes.
Wednesday, 28 October 2009
30. qplot(date, log10(sales), data = tx, geom = "line",
group = city)
models2 <- dlply(tx, "city", function(df)
lm(log10(sales) ~ factor(month), data = df))
coef2 <- ldply(models2, function(mod) {
data.frame(
month = 1:12,
effect = c(0, coef(mod)[-1]),
intercept = coef(mod)[1])
})
Put coefficients in
rows, so they can be
plotted more easily
Wednesday, 28 October 2009
31. 0.4
0.3
0.2
effect
0.1
0.0
−0.1
2 4 6 8 10 12
qplot(month, effect, data = coef2, group month
= city, geom = "line")
Wednesday, 28 October 2009
32. 2.5
2.0
10^effect
1.5
1.0
2 4 6 8 10 12
month
qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")
Wednesday, 28 October 2009
33. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County
2.5
2.0
1.5
1.0
BrownsvilleBryan−College Station
Collin County Corpus Christi Dallas Denton County El Paso
2.5
2.0
1.5
1.0
Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving
2.5
2.0
1.5
1.0
Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland
10^effect
2.5
2.0
1.5
1.0
Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur
2.5
2.0
1.5
1.0
San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler
2.5
2.0
1.5
1.0
Victoria Waco Wichita Falls
2.5
2.0
1.5
1.0
2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012
month
qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)
Wednesday, 28 October 2009
34. What should
we do next?
What do you think?
You have 30 seconds to come up with (at
least) one idea.
Wednesday, 28 October 2009
35. My ideas
Fit a single model, log(sales) ~ city *
factor(month), and look at residuals
Fit individual models, log(sales) ~
factor(month) + ns(date, 3), look cities
that don’t fit
Wednesday, 28 October 2009
36. # One approach - fit a single model
mod <- lm(log10(sales) ~ city + factor(month),
data = tx)
tx$sales2 <- 10 ^ resid(mod)
qplot(date, sales2, data = tx, geom = "line",
group = city)
last_plot() + facet_wrap(~ city)
Wednesday, 28 October 2009
37. 3.5
3.0
2.5
sales2
2.0
1.5
1.0
0.5
2000 2002 2004 2006 2008
date
qplot(date, sales2, data = tx, geom = "line", group = city)
Wednesday, 28 October 2009
38. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County
3.5
3.0
2.5
2.0
1.5
1.0
0.5
BrownsvilleBryan−College Station
Collin County Corpus Christi Dallas Denton County El Paso
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland
3.5
sales2
3.0
2.5
2.0
1.5
1.0
0.5
Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur
3.5
3.0
2.5
2.0
1.5
1.0
0.5
San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Victoria Waco Wichita Falls
3.5
3.0
2.5
2.0
1.5
1.0
0.5
2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
date
last_plot() + facet_wrap(~ city)
Wednesday, 28 October 2009
39. # Another approach: Essence of most cities is seasonal
# term plus long term smooth trend. We could fit this
# model to each city, and then look for models which don't
# fit well.
library(splines)
models3 <- dlply(tx, "city", function(df) {
lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)
})
# Extract rsquared from each model
rsq <- function(mod) c(rsq = summary(mod)$r.squared)
quality <- ldply(models3, rsq)
Wednesday, 28 October 2009
40. Wichita Falls ●
Waco ●
Victoria ●
Tyler ●
Texarkana ●
Temple−Belton ●
Sherman−Denison ●
San Marcos ●
San Antonio ●
San Angelo ●
Port Arthur ●
Paris ●
Palestine ●
Odessa ●
NE Tarrant County ●
Nacogdoches ●
Montgomery County ●
Midland ●
McAllen ●
Lufkin ●
Lubbock ●
Longview−Marshall ●
Laredo ●
city
Killeen−Fort Hood ●
Irving ●
Houston ●
Harlingen ●
Garland ●
Galveston ●
Fort Worth ●
Fort Bend ●
El Paso ●
Denton County ●
Dallas ●
Corpus Christi ●
Collin County ●
Bryan−College Station ●
Brownsville ●
Brazoria County ●
Beaumont ●
Bay Area ●
Austin ●
Arlington ●
Amarillo ●
Abilene ●
0.5 0.6 0.7 0.8 0.9
rsq
qplot(rsq, city, data = quality)
Wednesday, 28 October 2009
41. San Antonio ●
Montgomery County ●
Houston How are the good ●
Dallas ●
Bryan−College Station
Collin County
fits different from ●
●
Denton County
Austin
the bad fits? ●
●
Fort Bend ●
Fort Worth ●
Tyler ●
NE Tarrant County ●
Bay Area ●
Corpus Christi ●
Arlington ●
Waco ●
Temple−Belton ●
Lubbock ●
reorder(city, rsq)
Garland ●
Longview−Marshall ●
Midland ●
Laredo ●
Harlingen ●
Abilene ●
Killeen−Fort Hood ●
Brazoria County ●
McAllen ●
Brownsville ●
Wichita Falls ●
Sherman−Denison ●
Irving ●
Galveston ●
Odessa ●
San Marcos ●
San Angelo ●
Amarillo ●
Nacogdoches ●
Lufkin ●
Victoria ●
Beaumont ●
Texarkana ●
Paris ●
Palestine ●
El Paso ●
Port Arthur ●
0.5 0.6 0.7 0.8 0.9
rsq
qplot(rsq, reorder(city, rsq), data = quality)
Wednesday, 28 October 2009
42. quality$poor <- quality$rsq < 0.7
tx2 <- merge(tx, quality, by = "city")
cities <- dlply(tx, "city")
mfit <- mdply(cbind(mod = models3, df = cities),
function(mod, df) {
data.frame(
city = df$city, date = df$date,
resid = resid(mod), pred = predict(mod))
})
tx2 <- merge(tx2, mfit)
Wednesday, 28 October 2009
43. quality$poor <- quality$rsq < 0.7
tx2 <- merge(tx, quality, by = "city")
Takes each row of input
cities <- dlply(tx, to function
and feeds it "city")
mfit <- mdply(cbind(mod = models3, df = cities),
function(mod, df) {
data.frame(
city = df$city, date = df$date,
resid = resid(mod), pred = predict(mod))
})
tx2 <- merge(tx2, mfit)
Wednesday, 28 October 2009
44. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County
3.5
3.0
2.5
2.0
1.5
1.0
0.5
BrownsvilleBryan−College Station
Collin County Corpus Christi Dallas Denton County El Paso
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving
3.5
3.0
2.5
2.0
1.5
1.0
0.5
log10(sales)
Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur
3.5
3.0
2.5
2.0
1.5
1.0
0.5
San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Victoria Waco Wichita Falls
3.5
3.0
2.5
2.0
1.5
1.0
0.5
2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Raw data
date
Wednesday, 28 October 2009
45. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County
3.5
3.0
2.5
2.0
1.5
1.0
BrownsvilleBryan−College Station
Collin County Corpus Christi Dallas Denton County El Paso
3.5
3.0
2.5
2.0
1.5
1.0
Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving
3.5
3.0
2.5
2.0
1.5
1.0
Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland
3.5
3.0
pred
2.5
2.0
1.5
1.0
Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur
3.5
3.0
2.5
2.0
1.5
1.0
San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler
3.5
3.0
2.5
2.0
1.5
1.0
Victoria Waco Wichita Falls
3.5
3.0
2.5
2.0
1.5
1.0
Predictions
2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
date
Wednesday, 28 October 2009
46. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County
0.4
0.2
0.0
−0.2
−0.4
BrownsvilleBryan−College Station
Collin County Corpus Christi Dallas Denton County El Paso
0.4
0.2
0.0
−0.2
−0.4
Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving
0.4
0.2
0.0
−0.2
−0.4
Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland
0.4
0.2
resid
0.0
−0.2
−0.4
Montgomery County
Nacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur
0.4
0.2
0.0
−0.2
−0.4
San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler
0.4
0.2
0.0
−0.2
−0.4
Victoria Waco Wichita Falls
0.4
0.2
0.0
−0.2
−0.4
2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Residuals
date
Wednesday, 28 October 2009
47. Your turn
Pick one of the other variables and
perform a similar exploration, working
through the same steps.
Wednesday, 28 October 2009
48. Conclusions
Simple (and relatively small) example, but
shows how collections of models can be
useful for gaining understanding.
Each attempt illustrated something new
about the data.
Plyr made it easy to create and summarise
collection of models, so we didn’t have to
worry about the mechanics.
Wednesday, 28 October 2009