SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Stat405               Quantitative models




                             Hadley Wickham
Wednesday, 28 October 2009
1. Strategy for analysing large data.
               2. Introduction to the Texas housing
                  data.
               3. What’s happening in Houston?
               4. Using a models as a tool
               5. Using models in their own right


Wednesday, 28 October 2009
Large data strategy
                   Start with a single unit, and identify
                   interesting patterns.
                   Summarise patterns with a model.
                   Apply model to all units.
                   Look for units that don’t fit the pattern.
                   Summarise with a single model.


Wednesday, 28 October 2009
Texas housing data
                   For each metropolitan area (45) in Texas,
                   for each month from 2000 to 2009 (112):
                   Number of houses listed and sold
                   Total value of houses, and average sale
                   price
                   Average time on market


CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/
Wednesday, 28 October 2009
Strategy

                   Start with a single city (Houston).
                   Explore patterns & fit models.
                   Apply models to all cities.




Wednesday, 28 October 2009
220000




            200000
 avgprice




            180000




            160000




                     2000    2002   2004    2006   2008
                                     date




Wednesday, 28 October 2009
8000


         7000


         6000
 sales




         5000


         4000


         3000


                 2000        2002   2004     2006   2008
                                      date




Wednesday, 28 October 2009
6.5


            6.0
 onmarket




            5.5


            5.0


            4.5


            4.0

                  2000       2002   2004      2006   2008
                                       date




Wednesday, 28 October 2009
Seasonal trends

                   Make it much harder to see long term
                   trend. How can we remove the trend?
                   (Many sophisticated techniques from time
                   series, but what’s the simplest thing that
                   might work?)




Wednesday, 28 October 2009
220000




            200000
 avgprice




            180000




            160000




                             2   4     6     8   10   12
                                     month
Wednesday, 28 October 2009
220000




            200000
 avgprice




            180000




            160000




qplot(month, avgprice, data = houston, geom = "line", group = year) +
  stat_summary(aes(group = 1), fun.y = "mean", geom = "line",
               2         4          6        8         10        12
  colour = "red", size = 2, na.rm month
                                  = TRUE)
Wednesday, 28 October 2009
Challenge
                   What does the following function do?
                   deseas <- function(var, month) {
                     resid(lm(var ~ factor(month))) +
                       mean(var, na.rm = TRUE)
                   }
                   How could you use it in conjunction with
                   transform to deasonalise the data? What if
                   you wanted to deasonalise every city?


Wednesday, 28 October 2009
houston <- transform(houston,
       avgprice_ds = deseas(avgprice, month),
       listings_ds = deseas(listings, month),
       sales_ds = deseas(sales, month),
       onmarket_ds = deseas(onmarket, month)
     )

     qplot(month, sales_ds, data = houston,
       geom = "line", group = year) + avg




Wednesday, 28 October 2009
210000



               200000



               190000
 avgprice_ds




               180000



               170000



               160000



               150000



                             2   4     6     8   10   12
                                     month
Wednesday, 28 October 2009
Model as tools
                   Here we’re using the linear model as a
                   tool - we don’t care about the coefficients
                   or the standard errors, just using it to get
                   rid of a striking pattern.
                   Tukey described this pattern as residuals
                   and reiteration: by removing a striking
                   pattern we can see more subtle patterns.


Wednesday, 28 October 2009
210000


               200000


               190000
 avgprice_ds




               180000


               170000


               160000


               150000


                        2000   2002   2004    2006   2008
                                       date




Wednesday, 28 October 2009
7000


            6500


            6000
 sales_ds




            5500


            5000


            4500


            4000

                   2000      2002   2004     2006   2008
                                      date




Wednesday, 28 October 2009
6.5



               6.0
 onmarket_ds




               5.5



               5.0



               4.5



               4.0

                     2000    2002   2004      2006   2008
                                       date




Wednesday, 28 October 2009
Summary

                   Most variables seem to be combination of
                   strong seasonal pattern plus weaker long-
                   term trend.
                   How do these patterns hold up for the
                   rest of Texas? We’ll focus on sales.




Wednesday, 28 October 2009
8000



         6000
 sales




         4000



         2000




                 2000        2002   2004     2006   2008
                                      date




Wednesday, 28 October 2009
7000

            6000

            5000
 sales_ds




            4000

            3000

            2000

            1000




                   2000      2002   2004   2006   2008
tx <- ddply(tx, "city", transform,
                           date

  sales_ds = deseas(sales, month))
qplot(date, sales_ds, data = tx, geom = "line",
  group = city)
Wednesday, 28 October 2009
Is this still such a good idea? What do
                                    we lose? Is there anything else we
                                    could do to improve the plot?
            7000

            6000

            5000
 sales_ds




            4000

            3000

            2000

            1000




                   2000      2002    2004         2006        2008
tx <- ddply(tx, "city", transform,
                           date

  sales_ds = deseas(sales, month))
qplot(date, sales_ds, data = tx, geom = "line",
  group = city)
Wednesday, 28 October 2009
3.5



                3.0



                2.5
 log10(sales)




                2.0



                1.5



                1.0



                0.5

qplot(date, log10(sales), data = tx, geom = "line",
      2000      2002       2004       2006      2008
  group = city)                date
Wednesday, 28 October 2009
It works, but...


                   Instead of throwing the models away and
                   just using the residuals, let’s keep the
                   models and explore them in more depth.




Wednesday, 28 October 2009
Two new tools
                   dlply: takes a data frame, splits up in the
                   same way as ddply, applies function to
                   each piece and combines the results into a
                   list
                   ldply: takes a list, splits up into elements,
                   applies function to each piece and then
                   combines the results into a data frame
                   dlply + ldply = ddply


Wednesday, 28 October 2009
models <- dlply(tx, "city", function(df)
       lm(sales ~ factor(month), data = df))


     models[[1]]
     coef(models[[1]])

     ldply(models, coef)




Wednesday, 28 October 2009
Labelling

                   Notice we didn’t have to do anything to
                   have the coefficients labelled correctly.
                   Behind the scenes plyr records the labels
                   used for the split step, and ensures they
                   are preserved across multiple plyr calls.




Wednesday, 28 October 2009
Back to the model

                   What are some problems with this model?
                   How could you fix them?
                   Is the format of the coefficients optimal?
                   Turn to the person next to you and
                   discuss for 2 minutes.



Wednesday, 28 October 2009
qplot(date, log10(sales), data = tx, geom = "line",
     group = city)

     models2 <- dlply(tx, "city", function(df)
       lm(log10(sales) ~ factor(month), data = df))

     coef2 <- ldply(models2, function(mod) {
        data.frame(
          month = 1:12,
          effect = c(0, coef(mod)[-1]),
          intercept = coef(mod)[1])
     })


Wednesday, 28 October 2009
qplot(date, log10(sales), data = tx, geom = "line",
     group = city)

     models2 <- dlply(tx, "city", function(df)
       lm(log10(sales) ~ factor(month), data = df))

     coef2 <- ldply(models2, function(mod) {
        data.frame(
          month = 1:12,
          effect = c(0, coef(mod)[-1]),
          intercept = coef(mod)[1])
     })
                               Put coefficients in
                             rows, so they can be
                              plotted more easily
Wednesday, 28 October 2009
0.4




           0.3




           0.2
 effect




           0.1




           0.0




          −0.1




                             2   4       6            8           10   12
qplot(month, effect, data = coef2, group month
                                         = city, geom = "line")
Wednesday, 28 October 2009
2.5




             2.0
 10^effect




             1.5




             1.0




                             2   4       6           8           10    12
                                         month
qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")
Wednesday, 28 October 2009
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
             2.5
             2.0
             1.5
             1.0
                      BrownsvilleBryan−College Station
                                                     Collin County     Corpus Christi    Dallas     Denton County      El Paso
             2.5
             2.0
             1.5
             1.0
                      Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
             2.5
             2.0
             1.5
             1.0
                   Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
 10^effect




             2.5
             2.0
             1.5
             1.0
               Montgomery CountyNacogdoches NE Tarrant County             Odessa        Palestine       Paris         Port Arthur
             2.5
             2.0
             1.5
             1.0
                      San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
             2.5
             2.0
             1.5
             1.0
                       Victoria          Waco          Wichita Falls
             2.5
             2.0
             1.5
             1.0
                     2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012
                                         month
qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)
Wednesday, 28 October 2009
What should
                             we do next?

                   What do you think?
                   You have 30 seconds to come up with (at
                   least) one idea.




Wednesday, 28 October 2009
My ideas

                   Fit a single model, log(sales) ~ city *
                   factor(month), and look at residuals
                   Fit individual models, log(sales) ~
                   factor(month) + ns(date, 3), look cities
                   that don’t fit



Wednesday, 28 October 2009
# One approach - fit a single model

     mod <- lm(log10(sales) ~ city + factor(month),
       data = tx)

     tx$sales2 <- 10 ^ resid(mod)
     qplot(date, sales2, data = tx, geom = "line",
       group = city)
     last_plot() + facet_wrap(~ city)




Wednesday, 28 October 2009
3.5



          3.0



          2.5
 sales2




          2.0



          1.5



          1.0



          0.5




                2000         2002     2004           2006     2008
                                         date
qplot(date, sales2, data = tx, geom = "line", group = city)
Wednesday, 28 October 2009
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   BrownsvilleBryan−College Station
                                                  Collin County     Corpus Christi    Dallas     Denton County      El Paso
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
          3.5
 sales2




          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
            Montgomery CountyNacogdoches NE Tarrant County             Odessa        Palestine       Paris         Port Arthur
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                    Victoria          Waco          Wichita Falls
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
            2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
              2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
                                                                      date
last_plot() + facet_wrap(~ city)
Wednesday, 28 October 2009
#    Another approach: Essence of most cities is seasonal
     #    term plus long term smooth trend. We could fit this
     #    model to each city, and then look for models which don't
     #    fit well.

     library(splines)
     models3 <- dlply(tx, "city", function(df) {
        lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)
     })

     # Extract rsquared from each model
     rsq <- function(mod) c(rsq = summary(mod)$r.squared)
     quality <- ldply(models3, rsq)




Wednesday, 28 October 2009
Wichita Falls                                                                   ●
                                                Waco                                                                                ●
                                              Victoria                                   ●
                                                 Tyler                                                                                      ●
                                          Texarkana                                  ●
                                     Temple−Belton                                                                              ●
                                Sherman−Denison                                                                    ●
                                        San Marcos                                                         ●
                                        San Antonio                                                                                                     ●
                                        San Angelo                                                     ●
                                         Port Arthur     ●
                                                 Paris                           ●
                                            Palestine                  ●
                                             Odessa                                                            ●
                                 NE Tarrant County                                                                                          ●
                                      Nacogdoches                                            ●
                               Montgomery County                                                                                                     ●
                                             Midland                                                                           ●
                                             McAllen                                                                   ●
                                               Lufkin                                    ●
                                            Lubbock                                                                             ●
                                Longview−Marshall                                                                               ●
                                              Laredo                                                                           ●
                      city




                                 Killeen−Fort Hood                                                                         ●
                                                Irving                                                         ●
                                             Houston                                                                                                ●
                                           Harlingen                                                                       ●
                                             Garland                                                                           ●
                                          Galveston                                                            ●
                                         Fort Worth                                                                                             ●
                                          Fort Bend                                                                                             ●
                                             El Paso               ●
                                     Denton County                                                                                              ●
                                               Dallas                                                                                            ●
                                      Corpus Christi                                                                                    ●
                                       Collin County                                                                                            ●
                             Bryan−College Station                                                                                               ●
                                         Brownsville                                                                   ●
                                   Brazoria County                                                                     ●
                                          Beaumont                                       ●
                                            Bay Area                                                                                    ●
                                               Austin                                                                                           ●
                                            Arlington                                                                                   ●
                                             Amarillo                                             ●
                                              Abilene                                                                      ●



                                                             0.5             0.6                 0.7               0.8              0.9
                                                                           rsq
qplot(rsq, city, data = quality)
Wednesday, 28 October 2009
San Antonio                                                                                                ●
                                             Montgomery County                                                                                                  ●
                                                           Houston         How are the good                                                                    ●
                                                             Dallas                                                                                           ●
                                           Bryan−College Station
                                                     Collin County
                                                                           fits different from                                                               ●
                                                                                                                                                              ●

                                                   Denton County
                                                             Austin
                                                                             the bad fits?                                                                   ●
                                                                                                                                                            ●
                                                        Fort Bend                                                                                           ●
                                                       Fort Worth                                                                                           ●
                                                               Tyler                                                                                       ●
                                               NE Tarrant County                                                                                          ●
                                                          Bay Area                                                                                    ●
                                                    Corpus Christi                                                                                    ●
                                                          Arlington                                                                                   ●
                                                              Waco                                                                                ●
                                                   Temple−Belton                                                                              ●
                                                          Lubbock                                                                            ●
                      reorder(city, rsq)




                                                           Garland                                                                           ●
                                              Longview−Marshall                                                                              ●
                                                           Midland                                                                           ●
                                                            Laredo                                                                          ●
                                                         Harlingen                                                                      ●
                                                            Abilene                                                                     ●
                                               Killeen−Fort Hood                                                                        ●
                                                 Brazoria County                                                                    ●
                                                           McAllen                                                                  ●
                                                       Brownsville                                                                  ●
                                                     Wichita Falls                                                                  ●
                                              Sherman−Denison                                                                   ●
                                                              Irving                                                        ●
                                                        Galveston                                                           ●
                                                           Odessa                                                           ●
                                                      San Marcos                                                        ●
                                                      San Angelo                                                    ●
                                                           Amarillo                                            ●
                                                    Nacogdoches                                           ●
                                                             Lufkin                                   ●
                                                            Victoria                                  ●
                                                        Beaumont                                     ●
                                                        Texarkana                                ●
                                                               Paris                        ●
                                                          Palestine                    ●
                                                           El Paso                 ●
                                                       Port Arthur     ●



                                                                             0.5           0.6                0.7               0.8               0.9
                                          rsq
qplot(rsq, reorder(city, rsq), data = quality)
Wednesday, 28 October 2009
quality$poor <- quality$rsq < 0.7
     tx2 <- merge(tx, quality, by = "city")

     cities <- dlply(tx, "city")

     mfit <- mdply(cbind(mod = models3, df = cities),
        function(mod, df) {
          data.frame(
            city = df$city, date = df$date,
            resid = resid(mod), pred = predict(mod))
     })
     tx2 <- merge(tx2, mfit)



Wednesday, 28 October 2009
quality$poor <- quality$rsq < 0.7
     tx2 <- merge(tx, quality, by = "city")
                             Takes each row of input
     cities <-               dlply(tx, to function
                             and feeds it "city")

     mfit <- mdply(cbind(mod = models3, df = cities),
        function(mod, df) {
          data.frame(
            city = df$city, date = df$date,
            resid = resid(mod), pred = predict(mod))
     })
     tx2 <- merge(tx2, mfit)



Wednesday, 28 October 2009
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                         BrownsvilleBryan−College Station
                                                        Collin County   Corpus Christi    Dallas     Denton County      El Paso
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                         Fort Bend        Fort Worth       Galveston       Garland       Harlingen     Houston           Irving
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
 log10(sales)




                      Killeen−Fort Hood    Laredo      Longview−Marshall   Lubbock        Lufkin        McAllen         Midland
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                  Montgomery CountyNacogdoches NE Tarrant County Odessa      Palestine     Paris   Port Arthur
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                      San Angelo    San Antonio  San Marcos Sherman−DenisonTemple−Belton Texarkana    Tyler
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                        Victoria      Waco       Wichita Falls
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                   2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
                     2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006   Raw data
                                                                           date
Wednesday, 28 October 2009
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                 BrownsvilleBryan−College Station
                                                Collin County   Corpus Christi    Dallas     Denton County      El Paso
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                 Fort Bend        Fort Worth       Galveston       Garland       Harlingen     Houston           Irving
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
              Killeen−Fort Hood    Laredo      Longview−Marshall   Lubbock        Lufkin        McAllen         Midland
        3.5
        3.0
 pred




        2.5
        2.0
        1.5
        1.0
          Montgomery CountyNacogdoches NE Tarrant County Odessa      Palestine     Paris   Port Arthur
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
              San Angelo    San Antonio  San Marcos Sherman−DenisonTemple−Belton Texarkana    Tyler
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                Victoria      Waco       Wichita Falls
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                                                                                                  Predictions
           2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
             2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
                                                                   date
Wednesday, 28 October 2009
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   BrownsvilleBryan−College Station
                                                  Collin County     Corpus Christi    Dallas     Denton County      El Paso
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
          0.4
          0.2
          0.0
         −0.2
         −0.4
                Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
          0.4
          0.2
 resid




          0.0
         −0.2
         −0.4
            Montgomery County
                            Nacogdoches NE Tarrant County              Odessa        Palestine       Paris         Port Arthur
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
          0.4
          0.2
          0.0
         −0.2
         −0.4
                    Victoria          Waco          Wichita Falls
          0.4
          0.2
          0.0
         −0.2
         −0.4
            2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
              2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006      Residuals
                                                                      date
Wednesday, 28 October 2009
Your turn


                   Pick one of the other variables and
                   perform a similar exploration, working
                   through the same steps.




Wednesday, 28 October 2009
Conclusions
                   Simple (and relatively small) example, but
                   shows how collections of models can be
                   useful for gaining understanding.
                   Each attempt illustrated something new
                   about the data.
                   Plyr made it easy to create and summarise
                   collection of models, so we didn’t have to
                   worry about the mechanics.


Wednesday, 28 October 2009

Mais conteúdo relacionado

Mais de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 

Último

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Último (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Quantitative models of housing data

  • 1. Stat405 Quantitative models Hadley Wickham Wednesday, 28 October 2009
  • 2. 1. Strategy for analysing large data. 2. Introduction to the Texas housing data. 3. What’s happening in Houston? 4. Using a models as a tool 5. Using models in their own right Wednesday, 28 October 2009
  • 3. Large data strategy Start with a single unit, and identify interesting patterns. Summarise patterns with a model. Apply model to all units. Look for units that don’t fit the pattern. Summarise with a single model. Wednesday, 28 October 2009
  • 4. Texas housing data For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112): Number of houses listed and sold Total value of houses, and average sale price Average time on market CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/ Wednesday, 28 October 2009
  • 5. Strategy Start with a single city (Houston). Explore patterns & fit models. Apply models to all cities. Wednesday, 28 October 2009
  • 6. 220000 200000 avgprice 180000 160000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 7. 8000 7000 6000 sales 5000 4000 3000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 8. 6.5 6.0 onmarket 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 9. Seasonal trends Make it much harder to see long term trend. How can we remove the trend? (Many sophisticated techniques from time series, but what’s the simplest thing that might work?) Wednesday, 28 October 2009
  • 10. 220000 200000 avgprice 180000 160000 2 4 6 8 10 12 month Wednesday, 28 October 2009
  • 11. 220000 200000 avgprice 180000 160000 qplot(month, avgprice, data = houston, geom = "line", group = year) + stat_summary(aes(group = 1), fun.y = "mean", geom = "line", 2 4 6 8 10 12 colour = "red", size = 2, na.rm month = TRUE) Wednesday, 28 October 2009
  • 12. Challenge What does the following function do? deseas <- function(var, month) { resid(lm(var ~ factor(month))) + mean(var, na.rm = TRUE) } How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city? Wednesday, 28 October 2009
  • 13. houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month) ) qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg Wednesday, 28 October 2009
  • 14. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2 4 6 8 10 12 month Wednesday, 28 October 2009
  • 15. Model as tools Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern. Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns. Wednesday, 28 October 2009
  • 16. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 17. 7000 6500 6000 sales_ds 5500 5000 4500 4000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 18. 6.5 6.0 onmarket_ds 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 19. Summary Most variables seem to be combination of strong seasonal pattern plus weaker long- term trend. How do these patterns hold up for the rest of Texas? We’ll focus on sales. Wednesday, 28 October 2009
  • 20. 8000 6000 sales 4000 2000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
  • 21. 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 tx <- ddply(tx, "city", transform, date sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
  • 22. Is this still such a good idea? What do we lose? Is there anything else we could do to improve the plot? 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 tx <- ddply(tx, "city", transform, date sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
  • 23. 3.5 3.0 2.5 log10(sales) 2.0 1.5 1.0 0.5 qplot(date, log10(sales), data = tx, geom = "line", 2000 2002 2004 2006 2008 group = city) date Wednesday, 28 October 2009
  • 24. It works, but... Instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth. Wednesday, 28 October 2009
  • 25. Two new tools dlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame dlply + ldply = ddply Wednesday, 28 October 2009
  • 26. models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df)) models[[1]] coef(models[[1]]) ldply(models, coef) Wednesday, 28 October 2009
  • 27. Labelling Notice we didn’t have to do anything to have the coefficients labelled correctly. Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls. Wednesday, 28 October 2009
  • 28. Back to the model What are some problems with this model? How could you fix them? Is the format of the coefficients optimal? Turn to the person next to you and discuss for 2 minutes. Wednesday, 28 October 2009
  • 29. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Wednesday, 28 October 2009
  • 30. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Put coefficients in rows, so they can be plotted more easily Wednesday, 28 October 2009
  • 31. 0.4 0.3 0.2 effect 0.1 0.0 −0.1 2 4 6 8 10 12 qplot(month, effect, data = coef2, group month = city, geom = "line") Wednesday, 28 October 2009
  • 32. 2.5 2.0 10^effect 1.5 1.0 2 4 6 8 10 12 month qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line") Wednesday, 28 October 2009
  • 33. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 10^effect 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 2.5 2.0 1.5 1.0 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 month qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city) Wednesday, 28 October 2009
  • 34. What should we do next? What do you think? You have 30 seconds to come up with (at least) one idea. Wednesday, 28 October 2009
  • 35. My ideas Fit a single model, log(sales) ~ city * factor(month), and look at residuals Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit Wednesday, 28 October 2009
  • 36. # One approach - fit a single model mod <- lm(log10(sales) ~ city + factor(month), data = tx) tx$sales2 <- 10 ^ resid(mod) qplot(date, sales2, data = tx, geom = "line", group = city) last_plot() + facet_wrap(~ city) Wednesday, 28 October 2009
  • 37. 3.5 3.0 2.5 sales2 2.0 1.5 1.0 0.5 2000 2002 2004 2006 2008 date qplot(date, sales2, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
  • 38. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 sales2 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date last_plot() + facet_wrap(~ city) Wednesday, 28 October 2009
  • 39. # Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't # fit well. library(splines) models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df) }) # Extract rsquared from each model rsq <- function(mod) c(rsq = summary(mod)$r.squared) quality <- ldply(models3, rsq) Wednesday, 28 October 2009
  • 40. Wichita Falls ● Waco ● Victoria ● Tyler ● Texarkana ● Temple−Belton ● Sherman−Denison ● San Marcos ● San Antonio ● San Angelo ● Port Arthur ● Paris ● Palestine ● Odessa ● NE Tarrant County ● Nacogdoches ● Montgomery County ● Midland ● McAllen ● Lufkin ● Lubbock ● Longview−Marshall ● Laredo ● city Killeen−Fort Hood ● Irving ● Houston ● Harlingen ● Garland ● Galveston ● Fort Worth ● Fort Bend ● El Paso ● Denton County ● Dallas ● Corpus Christi ● Collin County ● Bryan−College Station ● Brownsville ● Brazoria County ● Beaumont ● Bay Area ● Austin ● Arlington ● Amarillo ● Abilene ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, city, data = quality) Wednesday, 28 October 2009
  • 41. San Antonio ● Montgomery County ● Houston How are the good ● Dallas ● Bryan−College Station Collin County fits different from ● ● Denton County Austin the bad fits? ● ● Fort Bend ● Fort Worth ● Tyler ● NE Tarrant County ● Bay Area ● Corpus Christi ● Arlington ● Waco ● Temple−Belton ● Lubbock ● reorder(city, rsq) Garland ● Longview−Marshall ● Midland ● Laredo ● Harlingen ● Abilene ● Killeen−Fort Hood ● Brazoria County ● McAllen ● Brownsville ● Wichita Falls ● Sherman−Denison ● Irving ● Galveston ● Odessa ● San Marcos ● San Angelo ● Amarillo ● Nacogdoches ● Lufkin ● Victoria ● Beaumont ● Texarkana ● Paris ● Palestine ● El Paso ● Port Arthur ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, reorder(city, rsq), data = quality) Wednesday, 28 October 2009
  • 42. quality$poor <- quality$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") cities <- dlply(tx, "city") mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df$city, date = df$date, resid = resid(mod), pred = predict(mod)) }) tx2 <- merge(tx2, mfit) Wednesday, 28 October 2009
  • 43. quality$poor <- quality$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") Takes each row of input cities <- dlply(tx, to function and feeds it "city") mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df$city, date = df$date, resid = resid(mod), pred = predict(mod)) }) tx2 <- merge(tx2, mfit) Wednesday, 28 October 2009
  • 44. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 log10(sales) Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Raw data date Wednesday, 28 October 2009
  • 45. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 pred 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 Predictions 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date Wednesday, 28 October 2009
  • 46. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 0.4 0.2 0.0 −0.2 −0.4 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 0.4 0.2 0.0 −0.2 −0.4 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 0.4 0.2 0.0 −0.2 −0.4 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 0.4 0.2 resid 0.0 −0.2 −0.4 Montgomery County Nacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 0.4 0.2 0.0 −0.2 −0.4 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 0.4 0.2 0.0 −0.2 −0.4 Victoria Waco Wichita Falls 0.4 0.2 0.0 −0.2 −0.4 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Residuals date Wednesday, 28 October 2009
  • 47. Your turn Pick one of the other variables and perform a similar exploration, working through the same steps. Wednesday, 28 October 2009
  • 48. Conclusions Simple (and relatively small) example, but shows how collections of models can be useful for gaining understanding. Each attempt illustrated something new about the data. Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics. Wednesday, 28 October 2009