SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Stat405Visualising time & space


                            Hadley Wickham
Thursday, 14 October 2010
1. New data: baby names by state
                2. Visualise time (done!)
                3. Visualise time conditional on space
                4. Visualise space
                5. Visualise space conditional on time
                6. Aside: geographic data


Thursday, 14 October 2010
Project
                    Project 2 due November 4.
                    Basically same as project 2, but will be
                    using the full play-by-play data from the
                    08/09 NBA season.
                    I expect to see lots of ddply usage, and
                    more advanced graphics (next week).



Thursday, 14 October 2010
Baby names by state
                    Top 100 male and female baby
                    names for each state, 1960–2008.
                    480,000 records
                    (100 * 50 * 2 * 48)
                    Slightly different variables: state,
                    year, name, sex and number.

                                          CC BY http://www.flickr.com/photos/the_light_show/2586781132
Thursday, 14 October 2010
Subset

                    Easier to compare states if we have
                    proportions. To calculate proportions,
                    need births. Could only find data from
                    1981.
                    Selected 30 names that occurred fairly
                    frequently, and had interesting patterns.



Thursday, 14 October 2010
Aaron Alex Allison Alyssa Angela Ashley
                    Carlos Chelsea Christian Eric Evan
                    Gabriel Jacob Jared Jennifer Jonathan
                    Juan Katherine Kelsey Kevin Matthew
                    Michelle Natalie Nicholas Noah Rebecca
                    Sara Sarah Taylor Thomas




Thursday, 14 October 2010
Getting started

                library(ggplot2)
                library(plyr)

                bnames <- read.csv("interesting-names.csv",
                  stringsAsFactors = F)

                matthew <- subset(bnames, name == "Matthew")




Thursday, 14 October 2010
Time |
                            Space
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990          1995   2000   2005
                                          year
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990   1995   2000      2005
qplot(year, prop, data = matthew,year
                                  geom = "line", group = state)
Thursday, 14 October 2010
AK       AL         AR         AZ          CA        CO         CT         DC
        0.04
        0.03
        0.02
        0.01
                     DE       FL         GA         HI          IA        ID         IL         IN
        0.04
        0.03
        0.02
        0.01
                     KS       KY         LA         MA          MD        ME         MI         MN
        0.04
        0.03
        0.02
        0.01
                    MO       MS          MT         NC          NE        NH         NJ         NM
        0.04
 prop




        0.03
        0.02
        0.01
                     NV       NY         OH         OK          OR        PA         RI         SC
        0.04
        0.03
        0.02
        0.01
                     SD       TN         TX         UT          VA        VT         WA         WI
        0.04
        0.03
        0.02
        0.01
                    WV       WY
        0.04
        0.03
        0.02
        0.01
               198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000
                 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005
                                                         year
Thursday, 14 October 2010
AK       AL         AR         AZ         CA         CO         CT         DC
        0.04
        0.03
        0.02
        0.01
                     DE       FL         GA         HI         IA         ID         IL         IN
        0.04
        0.03
        0.02
        0.01
                     KS       KY         LA         MA         MD         ME         MI         MN
        0.04
        0.03
        0.02
        0.01
                    MO       MS          MT         NC         NE         NH         NJ         NM
        0.04
 prop




        0.03
        0.02
        0.01
                     NV       NY         OH         OK         OR         PA         RI         SC
        0.04
        0.03
        0.02
        0.01
                     SD       TN         TX         UT         VA         VT         WA         WI
        0.04
        0.03
        0.02
        0.01
                    WV       WY
        0.04
        0.03
        0.02
        0.01
               198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000
                 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005

last_plot() + facet_wrap(~ state)year
Thursday, 14 October 2010
Your turn

                    Ensure that you can re-create these plots
                    for other names. What do you see?
                    Can you write a function that plots the
                    trend for a given name?




Thursday, 14 October 2010
show_name <- function(name) {
       name <- bnames[bnames$name == name, ]
       qplot(year, prop, data = name, geom = "line",
         group = state)
     }

     show_name("Jessica")
     show_name("Aaron")
     show_name("Juan") + facet_wrap(~ state)




Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990          1995   2000   2005
                                          year
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




qplot(year, prop, data = matthew, geom1995 "line", 2000
                1985       1990         =          group = state) +
                                                              2005
  geom_smooth(aes(group = 1), se year size = 3)
                                 = F,
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01


                         So we only get one smooth
                             for the whole dataset
qplot(year, prop, data = matthew, geom1995 "line", 2000
                1985       1990          =         group = state) +
                                                              2005
  geom_smooth(aes(group = 1), se year size = 3)
                                   = F,
Thursday, 14 October 2010
Three useful tools
                    Smoothing: can be easier to perceive
                    overall trend by smoothing individual
                    functions
                    Centering: remove differences in center
                    by subtracting mean
                    Scaling: remove differences in range by
                    dividing by sd, or by scaling to [0, 1]


Thursday, 14 October 2010
library(mgcv)
     smooth <- function(y, x, amount = 0.1) {
       mod <- gam(y ~ s(x, bs = "cr"), sp = amount)
       as.numeric(predict(mod))
     }

     matthew <- ddply(matthew, "state", transform,
       prop_s1 = smooth(prop, year, amount = 0.01),
       prop_s2 = smooth(prop, year, amount = 0.1),
       prop_s3 = smooth(prop, year, amount = 1),
       prop_s4 = smooth(prop, year, amount = 10))

     qplot(year, prop_s1, data = matthew, geom = "line",
       group = state)

Thursday, 14 October 2010
center <- function(x) x - mean(x, na.rm = T)

     matthew <- ddply(matthew, "state", transform,
       prop_c = center(prop),
       prop_sc = center(prop_s1))

     qplot(year, prop_c, data = matthew, geom = "line",
       group = state)
     qplot(year, prop_sc, data = matthew, geom = "line",
       group = state)




Thursday, 14 October 2010
scale <- function(x) x / sd(x, na.rm = T)
     scale01 <- function(x) {
       rng <- range(x, na.rm = T)
       (x - rng[1]) / (rng[2] - rng[1])
     }

     matthew <- ddply(matthew, "state", transform,
       prop_ss = scale01(prop_s1))

     qplot(year, prop_ss, data = matthew, geom = "line",
       group = state)



Thursday, 14 October 2010
Your turn
                    Create a plot to show all names
                    simultaneously. Does smoothing every
                    name in every state make it easier to see
                    patterns?
                    Hint: run the following R code on the next
                    slide to eliminate names with less than 10
                    years of data


Thursday, 14 October 2010
longterm <- ddply(bnames, c("name", "state"),
     function(df) {
        if (nrow(df) > 10) {
          df
        }
     })




Thursday, 14 October 2010
qplot(year, prop, data = bnames, geom = "line",
       group = state, alpha = I(1 / 4)) +
       facet_wrap(~ name)

     longterm <- ddply(longterm, c("name", "state"),
       transform, prop_s = smooth(prop, year))

     qplot(year, prop_s, data = longterm, geom = "line",
       group = state, alpha = I(1 / 4)) +
       facet_wrap(~ name)
     last_plot() + facet_wrap(~ name, scales = "free_y")



Thursday, 14 October 2010
Space

Thursday, 14 October 2010
Spatial plots

                    Choropleth map:
                    map colour of areas to value.
                    Proportional symbol map:
                    map size of symbols to value




Thursday, 14 October 2010
juan2000 <- subset(bnames, name == "Juan" &
       year == 2000)

     # Turn map data into normal data frame
     library(maps)
     states <- map_data("state")
     states$state <- state.abb[match(states$region,
       tolower(state.name))]

     # Join datasets
     choropleth <- join(states, juan2000, by = "state")

     # Plot with polygons
     qplot(long, lat, data = choropleth, geom = "polygon",
       fill = prop, group = group)

Thursday, 14 October 2010
45




       40
                                                                  prop
                                                                         0.004
                                                                         0.006
 lat




                                                                         0.008
       35
                                                                          0.01




       30




                        −120   −110   −100      −90   −80   −70
                                         long
Thursday, 14 October 2010
What’s the problem
                                                    with this map?
       45                                       How could we fix it?


       40
                                                                      prop
                                                                             0.004
                                                                             0.006
 lat




                                                                             0.008
       35
                                                                              0.01




       30




                        −120   −110   −100      −90    −80    −70
                                         long
Thursday, 14 October 2010
ggplot(choropleth, aes(long, lat, group = group)) +
       geom_polygon(fill = "white", colour = "grey50") +
       geom_polygon(aes(fill = prop))




Thursday, 14 October 2010
45




       40
                                                                  prop
                                                                         0.004
                                                                         0.006
 lat




                                                                         0.008
       35
                                                                          0.01




       30




                        −120   −110   −100      −90   −80   −70
                                         long
Thursday, 14 October 2010
Problems?

                    What are the problems with this sort of
                    plot?
                    Take one minute to brainstorm some
                    possible issues.




Thursday, 14 October 2010
Problems
                    Big areas most striking. But in the US (as
                    with most countries) big areas tend to
                    least populated. Most populated areas
                    tend to be small and dense - e.g. the East
                    coast.
                    (Another computational problem: need to
                    push around a lot of data to create these
                    plots)


Thursday, 14 October 2010
●




       45
                            ●

                                                                                    ●




                                                                                            ●
                                                     ●




       40                                                        ●
                                                                                        ●



                                               ●                                                      prop
                                    ●                    ●
                                                                                                       ●     0.004
                                ●                                                                     ●      0.006
 lat




                                                         ●                     ●
                                                                                                      ●      0.008
       35
                                        ●      ●                                                      ●      0.010

                                                                          ●




                                                   ●
       30

                                                                      ●




                        −120            −110       −100         −90           −80               −70
                                                         long
Thursday, 14 October 2010
mid_range <- function(x) mean(range(x))
     centres <- ddply(states, c("state"), summarise,
       lat = mid_range(lat), long = mid_range(long))

     bubble <- join(juan2000, centres, by = "state")
     qplot(long, lat, data = bubble,
       size = prop, colour = prop)

     ggplot(bubble, aes(long, lat)) +
       geom_polygon(aes(group = group), data = states,
         fill = NA, colour = "grey50") +
       geom_point(aes(size = prop, colour = prop))


Thursday, 14 October 2010
Your turn


                    Replicate either a choropleth or a
                    proportional symbol map with the name
                    of your choice.




Thursday, 14 October 2010
Space |
                             Time
Thursday, 14 October 2010
Thursday, 14 October 2010
Your turn


                    Try and create this plot yourself. What is
                    the main difference between this plot and
                    the previous?




Thursday, 14 October 2010
juan <- subset(bnames, name == "Juan")
     bubble <- merge(juan, centres, by = "state")

     ggplot(bubble, aes(long, lat)) +
       geom_polygon(aes(group = group), data = states,
         fill = NA, colour = "grey80") +
       geom_point(aes(colour = prop)) +
       facet_wrap(~ year)




Thursday, 14 October 2010
Aside: geographic data

                    Boundaries for most countries available
                    from: http://gadm.org
                    To use with ggplot2, use the fortify
                    function to convert to usual data frame.
                    Will also need to install the sp package.



Thursday, 14 October 2010
# install.packages("sp")

     library(sp)
     load(url("http://gadm.org/data/rda/CHE_adm1.RData"))

     head(as.data.frame(gadm))
     ch <- fortify(gadm, region = "ID_1")
     str(ch)

     qplot(long, lat, group = group, data = ch,
       geom = "polygon", colour = I("white"))



Thursday, 14 October 2010
Thursday, 14 October 2010
This work is licensed under the Creative
       Commons Attribution-Noncommercial 3.0 United
       States License. To view a copy of this license,
       visit http://creativecommons.org/licenses/by-nc/
       3.0/us/ or send a letter to Creative Commons,
       171 Second Street, Suite 300, San Francisco,
       California, 94105, USA.




Thursday, 14 October 2010

Mais conteúdo relacionado

Mais de Hadley Wickham (20)

23 data-structures
23 data-structures23 data-structures
23 data-structures
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
19 tables
19 tables19 tables
19 tables
 
17 polishing
17 polishing17 polishing
17 polishing
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
06 data
06 data06 data
06 data
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
04 reports
04 reports04 reports
04 reports
 
03 extensions
03 extensions03 extensions
03 extensions
 

Último

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

15 time-space

  • 1. Stat405Visualising time & space Hadley Wickham Thursday, 14 October 2010
  • 2. 1. New data: baby names by state 2. Visualise time (done!) 3. Visualise time conditional on space 4. Visualise space 5. Visualise space conditional on time 6. Aside: geographic data Thursday, 14 October 2010
  • 3. Project Project 2 due November 4. Basically same as project 2, but will be using the full play-by-play data from the 08/09 NBA season. I expect to see lots of ddply usage, and more advanced graphics (next week). Thursday, 14 October 2010
  • 4. Baby names by state Top 100 male and female baby names for each state, 1960–2008. 480,000 records (100 * 50 * 2 * 48) Slightly different variables: state, year, name, sex and number. CC BY http://www.flickr.com/photos/the_light_show/2586781132 Thursday, 14 October 2010
  • 5. Subset Easier to compare states if we have proportions. To calculate proportions, need births. Could only find data from 1981. Selected 30 names that occurred fairly frequently, and had interesting patterns. Thursday, 14 October 2010
  • 6. Aaron Alex Allison Alyssa Angela Ashley Carlos Chelsea Christian Eric Evan Gabriel Jacob Jared Jennifer Jonathan Juan Katherine Kelsey Kevin Matthew Michelle Natalie Nicholas Noah Rebecca Sara Sarah Taylor Thomas Thursday, 14 October 2010
  • 7. Getting started library(ggplot2) library(plyr) bnames <- read.csv("interesting-names.csv", stringsAsFactors = F) matthew <- subset(bnames, name == "Matthew") Thursday, 14 October 2010
  • 8. Time | Space Thursday, 14 October 2010
  • 9. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 year Thursday, 14 October 2010
  • 10. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 qplot(year, prop, data = matthew,year geom = "line", group = state) Thursday, 14 October 2010
  • 11. AK AL AR AZ CA CO CT DC 0.04 0.03 0.02 0.01 DE FL GA HI IA ID IL IN 0.04 0.03 0.02 0.01 KS KY LA MA MD ME MI MN 0.04 0.03 0.02 0.01 MO MS MT NC NE NH NJ NM 0.04 prop 0.03 0.02 0.01 NV NY OH OK OR PA RI SC 0.04 0.03 0.02 0.01 SD TN TX UT VA VT WA WI 0.04 0.03 0.02 0.01 WV WY 0.04 0.03 0.02 0.01 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 year Thursday, 14 October 2010
  • 12. AK AL AR AZ CA CO CT DC 0.04 0.03 0.02 0.01 DE FL GA HI IA ID IL IN 0.04 0.03 0.02 0.01 KS KY LA MA MD ME MI MN 0.04 0.03 0.02 0.01 MO MS MT NC NE NH NJ NM 0.04 prop 0.03 0.02 0.01 NV NY OH OK OR PA RI SC 0.04 0.03 0.02 0.01 SD TN TX UT VA VT WA WI 0.04 0.03 0.02 0.01 WV WY 0.04 0.03 0.02 0.01 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 last_plot() + facet_wrap(~ state)year Thursday, 14 October 2010
  • 13. Your turn Ensure that you can re-create these plots for other names. What do you see? Can you write a function that plots the trend for a given name? Thursday, 14 October 2010
  • 14. show_name <- function(name) { name <- bnames[bnames$name == name, ] qplot(year, prop, data = name, geom = "line", group = state) } show_name("Jessica") show_name("Aaron") show_name("Juan") + facet_wrap(~ state) Thursday, 14 October 2010
  • 15. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 year Thursday, 14 October 2010
  • 16. 0.04 0.03 prop 0.02 0.01 qplot(year, prop, data = matthew, geom1995 "line", 2000 1985 1990 = group = state) + 2005 geom_smooth(aes(group = 1), se year size = 3) = F, Thursday, 14 October 2010
  • 17. 0.04 0.03 prop 0.02 0.01 So we only get one smooth for the whole dataset qplot(year, prop, data = matthew, geom1995 "line", 2000 1985 1990 = group = state) + 2005 geom_smooth(aes(group = 1), se year size = 3) = F, Thursday, 14 October 2010
  • 18. Three useful tools Smoothing: can be easier to perceive overall trend by smoothing individual functions Centering: remove differences in center by subtracting mean Scaling: remove differences in range by dividing by sd, or by scaling to [0, 1] Thursday, 14 October 2010
  • 19. library(mgcv) smooth <- function(y, x, amount = 0.1) { mod <- gam(y ~ s(x, bs = "cr"), sp = amount) as.numeric(predict(mod)) } matthew <- ddply(matthew, "state", transform, prop_s1 = smooth(prop, year, amount = 0.01), prop_s2 = smooth(prop, year, amount = 0.1), prop_s3 = smooth(prop, year, amount = 1), prop_s4 = smooth(prop, year, amount = 10)) qplot(year, prop_s1, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 20. center <- function(x) x - mean(x, na.rm = T) matthew <- ddply(matthew, "state", transform, prop_c = center(prop), prop_sc = center(prop_s1)) qplot(year, prop_c, data = matthew, geom = "line", group = state) qplot(year, prop_sc, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 21. scale <- function(x) x / sd(x, na.rm = T) scale01 <- function(x) { rng <- range(x, na.rm = T) (x - rng[1]) / (rng[2] - rng[1]) } matthew <- ddply(matthew, "state", transform, prop_ss = scale01(prop_s1)) qplot(year, prop_ss, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 22. Your turn Create a plot to show all names simultaneously. Does smoothing every name in every state make it easier to see patterns? Hint: run the following R code on the next slide to eliminate names with less than 10 years of data Thursday, 14 October 2010
  • 23. longterm <- ddply(bnames, c("name", "state"), function(df) { if (nrow(df) > 10) { df } }) Thursday, 14 October 2010
  • 24. qplot(year, prop, data = bnames, geom = "line", group = state, alpha = I(1 / 4)) + facet_wrap(~ name) longterm <- ddply(longterm, c("name", "state"), transform, prop_s = smooth(prop, year)) qplot(year, prop_s, data = longterm, geom = "line", group = state, alpha = I(1 / 4)) + facet_wrap(~ name) last_plot() + facet_wrap(~ name, scales = "free_y") Thursday, 14 October 2010
  • 26. Spatial plots Choropleth map: map colour of areas to value. Proportional symbol map: map size of symbols to value Thursday, 14 October 2010
  • 27. juan2000 <- subset(bnames, name == "Juan" & year == 2000) # Turn map data into normal data frame library(maps) states <- map_data("state") states$state <- state.abb[match(states$region, tolower(state.name))] # Join datasets choropleth <- join(states, juan2000, by = "state") # Plot with polygons qplot(long, lat, data = choropleth, geom = "polygon", fill = prop, group = group) Thursday, 14 October 2010
  • 28. 45 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 29. What’s the problem with this map? 45 How could we fix it? 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 30. ggplot(choropleth, aes(long, lat, group = group)) + geom_polygon(fill = "white", colour = "grey50") + geom_polygon(aes(fill = prop)) Thursday, 14 October 2010
  • 31. 45 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 32. Problems? What are the problems with this sort of plot? Take one minute to brainstorm some possible issues. Thursday, 14 October 2010
  • 33. Problems Big areas most striking. But in the US (as with most countries) big areas tend to least populated. Most populated areas tend to be small and dense - e.g. the East coast. (Another computational problem: need to push around a lot of data to create these plots) Thursday, 14 October 2010
  • 34. 45 ● ● ● ● 40 ● ● ● prop ● ● ● 0.004 ● ● 0.006 lat ● ● ● 0.008 35 ● ● ● 0.010 ● ● 30 ● −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 35. mid_range <- function(x) mean(range(x)) centres <- ddply(states, c("state"), summarise, lat = mid_range(lat), long = mid_range(long)) bubble <- join(juan2000, centres, by = "state") qplot(long, lat, data = bubble, size = prop, colour = prop) ggplot(bubble, aes(long, lat)) + geom_polygon(aes(group = group), data = states, fill = NA, colour = "grey50") + geom_point(aes(size = prop, colour = prop)) Thursday, 14 October 2010
  • 36. Your turn Replicate either a choropleth or a proportional symbol map with the name of your choice. Thursday, 14 October 2010
  • 37. Space | Time Thursday, 14 October 2010
  • 39. Your turn Try and create this plot yourself. What is the main difference between this plot and the previous? Thursday, 14 October 2010
  • 40. juan <- subset(bnames, name == "Juan") bubble <- merge(juan, centres, by = "state") ggplot(bubble, aes(long, lat)) + geom_polygon(aes(group = group), data = states, fill = NA, colour = "grey80") + geom_point(aes(colour = prop)) + facet_wrap(~ year) Thursday, 14 October 2010
  • 41. Aside: geographic data Boundaries for most countries available from: http://gadm.org To use with ggplot2, use the fortify function to convert to usual data frame. Will also need to install the sp package. Thursday, 14 October 2010
  • 42. # install.packages("sp") library(sp) load(url("http://gadm.org/data/rda/CHE_adm1.RData")) head(as.data.frame(gadm)) ch <- fortify(gadm, region = "ID_1") str(ch) qplot(long, lat, group = group, data = ch, geom = "polygon", colour = I("white")) Thursday, 14 October 2010
  • 44. This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Thursday, 14 October 2010