SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
1      Example of self-documenting data journalism notes
This is an example of using Sweave to combine code and output from the R statistical programming
environment and the LaTeX document processing environment to generate a self-documenting
script in which the actual code used to do stats and generate statistical graphics is displayed along
the charts it directly produces.

1.1     Getting Started...
The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical
analysis: experience the thrill of touching real data 1 .

>   # The << echo = T >>= identifies an R code region;
>   # echo=T means run the code, and print what happens when it's run
>   # In the code area, lines beginning with a # are comment lines and are not executed
>
>   #First, we need to load in the XML library that contains the scraper function
>   library(XML)
>   #Now we scrape the table
>   srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
>   cancerdata=data.frame(
+     readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )
>
>   #The @ symbol on its own at the start of a line marks the end of a code block

   The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to
extract the N’th table in the page.) The header part labels the columns (the data pulled in from
the HTML table itself contains all sorts of clutter).
   We can inspect the data we’ve imported as follows:

>   #Look at the whole table (the whole table is quite long,
>   # so donlt disply it/comment out the command for now instead.
>   #cancerdata
>   #If you are using RStudio, you can inspect the data using the command: View(cancerdata))
>   #Look at the column headers
>   names(cancerdata)

[1] "Area"            "Rate"          "Population" "Number"

> #Look at the first 10 rows
> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

> #Look at the last 10 rows
> tail(cancerdata)
    1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis




                                                   1
Area      Rate Population Number
374 Wiltshire      18.69     727662    136
375 Sheffield       16.9     757396    128
376     Durham     17.29     786582    136
377      Leeds      17.3     959538    166
378   Cornwall     15.44    1062176    164
379 Birmingham     19.78    1268959    251

> #What sort of datatype is in the Number column?
> class(cancerdata$Number)

[1] "factor"

   The last line, class(cancerdata$Number), identifies the data as type factor. In order to
do stats and plot graphs, we need the Number, Rate and Population columns to contain actual
numbers. (Factors organise data according to categories; when the table is loaded in, the data is
loaded in as strings of characters; rather than seeing each number as a number, it’s identified as
a category.) The

>   #Convert the numerical columns to a numeric datatype
>   cancerdata$Rate =
+     as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
>   cancerdata$Population =
+     as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
>   cancerdata$Number =
+     as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
>                        a˘
    #Just check it worked^Ae
>   class(cancerdata$Number)

[1] "numeric"

> class(cancerdata$Rate)

[1] "numeric"

> class(cancerdata$Population)

[1] "numeric"

> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

   We can now plot the data as a simple scatterplot using the plot command (figure 1) or we
can add a title to the graph and tweak the axis labels (figure 2).
   The plot command is great for generating quick charts. If we want a bit more control over
the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R
bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,
find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its
dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.


                                                  2
> #Plot the Number of deaths by the Population
> plot(Number ~ Population, data=cancerdata)
                 250




                                                                                                   q




                                                                      q
                 200




                                                                                      q   q
                 150
        Number




                                                                          q       q
                                                                              q
                                                                  q
                                                             q
                                                             q
                                                           q q q qq
                 100




                                                          q q
                                                     q qq       q
                                                     q q q
                                                          qq
                                                q   q q q
                                                      q
                                                   q
                                                 qq q
                                            qqq q qq
                                               qq q
                                                q
                                              q qqq
                                                qq
                                           q q q
                                       q qq q q q q q
                                            q
                                                      q
                                                      q
                                       q q q qqq q
                 50




                                          q qqq q
                                                q
                                         q qq q
                                        qq
                                         qq qq
                                       qqq qqq q q
                                          q q
                                          q
                                      qqqqqq qq
                                      qqqqq
                                        qqqq
                                    q q qq qq
                                    qq q
                                   qqq qq qqq
                                    q q qq q
                                     qq
                                     qq q
                                     qq
                                  qqqq q q
                                     qq
                                 qqqqqq
                                 qqqqqq q
                                 qq qqq
                                  qqqq q
                                      q
                                  q qq q
                                 q qq q
                                  q qq q
                                      q
                                      q
                                qqqqqqqq
                                 q qq qq
                                qqqqqq
                                qqqqqqq
                                  qq
                                  qq
                              qqqq q
                                 qq
                              qqqq q
                               qqq
                                qq
                              q qqq
                                qq
                                 qq
                             qq q
                           q q
                            q
                                q q
                             q q qq
                              qq q
                               q q
                           qqq q
                           qqq
                           q q
                            q
                 0




                       0         200000 400000 600000 800000                                  1200000

                                                            Population



                                           Figure 1: Vanilla scatter plot




                                                               3
> #Plot the Number of deaths by the Population.
> #Add in a title (main) and tweak the y-axis label (ylab).
> plot(Number ~ Population, data=cancerdata,
+      main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')



                                           Bowel Cancer Occurrence by Population
                           250




                                                                                                             q




                                                                                q
                           200
        Number of deaths




                                                                                                q   q
                           150




                                                                                    q       q
                                                                                        q
                                                                            q
                                                                       q
                                                                       q
                                                                     q q q qq
                           100




                                                                    q q
                                                               q qq       q
                                                               q q q
                                                                    qq
                                                          q   q q q
                                                                q
                                                             q
                                                           qq q
                                                      qqq q qq
                                                         qq q
                                                          q
                                                        q qqq
                                                          qq
                                                     q q q
                                                 q qq q q q q q
                                                      q
                                                                q
                                                                q
                                                 q q q qqq q
                           50




                                                    q qqq q
                                                          q
                                                   q qq q
                                                  qq
                                                   qq qq
                                                 qqq qqq q q
                                                    q q
                                                    q
                                                qqqqqq qq
                                                qqqqq
                                                  qqqq
                                              q q qq qq
                                              qq q
                                             qqq qq qqq
                                              q q qq q
                                               qq
                                               qq q
                                               qq
                                            qqqq q q
                                               qq
                                           qqqqqq
                                           qqqqqq q
                                           qq qqq
                                            qqqq q
                                                q
                                            q qq q
                                           q qq q
                                            q qq q
                                                q
                                                q
                                          qqqqqqqq
                                           q qq qq
                                          qqqqqq
                                          qqqqqqq
                                            qq
                                            qq
                                        qqqq q
                                           qq
                                        qqqq q
                                         qqq
                                          qq
                                        q qqq
                                          qq
                                           qq
                                       qq q
                                     q q
                                      q
                                          q q
                                       q q qq
                                        qq q
                                         q q
                                     qqq q
                                     qqq
                                     q q
                                      q
                           0




                                 0         200000 400000 600000 800000                                  1200000

                                                                      Population



                                                     Figure 2: Vanilla scatter plot




                                                                         4
>   require(ggplot2)
>   #Plot the Number of deaths by the Population
>   p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
>   print(p)


                                                                                                                       q
                    250




                                                                                      q

                    200



                                                                                                        q    q


                    150
           Number




                                                                                          q       q
                                                                                              q

                                                                              q
                                                                      q
                                                                      q
                                                                   q      q   q
                    100                                           q qq            q
                                                           q              q
                                                              qq
                                                               qq
                                                           q
                                                                 qq
                                                      q    qq q
                                                            q
                                                        qqq
                                                          q q
                                                  q q q qq q
                                                     qq
                                                    q qq q q
                                                       q
                                                      qq
                                                 q q     q q qq
                                            q qqq      q      q
                                                   qqq q
                                                       q
                                                 qq q q q
                    50                     q qq q q q q
                                                  q qq
                                           qqqqq qq q q
                                               q
                                            qqq q q q q
                                             qq q
                                               q
                                               q
                                          q q qq qq q
                                          q q qqq
                                          q qq q
                                             qqq q
                                      q qq qq qq
                                       q q qq q
                                     qqqq qqq qq
                                                qq
                                      qqq qqq q q
                                       q qq q
                                       q q q
                                       qq  q
                                                  q
                                   qq qqq qq
                                      qqq qq
                                      qqqqq
                                       qqq q
                                         q
                                    qqqqq qq
                                        qq q
                                         q
                                         qq q q q
                                   qqq qq q
                                   qq q
                                   qqqqq qq
                                   qqqqqq q
                                   q qqq q
                                    q
                                   q qq q q
                                    q qq
                                    q q
                                qq q q
                                  qqq
                                 qqqq
                                   q
                                  qqq
                                    q
                              qqqqqqq q
                              qqqqqq
                                 qqq q q
                                 qqq q
                              qqq qqq
                                 qq qq
                             qqqqqq
                                qq q
                               qq q   q
                          qq qq q qq
                           qqqqq q
                            qq q q
                                q     q
                          qq q
                          qq
                          q   q




                                    200000           400000           600000                  800000   1000000   1200000
                                                                  Population



                                                Figure 3: A rather prettier plot




                                                                              5
1.2    Generating the Funnel Plot
Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s
article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated
to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2
    The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing
the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple
of things:

  1. work out what values to use where! I did this by looking at the ggplot code to see what
     was plotted. p was on the y-axis and should be used to present the death rate. The data
     provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the
     range 0..1. The x-axis is the population.
  2. change the range and width of samples used to create the curves
  3. change the y-axis range.

   You can see the result in figure 3.




   2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#

5210


                                                 6
>   #TH: funnel plot code from:
>   #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
>   #TH: Use our cancerdata
>   number=cancerdata$Population
>   #TH: The rate is given as a 'per 100,000' value, so normalise it
>   p=cancerdata$Rate/100000
>   p.se <- sqrt((p*(1-p)) / (number))
>   df <- data.frame(p, number, p.se, Area=cancerdata$Area)
>   ## common effect (fixed effect model)
>   p.fem <- weighted.mean(p, 1/p.se^2)
>   ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
>   #TH: I'm going to alter the spacing of the samples used to generate the curves
>   number.seq <- seq(1000, max(number), 1000)
>   number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
>   ## draw plot
>   #TH: note that we need to tweak the limits of the y-axis
>   fp <- ggplot(aes(x = number, y = p), data = df) +
+   geom_point(shape = 1) +
+   geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
+   geom_hline(aes(yintercept = p.fem), data = dfCI) +
+   xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()
>   #Automatically set the maximum y-axis value to be just a bit larger than the max data value
>   fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))
>   #Label the outlier point
>   fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))
>   print(fp)




                                                                                           Glasgow City
                                                                                                q

                                     0.00030   q


                                               q            q
                                                qq
                                                      q
                                                q
                                                   qq          q
                                     0.00025           q qq          q q
                                                        qq
                                                       qq         qq
                                                                   q            q
                                                   qq q q
                                                       q
                                                       q qq
                                                 q q q qq              q
           Bowel cancer death rate




                                                                        q q      q
                                                        q q         q q
                                               q q q qq q q                   q
                                                 q     q qq q q q q
                                                        q        q
                                                        q q
                                                        q q              q q              q
                                                        q q q
                                                  q q q q qq                 q       q qq q
                                     0.00020          qq q q
                                                     q qqqq q
                                                    qq q                    q
                                                    qq q q qq qq         q                                                    q
                                               q      qq qqq q q q q
                                                               q
                                                               q
                                                               q
                                                              q q
                                                                          q
                                                                                  q q
                                                                                      q
                                                                                      q
                                                                                              q
                                                             q    q
                                                  q q q qq q q q q q              q       qq        q
                                                         q q q qqq
                                                           q      q q      qq q           q
                                                   q q qq qqq q q
                                                         q
                                                        q qqq
                                                                   q
                                                    qqq
                                                     qq q q qq q q     q         q q q        q             q    q
                                               q        q qqq qq q q
                                                             q q
                                                             q
                                                             q q                                        q
                                                    q qq q q q q q q q
                                                          q qq
                                                      q qq qq q
                                                             qqq
                                                    q q q qq qqqqq q q qq
                                                       q qq q qq
                                                              q          q              q   q
                                                                                                q
                                                          q q q
                                                      q q qqq q          q                                           q
                                                q q q            q q
                                     0.00015        q
                                                            qq q q qq q
                                                            qqq q qqq
                                                             qq      q q        q
                                                            qq               q
                                                     qq q q qq qqq q               q
                                                                                   q
                                                       qqqq q
                                                         qq           q      q
                                                      q qq q q
                                                         qq           q q
                                                                           q       q
                                                          q qqqq
                                                               q            qq
                                                          q      qq
                                                      q q q             q
                                                 q               q
                                                                 q     q
                                                   q
                                                                      q
                                     0.00010             q
                                                        qq
                                                          q
                                                          q
                                                   q     q




                                     0.00005
                                                                                       7

                                     0.00000


                                                          200000        400000          600000      800000      1000000 1200000
                                                                               Population

Mais conteúdo relacionado

Semelhante a Example sweavefunnelplot (10)

Stat7840 hao wu
Stat7840 hao wuStat7840 hao wu
Stat7840 hao wu
 
Clustering Plot
Clustering PlotClustering Plot
Clustering Plot
 
Time series compare
Time series compareTime series compare
Time series compare
 
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
 
Slides mcneil
Slides mcneilSlides mcneil
Slides mcneil
 
Slides geotop
Slides geotopSlides geotop
Slides geotop
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race Report
 
Manual de Aplicação - TCC
Manual de Aplicação - TCCManual de Aplicação - TCC
Manual de Aplicação - TCC
 
Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices
 
Slides GEOTOP
Slides GEOTOPSlides GEOTOP
Slides GEOTOP
 

Mais de Tony Hirst

Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
Tony Hirst
 

Mais de Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Example sweavefunnelplot

  • 1. 1 Example of self-documenting data journalism notes This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. 1.1 Getting Started... The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical analysis: experience the thrill of touching real data 1 . > # The << echo = T >>= identifies an R code region; > # echo=T means run the code, and print what happens when it's run > # In the code area, lines beginning with a # are comment lines and are not executed > > #First, we need to load in the XML library that contains the scraper function > library(XML) > #Now we scrape the table > srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' > cancerdata=data.frame( + readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) ) > > #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows: > #Look at the whole table (the whole table is quite long, > # so donlt disply it/comment out the command for now instead. > #cancerdata > #If you are using RStudio, you can inspect the data using the command: View(cancerdata)) > #Look at the column headers > names(cancerdata) [1] "Area" "Rate" "Population" "Number" > #Look at the first 10 rows > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 > #Look at the last 10 rows > tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1
  • 2. Area Rate Population Number 374 Wiltshire 18.69 727662 136 375 Sheffield 16.9 757396 128 376 Durham 17.29 786582 136 377 Leeds 17.3 959538 166 378 Cornwall 15.44 1062176 164 379 Birmingham 19.78 1268959 251 > #What sort of datatype is in the Number column? > class(cancerdata$Number) [1] "factor" The last line, class(cancerdata$Number), identifies the data as type factor. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.) The > #Convert the numerical columns to a numeric datatype > cancerdata$Rate = + as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)]) > cancerdata$Population = + as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) > cancerdata$Number = + as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) > a˘ #Just check it worked^Ae > class(cancerdata$Number) [1] "numeric" > class(cancerdata$Rate) [1] "numeric" > class(cancerdata$Population) [1] "numeric" > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 We can now plot the data as a simple scatterplot using the plot command (figure 1) or we can add a title to the graph and tweak the axis labels (figure 2). The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3. 2
  • 3. > #Plot the Number of deaths by the Population > plot(Number ~ Population, data=cancerdata) 250 q q 200 q q 150 Number q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 1: Vanilla scatter plot 3
  • 4. > #Plot the Number of deaths by the Population. > #Add in a title (main) and tweak the y-axis label (ylab). > plot(Number ~ Population, data=cancerdata, + main='Bowel Cancer Occurrence by Population', ylab='Number of deaths') Bowel Cancer Occurrence by Population 250 q q 200 Number of deaths q q 150 q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 2: Vanilla scatter plot 4
  • 5. > require(ggplot2) > #Plot the Number of deaths by the Population > p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number)) > print(p) q 250 q 200 q q 150 Number q q q q q q q q q 100 q qq q q q qq qq q qq q qq q q qqq q q q q q qq q qq q qq q q q qq q q q q qq q qqq q q qqq q q qq q q q 50 q qq q q q q q qq qqqqq qq q q q qqq q q q q qq q q q q q qq qq q q q qqq q qq q qqq q q qq qq qq q q qq q qqqq qqq qq qq qqq qqq q q q qq q q q q qq q q qq qqq qq qqq qq qqqqq qqq q q qqqqq qq qq q q qq q q q qqq qq q qq q qqqqq qq qqqqqq q q qqq q q q qq q q q qq q q qq q q qqq qqqq q qqq q qqqqqqq q qqqqqq qqq q q qqq q qqq qqq qq qq qqqqqq qq q qq q q qq qq q qq qqqqq q qq q q q q qq q qq q q 200000 400000 600000 800000 1000000 1200000 Population Figure 3: A rather prettier plot 5
  • 6. 1.2 Generating the Funnel Plot Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2 The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple of things: 1. work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. 2. change the range and width of samples used to create the curves 3. change the y-axis range. You can see the result in figure 3. 2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210# 5210 6
  • 7. > #TH: funnel plot code from: > #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 > #TH: Use our cancerdata > number=cancerdata$Population > #TH: The rate is given as a 'per 100,000' value, so normalise it > p=cancerdata$Rate/100000 > p.se <- sqrt((p*(1-p)) / (number)) > df <- data.frame(p, number, p.se, Area=cancerdata$Area) > ## common effect (fixed effect model) > p.fem <- weighted.mean(p, 1/p.se^2) > ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator > #TH: I'm going to alter the spacing of the samples used to generate the curves > number.seq <- seq(1000, max(number), 1000) > number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem) > ## draw plot > #TH: note that we need to tweak the limits of the y-axis > fp <- ggplot(aes(x = number, y = p), data = df) + + geom_point(shape = 1) + + geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + + geom_hline(aes(yintercept = p.fem), data = dfCI) + + xlab("Population") + ylab("Bowel cancer death rate") + theme_bw() > #Automatically set the maximum y-axis value to be just a bit larger than the max data value > fp=fp+scale_y_continuous(limits = c(0,1.1*max(p))) > #Label the outlier point > fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003)) > print(fp) Glasgow City q 0.00030 q q q qq q q qq q 0.00025 q qq q q qq qq qq q q qq q q q q qq q q q qq q Bowel cancer death rate q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq q 0.00020 qq q q q qqqq q qq q q qq q q qq qq q q q qq qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q qq qqq q q q q qqq q qqq qq q q qq q q q q q q q q q q q qqq qq q q q q q q q q q qq q q q q q q q q qq q qq qq q qqq q q q qq qqqqq q q qq q qq q qq q q q q q q q q q q qqq q q q q q q q q 0.00015 q qq q q qq q qqq q qqq qq q q q qq q qq q q qq qqq q q q qqqq q qq q q q qq q q qq q q q q q qqqq q qq q qq q q q q q q q q q q 0.00010 q qq q q q q 0.00005 7 0.00000 200000 400000 600000 800000 1000000 1200000 Population