This document discusses the history and importance of data visualization. It provides examples of how the software DataDesk can be used to interactively explore and analyze Irish road accident data. Key points include: (1) DataDesk allows slicing and dicing of data through linked views, revealing patterns like regional differences in accident distributions. (2) Rotating plots can show relationships that emerge from entire datasets, like distinguishing clusters in accident profiles. (3) Interactive visualization brings statistics to life and aids decision making.
Visualising Road Traffic Accident Data with DataDesk
1. Visualising Road Traffic Accident Data
1. A Brief History of Data Visualisation
The power and importance of effective visualisation has long been recognised.
William Playfair (1759-1823) the founder of statistical graphics contrasted his new graphical
method with the tabular presentation of data as follows;
‘Information, that is imperfectly acquired, is generally imperfectly retained; and a man who
has carefully inspected a printed table, finds when done, that he has only a very faint and
partial ideas of what he has read’ (1)
This view has been echoed over the intervening 200 years. For example, Florence
Nightingale (1820-1910) recognised the power of data visualisation as an effective aid for
communicating to a wide audience issues of concern particularly the impact of poor
sanitation on mortality rates during the Crimean war. This is summarised in her statement of
the power of graphics‘ to affect thro the eyes what we may fail to convey to the brains of the
public through their word-proof ears’.
Graphical innovations were relatively absent in the first half of the 20th century
but renewed interest in visualisation followed the publication in 1962 of a paper entitled ‘The
Future of Data Analysis’ (2) by American statistician John W. Tukey. This paper was regarded
as a landmark in data visualisation. Tukey suggested that we examine our data as a
detective would examine the scene of a crime - not with a hypothesis - ‘I’ll bet the butler did
it’, but with an open mind and as few assumptions as possible.
This approach was a radical departure from conventional data analysis (and research
programmes in general) which tended to be based on the scientific principles of formulating a
hypothesis, collecting appropriate data and finally using some test statistic to decide on the
validity of the hypothesis. Tukey believed by letting the data speak to us ‘we can learn the
truths hidden beneath the random fluctuations, errors and general confusion seen in real
data’.
The publication in 1967 of Jacques Bertin's ‘Semiologie Graphique’ (3) was also an
important milestone in the development of data visualisation. In his foreword to the English
version of this text published in 1983 Howard Wainer states that the text ‘is the most
important work on graphics since the publication of William Playfair's Atlas. While William
Cyril Connolly, IADT! 1
2. Playfair illustrated good graphic practice over 200 years previously he did not explain why the
specific structures of his graphic forms and formats work’.
The development of a variety of highly specialised and well-developed interactive computer
systems during the 1970s allowed data to be analysed in a dynamic, iterative and visual
manner. One of the early systems was known as the PRIM-9 (4) at the Standford Linear
Accelerator Centre. PRIM stood for Projection, Rotation, Isolation and Masking and allowed
for the exploration of multidimensional data in up to nine dimensions. It ran on an IBM
system and required a few million dollars worth of computer and display hardware, (the
display unit was $400,000 alone) and cost several hundred dollars an hour to use.
Later developments in hardware and software allowed PRIM technology to become
generally available on desktop computers. The innovative Apple Macintosh hardware
and software, first produced during the mid 1980s led the way in these developments
with applications like MacSpin (5) and DataDesk (6). These changes in computer
systems have as William Cleveland states in his text Visualising Data (7) ‘changed how
we carry out visualisation but not its goals’
2. Data Visualisation using DataDesk
DataDesk was originally developed on the Apple Macintosh platform by Apple
research fellow, Paul Velleman during the latter part of the 1980’s and subsequently
become available on the Windows platform. The principle feature of DataDesk, in
contrast to other mainstream data analysis applications, is the ability to interact with
multiple linked views of a dataset, so that, for example, selecting a subset of cases
in one view highlights them in all other views. This ability to ‘slice and dice’ data
using dynamic and interactive tools brings statistics to life generating interest and
an appreciation of its importance in the decision making process. Some examples
of the use of DataDesk to explore Irish road accident data are shown below.
i) Regional Variation of Road Accidents
The knife tool ‘slices’ over the east coast of Ireland’s accident scatterplot map in Figure
1 . The two bar charts to the right of this plot illustrate the daily (Sunday = 1, Saturday
= 7) and monthly distribution of accidents (January = 1, December = 12). From the
plot the distribution of accidents along the east cost by weekday and month appears
to be fairly constant by weekday and month.
Cyril Connolly, IADT! 2
3. Figure 1: Spatial distribution of east and west coast accidents
! If the knife is moved to the west coast as shown in Figure 1 the bar charts update
automatically and the distribution of accidents by weekday and month reveal a different
pattern to the east coast. Accidents by weekday are lowest during midweek and
highest at the weekends while accidents by month are highest during the summer
months and lowest during the winter months.
ii) The Influence of Daylight Variation on Pedestrian Road Accidents
! Figure 2 illustrates the number of pedestrians killed in Ireland by month between 2000 and
2006. The plot suggests a U profile with accidents higher in the winter months but lower in
the summer months.
! Figure 2: Monthly Distribution of Fatal Pedestrian Road Accidents,
! To investigate this pattern in more detail a plot of the number of fatal pedestrians by hour is
generated. Browsing the the hourly bar chart with the knife tool it becomes clear that the U
shape is explained by fatalities between 16:00 to 21:00 hours as shown in Figure 3.
Cyril Connolly, IADT! 3
4. Figure 3: Monthly distribution of fatal accidents between 16:00 and 21:00 (left) and excluding the
hours 16:00-21:00 (right)
This is further illustrated by examining the distribution of accidents excluding the hours 16:00
to 21:00 as shown in Figure 3. The monthly bar chart now shows no evidence of a seasonal
profile. The seasonal U profile of fatal pedestrian accidents during these hours is explained
by the variation in the number of hours of daylight during these hours throughout the year
(8).
For the winter months of December and January there is virtually no daylight during these
hours and the corresponding number of fatal accidents is highest. For the summer months
of June and July there is almost complete daylight between 4pm and 10pm and the number
of pedestrian accident is lowest.
.
iii) Accident Profiling using Rotating Plots
The French cartographer Jacques Bertin stated in his ground breaking text Graphics and
Graphic Information Processing (9) that ‘it is not sufficient to have data, to have statistics, in
order to arrive at a decision. Items of data do not supply the information necessary for
decision making. What must be seen are the relationships which emerge from consideration
of the entire set of data’
This statement is illustrated in the examination of the age distribution of the driver, front and
rear seat passengers coded as ageDr, ageFP and ageRP, respectively. If we are restricted
to working in what Edward Tufte (10) refers to as two-dimensional Flatland we would
generate three scatterplots which would examine the relationship between driver and front
seat passenger, driver and rear seat passenger and front seat and rear seat passenger as
shown in Figure 4.
Cyril Connolly, IADT! 4
5. Figure 4: Scatterplots of the age of driver vs age of front passenger (left), age of driver vs age of rear
passenger (centre) and age of front seat passenger versus age of rear seat passenger
While these plots illustrate the presence of up to three clusters it is through the use of a
rotating plot that we can see the overall relationships emerging from consideration of the
entire set of data as shown in Figure 5. After spending a short time rotating the data a star
shape becomes evident with each arm corresponding to a distinctive cluster. Investigating
the profile of each cluster is easy with DataDesk. Capturing each cluster using a lasso tool
and dynamically linking the cluster with variables of hour, primcoltype, ageDr, ageFP and
ageRP, and genderDr, genderFP and genderRP gender the profile of this segment can be
readily determined.
For example, in Figure 5 the centre cluster is selected. The linked variables suggest that this
profile comprises young vehicle occupants with a substantial number of accidents in the
early hours of the morning, a high proportion of primcoltype code 2 values which
corresponds to single vehicle accidents. In addition, the profile of the driver is primarily male
with an excess of male over female passengers. In summary, this accident profile is
explained by young male drivers with passengers of a similar age who are involved primarily
in single vehicle accidents. The principal causal factor associated with this profile is alcohol
and /or excessive speed.
Figure 5: Centre of star cluster with dynamically linked variables hour, type of collision, age and
gender of vehicle occupants
Cyril Connolly, IADT! 5
6. In contrast, selecting the southern arm of the star in Figure 6 we see a considerably different
profile. The early morning surge is absent as is the dominance of code 2 primcoltype. The
driver and front seat passengers are of a similar but older age profile with a considerably
younger rear seat passenger. The drivers are primarily male, the front seat passengers are
primarily female while the distribution of male and female rear seat passengers is virtually the
same. It is clear that this profile represents accidents involving parents with a young child in
the rear seat.
The ability to slice, brush and rotate data allows the analyst to discover hidden patterns
and relationships while also providing a framework for explaining more theoretical
concepts including the use of multivariate analysis techniques
Figure 6: Southern arm of star cluster with dynamically linked variables hour, type of collision, age and
gender of vehicle occupants
In summary data visualisation is described by the American psychologist and statistician
Michael Friendly as ‘an approach to data analysis that focuses on insightful graphical display.
The word ‘insightful’ suggests that the goal is (we hope) to reveal some aspects of the data
that might not be perceived, appreciated or absorbed by other means’ (11).
Cyril Connolly, IADT! 6
7. !
! References
[1]
Playfair, William, Commercial and Political Atlas, London, 1786, pp xiii- xiv. Reprinted as Playfair’s
Commercial and Political Atlas and Statistical Breviary edited and introduced by Howard Wainer and
Ian Spence, 2005, Cambridge University Press.
[2] Tukey, J. W., 1962, The future of data analysis, Annals of Mathematical Statistics,
33: 1-67, 812.
[3] Bertin, J, Semiologie Graphique, 1967, Paris: Editions Gauthier-Villars. English translation by W.J.
Berg as Semiology of Graphics, Madison, WI: University of Wisconsin Press, 1983., (reprinted in
October 2010 by ESRI Press)
[4] Fisherkeller, M.A., Friedman, J.H., and Tukey, J.W., 1975, PRIM-9: an interactive multidimensional
data display analysis system, Data: Its Use, Organisation and Management, 140-145. New York: The
Association for Computing Machinery.
[5] Donoho, A.W., Donoho, D.L., and Gasko, M, 1988, MacSpin: Dynamic Graphics on a desktop
computer. In W.S Cleveland and M.E. McGill, eds., Dynamic Graphics for Statistics. Belmont, CA:
Wadsworth, pp 331-351.
[6] Velleman, P.F., 1988, Data Desk. Ithaca, New York: Data Descriptions Inc.
[7] Cleveland, W.S, Visualising data, 1993, Hobart Press, page 2.
[8] Pedestrian Accidents in Ireland, Great Britain and Northern Ireland, 1998, National Roads Authority,
Dublin.
[9]
Bertin, J, La Graphique et le Treatment Graphique de I’Information 1977, Paris: Flammarion. English
translation by W.J. Berg and P. Scott as Graphics and Graphic Information Processing, 1981, Berlin:
Walter de Gruyter & Co.
[10] Tufte, E.R, 1990, Envisioning information, Graphics Press. pp 12-30.
[11] Friendly, M.,2001, Visualizing Categorical Data, SAS Institute Inc.,Cary, NC, USA.
Cyril Connolly, IADT! 7