Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Statistics for Geography and Environmental Science:an introductory lecture course
1. Statistics for Geography and
Environmental Science:
an introductory lecture course
By Richard Harris, with material
by Claire Jarvis
USA: http://amzn.to/rNBWd5
UK: http://amzn.to/tZ7fVu
5. The modules
Module1 makes the case for knowing
about statistics as a transferable skill
and to be equipped for social and
political debate.
Module 2 is about using descriptive
statistics and simple graphical
techniques to explore and make
sense of data.
Module 3 discusses the Normal
curve, the properties of which
provide the basis for inferential
6. The modules
Module 4 is about the principles of
research design and effective data
collection.
Module 6 discusses the role of
hypothesis testing.
Module 7 is about regression
analysis.
7. The modules
Module 8 moves to modelling point
patterns, ―hotspot analysis‖ and ways
of measuring patterns of spatial
autocorrelation in data.
Module 9 looks at spatial regression
models, geographically weighted
regression and multilevel modelling.
Each module is explored more fully
in the accompanying
textbook, Statistics for Geography
and Environmental Science.
8. Module 1
(Extracts from Chapter 1 of Statistics for Geography
and Environmental Science)
DATA, STATISTICS AND
GEOGRAPHY
9. Module overview
To convince you that studying
statistics is a good idea!
Our argument is that data collection
and analysis are central to the
functioning of contemporary society
so knowledge of quantitative
methods is a necessary skill to
contribute to social and scientific
debate.
10. About statistics
Statistics are a reflective practice: a
way of approaching research that
requires a clear and manageable
research question to be formulated, a
means to answer that question,
knowledge of the assumptions of
each test used, an understanding of
the consequences of violating those
assumptions, and awareness of the
researcher‘s own prejudices when
doing the research.
11. Some reasons to study statistics
Reasons for human geographers
– Data collection and analysis are central
to the functioning of society, to systems
of governance and science.
– Knowledge of statistics is an entry into
debate, informed critique and the
possibility of creating change.
12. Some reasons to study statistics
Reasons for GI scientists
– To address the uncertainties and
ambiguities of using data analytical.
– Because of the increased integration of
mapping capabilities, data visualizations
and (geo-) statistical analysis.
13. Some reasons to study statistics
Reasons for all students
– They provide a transferable skill set
using in other areas of research, study
and employment.
– There is a recognised shortage of
students with skills in quantitative
methods, especially within the social
sciences.
14. Types of statistic
Descriptive
– Used to provide a summary of a set of
measurements, e.g. the average.
Inferential
– Use the data at hand to convey information
about the population (‗the greater
something‘) from which the data are drawn.
Relational
– Consider whether greater or lesser values
in one set of data are related to greater or
lesser values in another.
15. Geographical data
These are records of what has
happened at some location on the
Earth‘s surface and where.
For many statistical tests the where
is largely ignored.
However, it is central to geostatistics
and to spatial statistics (as their
names suggest)
16. Some problems when analysing
geographical data
Standard statistical tests assume that
each ‗bit‘ of data (each observation)
has a value that is not influenced by
any other.
However, we may often expect there
to be geographical patterns in the
data.
– Spatial autocorrelation: geographical
patterns in the measurements
17. Some problems when analysing
geographical data
Determining what causes what in a
complex and dynamic natural or
social system is extremely tricky.
Two things may be associated (e.g.
greater income inequality and more
non-recycled waste) without the one
directly causing the other.
18. Some problems when analysing
geographical data
Data and structured forms of enquiry
can only tell us so much and may not
be appropriate to some types of
research for which a more
qualitative, participatory or less
representational approach may be
better.
19. Further reading
Chapter 1 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: types of statistics;
why error is unavoidable;
geographical data analysis; and
spatial autocorrelation and the first
law of geography.
20. Module 2
(Extracts from Chapter 2 of Statistics for Geography
and Environmental Science)
DESCRIPTIVE STATISTICS
21. Module overview
This module is about ―everyday
statistics‖, the sort that summarise
data and describe them in simple
ways.
They include the number of home
runs this season, average male
earnings, numbers unemployed,
outside temperature, average cost of
a barrel of oil, regional variations in
crime rates, pollution statistics,
measures of the economy and other
―facts and figures‖
22. Data and variables
Data
– A collection of observations:
measurements made of something.
A variable
– Another name for a collection of data.
Variable because it is unlikely that the
data are all the same.
Data types
– These include
discrete, continuous, and categorical
data.
23. Simple ways of presenting data
Discrete data Continuous data
Frequency table Summary table
Bar chart (below) Histogram (below, with a rug plot)
25. Information to include
in a summary table
Measures of central tendency
(―averages‖)
– The mean and/or median
• The ―centre‖ of the data
Measures of spread and variation
– The range (minimum to maximum)
– The interquartile range (from ‗mid-
spread‘ of the data)
– The standard deviation,s
26. More about the standard deviation
Essentially a measure of average
variation around the mean.
It is also the square root of the
variance.
The variance is the sum of squares
divided by the degrees of freedom
27. Boxplots
Are useful for
showing the
median,
interquartile
range and range
of a set of data,
for indentifying
outliers and also
for comparing
variables.
28. Other ways of classifying numeric
data
Nominal, ordinal, interval and ratio
Counts and rates
Proportions and percentages
Parametric and non—parametric
Arithmetic and geometric
Primary and secondary
29. Further reading
Chapter 2 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: data and variables; discrete
and continuous data; the range;
histograms, rug plots, and stem and
leaf plots; measures of central
tendency; why averages can be
misleading; quantiles; the sum of
squares; degrees of freedom; the
standard deviation and the variance;
box plots; and five and six number
summaries
30. Module 3
(Extracts from Chapter 3 of Statistics for Geography
and Environmental Science)
THE NORMAL CURVE
31. Module overview
This module introduces the normal
curve, so called because it is how
many social and scientific data
appear distributed.
32. The normal curve
It is also known as
the Gaussian
distribution and is
often described as
‗bell-shaped‘
It is a family of
distributions all of
which have the
same probability
density function
(the same formula
defining their
shape).
33. The central limit theorem
The central limit theorem states that
the sum (and therefore average) of a
large number of independent and
identically distributed random
variables will approach a normal
distribution as the sample size
increases, even if the variables are
not themselves normally distributed.
34. Properties of a normal curve
Ranges from
negative to positive
infinity
Is symmetrical
around its mean
95% of the area
under the curve is
within 1.96
standard
deviations of the
mean
99% of the area is
within 2.58
standard
deviations.
35. Properties of a normal curve
Consequently, if a
data set is
approximately
Normal, the
probability of
selecting, at random,
an observation at
that is within 1.96
standard deviations
of the mean is p =
0.95, and the
probability it will be
within 2.58 standard
deviations is p
=0.99.
36. Standardising data (z values)
Data are
standardised if their
original
measurement units
are replaced with
units of standard
deviation from the
mean (z values).
It is a little like
converting a
proportion (0 to 1) to
a percentage (0 to
100): it doesn‘t
change the shape of
the data.
37. Standardising data (z values)
The z values are calculated by
subtracting the mean of the data from
each observation and then dividing by
the standard deviation.
Once data are standardised and
assuming they are approximately
normal then they can be compared
against the Standard Normal curve.
This is a special instance of a normal
curve that has a mean of zero and a
standard deviation of one.
It provides a model or benchmark for
the data.
38. Probability and the Standard
Normal
The area between two z values
(under the Standard Normal) is the
probability of selecting an
observation randomly from the data
that will have a z value between
those two values.
That area can be determined using a
statistical table or equivalent.
39. Probability and the Standard
Normal
See the worked
examples on pp.
62–70 of
Statistics for
Geography and
Environmental
Science
40. Some data are skewed but can
often be transformed to
approximate normality
41. The quantile plot
Useful (and better
than a histogram)
to check for non-
normality, such as
skew and the
presence of
outliers.
If the data were
normal they‘d be
distributed along
the straight line.
42. Further reading
Chapter 3 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: properties of normal curves;
the central limit theorem; probability
and the normal curve; finding the area
under the normal curve; skewed data
and the ‗ladder of transformation‘;
moments of a distribution; and the
quantile plot.
43. Module 4
(Extracts from Chapter 4 of Statistics for Geography
and Environmental Science)
SAMPLING
44. Module overview
It is rarely possible or necessary to
collect all possible data about
something that is being studied.
This module is about how to go
about collecting a sample of data that
is fit for a particular research task.
45. Sampling
It is common in geographical and
other research to gather a sample
(or subset) of data from a target
population.
The aim is for the sample to be
representative of that population.
Sampling bias occurs when the
sample favour some parts of the
target population more than
others, perhaps by sampling at an
unrepresentative time or place or
because of the data collection
46. The process of sampling
Define the research question
Review the related literature
Review the scope of the planned
study
Construct a sample frame
Select a sample design method
Review the design from
practical, ethical, safety and logistical
perspectives
Implement the design and collect the
47. Sampling methods
Sampling methods
Non-probabilistic Probabilistic sampling
sampling methods methods
Judgemental Convenience Simple Systematic
random
Quota Snowball Clustered Stratified
random random
48. Sampling methods
The different methods are outlined on pp.
94-105 of Statistics for Geography and
Environmental Science.
In general, random sampling methods are
preferred because the errors in the data
should be random too.
However, a random sample won‘t
necessarily offer a wide enough coverage
of the target population.
Therefore stratified samples may be used
which may themselves target
specific, representative places to reduce
the cost and ease the logistics of the data
collection.
49. Sampling error and sample size
The impression that is formed of the target
population depends on the sample of data
taken to represent it.
It is possible that a random sample
accidentally misrepresents the population
if it happens only to observe its most
unusual occurrences: it is susceptible to
sampling error.
The larger the sample (the more
observations there are) there smaller the
error is expected to be but with
‗diminishing returns‘
– the error is generally proportional to the square
root of the sample size
50. Sampling error and sample size
The error is also a function of how
much the target population varies
– If it were exactly the same, everywhere, it
wouldn‘t matter where the samples were
taken
A larger sample is costly and more time
consuming to collect.
However, a small sample of a highly
variable population is unlikely to
generate any statistically meaningful
analytical results.
51. Sampling methods: issues and
practicalities
Personal safety, gaining permission
from an ethics committee, what to do
about missing data.
Practical considerations
– Weight and/or volume of the
sample, import/export
restrictions, analytical costs
Instrument accuracy and scale
Bottom line: if your sample is no
good, your analysis won‘t be any
good either.
52. Further reading
Chapter 4 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: the target population;
representative samples; sampling
frames; sampling bias; metadata;
fitness for purpose and use; sample
design; sampling error and sample size;
sample size and replicates; and
measurement accuracy.
53. Module 5
(Extracts from Chapter 5 of Statistics for Geography
and Environmental Science)
FROM DESCRIPTION TO
INFERENCE
54. Module overview
This module is about inference.
Inference is at the heart of how and
why statistics developed.
It moves beyond simply summarising
data (the sample) to using those
summaries to gain insights into the
underlying system, process or
structure that the data are
measurements of (the population).
55. A population
Its meaning isn‘t restricted to ―everyone
who lives in a particular place‖ but can
be much more abstract.
– ―Every possible object (or entity) from which
the sample is selected.‖
– ―The complete set of all possible
measurements that might hypothetically be
recorded.‖
Informally: the complete ‗thing‘ that you
are interested in study but which can‘t
be measured in its entirety.
57. The sample mean and
the population mean
Assuming the sample is
representative (unbiased), it is
possible to estimate the true mean of
the population from the mean for the
sample.
– The population mean from the sample
mean
But that estimate is sample
dependent
– Change the sample and you get a
different estimate
58. Confidence intervals
It is improbable that the sample
mean is exactly equal to the
population mean
– And we wouldn‘t know even if it was.
• (unless we sampled the population in its
entirety, in which case they‘d be no need to
make an estimate!)
However, we can place a confidence
interval around the sample mean and
estimate the probability that
confidence interval contains the
population mean.
60. The width of a confidence interval
The confidence interval is wider
– The greater the probability you want it to
contain the unknown population mean
– The more variable the data are (the
greater their variance / standard
deviation)
– The less data you have
61. The standard error (of the mean)
The standard deviation of the data
divided by the square root of the
number of observations gives an
estimate of the standard error (of
the mean) and is a measure of
uncertainty in the data
– The greater the standard error, the
greater the uncertainty
62. Why confidence intervals ‘work’
In principle, if a
population were
sampled a very large
number of times, the
sample mean
calculated in each
case and those means
then collected together
to form a new
variable, we‘d find that
variable to be normally
distributed, centred on
the population mean
and with a standard
deviation equal to the
standard error of the
mean.
63. Small samples
For small samples the
confidence interval will
be underestimated if it is
calculated with reference
to a normal distribution.
A t-distribution is used
instead.
This is ‗fatter‘ than the
normal.
Intuitively: we are more
cautious with small
samples that contain little
information. The
confidence intervals are
widened to reflect that
caution.
64. Summary
Mean of the sample Known
Standard deviation of the sample Known
Standard deviation of the Unknown but approximated by
population the standard deviation of the
sample
Standard error of the mean Estimated as the standard
deviation of the sample divided by
the square root of the sample size
Mean of the population Unknown but we can estimate the
probability that it has a value that
lies within a given number of
standard errors either side of the
sample mean (within a given
confidence interval)
65. Further reading
Chapter 5 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: inference, samples
and populations; the distribution of
the sample means; standard error of
the mean; confidence intervals; the t-
distribution and confidence intervals
for ‗small samples‘
66. Module 6
(Extracts from Chapter 6 of Statistics for Geography
and Environmental Science)
HYPOTHESIS TESTING
67. Module overview
This module introduces hypothesis
testing as a way of formally
questioning whether a population
mean could plausibly be equal to a
hypothesised value, and to consider
whether two or more samples of data
were most probably drawn from the
same population.
68. The process of hypothesis testing
Define the null hypothesis
Define the alternative hypothesis
Specify an alpha value
– The maximum probability of rejecting the
null hypothesis when it is, in fact, correct.
Calculate the test statistic
Compare the test statistic with a critical
value
Reject the null hypothesis is the test
statistic has greater magnitude than the
critical value
69. One-sample t test
The one-sample t test measures the
number of standard errors the sample
mean is from an hypothesised value.
The further it is, the less probably the
sample is drawn from a population with a
mean equal to that hypothesised value.
The p value records that probability.
A p value of 0.95 or greater means we can
be (at least) ―95% confident‖ that the ―true
mean‖ (the population mean) for whatever
has been measured is not the
hypothesised value.
A p value of 0.99 or greater gives 99%
confidence
70. Two-sample t test
Considers the probability that two samples
of data do not have the same population
mean.
– If they don‘t, it suggests the samples measure
categorically different things.
It works by measuring the difference
between the sample means relative to the
variance of the samples.
There are different versions of the t
test, for example for paired data and for
whether the two samples have
approximately equal variance or not.
An F test is used to compare the sample
variances and see if any difference could
be due to chance.
71. Analysis of variance (ANOVA)
Used to test whether
three or more groups
of data have the
same population
mean.
Considers the
variations between
groups relative to the
variation within
groups.
Contrasts can be
used to specifically
contrast one or more
of the groups with
one or more of the
72. Two- and one-tailed tests
A two-tailed test is
non-directional
whereas a one-tailed
test is directional.
Consider a one-
sample t test
– The alternative
hypothesis for a two-
tailed test is only that
the population mean
is not equal to the
hypothesised value.
– A one-tailed test it
specifies which is the
greater
73. Non-parametric tests
Non-parametric
tests do not begin Parametric Non-
with fixed parametric
assumptions about Two-samplet Wilcoxon rank
how the data and test sum test (aka
Mann-Whitney
the population are test)
distributed ANOVA Kruskal-Wallis
– E.g. a normal test
distribution
However, if the
assumptions are
met, it is better to
use the parametric
test.
75. Power
We worry about limiting the probability of
rejecting the null hypothesis when it is
correct (of making a wrong decision)
– Of having a low p value
But we could avoid the error by never
rejecting the null hypothesis
Except, that‘s daft because the null
hypothesis could be wrong.
So, also need to think about the probability
of rejecting the null hypothesis when it is
indeed wrong
– The probability of making this, the right
decision, is the power of the test.
76. Further reading
Chapter 6 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: type I errors; the one-
sample t test; hypothesis testing; two-
and one-tailed tests; type II errors and
statistical power;
homoscedasticity, heteroscedasticity
and the F test; analysis of variance;
measuring effects; and parametric and
non-parametric tests.
77. Module 7
(Extracts from Chapter 7 of Statistics for Geography
and Environmental Science)
RELATIONSHIPS AND
EXPLANATIONS
78. Module overview
This module looks at relational
statistics, exploring whether higher
values in one variable are associated
with higher values in another (a
positive relationship) or whether
higher values in the one are
associated with lesser values in the
other (a negative relationship).
It also looks at trying to explain the
variation found in one variable using
others.
79. Scatter plots
Scatter plots are
an effective way
of seeing if there
is any
relationship
between two
variables, whethe
r it is a straight
line
relationship, and
to help detect
errors in the data.
80. A positive relationship is when the line
of best fit is upwards sloping.
A negative relationship is when it is
downwards sloping.
The X variable (horizontal axis) is the
independent variable.
The Y variable (vertical axis) is the
dependent variable.
It is assumed that the X variable leads
to, possibly even causes, the Y
variable.
82. Uses of regression
To summarise data
To make predictions
To explain what causes what
83. Bivariate regression
Bivariate
regression finds a
line of best fit to
summarise the
relationship
between two
variables.
That line can be
used to make
predictions for
what the Y value
would be for a
given value for X.
It is a line of best
fit, rarely perfect fit.
84. Regression tables
The strength of the effect of the X variable
on the Y is measured by the gradient of
the line of best fit
– It measures whether a change in X will lead to
a change in Y and by how much.
We have greater confidence that the effect
is genuine and not a chance property of
the sample the better the line fits the data
(i.e. the less the residual variation around
it).
Regression tables report various measures
and diagnostics including the measured
gradient of the line, the residual error, the
probability the gradient could actually be
zero (no relationship) and goodness-of-fit
measures
85. Assumptions of regression
analysis
There are various
types of regression
analysis but the most
common, Ordinary
Least Squares
regression, assumes
that the two variables
are linearly related (or
could be transformed
to be so) and that the
residual errors are
random with no
unexplained patterns.
Visual checks can
easily be made.
Watch out for
leverage
points, extreme
outliers and
87. Multiple regression
When two or more X variables are
used to explain the Y variable.
In addition to the usual checks (of
linearity and of random errors) need
to check also for multicollinearity
It is often helpful to standardize the
variables so their effects can be
compared
88. A strategy for multiple regression
Crawley (2005; 2007) describes the aim
of statistical modelling as finding a
minimal adequate model. The process
involves going from a ―maximal model‖
containing all the variables of interest to
a simpler model that fits the data almost
as well by deleting the least significant
variables one at a time (and checking
the impact on the model at each stage
of doing so). As part of the
process, consideration also needs to be
given to outliers and to other checks
that the regression assumptions are
being met.
89. Further reading
Chapter 7 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: scatter plots; independent
and dependent variables; Pearson‘s
correlation coefficient; the equation of a
straight line; residuals; bivariate
regression; outliers and leverage
points; multiple regression; goodness-
of-fit measures; assumptions of OLS
regression; and Occam‘s Razor and the
minimal adequate model.
90. Module 8
(Extracts from Chapter 8 of Statistics for Geography
and Environmental Science)
DETECTING & MANAGING
SPATIAL DEPENDENCY
93. The ecological fallacy
In a general sense
– Means that
statistical Scale n r
relationships found Region 9 -0.95
at one scale may LA 376 -0.77
not apply at ward 8868 -0.55
another scale
A more specific
meaning
– When inappropriate
assumptions are
made about
individuals from
using grouped
94. Spatial autocorrelation
Standard statistics assume the
observations / errors are
independent of each other
But spatial data tend to be more
similar in value at nearby locations
than those further away
– This is positive spatial autocorrelation
Negative spatial autocorrelation is
when nearby measurements are
‗opposite‘ to each other
96. Detecting spatial autocorrelation
The semi-
variogramis used
to explore a data
set visually and to
estimate how far
you need to move
away from a
particular data
point before data
points at that
distance can be
considered
unassociated with
the first.
97. Other measures of global
autocorrelation
Moran‘s I
Getis‘ G statistic
Geary‘s C
Joint counts method
98. Global Vs Local measures
A global measure of spatial autocorrelation
gives a single summary measure of the
patterns of association for the whole study
region.
This can conceal more localised patterns
within the region.
Global measures can often be ‗broken
down‘ into local measures where the
patterns of association are measured and
compared for sub-regions
– E.g. Local Moran‘s I, Local Getis, G.
Can be used to identify ‗hotspots‘ and ‗cold
spots‘ of something (e.g. crime)
99.
100. Further reading
Chapter 8 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: spatial
autocorrelation; the MAUP; the
ecological fallacy; semi-variance;
semi-variogram; common structures
used to model the semi-variogram;
and hotspots.
101. Module 9
(Extracts from Chapter 8 of Statistics for Geography
and Environmental Science)
EXPLORING SPATIAL
RELATIONSHIPS
102. Module overview
This module is about treating where
something happens as useful
information that may help explain
what is happening. The central idea
is when we find geographical
patterns in data and there is
evidence to suggest they did not
arise by chance then it would be
better to explore and model the
cause of the patterns then to treat
them as an inconvenience.
103. Spatial regression
The spatial error model and the
spatially lagged y model are
examples of spatial regression
models that allow for and measure
the interdependencies between
neighbouring or proximate data.
Neighbourhoods are defined by a
weights matrix indicating, for
example, if places share a boundary.
107. Multilevel modelling
Multilevel modelling can be used to
model at multiple scales simultaneously
and to explore how individual
behaviours and characteristics are
shaped by the places in which they live
or by the organisations they attend.
Because multilevel models can
consider people in places they are
sometimes used to generate evidence
of a neighbourhood effect.
Also useful for longitudinal analysis
(analysis over time)
108. Geography, computation and
statistics
The development of spatial analysis
has been made possible by
advances in computation.
But techniques like GWR are
characterised by repeat fitting and
remain demanding computationally.
There is increasing integration
between geographical information
science, computer science and
statistics.
109. Further reading
Chapter 9 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: cartograms; spatial
analysis; weights matrices; spatial
econometrics; geographically
weighted regression; local indicators
of spatial association; and multilevel
modelling.