Key lecture for the EURO-BASIN Training Workshop on Introduction to Statistical Modelling for Habitat Model Development, 26-28 Oct, AZTI-Tecnalia, Pasaia, Spain (www.euro-basin.eu)
2. 2
OUTLINE
• Why to model?
• Habitat models
• Model properties
• Steps for modelling
• What about data?
3. 3
WHY TO MODEL?
• “All models are wrong, some models are useful” (G. Box)
• Models are how we understand the world:
We see the world through models
We learn about the world using formal descriptions
• Model types:
– Static vs dynamic
– Explanatory vs predictive
– Deterministic vs stochastic
– Discrete vs continuous
4. 4
HABITAT MODELS
• Habitat models are focused on how environmental factors control
the distribution of species and communities.
• Multiple applications:
– Biogeography, impact of the global change, management,
conservation, ecology, …
• New conceptual and operative advances due to the growth in
computing power, e.g. GIS, remote sensing, new statistical
modelling tools (computer intensive), etc
5. 5
MODEL PROPERTIES
Some desirable model properties:
• Parsimony (Occam’s razor): “All things being equal, the simplest
solution tends to be the best one”
• Tractability: easy to be analysed
• Conceptually insightful: reveal fundamental properties
• Generalizability: can be applied to other situations/species/…
• Empirical consistency: consistent with the available data
• Falsifiability: can be tested by observations
• Predictive precision
6. 6
MODEL PROPERTIES
Predictive habitat
distribution models
Levins (1966); Sharpe (1990); Guisan and Zimmermann (2000)
7. 7
MODEL PROPERTIES
COMPLEXITY
GENERALITY
The more complex model is not necessarily the best…
8. 8
STEPS FOR MODELLING
1) Conceptual phase
2) Model formulation
3) Model calibration
4) Spatial predictions
5) Model evaluation
6) Model applicability
10. 10
1. Conceptual phase
• Some sort of theoretical model should be in mind, before a statistical
model is even considered
• This phase includes:
– Literature review
– Define an up-to-date conceptual model
– Set multiple hypothesis
– Assess available and missing data
– Identify appropriate sampling strategy for new data
– Choose appropriate spatio-temporal resolution and geographic
extent
– Identify the most appropriate statistical methods for the other
phases
12. 12
2. Model formulation
• The model depends on the type of response variable and its
associated probability distribution
Distribution Examples
Gaussian Biomass
Poisson Individual counts
Negative Binomial Individual counts
Multinomial Communities
Binomial Presence/absence
21. 21
REGRESSION ANALYSIS 2. Model formulation
Other regression models:
• Mixed models: LM, GLM and GAMs including random effect
terms. Useful for meta-analysis.
• Quantile regression: the quantiles are modelled instead of
the mean. Useful for finding limiting factors
• Segmented regression: the model changes depending on a
partition of the explanatory variable. Useful for detecting
regime changes
• Spatial autocorrelation and autoregressive models
22. 22
CLASSIFICATION TECHNIQUES 2. Model formulation
• Classification is the placement of species and/or sample units
into groups based on the environmental variables
23. 23
CLASSIFICATION TECHNIQUES 2. Model formulation
• Classification is the placement of species and/or sample units
into groups based on the environmental variables
• Many techniques included: classification decision tree,
regression decision tree, rule-based classification, maximum-
likelihood classification
• Mainly two groups:
– Supervised classification: a training data set is required
(groups are known beforehand)
– unsupervised classification: groups are unknown and need
to be defined, like in cluster analysis
24. 24
ENVIRONMENTAL ENVELOPES 2. Model formulation
• The environmental envelope of a species is defined as the set
of environments within which it is believed that the species can
persist (Walker and Cocks, 1991)
25. 25
ENVIRONMENTAL ENVELOPES 2. Model formulation
• The environmental envelope of a species is defined as the set
of environments within which it is believed that the species can
persist (Walker and Cocks, 1991)
• Examples of models:
– BIOCLIM: minimal rectilinear envelopes based on
classification trees
– HABITAT: convex polytope envelopes based on
classification trees
– DOMAIN: based on multivariate distance metrics
26. 26
2. Model formulation
• Ordination is the arrangement or ‘ordering’ of species and/or
ORDINATION TECHNIQUES
sample units along gradients
• Usually applied to community data matrices (row: species,
column: samples, value: abundance)
27. 27
2. Model formulation
• Indirect gradient analysis (no environmental data used)
– Distance-based approaches:
ORDINATION TECHNIQUES
• Polar ordination, Principal Coordinates Analysis, Nonmetric
Multidimensional Scaling
– Eigenanalysis-based approaches
• Linear model
– Principal Components Analysis
• Unimodal model
– Correspondence Analysis, Detrended Correspondence Analysis
• Direct gradient analysis (environmental data used)
– Linear model
• Redundancy Analysis
– Unimodal model
• Canonical Correspondence Analysis, Detrended Canonical
Correspondence Analysis
ter Braak and Prentice (1988)
28. 28
2. Model formulation
• Models inspired in the human-brain (interconnected group of
neurons)
NEURAL NETWORKS
• They define a non-linear function, decomposed further as a
weighted sum of functions, that similarly can be further
decomponsed, etc. So, complex non-parametric model (black-
box?)
• Adjusted by varying parameters, connection weights, or
specifics of the architecture such as the number of neurons or
their connectivity
• Few examples available yet
30. 30
3. Model calibration
• It includes model fitting (find the best value of the unknown
parameters to improve the agreement between the data and model
outputs) and model selection (which explanatory variables to be
included)
• To take into account:
– Use of predictors that are ecologically relevant: direct vs indirect
(proxy) variables
– Correlation between explanatory variables
• Each method has each own diagnostic tools according to their
assumptions, e.g, in regression models the residual deviance
32. 32
4.Spatial predictions
• Spatial predictions can be done on the data set used for calibration
or on new data sets. Care must be taken if predictions are done in a
new data set with new combinations between the explanatory
variables and for values outside the range of values in the data set
for calibration
• GIS tools are very often used, but still many statistical models are
not implemented in a GIS environment
34. 34
5. Model evaluation
• The aim is to evaluate the predictive power of a model
• If only one data set is available (we have used the data set for
calibration), bootstrap, cross-validation, jacknife
• If other data sets are available (independent of the calibration data
set), predicted and observed values are compared using:
– the same goodness of fit measure as used for model calibration
– any other measure of association
The data sets for calibration and evaluation are called respectively
training and evaluation data sets. Sometimes the original single
data set is split in two (split-sample approach)
36. 36
6. Model applicability
• It refers to the domain over which a validated model can be properly
used
• Potential uses (Decoursey, 1992):
– Screening
– Research
– Planning, monitoring and assessment
37. 37
WHAT ABOUT DATA?
• Data is even more important than the model itself.
• Usually from multiple sources: surveys (continuous, stations, vertical
profiles), remote sensing, circulation models, …
• The scale of the response and the environmental variables might not
be the same. Need to define a common scale unit. Sometimes
interpolation might be needed. This might include additional
uncertainities
• Simple exploratory statistics and figures can be very useful before
even start thinking on any model. They also help to spot errors in the
data.