Optimization of sample configurations for spatial trend estimation

Support
We are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the
development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of
Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil
Technology of Brazil (Process 480515/2013-1).
Pedometrics 2015
14 – 18 September 2015
Faculty of Labour Sciences, Avenida de Ollerías 2
Córdoba, Spain (37.891586, -4.777202)
Optimization of Sample Configurations for Spatial Trend Estimation
Alessandro Samuel-Rosa(1)
, Dick J Brus(2)
, Gustavo M Vasques(3)
, Lúcia H C Anjos(1)
(1) Universidade Federal Rural do Rio de Janeiro, Brazil (alessandrosamuelrosa@gmail.com, lanjos@ufrrj.br); (2) Alterra, Wageningen University and Research Centre, the
Netherlands (dick.brus@wur.nl); (3) Embrapa Soils, Brazil (gustavo.vasques@embrapa.br).
Introduction
The spatial trend corresponds to the spatial variation of Z(s) that is explained linearly or non-linearly by the
covariates. There are various methods to design samples for spatial trend estimation. One of the most used in soil
science, the so-called conditioned Latin Hypercube Sampling (cLHS) (Minasny & McBratney, 2006), searches for a
spatial sample optimal in terms of
1) coverage of the marginal distribution of numeric covariates,
2) linear correlation of numeric covariates, and
3) proportional sample sizes for the classes of factor covariates.
The idea is that with such a sample we can identify the “true” spatial trend if we are ignorant about its form. We
propose to improve on the existing cLHS and present our implementation in the R-package spsann.
Measuring the Association Between Factor Covariates
Like the cLHS, our implementation it is based on solving a multi-objective optimization problem (MOOP)
using spatial simulated annealing. But instead of three, we define two objective functions. As such, we redefine the
optimization criterion as the reproduction of an Association/Correlation measure and the marginal Distribution of
the Covariates (ACDC).
This is because the cLHS ignores the association among factor covariates and among factor and numeric
covariates. We propose to use the Pearson's r (correlation) only when all covariates are numeric, and the Cramér's V
(association) when some or all covariates are factors. In the latter case any numeric covariate is transformed to a
factor covariate, with the factor levels defined by the marginal sampling strata.
where r and c are the number of rows and columns of the contingency table, n is the number of observations, and χ2
is the chi-squared statistic
where Oi
and Ei
are the observed and expected frequency, respectively (Cramer, 1946).
Defining the Marginal Sampling Strata
The cLHS uses quantiles to create equal-area marginal sampling strata. Depending on the number of marginal
strata, this may produce replicated breakpoints in regions with a relatively high frequency of covariate values.
R> # Replicated breakpoints
R> sample_size <- 5
R> covariate <- c(1, 5, 1, 3, 4, 1, 2, 3, 2, 1, 8, 9, 9, 9, 9)
R> probs <- seq(0, 1, length.out = sample_size + 1)
R> breaks <- quantile(covariate, probs, na.rm = TRUE)
R> breaks
0% 20% 40% 60% 80% 100%
1.0 1.0 2.6 4.4 9.0 9.0
The presence of replicated breakpoints prevents the optimization algorithm from converging to the optimum. We
propose defining marginal sampling strata using only the unique values of the sample quantiles estimated with a
discontinuous function (Hyndman & Fan, 1996). This avoids creating empty marginal strata.
R> # Unique breakpoints
R> breaks <- quantile(covariate, probs, na.rm = TRUE, type = 3)
R> breaks <- unique(breaks)
R> breaks
[1] 1 2 4 9
This approach results in each numeric covariate having a different number of quasi-equal-size sampling strata.
The number of sample points that should fall in each marginal sampling stratum is proportional to the number of
sampling units in that stratum.
R> # Number of points per strata
R> count <- hist(covariate, breaks, plot = FALSE)$counts
R> count <- count / sum(count) * sample_size
R> count
[1] 2 1 2
Avoiding Numerical Dominance
We also solve the MOOP aggregating the objective functions into a single utility function using a weighted
sum, the weights defining the relative importance of each objective function:
where w is a vector of positive weights that sum to unity, k being the number of objective functions (Marler &
Arora, 2009). The improvement is that the objective functions are first scaled to the same approximate range of
values using the upper-lower bound approach with the Pareto maximum (and minimum):
where xj
*
is the point that minimizes the jth objective function, a vertex of the Pareto optimal set in the design space
(Marler & Arora, 2005).
Using the Pareto maximum (and minimum) avoids the numerical dominance (bias) of any objective function
such as occurs with the first objective function (O1
) of the cLHS that yields criterion values much larger than the
second (O2
) and third (O3
). The numerical dominance occurs because O1
uses the number of points per strata (0 to
n), while O2
uses the proportion of points per strata (0 to 1) and O3
uses the linear correlation coefficient (-1 to 1).
V=
√ χ2
/n
min(c−1,r−1)
χ
2
=
∑i=1
r
∑j=1
c
(Oi−Ei)
2
Ei
U =∑i=1
k
wi f i(x)
f i
max
=max1≤j≤k f i(x j
∗
) References
Cramér, H. Mathematical methods of statistics. Princeton: Princeton University Press, p. 575, 1946.
Hyndman, R. J. & Fan, Y. Sample quantiles in statistical packages. The American Statistician, v. 50, p. 361-365,
1996.
Marler, R. T. & Arora, J. S. Function-transformation methods for multi-objective optimization. Engineering
Optimization, v. 37, p. 551-570, 2005.
Marler, R. T. & Arora, J. S. The weighted sum method for multi-objective optimization: new insights. Structural
and Multidisciplinary Optimization, v. 41, p. 853-862, 2009.
Minasny, B. & McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary
information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.
Preliminary Results
Our preliminary results indicated that sampling distributions derived using our algorithm varied very little from
the same set of covariates, indicating that the criterion approaches the global optimum.
An in-depth study is being carried out to evaluate how our implementation performs compared to the original
cLHS method (and other sample designs as well). Using simulated data, we will evaluate their ability to capture the
true form of the spatial trend (linear and non-linear) and make accurate predictions.
Acknowledgements
We are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the
development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of
Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil
(Process 140720/2012-0). The last author was supported by the CNPq Foundation, Ministry of Science and
Technology of Brazil (Process 480515/2013-1).
Student
Presentation

Optimization of sample configurations for spatial trend estimation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Optimization of sample configurations for spatial trend estimation

Semelhante a Optimization of sample configurations for spatial trend estimation (20)

Mais de Alessandro Samuel-Rosa

Mais de Alessandro Samuel-Rosa (13)

Último

Último (20)

Optimization of sample configurations for spatial trend estimation