Five normalization methods were compared, of which the combination of qc-LOESS and cubic splines showed the best performance based on within-batch and between-batch variable relative standard deviations for QCs. This approach was used to normalize sample measurements the results of which were analyzed using principal components analysis.
Case Study: Overview of Metabolomic Data Normalization Strategies
1. Implementation of Metabolomic Data Normalization Strategies
Dmitry Grapov, PhD
Summary
Five normalization methods were compared, of which the combination of
qc-LOESS and cubic splines showed the best performance based on within-batch and
between-batch variable relative standard deviations for QCs. This approach was used to
normalize sample measurements the results of which were analyzed using principal
components analysis. Based on this analysis an unknown source of variance was
identified among the samples (batches 1-~7 and 8-25) which was absent from QC
samples and concluded to stem from the biological variability due to the experimental
design.
Results
The complete data set, acquired over a one year period (3/6/2013 to
2/20/2014), consisted of 1262 measurements of 319 variables. Analytical variance over
the duration of the data acquisition was estimated based on 105 equally interspersed
quality control (QCs) samples (1:10 QC/samples). To aid the overview of temporal trends
the full data acquisition time was segmented into 1-3 day increments or 25 batches
(median samples per batch 53; range, 13 to 84).
QC samples were used to evaluate five common data normalization procedures:
quantile, cubic splines, cyclic LOESS [1], batch ratio and (qc-)LOESS [2]. Normalization
performance was assessed based on within-batch (Figure 1A) and between-batch
(Figure 1 B&C) variable relative standard deviations (RSD) of QC samples. The qc-LOESS
approach, which is a modification of the LOESS procedure (Figure 2), displayed the best
performance for QC samples (median batch RSD, 30%, range: 20-42%; raw data, 35%,
2. 19-51%), with 78% of normalized variables showing RSD<40% compared 65% for raw
data. However 113 variables (35%) displayed inconsistent trends between qc-LOESS
model training and tests sets and were identified as inappropriate for the qc-LOESS
normalization. The remaining variables were normalized using the cubic splines method,
which does not require a similar consistency criterion, and showed the second best
performance for QC samples (median batch RSD, 31%, 18-44%; and 77% of variables
with RSD<40%). The combination of qc-LOESS and cubic splines normalizations were
shown to improve data quality by reducing within-batch and between-batch analytical
variance (Figure 3).
Principal components analysis was used to evaluate raw and normalized QC and
sample measurements for batch effects (Figure 4). Raw QCs data displayed slight
differences between batches 1-7 and 7-25 (Figure 4A, red points), which was removed
after normalization (Figure 4B). However both raw and normalized samples displayed a
large mode of variance between samples among batches ~1-7 and all other batches
(Figure 4 C &D). After confirming that this trend was not due to the biological design of
the study, based on evaluation of same samples measured by an orthogonal
metabolomic platform (LC-Q-TOF), a semi-supervised approach of model based
clustering was used to define the members of the unique modes of variance. A linear
model was used to adjust the normalized data based on the model-based clustering
defined clusters (Figure 5).
Methods
3. Principal components analysis (PCA) on autoscaled data was used to overview
raw and normalized data and QC sample variance based on acquisition batch, and used
to identify 1 outlier QC sample (Bio Rec 94) which was removed from all further analyses
(Figure 6). Quantile, cubic splines and cyclic LOESS normalizations were implemented
without cross-validation [1]. Within- and between-batch RSDs were calculated based on
batch and aggregated medians. Batch ratio (BR) and qc-LOESS were implemented using
cross-validation where 2/3 of QC samples were used to train the model, which was then
applied to the remaining 1/3 data, and for consistency with the other normalization
methods performance is reported for the combined training and test sets.
BR normalization is an implementation a batch specific correction factor for each
variable, and was calculated as the ratio of the within-batch to the study wide variable
medians. The qc-LOESS normalization is an adaptation of the LOESS normalization which
uses qc samples, but also includes a step to determine if the LOESS based normalization
is applicable to the data by testing the correlation between LOESS models for the
training and test sets (cubic splines interpolated). LOESS model span was selected using
leave-one-out cross-validation on the training data. Variables inappropriate for the qc-
LOESS normalization were instead normalized by the cubic splines method. Cubic splines
normalization displayed the best performance of all algorithms for variables with
intensities < 1,000, but displayed slightly higher RSD compared to no normalization for
variables > 1000 intensity (Figure 6). The combination of qc-LOESS and cubic splines
were used to fully normalize the dataset, but variables with intensities >1000 and
showing poor cubic splines performance could instead be presented as raw or non-
normalized data.
4. Model based clustering was carried out using Bayesian information criterion
(BIC) optimized and EM initialized hierarchical clustering of finite mixtures of Gaussian
mixture models [3]. The best two cluster model was selected based on BIC. Analyte
specific linear models were used to adjust sample means based on the model-based
cluster memberships.
All analyses were implemented in R v3.0.2 [4] using the Devium package
(https://github.com/dgrapov/devium).
5. Figure 1. Overview of common data normalization approaches applied to the QC samples.
A)
B C
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
6. logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
7. Figure 2. Modified workflow for qc-LOESS normalization.
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
8. Figure 3. Comparison of raw and normalized sample relative standard deviations.
A)
B C
9. A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
10. Figure 4. PCA scores of raw and normalized samples and QCs, annotated by batch and acquisition order.
A) raw QC B normalized QC
C) raw samples D) normalized samples
PCA sample scores for the first 2 components for a) raw QCs B) normalized QCs C) raw samples and D)
normalized samples.
11. Figure 5. PCA scores of normalized samples before and after non-supervised model based clustering defined
covariate adjustment
A) defined clusters B cluster-membership adjusted data
PCA sample scores for the first 2 components for a) model-based clustering defined clusters B) cluster-
membership adjusted data.
12. Figure 6. Principal components analysis of QCs, with annotation of acquisition order (sample label).
A) PCA scores from the first two components displaying QC sample label IDs (duplicated labels are expressed
as X.1). Sample 94, circled in red, and was identified as an outlier (no other QC scores with similar dates in
its proximity).
14. References
1. Kohl, S.M., et al., State-of-the art data normalization methods improve NMR-
based metabolomic analysis. Metabolomics, 2012. 8(Suppl 1): p. 146-160.
2. Dunn, W.B., et al., Procedures for large-scale metabolic profiling of serum and
plasma using gas chromatography and liquid chromatography coupled to mass
spectrometry. Nat Protoc, 2011. 6(7): p. 1060-83.
3. Fraley, C. and A. Raftery, E.,, Model-based Clustering, Discriminant Analysis
and Density Estimation. Journal of the American Statistical Association,
2002(97): p. 611-631.
4. R Development Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, 2011. ISBN 3-900051-
900007-900050, URL http://www.R-project.org/.