1. Metabolomics Data Analysis
Johan A. Westerhuis
Swammerdam Institute for Life Sciences, University of Amsterdam
Business Mathematics and Information,
North-West University, Potchefstroom, South Africa
egra
SeqAhead, Barcelona February 2013
2.
3. Metabolomics pipeline :
Issues for biostatistics
Biological Data Statistical Biological
Experimental Data Metabolite
question Pre- Data inter-
design acquisition identification
processing analysis pretation
Power analysis Normalisation Explorative
Treatment Quantification Predictive
design Hypothetical
QC strategy biomarkers
Measurement Spectral Network
design matching inference,
De NOVO MSEA,
indentification Pathway
analysis
3
4. Data Analysis
special issue Metabolomics
• Data preprocessing methods (make samples
more comparable)
• How to treat non-detects
• Variable importance in multivariate models
• Metabolic network analysis
• Data fusion methods
• Individual responses
• Between metabolite ratio’s
Guest Editors
Jeroen J. Jansen
Johan A. Westerhuis
6. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment, – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference
7. Metabolomics Data preprocessing
• Optimize biological content of data
• Correct for incorrect sampling, sample
workup issues, batch effects
• What is the noise level in the data? Generalized log
transform
Variance stabilization.
• High peaks more important than low
peaks?
• Multivariate methods love large values!
7
8.
9. Metabolic changes during E. coli culture
growth using k-means clustering. time
metabolites
(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of
respective sampling time points are marked in the curve. Time point 0 minutes
marks the application of the respective stress condition.
(B) Relative changes of metabolites pools normalized time point 1. Fold change is
presented on log10 scale. To reveal main trends of metabolic changes
10 K means clusters are color coded.
Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10
10. Self Organising Map of Metabolites in serum
1H NMR spectra of 613 patients
with type I diabetes and a diverse
spread of complications
Nonlinear mapping method
for large number of samples.
Relate position on the map to
diagnostic responses.
Can be made supervised
1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death
VP Mäkinen et al, Molecular Systems Biology 4:167, 2008
11. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised (Differentially expressed)
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers.
– Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment, Pathway
analysis – Metabolomics Data
– Metabolic network inference Fusion
12. Supervised Metabolomics Data analysis
Case – Control (PLSDA)
Y 4
Men
3
0 Women
2
0
1
0
PC2
0
1 -1
1 -2
1 -3
-4 -2 0 2 4 6
PC1
0.04
• Is there really a difference
between the groups ?
0.02
Statistical validation issues 0
PLS
b
• Which are the most important -0.02
peaks for discrimination ? -0.04
Variable importance -0.06
4 3.5 3 2.5 2 1.5 1 0.5 0
Chemical shift (ppm)
13. • Psyhogios example uitleggen met paper
voorbeelden en metaboanalyst voorbeelden
Proton NMR spectra of the urine samples were obtained
on a 500MHz 1H NMR machine.
13
16. Experimental Design Example
Experiment:
Rats are given Bromobenzene that affects the liver
Measurements: NMR spectroscopy of urine Rats
Experimental Design: 6 hours
24 hours
Time: 6, 24 and 48 hours 48 hours
Groups: 3 doses of BB 3.0275
Vehicle group, Control group 2.055
5.38 3.285
3.0475
Animals: 3 rats per dose per time 3.675
3.7525
2.7175
2.075
2.93
point
10 8 6 4 2 0
chemical shift (ppm)
17. Different contributions
Experimental Design
Time
4
3.5 0 0.2 0.4 time 0.6 0.8 1
Metabolite concentration
3
2.5
Dose
2
1.5
1
0 0.2 0.4 0.6 0.8 1
0.5 time
0
-0.5
0 0.2 0.4 0.6 0.8 1
time
Animal
Trajectories 0 0.2 0.4 time 0.6 0.8 1
19. ANOVA and PCA ASCA
X 1m Xα Xαβ Xαβγ
T
Pα Pαβ Pαβγ
X E
Tα Tαβ Tαβγ
Parts of the
data not
explained by
the
component
X 1mT TαPα TαβPαβ TαβγPαβγ E
T T T
models
20. Results
0.5 control
vehicle
0.4 low
Xαβγ medium
Xα 0.3
high
αβ -scores
Xαβ Scores 0.2
0.1
40 % 0
-0.1
-0.2
6 24 48
Time (Hours)
22. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
– Method comparison ratios
• Biological Interpretation
– Metabolite set enrichment – Metabolomics Data
– Pathway analysis Fusion
– Metabolic network inference
23. NONTARGETED
SELDI measurements of serum samples of
20 Gaucher patients and 20 healthy
controls.
Gaucher is a genetic disease in which a fatty
substance (lipid) accumulates in cells and
certain organs
24. • human urine and porcine cerebrospinal fluid
samples spiked with a range of peptides.
• Variation in #samples, within and between
group variation
26. Feature selection methods RESULTS
• Complex nontargeted Gaucher profiling data with
highly variable background and varying difference
between case and control: Multivariate methods
perform best.
• Spiked LCMS targeted data with less variation in
effect size: univariate and semi-univariate methods
are best in selecting biomarkers.
27. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment, – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference
29. BMR of green tea intervention study
186 human subjects with abdominal obesity
Validation shows significant changes in BMR between placebo and green tea treatment
together with most important triacylglycerols TG28-29 and TG41-42.
30. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference
36. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference
37. Special topic: Metabolic networks
Biochemical Network vs Association Network
Figure 7 Marginal correlation network for a set of metabolites in
tomato. Volatiles in red, derivatized metabolites in yellow. Solid
lines represent positive correlations, dashed lines negative ones.
Thickness of line corresponds to magnitude of ...
Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212
38. Metabolomics, 2005
Data from
Potato tubers
Metabolic neighbors Do not participate in common reactions
High correlation due to e.g. chemical equilibrium, mass conservation,..
“a systematic relationship between observed correlation
networks and the underlying biochemical pathways.”
Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003
39. Metabolic Network Inference
Search for the link between metabolome data and underlying metabolic
networks.
F A E ?? F A E
C B C B
D D
As an example: can we distinguish healthy from diseased networks:
C Glucose A B C
Glucose A B
G G G
G
D D
HEALTHY DISEASE
F F E
E
F F
40. From data to network
NETWORK
TOPOLOGY
Goal: ?
? DIRECTIONS
Problems:
NOISE MISSING METABOLITES
HUGE AMOUNT OF POSSIBLE
NETWORK STRUCTURES
40
41. Inference from static data
1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION
2a. Relevance Networks 2b. Conditioned Networks
A. Enzymatic
Variability ALL POSSIBLE
Pearson Correlation (PC) Partial Pearson Correlation (PPC)
PAIRWISE
0.6
INTERACTIONS (linear) (linear)
0.55
F
0.5
A E F
A E
0.45 2
0.4
1.5
B
0.35
B
1
100 200 300 400 500 600 700 800 900 1000
0.5 5
C C
2 0 4
B. Intrinsic Variability
1 1.5
1
0.2 0.4
3
0.6 0.8
D D
2
0.9
0.5 1
5
0.8 0 0
0 1 2 3 4
4
0.2 0.4 0.6 0.8
0.7
3
0.6 2
1
0.5
0
F
A E
0 1 2 3 4
0 0.4
50 100
2
1.5
F
0 2 4 6 8 B A E
1
C
0.5
B
5
0 4
C
D
C. Environmental
0.2 0.4 0.6 0.8
3
2
Variability 1
D
0
0 1 2 3 4
Mutual Information (MI) Conditional Mutual Information
(non-linear) (CMI) (non-linear)
0 50 100
10 20 30 40 50
43. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference
44. Metabolomics data fusion
• Account for between-block difference in quality of
measurements to improve data fusion
• For example, multi-platform data fusion, with differences in
quantification, (non) targeted, error structure
Amino acids Lipids
Fused data
• How to quantify the quality of measurements with many
metabolites, and many samples?
45. Error model for 1 metabolite
QC sample -> RSD
Standard Deviaton St.D
• Error models:
- RSD using 1 QC sample
- 2-component
using study samples
M
• Good error description
- sufficient # samples
A - large -range
study samples
I
S
Mean Intensity I
46. Figure of merit for data from 1 platform
Median: F-50 = 0.1
St.D
Var. 15
Var. 365 90th-percentile: F-90 = 0.35
Number of peaks
Var. 118
F-50 F-90
Var. 213
I
(Van Batenburg et al. Analytical Chemistry, 2011)
47. Two-step data fusion
j GC/MS LC/MS
J1=
82 J2= 49 peaks
peaks
Ij
M M
• Step 1:
Compute figures of merit for each platform
48. Two-step data fusion: MB-MLPCA
• Step 2 : Multi-block PCA with weighting by figures of merit
Fused error
covariance
X1 X2
Amino acids Lipids js
ˆ2
• Method needs good estimation of error variance by
– Repeats
– QC samples
49. Realistic simulations
using GCMS and
LCMS data
• Error variance estimated
from duplicates
• True error variance
• Estimating variance from
duplicates is problematic.
• Use Mix of QC samples and
repeats.
50. Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples
• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference