SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Implementation of Metabolomic Data Normalization Strategies
Dmitry Grapov, PhD
Summary
Five normalization methods were compared, of which the combination of
qc-LOESS and cubic splines showed the best performance based on within-batch and
between-batch variable relative standard deviations for QCs. This approach was used to
normalize sample measurements the results of which were analyzed using principal
components analysis. Based on this analysis an unknown source of variance was
identified among the samples (batches 1-~7 and 8-25) which was absent from QC
samples and concluded to stem from the biological variability due to the experimental
design.
Results
The complete data set, acquired over a one year period (3/6/2013 to
2/20/2014), consisted of 1262 measurements of 319 variables. Analytical variance over
the duration of the data acquisition was estimated based on 105 equally interspersed
quality control (QCs) samples (1:10 QC/samples). To aid the overview of temporal trends
the full data acquisition time was segmented into 1-3 day increments or 25 batches
(median samples per batch 53; range, 13 to 84).
QC samples were used to evaluate five common data normalization procedures:
quantile, cubic splines, cyclic LOESS [1], batch ratio and (qc-)LOESS [2]. Normalization
performance was assessed based on within-batch (Figure 1A) and between-batch
(Figure 1 B&C) variable relative standard deviations (RSD) of QC samples. The qc-LOESS
approach, which is a modification of the LOESS procedure (Figure 2), displayed the best
performance for QC samples (median batch RSD, 30%, range: 20-42%; raw data, 35%,
19-51%), with 78% of normalized variables showing RSD<40% compared 65% for raw
data. However 113 variables (35%) displayed inconsistent trends between qc-LOESS
model training and tests sets and were identified as inappropriate for the qc-LOESS
normalization. The remaining variables were normalized using the cubic splines method,
which does not require a similar consistency criterion, and showed the second best
performance for QC samples (median batch RSD, 31%, 18-44%; and 77% of variables
with RSD<40%). The combination of qc-LOESS and cubic splines normalizations were
shown to improve data quality by reducing within-batch and between-batch analytical
variance (Figure 3).
Principal components analysis was used to evaluate raw and normalized QC and
sample measurements for batch effects (Figure 4). Raw QCs data displayed slight
differences between batches 1-7 and 7-25 (Figure 4A, red points), which was removed
after normalization (Figure 4B). However both raw and normalized samples displayed a
large mode of variance between samples among batches ~1-7 and all other batches
(Figure 4 C &D). After confirming that this trend was not due to the biological design of
the study, based on evaluation of same samples measured by an orthogonal
metabolomic platform (LC-Q-TOF), a semi-supervised approach of model based
clustering was used to define the members of the unique modes of variance. A linear
model was used to adjust the normalized data based on the model-based clustering
defined clusters (Figure 5).
Methods
Principal components analysis (PCA) on autoscaled data was used to overview
raw and normalized data and QC sample variance based on acquisition batch, and used
to identify 1 outlier QC sample (Bio Rec 94) which was removed from all further analyses
(Figure 6). Quantile, cubic splines and cyclic LOESS normalizations were implemented
without cross-validation [1]. Within- and between-batch RSDs were calculated based on
batch and aggregated medians. Batch ratio (BR) and qc-LOESS were implemented using
cross-validation where 2/3 of QC samples were used to train the model, which was then
applied to the remaining 1/3 data, and for consistency with the other normalization
methods performance is reported for the combined training and test sets.
BR normalization is an implementation a batch specific correction factor for each
variable, and was calculated as the ratio of the within-batch to the study wide variable
medians. The qc-LOESS normalization is an adaptation of the LOESS normalization which
uses qc samples, but also includes a step to determine if the LOESS based normalization
is applicable to the data by testing the correlation between LOESS models for the
training and test sets (cubic splines interpolated). LOESS model span was selected using
leave-one-out cross-validation on the training data. Variables inappropriate for the qc-
LOESS normalization were instead normalized by the cubic splines method. Cubic splines
normalization displayed the best performance of all algorithms for variables with
intensities < 1,000, but displayed slightly higher RSD compared to no normalization for
variables > 1000 intensity (Figure 6). The combination of qc-LOESS and cubic splines
were used to fully normalize the dataset, but variables with intensities >1000 and
showing poor cubic splines performance could instead be presented as raw or non-
normalized data.
Model based clustering was carried out using Bayesian information criterion
(BIC) optimized and EM initialized hierarchical clustering of finite mixtures of Gaussian
mixture models [3]. The best two cluster model was selected based on BIC. Analyte
specific linear models were used to adjust sample means based on the model-based
cluster memberships.
All analyses were implemented in R v3.0.2 [4] using the Devium package
(https://github.com/dgrapov/devium).
Figure 1. Overview of common data normalization approaches applied to the QC samples.
A)
B C
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 2. Modified workflow for qc-LOESS normalization.
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 3. Comparison of raw and normalized sample relative standard deviations.
A)
B C
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 4. PCA scores of raw and normalized samples and QCs, annotated by batch and acquisition order.
A) raw QC B normalized QC
C) raw samples D) normalized samples
PCA sample scores for the first 2 components for a) raw QCs B) normalized QCs C) raw samples and D)
normalized samples.
Figure 5. PCA scores of normalized samples before and after non-supervised model based clustering defined
covariate adjustment
A) defined clusters B cluster-membership adjusted data
PCA sample scores for the first 2 components for a) model-based clustering defined clusters B) cluster-
membership adjusted data.
Figure 6. Principal components analysis of QCs, with annotation of acquisition order (sample label).
A) PCA scores from the first two components displaying QC sample label IDs (duplicated labels are expressed
as X.1). Sample 94, circled in red, and was identified as an outlier (no other QC scores with similar dates in
its proximity).
Figure 6. Performance of the cubic splines normalization on QC samples.
References
1. Kohl, S.M., et al., State-of-the art data normalization methods improve NMR-
based metabolomic analysis. Metabolomics, 2012. 8(Suppl 1): p. 146-160.
2. Dunn, W.B., et al., Procedures for large-scale metabolic profiling of serum and
plasma using gas chromatography and liquid chromatography coupled to mass
spectrometry. Nat Protoc, 2011. 6(7): p. 1060-83.
3. Fraley, C. and A. Raftery, E.,, Model-based Clustering, Discriminant Analysis
and Density Estimation. Journal of the American Statistical Association,
2002(97): p. 611-631.
4. R Development Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, 2011. ISBN 3-900051-
900007-900050, URL http://www.R-project.org/.

Mais conteúdo relacionado

Mais procurados

Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisDmitry Grapov
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
 
Metabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesMetabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesDmitry Grapov
 
4 partial least squares modeling
4  partial least squares modeling4  partial least squares modeling
4 partial least squares modelingDmitry Grapov
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldDmitry Grapov
 
3 principal components analysis
3  principal components analysis3  principal components analysis
3 principal components analysisDmitry Grapov
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
 
1 statistical analysis
1  statistical analysis1  statistical analysis
1 statistical analysisDmitry Grapov
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsDmitry Grapov
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationDmitry Grapov
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Dmitry Grapov
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisUC Davis
 
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
Harnessing The Proteome With Proteo Iq Quantitative Proteomics SoftwareHarnessing The Proteome With Proteo Iq Quantitative Proteomics Software
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Softwarejatwood3
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration StrategiesDmitry Grapov
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08Paolo Missier
 

Mais procurados (20)

Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
 
Metabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case StudiesMetabolomic Data Analysis Case Studies
Metabolomic Data Analysis Case Studies
 
4 partial least squares modeling
4  partial least squares modeling4  partial least squares modeling
4 partial least squares modeling
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic Manifold
 
3 principal components analysis
3  principal components analysis3  principal components analysis
3 principal components analysis
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
 
1 statistical analysis
1  statistical analysis1  statistical analysis
1 statistical analysis
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization tools
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
7 network mapping i
7  network mapping i7  network mapping i
7 network mapping i
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysis
 
Article of analytical chemistry
Article of analytical chemistryArticle of analytical chemistry
Article of analytical chemistry
 
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
Harnessing The Proteome With Proteo Iq Quantitative Proteomics SoftwareHarnessing The Proteome With Proteo Iq Quantitative Proteomics Software
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
2 cluster analysis
2  cluster analysis2  cluster analysis
2 cluster analysis
 

Destaque

Metabolomic data: combining wavelet representation with learning approaches
Metabolomic data: combining wavelet representation with learning approachesMetabolomic data: combining wavelet representation with learning approaches
Metabolomic data: combining wavelet representation with learning approachestuxette
 
Metabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlMetabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlCOST action BM1006
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
6 metabolite enrichment analysis
6  metabolite enrichment analysis6  metabolite enrichment analysis
6 metabolite enrichment analysisDmitry Grapov
 
Introduction to Network Mapping
Introduction to Network MappingIntroduction to Network Mapping
Introduction to Network MappingDmitry Grapov
 
Metabolomics: The Next Generation of Biochemistry
Metabolomics: The Next Generation of Biochemistry Metabolomics: The Next Generation of Biochemistry
Metabolomics: The Next Generation of Biochemistry Metabolon, Inc.
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case studyDmitry Grapov
 

Destaque (9)

Metabolomic data: combining wavelet representation with learning approaches
Metabolomic data: combining wavelet representation with learning approachesMetabolomic data: combining wavelet representation with learning approaches
Metabolomic data: combining wavelet representation with learning approaches
 
Metabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlMetabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality control
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
6 metabolite enrichment analysis
6  metabolite enrichment analysis6  metabolite enrichment analysis
6 metabolite enrichment analysis
 
Introduction to Network Mapping
Introduction to Network MappingIntroduction to Network Mapping
Introduction to Network Mapping
 
Metabolomics: The Next Generation of Biochemistry
Metabolomics: The Next Generation of Biochemistry Metabolomics: The Next Generation of Biochemistry
Metabolomics: The Next Generation of Biochemistry
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case study
 
Metabolomics
MetabolomicsMetabolomics
Metabolomics
 
Metabolomics Data Analysis
Metabolomics Data AnalysisMetabolomics Data Analysis
Metabolomics Data Analysis
 

Semelhante a Case Study: Overview of Metabolomic Data Normalization Strategies

In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...o_almasi
 
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...Maciej Przybyłek
 
IJBB-51-3-188-200
IJBB-51-3-188-200IJBB-51-3-188-200
IJBB-51-3-188-200sankar basu
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
A comparison of SIFT, PCA-SIFT and SURF
A comparison of SIFT, PCA-SIFT and SURFA comparison of SIFT, PCA-SIFT and SURF
A comparison of SIFT, PCA-SIFT and SURFCSCJournals
 
How to establish QC reference ranges - Randox QC Educational Guide
How to establish QC reference ranges - Randox QC Educational GuideHow to establish QC reference ranges - Randox QC Educational Guide
How to establish QC reference ranges - Randox QC Educational GuideRandox
 
Analytical method validation
Analytical method validationAnalytical method validation
Analytical method validationSai Praveen Reddy
 
Item 2. Verification and Validation of Analytical Methods
Item 2. Verification and Validation of Analytical MethodsItem 2. Verification and Validation of Analytical Methods
Item 2. Verification and Validation of Analytical MethodsSoils FAO-GSP
 
Methods for identifying oot
Methods for identifying ootMethods for identifying oot
Methods for identifying ootHemendra Dave
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
Partha Sengupta_structural analysis.pptx
Partha Sengupta_structural analysis.pptxPartha Sengupta_structural analysis.pptx
Partha Sengupta_structural analysis.pptxJimmyPhoenix2
 
journal.pone.0161879.PDF
journal.pone.0161879.PDFjournal.pone.0161879.PDF
journal.pone.0161879.PDFsankar basu
 
The Analytical Method Transfer Process SK-Sep 2013
The Analytical Method Transfer Process  SK-Sep 2013The Analytical Method Transfer Process  SK-Sep 2013
The Analytical Method Transfer Process SK-Sep 2013Stephan O. Krause, PhD
 
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...Emerson Alves
 

Semelhante a Case Study: Overview of Metabolomic Data Normalization Strategies (20)

In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
OTTO-Report
OTTO-ReportOTTO-Report
OTTO-Report
 
Chap3 1
Chap3 1Chap3 1
Chap3 1
 
Dm
DmDm
Dm
 
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...
Predicting Value of Binding Constants of Organic Ligands to Beta-Cyclodextrin...
 
IJBB-51-3-188-200
IJBB-51-3-188-200IJBB-51-3-188-200
IJBB-51-3-188-200
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Cohen
CohenCohen
Cohen
 
A comparison of SIFT, PCA-SIFT and SURF
A comparison of SIFT, PCA-SIFT and SURFA comparison of SIFT, PCA-SIFT and SURF
A comparison of SIFT, PCA-SIFT and SURF
 
How to establish QC reference ranges - Randox QC Educational Guide
How to establish QC reference ranges - Randox QC Educational GuideHow to establish QC reference ranges - Randox QC Educational Guide
How to establish QC reference ranges - Randox QC Educational Guide
 
Analytical method validation
Analytical method validationAnalytical method validation
Analytical method validation
 
Item 2. Verification and Validation of Analytical Methods
Item 2. Verification and Validation of Analytical MethodsItem 2. Verification and Validation of Analytical Methods
Item 2. Verification and Validation of Analytical Methods
 
Methods for identifying oot
Methods for identifying ootMethods for identifying oot
Methods for identifying oot
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Partha Sengupta_structural analysis.pptx
Partha Sengupta_structural analysis.pptxPartha Sengupta_structural analysis.pptx
Partha Sengupta_structural analysis.pptx
 
journal.pone.0161879.PDF
journal.pone.0161879.PDFjournal.pone.0161879.PDF
journal.pone.0161879.PDF
 
The Analytical Method Transfer Process SK-Sep 2013
The Analytical Method Transfer Process  SK-Sep 2013The Analytical Method Transfer Process  SK-Sep 2013
The Analytical Method Transfer Process SK-Sep 2013
 
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...
Improvement of the Reliability in Novelty Detection using PCA in No Periodic ...
 

Mais de Dmitry Grapov

R programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideR programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideDmitry Grapov
 
Network mapping 101 course
Network mapping 101 courseNetwork mapping 101 course
Network mapping 101 courseDmitry Grapov
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Dmitry Grapov
 
Dmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov
 
Machine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisMachine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisDmitry Grapov
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningDmitry Grapov
 
American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014Dmitry Grapov
 

Mais de Dmitry Grapov (8)

R programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideR programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s Guide
 
Network mapping 101 course
Network mapping 101 courseNetwork mapping 101 course
Network mapping 101 course
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
 
Dmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov Resume and CV
Dmitry Grapov Resume and CV
 
Machine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisMachine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network Analysis
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine Learning
 
Modeling poster
Modeling posterModeling poster
Modeling poster
 
American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014
 

Último

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 

Último (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 

Case Study: Overview of Metabolomic Data Normalization Strategies

  • 1. Implementation of Metabolomic Data Normalization Strategies Dmitry Grapov, PhD Summary Five normalization methods were compared, of which the combination of qc-LOESS and cubic splines showed the best performance based on within-batch and between-batch variable relative standard deviations for QCs. This approach was used to normalize sample measurements the results of which were analyzed using principal components analysis. Based on this analysis an unknown source of variance was identified among the samples (batches 1-~7 and 8-25) which was absent from QC samples and concluded to stem from the biological variability due to the experimental design. Results The complete data set, acquired over a one year period (3/6/2013 to 2/20/2014), consisted of 1262 measurements of 319 variables. Analytical variance over the duration of the data acquisition was estimated based on 105 equally interspersed quality control (QCs) samples (1:10 QC/samples). To aid the overview of temporal trends the full data acquisition time was segmented into 1-3 day increments or 25 batches (median samples per batch 53; range, 13 to 84). QC samples were used to evaluate five common data normalization procedures: quantile, cubic splines, cyclic LOESS [1], batch ratio and (qc-)LOESS [2]. Normalization performance was assessed based on within-batch (Figure 1A) and between-batch (Figure 1 B&C) variable relative standard deviations (RSD) of QC samples. The qc-LOESS approach, which is a modification of the LOESS procedure (Figure 2), displayed the best performance for QC samples (median batch RSD, 30%, range: 20-42%; raw data, 35%,
  • 2. 19-51%), with 78% of normalized variables showing RSD<40% compared 65% for raw data. However 113 variables (35%) displayed inconsistent trends between qc-LOESS model training and tests sets and were identified as inappropriate for the qc-LOESS normalization. The remaining variables were normalized using the cubic splines method, which does not require a similar consistency criterion, and showed the second best performance for QC samples (median batch RSD, 31%, 18-44%; and 77% of variables with RSD<40%). The combination of qc-LOESS and cubic splines normalizations were shown to improve data quality by reducing within-batch and between-batch analytical variance (Figure 3). Principal components analysis was used to evaluate raw and normalized QC and sample measurements for batch effects (Figure 4). Raw QCs data displayed slight differences between batches 1-7 and 7-25 (Figure 4A, red points), which was removed after normalization (Figure 4B). However both raw and normalized samples displayed a large mode of variance between samples among batches ~1-7 and all other batches (Figure 4 C &D). After confirming that this trend was not due to the biological design of the study, based on evaluation of same samples measured by an orthogonal metabolomic platform (LC-Q-TOF), a semi-supervised approach of model based clustering was used to define the members of the unique modes of variance. A linear model was used to adjust the normalized data based on the model-based clustering defined clusters (Figure 5). Methods
  • 3. Principal components analysis (PCA) on autoscaled data was used to overview raw and normalized data and QC sample variance based on acquisition batch, and used to identify 1 outlier QC sample (Bio Rec 94) which was removed from all further analyses (Figure 6). Quantile, cubic splines and cyclic LOESS normalizations were implemented without cross-validation [1]. Within- and between-batch RSDs were calculated based on batch and aggregated medians. Batch ratio (BR) and qc-LOESS were implemented using cross-validation where 2/3 of QC samples were used to train the model, which was then applied to the remaining 1/3 data, and for consistency with the other normalization methods performance is reported for the combined training and test sets. BR normalization is an implementation a batch specific correction factor for each variable, and was calculated as the ratio of the within-batch to the study wide variable medians. The qc-LOESS normalization is an adaptation of the LOESS normalization which uses qc samples, but also includes a step to determine if the LOESS based normalization is applicable to the data by testing the correlation between LOESS models for the training and test sets (cubic splines interpolated). LOESS model span was selected using leave-one-out cross-validation on the training data. Variables inappropriate for the qc- LOESS normalization were instead normalized by the cubic splines method. Cubic splines normalization displayed the best performance of all algorithms for variables with intensities < 1,000, but displayed slightly higher RSD compared to no normalization for variables > 1000 intensity (Figure 6). The combination of qc-LOESS and cubic splines were used to fully normalize the dataset, but variables with intensities >1000 and showing poor cubic splines performance could instead be presented as raw or non- normalized data.
  • 4. Model based clustering was carried out using Bayesian information criterion (BIC) optimized and EM initialized hierarchical clustering of finite mixtures of Gaussian mixture models [3]. The best two cluster model was selected based on BIC. Analyte specific linear models were used to adjust sample means based on the model-based cluster memberships. All analyses were implemented in R v3.0.2 [4] using the Devium package (https://github.com/dgrapov/devium).
  • 5. Figure 1. Overview of common data normalization approaches applied to the QC samples. A) B C A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
  • 6. logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 7. Figure 2. Modified workflow for qc-LOESS normalization. A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 8. Figure 3. Comparison of raw and normalized sample relative standard deviations. A) B C
  • 9. A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 10. Figure 4. PCA scores of raw and normalized samples and QCs, annotated by batch and acquisition order. A) raw QC B normalized QC C) raw samples D) normalized samples PCA sample scores for the first 2 components for a) raw QCs B) normalized QCs C) raw samples and D) normalized samples.
  • 11. Figure 5. PCA scores of normalized samples before and after non-supervised model based clustering defined covariate adjustment A) defined clusters B cluster-membership adjusted data PCA sample scores for the first 2 components for a) model-based clustering defined clusters B) cluster- membership adjusted data.
  • 12. Figure 6. Principal components analysis of QCs, with annotation of acquisition order (sample label). A) PCA scores from the first two components displaying QC sample label IDs (duplicated labels are expressed as X.1). Sample 94, circled in red, and was identified as an outlier (no other QC scores with similar dates in its proximity).
  • 13. Figure 6. Performance of the cubic splines normalization on QC samples.
  • 14. References 1. Kohl, S.M., et al., State-of-the art data normalization methods improve NMR- based metabolomic analysis. Metabolomics, 2012. 8(Suppl 1): p. 146-160. 2. Dunn, W.B., et al., Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc, 2011. 6(7): p. 1060-83. 3. Fraley, C. and A. Raftery, E.,, Model-based Clustering, Discriminant Analysis and Density Estimation. Journal of the American Statistical Association, 2002(97): p. 611-631. 4. R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2011. ISBN 3-900051- 900007-900050, URL http://www.R-project.org/.