Inferring microbial community function from taxonomic composition
1. Inferring microbial community function from taxonomic composition
Morgan G.I. Langille1,*, Jesse R.R. Zaneveld2, J Gregory Caporaso3, Joshua Reyes4,
Dan Knights5, Daniel McDonald6, Rob Knight5, Robert G. Beiko1, Curtis Huttenhower4
1Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada; 2Dept. of Microbiology, Oregon State University, Corvallis, OR, USA; 3Dept. of Computer Science, Northern Arizona University, Flagstaff, AZ, USA;4Dept. of
Biostatistics, Harvard School of Public Health, Boston, MA, USA; 5Dept. Computer Science, University of Colorado, Boulder, CO, USA; 6Biofrontiers Institute, University of Colorado, Boulder, CO, USA; *morgangilangille@gmail.com
Abstract
It is often most efficient to characterize microbial communities using taxonomic markers such as 3. Genome Validation
the 16S ribosomal small subunit rRNA gene. The 16S gene is typically used to describe the
organisms or taxonomic units present in a sample, but data from such markers do not inherently 3.1 Method
reveal the molecular functions or ecological roles of members of a microbial community. We have 1) Remove a single genome from our reference dataset (pretending it has not been sequenced)
developed and validated a novel computational method that takes a set of observed taxonomic 2) Use PI-CRUST to predict the functional abundances for our “unknown” genome using only its 16S gene
abundances and infers abundance profiles of enzymes and pathways from multiple functional 3) Compare PI-CRUST predictions vs. the known functional abundances of our genome
classification schemes (KEGG, PFAM, COG, etc.). We use ancestral state reconstruction to 4) Repeat for all completed genomes (>2000)
determine approximate genomic content, taking into account 16S copy number and known 5) Plot the distribution of accuracy values for each genome (3.2) or each functional group (3.3)
functional abundance profiles from all currently available microbial genomes. We have evaluated
the accuracy of this inference for different groups of taxa and for different areas of biological
function. Our method, implemented as the PI-CRUST software (Phylogenetic Investigation of 3.2 PI-CRUST accuracy for completed genomes
Communities by Reconstruction of Unobserved STates), allows 16S metagenomic based studies to
be extended to predict the functional abilities of microbiomes as well as to compare expected Using Various Ancestral State Reconstruction Distance to nearest genome affects accuracy
versus observed functions in shotgun based metagenomic experiments.
1. PI-CRUST Software Pipeline
1.1 Starting Data Sources (Internally used by PI-CRUST)
• Entire GreenGenes 16S reference tree.
• A functional “Trait Table” for all completed genomes (e.g. KEGG, PFAM, etc.). This contains
abundances of each functional category for each genome in the IMG database. Endosymbionts&
• 16S copy number information for each completed genome in IMG (used to normalize OTU tables) Reduced Genomes
• GreenGenes identifier to IMG completed genomes map (to link information we have about
completed genomes to tips in our reference tree).
1.2 PI-CRUST: Genome Functional Predictions 16S phylogenetic distance to nearest species
16S Copy Genome Known functional composition “Random”: Functional abundances are chosen randomly from each of its distributions in all genomes.
Number
(completed & Functional Table
(completed
(from sequenced genome)
Inferred ancestral
“Nearest Neighbour”: Functional profile from genome with closest 16S distance is used.
“PIC”: Ancestral state reconstruction using least squares regression (APE R package).
genomes only) genomes only) functional composition “WAGNER”: Ancestral state reconstruction using Wagner parsimony (Count package).
Predicted functional composition
(for unsequenced genome)
Reference 16S Tree
(greengenes)
3.3 PI-CRUST accuracy for various functional groups
16S Copy Functional
Number Trait
Predictions Predictions
Prune taxa with
no genome
information
Predict
Infer ancestral
functional
genome traits
compositions
1.3 User Input
• “OTU table”, Number of OTUs (with greengenes identifiers) per sample
1.4 PI-CRUST: Metagenome Functional Predictions
16S Copy
Normalized
OTU Table Number
OTU Table
Predictions
PI-CRUST Accuracy (for each SEED function)
Functional Metagenome The ability to predict functions from 16S varies depending on the functional class. Functions that are well
Normalized Functional conserved and evolve similarly to 16S have higher accuracy, such as “RNA metabolism” and “Cell Division
Trait
OTU Table Predictions and Cell Cycle”. Other groups that tend not to be inherited by vertical descent such as “Phages, Prophages,
Predictions
Transposable Elements, Plasmids” are not predicted as accurately.
2 Metagenome Validation 4 Concluding Remarks
2.1 Method
1) Obtain microbiome samples with both whole metagenomic and 16S sequencing
4.1 Discussion
2) Use PI-CRUST with 16S data to predict functions for samples • Genome content has been shown in the past to vary widely even in closely related species. However,
3) Compare PI-CRUST predictions with functions observed from sequencing this may not be typical for the majority of bacterial and archaeal species. Our ability to predict the
functions encoded in an organism based solely by its 16S gene and knowledge from the thousands
of completed genomes suggests that gene content often has good phylogenetic correlation with 16S.
2.2 PI-CRUST accuracy on HMP samples • PI-CRUST allows 16S-only studies to be expanded to include information about functional
abundances.
• Studies with full metagenomic sequencing can use PI-CRUST to identify functions that are observed
but not expected based on their 16S profiles (i.e the taxa that are present in the sample).
4.2 Availability & Future Plans
• PI-CRUST is still under development but will be freely available under the GPL at:
http://picrust.sourceforge.net
• Various methods of ancestral state reconstruction and confidence weighting are still being evaluated.
• Evaluation of PI-CRUST on other paired metagenomic and 16S datasets is underway.
Acknowledgements
PI-CRUST predicted abundance based on 16S data • MGIL is the recipient of an IHMC travel award funded by the NIH.
Each point represents the predicted vs. observed relative abundance for a single KEGG category • MGIL and RGB are supported by a CIHR emerging team grant.