Hans-Joachim Ruscheweyh: Pooling Metagenomes in MEGAN Based on Environmental Parameters
1. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Pooling metagenomes in MEGAN based on
environmental parameters
Hans-Joachim Ruscheweyh
Center for Bioinformatics, Tuebingen University
June 15, 2011
1 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
2. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
2 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
3. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
3 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
4. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Metagenomics
The study of DNA of uncultured organisms
> 99% of all microbes cannot be cultured
A genome is the entire genetic information of a single
organism
A metagenome is the entire genetic information of a
assemblage of organisms
4 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
5. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Typical Metagenomic Samples
Human microbiome
Soil samples
Sea water samples
Seabed samples
Air samples
Medical samples
Ancient bones
5 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
6. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Metagenomic Pipeline
A primer on metagenomics; Wooley et al. (2010)
6 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
7. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
7 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
9. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Taxonomic Analysis
Tree reflects the
NCBI taxonomy
Reads are
compared against
reference
database e.g. NR
Reads are
mapped on the
tree using the
comparison
results based on
the LCA algorithm
9 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
10. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Functional Analysis - SEED
The tree contains
the nodes of the
SEED
classification
Reads are
mapped on to the
SEED
classification
www.theSEED.org
10 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
12. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Comparing Datasets
Based on
(normalized)
number of reads
assigned to each
node
Each color
determines a
dataset
12 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
13. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
DB Extension - PostgreSQL
MEGAN communicates with a
PostgreSQL database
Many datasets are available in
one database instance
Many users can operate on
the same database instance
This avoids redundancy on
often large datasets
http://www.postgresql.org/
13 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
14. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
14 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
15. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
What is Metadata?
Metadata are for example environmental parameters recorded
together with the actual metagenomic sample e.g. collection
date, gender, health status, ...
Month Salinity Ammonia
January_2PM January 33.3 0.0
January_10PM January 34.2 0.0
August_4AM August 33.3 0.14
August_10AM August 32.1 0.06
Datasets taken from: The taxonomic and functional diversity of microbes at a temperate coastal site: a ’multi-omic’
study of the seasonal and diel temporal variation; Gilbert et al. (2010)
15 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
17. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
17 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
18. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Basic Idea
Create two new datasets (winter, summer) from the four
BLAST files
Problems:
Doubles space consumption
Is time inefficient
Idea:
Use database technology to avoid redundancy, save time
and space
18 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
19. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Primary & Combined Datasets in the Database
A primary dataset is a dataset created from the original
BLAST output and the reads file
A combined dataset is created from primary datasets
A combined dataset is created by using:
References to read and match data of the primary datasets
Optionally also the classification data of the primary
datasets
Hence, a combined dataset can be created time and space
efficiently
19 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
26. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Analysis
Input: 8 primary datasets. Altogether ~100,000 reads, ~4
mio matches, ~4.5 GB space
It takes ~50 minutes to load these datasets to the database
Three combined datasets (winter, spring, summer) are
created
Their creation takes ~30 seconds and needs ~40MB
additional space
Alternatively combined datasets can be created on-the-fly.
This takes less than a second and needs no additional
space
21 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
31. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
1 Introduction Metagenomics
Unculturable Microbes
Typical Metagenomic Samples
Pipeline
2 MEGAN
MEGAN Introduction
Taxonomic & Functional Analysis
Comparison Analysis
PostgreSQL
3 Metadata
What is Metadata?
Using Metadata to pool Datasets
4 Pooling Datasets
Basic Idea
Combined Datasets
MetaData Analyzer
5 Summary & Conclusion
25 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
32. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
Summary & Conclusion
MEGAN communicates with a PostgreSQL database
This gives the user access to many datasets
Many user can work on the database simultaneously
Primary datasets can be pooled to create combined
datasets
The MetaData Analyzer allows one to create combined
datasets based on the usage of boolean expressions on
assigned metadata
This technique is highly space and time efficient
26 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes
33. Introduction MEGAN Metadata Pooling Datasets Summary & Conclusion
MEGAN v4 is freely available from www-ab.
informatik.uni-tuebingen.de/software/megan
Integrative analysis of environmental sequences using
MEGAN4, Daniel H. Huson, Suparna Mitra, Hans-Joachim
Ruscheweyh, Nico Weber, Stephan C. Schuster; submitted
2011
Thanks go to Daniel Huson, Suparna Mitra, Nico Weber,
Stefan Schuster
Thank your for your attention!
27 / 27 Hans-Joachim Ruscheweyh Pooling metagenomes