SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
MGTAXA – a toolkit and a Web server for predicting taxonomy
     of the metagenomic sequences with Galaxy front-end and
                 parallel computational back-end
1

        Andrey Tovchigrechko, Seung-Jin Sul, Timothy Prindle,
        Shannon J. Williamson and Shibu Yooseph
        http://andreyto.github.com/mgtaxa/         atovtchi@jcvi.org
MGTAXA Components
2


    Predict taxonomy for bacterial metagenomic sequences
    • Glimmer ICM based classifier (similar to Phymm approach). Parallelized on HTC and HPC clusters,
      command line and Galaxy Web interface. Sequences above 300 bp.
    • BLAST+ and Batch SOM implementations for HPC clusters with MPI-MapReduce framework. Calls
      to pristine NCBI BLAST+ API – full compatibility. Scales to 2000 cores on XSEDE (former TeraGrid)
      (TACC Ranger).
    • Assembly contig consensus classifier from ORF BLASTP or APIS assignments


    First algorithm to predict hosts for bacteriophages in metagenomes
    •   Explores compositional similarity between phage and host
    •   ICM and SOM for prediction and visualization
    •   Assigns phage scaffolds (5Kbp+) to bacterial sequences
    •   Uses infrastructure and interfaces of the bacterial classifier




    CRISPR pipeline
    • Scans genomes & metagenomes for CRISPR arrays and genes
    • Connects with viral metagenomes through spacer matches – alternative way to establish the
      bacteriophage-host relationship
    • Parallelized on HTC and HPC clusters
ICMs for more sensitive compositional models with less
    dependency on the length of training sequence
3

       Interpolated Context Model (ICM) developed
        for Glimmer (Delcher 1999) to find correct      a c c t     t    a c g g g a c
        reading frame based on nucleotide
        composition                                     a c c t     t    a c g g g a c
       ICMs used as a basis of Phymm taxonomic
        prokaryotic classification algorithm (Brady &
        Salzberg, 2009)
       The main idea of ICM is to use longer k-mers
        when there is a statistically significant
        frequency of those specific k-mers in the
        training sample
       This helps extract compositional signal from
        sequences shorter than would be required by
        a fixed k-mer model
       Also less bias toward longer training
        sequences – critical for classification of
        viruses
       We use ICMs to assign sample taxonomy as
        well as bacteriophage host taxonomy                       Image from Delcher 1999
How predicting the viral host taxonomy relates to
            predicting the bacterial taxonomy
4

                                                   Color legend and proportions of
                                                   inputs to the SOM




                                         In this region, enterobacteria and their
                                         phages cluster together. Many other
                                         clades behave like that on this map.
    Self-organizing Map (SOM) of tetranucleotide (4-mers) frequency vectors
    Mix of NCBI RefSeq 5 Kb chunks and GOS scaffolds
Phage host prediction algorithm
5

       Viruses exhibit huge diversity and fast evolution. Homology based methods
        for assigning taxonomy to viral metagenomic sequences suffer from the lack
        of sufficiently close sequences in the reference databases
       Although it has been shown by several groups that viruses tend to adopt
        polynucleotide (k-mer) composition of their hosts, regardless of homology,
        ours is the first practical method that uses this tendency to build a classifier
        for metagenomic viral contigs (5Kbp or longer)
       We score viral contigs against prokaryotic ICMs for the entire RefSeq and
        pick the ICM with the best score as the host prediction
       We also have a mixed mode classifier, where we score both against ICMs
        trained on viral sequences and prokaryotic ICMs. Unlike the only other
        reported compositional taxonomic classifier of viral sequences (NBC, fixed
        k models), ICM scoring does not exhibit any strong preference to longer
        reference sequences (NBC reported to misclassify a lot of its benchmark
        sequences as giant mimivirus).
Phage host prediction benchmarking
6
        Benchmark was compiled from all NCBI RefSeq Genbank records (Nov 2010)
         where a virus had a prokaryotic species identified as a host either by host record
         by the naming of the virus
        Among those, lysogenic (those that integrate into host DNA during their lifecycle)
         viruses were identified by literature curation for viruses that named a specific host
         genome
        A. Phages with lysogenic cycle (24 viral species, 10 host genera)
        ExcludePredict Host Rank            genus    family      order     class         phylum   rejected
        none                                   89%          96%       94%           99%      97%          2%
        species                                85%          90%       90%           97%      95%          5%

        B. All phages (294 viral species, 33 host genera)
        ExcludePredict Host Rank            genus    family      order     class         phylum   rejected
        none                                   50%          55%       59%           71%      76%          6%
        species                                43%          49%       54%           68%      75%          7%
       Random assignment would result in 0.2% accuracy at a genus level
       Algorithm, benchmarking and results on Global Ocean Sampling data are
        described in (Williamson et al, Metagenomic Investigation of Viruses throughout the
        Indian Ocean. PLoS One 2012; accepted)
Parallelization approach
7

       2500 models, 50G on disk, each input sequence has to be scored against
        each model
       Coarse-grained parallelism
           Training – one task is to build one ICM (1 min to build one model for a 4.7Mbp
            genome, creating 29MB model file)
           Scoring – one task is to score input sequences against one model (12.5 min to
            score 1Gbp against one model)

        Model
        Model
                           Local
        Model
                         reduction     HDF5
         …
                                                     Global            Aggregate
                                                    reduction            counts
        Model                                          and               Output
        Model                                       prediction         formatting
                           Local        HDF5
        Model
                         reduction
         …
Backends for parallel execution
8

       How MGTAXA workflows (DAGs) will be executed is selected at run-time by a command-line
        switch: mgt-classifier –run-mode batchDep –batch-backend makeflow [ other options … ]
       Implemented choices:
           Serially in one process. No cluster or external workflow support is needed
           Submit itself as a DAG of SGE jobs using qsub job dependency option
           Generate a “make file” for the Makeflow workflow execution engine. Then run makeflow on it with a
            choice of its own back-ends:
               Parallel multi-processing on a single node (normally N processes == N cores)
               Submitting multiple SGE jobs while satisfying the dependencies defined by the DAG
               Glide-in mechanism (“dual-level scheduling”) within one or more large MPI jobs. With that,
                MGTAXA can run on large XSEDE clusters that only schedule efficiently large parallel MPI jobs
               WorkQueue with a shared file-system option. Advantage for large number of very short jobs.
               Generic LRM interface (define job submit command for the current LRM)
           Makeflow is tolerant to compute node failures, and also has restart capability (like make)
       As a general note, Makeflow seems like a great fit to design Galaxy tools that need to
        execute their own massively parallel programmatic workflows (as opposed to interactively
        user-defined Galaxy built-in workflows). Makeflow is serverless, zero-administration, and
        appears to Galaxy as a single serial job that can be submitted to SGE like any other tool.
Web server - architecture
9

                                   DMZ                                                        Intranet

                                                                                                  Cluster
          Galaxy                              DMZ
    Patched DRMAA runner to                                                                        disk
       support job file I/O                   disk
            staging
                                                                     scp            Our Proxy-MAD
       DRMAA                                                                           dispatcher
                                                                                     Validation and
                                           Apache Qpid              TCP Socket      untainting of job
                                          messaging server   Fire
       GridWay                                                                          requests
                                                             wall
     metascheduler
                                                                         MGTAXA tools
                                                                       DAGs with 100s of                  LRM
     Builtin Globus-                                                       serial tasks                  (SGE)
     MAD (middle-             Our Proxy-MAD                            Choice of execution:
     access driver)            (middle-access                          Generate makeflow;
                                                                         Self-submit with
      A way to use            driver) requester                                                     Local copy of
                                                                             qsub job
     Grid sites (e.g.                                                     dependencies;            Galaxy on disk
      NSF XSEDE)                                                        Execute serialy in        for pristine code
                                                                           one process            of Galaxy tools
MR-MPI BLAST+ implementation
10

        Runs as a regular MPI program, on any
         supercomputer with a shared file system
        Makes high-level API calls to unmodified NCBI
         C++ Toolkit – results are fully compatible with
         the upstream NCBI code; easy to keep up to date
         and support any options of the NCBI BLAST+
        Implemented with MapReduce MPI (MR-MPI)
         library from Sandia Lab that helps organizing
         computations and data flow
        The library scheduler was modified to solve the
         problem of maintaining context between map()
         calls – a common problem with the classical
         MapReduce algorithm
        Parallel sort for the final output results
        Output: Tabular, NCBI XML, internal binary
         format, HDF5
        Special mode for BLASTN metagenomic read
         recruitment (compatible with implementation in
         Rusch et al, PLoS Biol 2007)



                                                           Control flow of the MR-MPI BLAST
MR-MPI BLASTN and BLASTP scaling
11




     Scaling chart for MR-MPI BLASTN showing process wall        "Useful" CPU utilization per core during the course of the
     clock time at different total core counts in MPI job. The   computation for the MR-MPI BLASTP run with 1024
     total number of query sequences is 40,000. The              cores. CPU user time used at any given moment within a
     sequences are split into 40 blocks of 400 kbp. Each         BLAST call was divided by the corresponding wall clock
     block, when combined with one DB partition, forms a         time, summed over all concurrent calls, and divided by a
     sequential work unit for the MapReduce algorithm. The       total number of cores allocated to the MPI program.
     data point labels represent time in minutes.                From (Sul and Tovchigrechko, IPDPS, 2011)
CRISPR Analysis Pipeline
   Detects CRISPR arrays and CAS genes (genes as annotated by JCVI protein annotation
    pipeline), builds genome diagrams, finds spacer matches to the viral sequences
   Applied in (Yooseph et al, Nature 2010) for the Moore marine genomes collection, 137 marine
    genomes sequenced and annotated by JCVI (and previously sequenced 60 other marine
    genomes). Demonstrated that the presence of CRISPR system is clearly associated with the life-
    style of the organism as related to the chances of its repeated encounters with the same types
    of viruses.
   Applied to GOS bacterial assembly and viral reads, several smaller datasets
   The CRISPR finder is batched PILERCR with our post-processing to remove false positives
Assembly contig ORF consensus classifier
13

        Suppose that we have applied a homology-based metagenomic classification pipeline such as JCVI
         pipeline working on best BLASTP hits (Tanenbaum et al, Stand. Genomic Sci. 2010) or APIS (Badger
         et al, J. Bacteriol. 2006; Allen et al, ISME J. 2012) working on phylogenetic inference.
        And we identify taxonomic composition of our sample by counting individual gene assignments
        In the example below, we would decide that we have all these bugs in whatever proportions
        But, if the annotation was done on sufficiently large metagenomic contigs, we can classify each
         contig by a lowest common ancestor node of its ORF assignments where a specified majority vote
         (e.g. 75%) is reached.
        Example below is a fragment of a globally abundant SAR86 genome (published in 2011)
         searched against the version of Uniref100 from 2010. Our consensus classifier correctly called in
         gamma-proteobacteria.
Team
14


        Seung-Jin Sul (HPC tools)
        Tim Prindle (Proxy MAD)
        Shibu Yooseph (co-PI, metagenomics expertise)
        Shannon Williamson (original idea to work on
         bacteriophage-host prediction tool)

        Funding: NSF 0850256, DOE DE-FC02-02ER63453

Mais conteúdo relacionado

Mais procurados

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
Dongyan Zhao
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis Pipeline
Hai-Wei Yen
 

Mais procurados (20)

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis Pipeline
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 

Destaque

The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expression
Michael Barton
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
bosc
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
Jan Aerts
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11
Deepak Singh
 

Destaque (20)

J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expression
 
20120622 fridayadelboden
20120622 fridayadelboden20120622 fridayadelboden
20120622 fridayadelboden
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
Bio4j
Bio4jBio4j
Bio4j
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan Eisen
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perl
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomy
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
 
The neurobiological nature of free will
The neurobiological nature of free willThe neurobiological nature of free will
The neurobiological nature of free will
 
ORCID Principles
ORCID PrinciplesORCID Principles
ORCID Principles
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11
 
A brief description of the Chemical Rediscovery Survey and Open Chemistry in ...
A brief description of the Chemical Rediscovery Survey and Open Chemistry in ...A brief description of the Chemical Rediscovery Survey and Open Chemistry in ...
A brief description of the Chemical Rediscovery Survey and Open Chemistry in ...
 

Semelhante a A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of the metagenomic sequences

Automated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDBAutomated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDB
Omer Gertel
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
Ruymán Reyes
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 

Semelhante a A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of the metagenomic sequences (20)

3rd presentation
3rd presentation3rd presentation
3rd presentation
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Automated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDBAutomated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDB
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_Saturne
 
Inbot10 vxclass
Inbot10 vxclassInbot10 vxclass
Inbot10 vxclass
 

Mais de Jan Aerts

Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Jan Aerts
 

Mais de Jan Aerts (20)

Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy Update
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpace
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of the metagenomic sequences

  • 1. MGTAXA – a toolkit and a Web server for predicting taxonomy of the metagenomic sequences with Galaxy front-end and parallel computational back-end 1 Andrey Tovchigrechko, Seung-Jin Sul, Timothy Prindle, Shannon J. Williamson and Shibu Yooseph http://andreyto.github.com/mgtaxa/ atovtchi@jcvi.org
  • 2. MGTAXA Components 2 Predict taxonomy for bacterial metagenomic sequences • Glimmer ICM based classifier (similar to Phymm approach). Parallelized on HTC and HPC clusters, command line and Galaxy Web interface. Sequences above 300 bp. • BLAST+ and Batch SOM implementations for HPC clusters with MPI-MapReduce framework. Calls to pristine NCBI BLAST+ API – full compatibility. Scales to 2000 cores on XSEDE (former TeraGrid) (TACC Ranger). • Assembly contig consensus classifier from ORF BLASTP or APIS assignments First algorithm to predict hosts for bacteriophages in metagenomes • Explores compositional similarity between phage and host • ICM and SOM for prediction and visualization • Assigns phage scaffolds (5Kbp+) to bacterial sequences • Uses infrastructure and interfaces of the bacterial classifier CRISPR pipeline • Scans genomes & metagenomes for CRISPR arrays and genes • Connects with viral metagenomes through spacer matches – alternative way to establish the bacteriophage-host relationship • Parallelized on HTC and HPC clusters
  • 3. ICMs for more sensitive compositional models with less dependency on the length of training sequence 3  Interpolated Context Model (ICM) developed for Glimmer (Delcher 1999) to find correct a c c t t a c g g g a c reading frame based on nucleotide composition a c c t t a c g g g a c  ICMs used as a basis of Phymm taxonomic prokaryotic classification algorithm (Brady & Salzberg, 2009)  The main idea of ICM is to use longer k-mers when there is a statistically significant frequency of those specific k-mers in the training sample  This helps extract compositional signal from sequences shorter than would be required by a fixed k-mer model  Also less bias toward longer training sequences – critical for classification of viruses  We use ICMs to assign sample taxonomy as well as bacteriophage host taxonomy Image from Delcher 1999
  • 4. How predicting the viral host taxonomy relates to predicting the bacterial taxonomy 4 Color legend and proportions of inputs to the SOM In this region, enterobacteria and their phages cluster together. Many other clades behave like that on this map. Self-organizing Map (SOM) of tetranucleotide (4-mers) frequency vectors Mix of NCBI RefSeq 5 Kb chunks and GOS scaffolds
  • 5. Phage host prediction algorithm 5  Viruses exhibit huge diversity and fast evolution. Homology based methods for assigning taxonomy to viral metagenomic sequences suffer from the lack of sufficiently close sequences in the reference databases  Although it has been shown by several groups that viruses tend to adopt polynucleotide (k-mer) composition of their hosts, regardless of homology, ours is the first practical method that uses this tendency to build a classifier for metagenomic viral contigs (5Kbp or longer)  We score viral contigs against prokaryotic ICMs for the entire RefSeq and pick the ICM with the best score as the host prediction  We also have a mixed mode classifier, where we score both against ICMs trained on viral sequences and prokaryotic ICMs. Unlike the only other reported compositional taxonomic classifier of viral sequences (NBC, fixed k models), ICM scoring does not exhibit any strong preference to longer reference sequences (NBC reported to misclassify a lot of its benchmark sequences as giant mimivirus).
  • 6. Phage host prediction benchmarking 6  Benchmark was compiled from all NCBI RefSeq Genbank records (Nov 2010) where a virus had a prokaryotic species identified as a host either by host record by the naming of the virus  Among those, lysogenic (those that integrate into host DNA during their lifecycle) viruses were identified by literature curation for viruses that named a specific host genome A. Phages with lysogenic cycle (24 viral species, 10 host genera) ExcludePredict Host Rank genus family order class phylum rejected none 89% 96% 94% 99% 97% 2% species 85% 90% 90% 97% 95% 5% B. All phages (294 viral species, 33 host genera) ExcludePredict Host Rank genus family order class phylum rejected none 50% 55% 59% 71% 76% 6% species 43% 49% 54% 68% 75% 7%  Random assignment would result in 0.2% accuracy at a genus level  Algorithm, benchmarking and results on Global Ocean Sampling data are described in (Williamson et al, Metagenomic Investigation of Viruses throughout the Indian Ocean. PLoS One 2012; accepted)
  • 7. Parallelization approach 7  2500 models, 50G on disk, each input sequence has to be scored against each model  Coarse-grained parallelism  Training – one task is to build one ICM (1 min to build one model for a 4.7Mbp genome, creating 29MB model file)  Scoring – one task is to score input sequences against one model (12.5 min to score 1Gbp against one model) Model Model Local Model reduction HDF5 … Global Aggregate reduction counts Model and Output Model prediction formatting Local HDF5 Model reduction …
  • 8. Backends for parallel execution 8  How MGTAXA workflows (DAGs) will be executed is selected at run-time by a command-line switch: mgt-classifier –run-mode batchDep –batch-backend makeflow [ other options … ]  Implemented choices:  Serially in one process. No cluster or external workflow support is needed  Submit itself as a DAG of SGE jobs using qsub job dependency option  Generate a “make file” for the Makeflow workflow execution engine. Then run makeflow on it with a choice of its own back-ends:  Parallel multi-processing on a single node (normally N processes == N cores)  Submitting multiple SGE jobs while satisfying the dependencies defined by the DAG  Glide-in mechanism (“dual-level scheduling”) within one or more large MPI jobs. With that, MGTAXA can run on large XSEDE clusters that only schedule efficiently large parallel MPI jobs  WorkQueue with a shared file-system option. Advantage for large number of very short jobs.  Generic LRM interface (define job submit command for the current LRM)  Makeflow is tolerant to compute node failures, and also has restart capability (like make)  As a general note, Makeflow seems like a great fit to design Galaxy tools that need to execute their own massively parallel programmatic workflows (as opposed to interactively user-defined Galaxy built-in workflows). Makeflow is serverless, zero-administration, and appears to Galaxy as a single serial job that can be submitted to SGE like any other tool.
  • 9. Web server - architecture 9 DMZ Intranet Cluster Galaxy DMZ Patched DRMAA runner to disk support job file I/O disk staging scp Our Proxy-MAD DRMAA dispatcher Validation and Apache Qpid TCP Socket untainting of job messaging server Fire GridWay requests wall metascheduler MGTAXA tools DAGs with 100s of LRM Builtin Globus- serial tasks (SGE) MAD (middle- Our Proxy-MAD Choice of execution: access driver) (middle-access Generate makeflow; Self-submit with A way to use driver) requester Local copy of qsub job Grid sites (e.g. dependencies; Galaxy on disk NSF XSEDE) Execute serialy in for pristine code one process of Galaxy tools
  • 10. MR-MPI BLAST+ implementation 10  Runs as a regular MPI program, on any supercomputer with a shared file system  Makes high-level API calls to unmodified NCBI C++ Toolkit – results are fully compatible with the upstream NCBI code; easy to keep up to date and support any options of the NCBI BLAST+  Implemented with MapReduce MPI (MR-MPI) library from Sandia Lab that helps organizing computations and data flow  The library scheduler was modified to solve the problem of maintaining context between map() calls – a common problem with the classical MapReduce algorithm  Parallel sort for the final output results  Output: Tabular, NCBI XML, internal binary format, HDF5  Special mode for BLASTN metagenomic read recruitment (compatible with implementation in Rusch et al, PLoS Biol 2007) Control flow of the MR-MPI BLAST
  • 11. MR-MPI BLASTN and BLASTP scaling 11 Scaling chart for MR-MPI BLASTN showing process wall "Useful" CPU utilization per core during the course of the clock time at different total core counts in MPI job. The computation for the MR-MPI BLASTP run with 1024 total number of query sequences is 40,000. The cores. CPU user time used at any given moment within a sequences are split into 40 blocks of 400 kbp. Each BLAST call was divided by the corresponding wall clock block, when combined with one DB partition, forms a time, summed over all concurrent calls, and divided by a sequential work unit for the MapReduce algorithm. The total number of cores allocated to the MPI program. data point labels represent time in minutes. From (Sul and Tovchigrechko, IPDPS, 2011)
  • 12. CRISPR Analysis Pipeline  Detects CRISPR arrays and CAS genes (genes as annotated by JCVI protein annotation pipeline), builds genome diagrams, finds spacer matches to the viral sequences  Applied in (Yooseph et al, Nature 2010) for the Moore marine genomes collection, 137 marine genomes sequenced and annotated by JCVI (and previously sequenced 60 other marine genomes). Demonstrated that the presence of CRISPR system is clearly associated with the life- style of the organism as related to the chances of its repeated encounters with the same types of viruses.  Applied to GOS bacterial assembly and viral reads, several smaller datasets  The CRISPR finder is batched PILERCR with our post-processing to remove false positives
  • 13. Assembly contig ORF consensus classifier 13  Suppose that we have applied a homology-based metagenomic classification pipeline such as JCVI pipeline working on best BLASTP hits (Tanenbaum et al, Stand. Genomic Sci. 2010) or APIS (Badger et al, J. Bacteriol. 2006; Allen et al, ISME J. 2012) working on phylogenetic inference.  And we identify taxonomic composition of our sample by counting individual gene assignments  In the example below, we would decide that we have all these bugs in whatever proportions  But, if the annotation was done on sufficiently large metagenomic contigs, we can classify each contig by a lowest common ancestor node of its ORF assignments where a specified majority vote (e.g. 75%) is reached.  Example below is a fragment of a globally abundant SAR86 genome (published in 2011) searched against the version of Uniref100 from 2010. Our consensus classifier correctly called in gamma-proteobacteria.
  • 14. Team 14  Seung-Jin Sul (HPC tools)  Tim Prindle (Proxy MAD)  Shibu Yooseph (co-PI, metagenomics expertise)  Shannon Williamson (original idea to work on bacteriophage-host prediction tool)  Funding: NSF 0850256, DOE DE-FC02-02ER63453