Anúncio
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Building bioinformatics resources for the global community(20)

Anúncio
Anúncio

Building bioinformatics resources for the global community

  1. Building bioinformatics resources for the global community James Pettengill james.pettengill@fda.hhs.gov Biostatistics and Bioinformatics Staff Office of Analytics and Outreach FDA Center for Food Safety and Applied Nutrition GMI9 May 24, 2016 Rome, Italy
  2. CFSAN’s open-access peer reviewed methods for analyzing and differentiating among samples based on WGS data. Submitted 16 April 2014 Accepted 23 September 2014 Published 14 October 2014 Corresponding author Errol Strain, Errol.Strain@fda.hhs.gov Academic editor Keith Crandall Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj.620 An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella James B. Pettengill, Yan Luo, Steven Davis, Yi Chen, Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand, Marc W. Allard and Errol Strain Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park, MD, USA ABSTRACT Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise diVerences among replicate runs. Our investigation into the ro- bustness of clustering patterns illustrates the importance of carefully considering how data from diVerent platforms are combined and analyzed. We found clear diVerences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.
  3. Real-time pathogen detection in the era of whole-genome sequencing and big data: K-mer and site-based methods for inferring the distances among tens of thousands of Salmonella samples James Pettengill james.pettengill@fda.hhs.gov Biostatistics and Bioinformatics Staff Office of Analytics and Outreach FDA Center for Food Safety and Applied Nutrition GMI9 May 24, 2016 Rome, Italy
  4. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. Premise/Background of the project
  5. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. Premise/Background of the project
  6. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. Premise/Background of the project
  7. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. Premise/Background of the project
  8. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. •  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and whole-genome multi-locus sequence typing (wgMLST)) Premise/Background of the project
  9. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. •  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and multi-locus sequence typing (MLST)) •  Empirical data: whole-genome sequence data from 18,997 Salmonella isolates Premise/Background of the project
  10. NutButter Outbreak ? http://www.cdc.gov/salmonella/braenderup-08-14/index.html NCBI GenomeTrakr Tree
  11. Efficient method inter-category comparisons intra-category comparisons genetic distances Experimental design: based on a classification scheme determine how well each distance measure performs # Inefficient method genetic distances #
  12. Experimental design: Simulated data:
  13. Experimental design: Empirical data: •  Analyze different distance methods on de novo assemblies of all Salmonella samples in GenomeTrakr •  Use serovar as the classification scheme Efficient method inter-enteritidis comparisons intra-enteritidis comparisons genetic distances #
  14. Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data Assembly workflow: Obtain latest metadata file from NCBI pathogen database
  15. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  16. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  17. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Taxonomic/contamination filtering using Kraken with custom db Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  18. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Taxonomic/contamination filtering using Kraken with custom db Assembly using SPAdes Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  19. 1.  Obtain an assembly for each sample within GenomeTrakr •  Use pilot of cloud computing to accomplish assemblies – “cloudbursting” Summary! –! We! have! successfully!completed! running! Use! Cases! 2! and! 3! on! AWS! servers!via!the!CycleCloud!platform.!Even!without!time!for!extensive!optimization!of! the!clusters,!we!were!able!to!complete!the!Use!Cases!rapidly!and!inexpensively.!! ! ! ! Use)Case)2)–))Listeria)Isolates) ) ! A!workflow!was!designed!to!analyze!sequencing!data!from!all!of!the!publicly! available! Listeria! isolates! (3645)! collected! by! the! GenomeTrackr! network.! This! workflow! involves! downloading! data! from! the! NCBI! servers,! trimming! the! sequencing!reads!based!on!quality!scores,!filtering!the!reads!based!on!quality!and! taxonomy,!and!assembling!the!reads!into!contiguous!genome!segments.!The!results! of! this! workflow! will! allow! us! to! improve! our! methods! of! identifying! outbreak! isolates.! ! ! 1.!!Cluster!Specs!–!! ! Max!cores! ! 4000! ! Max!parallel!jobs! 1000! ! Master!node! ! i2.4xlarge! ! Compute!nodes!! r3.2xlarge,!r3.4xlarge! ! 2.!Results!–! ! Jobs! ! ! 3645! ! Run!time! ! 8)hours!! ! Job!completion!rate! 99.8%! ! Approximate!cost! $1800.00! ! 3.!Additional!Notes!–! ! Local!runtime!! ! 3.5)days! ! Feasible!to!run!locally! YES! ! Anticipated!frequency! once/quarter! ! Estimated!yearly!cost! !$9000.00! *!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!900!additional!samples!per! quarter!for!the!next!year.! ! 3,645 Listeria assemblies ! Use)Case)3)–))Salmonella)Isolates) ) ! Our!revised!Use!Case!3!applies!the!workflow!described! publicly! available! Salmonella! isolates! (25765)! collected! by! network.!The!analysis!of!this!dataset!is!much!more!difficult!due! size!and!a!much!larger!number!of!isolates!and!is!not!feasible!o resources.! ! 1.!!Cluster!Specs!–!! ! Max!cores! ! 12000! ! Max!parallel!jobs! 3000! ! Master!node! ! i2.4xlarge! ! Compute!nodes!! r3.2xlarge,!r3.4xlarge,!r3.8xlarge! ! 2.!Results!–! ! Jobs! ! ! 25765! ! Run)time! ! 20)hours!! ! Job!completion!rate! 99.1%! ! Approximate!cost! $8000.00! ! 3.!Additional!Notes!–! ! Estimated)local)runtime! 23)days! ! Feasible)to)run)locally! NO! ! Anticipated!frequency! once/quarter! ! Estimated!yearly!cost! !$56000.00! *!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!12,50 per!quarter!for!the!next!year.! ! 25,765 Salmonella assemblies
  20. Site-based: Sample1: ACCTAGTACC Sample2: ACGTACTACC Requires statements about homology/ sequence alignment Kmer-based (L = 9): Sample1: ACCTAGTACC kmer1: ACCTAGTAC kmer2: CCTAGTACC Sample2: ACGTACTACC kmer1: ACGTACTAC kmer2: CGTACTACC Fast but loss/oversimplification of information Similarity = 0.8 Similarity = 0 Experimental design: Distance measures
  21. Summary of methods used to infer the relationships among samples. Class Method Description Exec. time (s) Site-based Nucmer§ Pairwise genome alignment using suffix arrays 11.9 wgMLST¶ Gene based approach 46.95 K-mer based Jaccard Index§ The intersection divided by the union of all K-mers found between two samples 9.4 Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer present between two samples 45.1 Euclidean Distance§ The square root of the sum of square of all pairwise differences in K- mer abundance 44.2 Mash Distance MinHash (Broder 1998) technique to reduce genomes to sketches and estimates a novel evolutionary distance metric among them 1.2 Mash Jaccard Distance The Jaccard Distance (as described above) but based on the sketch size (e.g., the number of hashes) 1.2 § Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1 days for 25,000 samples) ¶ Requires a reference genome
  22. Classification of simulated data: ROC curves identical across different distance methods * Simulated data is not complex/ noisy enough
  23. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic
  24. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance
  25. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance •  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.
  26. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance •  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases. •  If working with k-mer distances err on the side of false positives •  And have high quality assemblies
  27. Acknowledgements FDA •  Center for Food Safety and Applied Nutrition •  Biostats/Bioinformatics staff – J. Baugher, H. Rand, J. Miller, Y. Luo, S. Davis, E. Strain •  Center for Veterinary Medicine •  Office of Regulatory Affairs National Institutes of Health •  National Center for Biotechnology Information State Health and University Labs •  Alaska •  Arizona •  California •  Florida •  Hawaii •  Maryland •  Minnesota •  New Mexico •  New York •  South Dakota •  Texas •  Virginia •  Washington USDA/FSIS •  Eastern Laboratory CDC •  Enteric Diseases Laboratory •  INEI-ANLIS “Carolos Malbran Institute,” Argentina •  Centre for Food Safety, University College Dublin, Ireland •  Food Environmental Research Agency, UK •  Public Health England, UK •  WHO •  Illumina •  Pac Bio •  CLC Bio •  Other independent collaborators
  28. •  False negatives are primarily due to failure to meet consensus frequency threshold ConsensusFrequency<0.9 Coverage<8 X20x_Coverage X100x_Coverage X20x_Coverage X100x_Coverage 0 1000 2000 3000 4000 5000 value variable variable X20x_Coverage X100x_Coverage Validation exercise key findings: Number of false negatives •  False negatives are not random across the genome
  29. Validation exercise of CFSAN SNP Pipeline key findings: •  100× dataset •  Recovered 98.9% of the introduced SNPs •  False positive rate of 1.04 × 10−6 •  20× dataset •  Recovered 98.8% of SNPs •  False positive rate of 8.34 × 10−7
Anúncio