The document describes the NCBI Pathogen Analysis Pipeline which supports real-time sequencing of foodborne pathogens. The pipeline performs k-mer analysis, genome assembly, annotation, placement, clustering, SNP analysis, and tree construction on sequencing data submitted to NCBI. It provides automated bacterial assembly and a SNP analysis pipeline for clustering isolates and identifying outbreaks. The pipeline is demonstrated on examples of outbreaks linked to stone fruit and chicken kiev. NCBI aims to build a database of sequenced antibiotic resistant isolates with standardized metadata and maintain reference databases of antibiotic resistance genes.
Semelhante a The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens
Semelhante a The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens (20)
5. Automated Bacterial Assembly
SRA Reads sample 1
Trim reads
(Ns, adaptor)
Reference
Distance tree
Find closest reference genome(s)
ArgoCA (Combined Assembly)
De novo assembly panel
Argo (Reference
assisted assembly) SOAP denovo GS-assembler (newbler)MaSuRCA Celera Assembler
Reads remapped to combined assembly
Contig fasta
Read placements (bam)
Quality profile
SPAdes
6. 1. Initial partition of isolates within each species by kmer distances
2. Within each partition, blast comparison of all pairs of genomes
3. Single linkage clusters with at most 50 SNPs
4. Within clusters, SNPs with respect to one reference
5. Generate final SNP list and phylogenetic trees
Filtering:
• Base level
• Repeat
• Density
Problematic genomes are eliminated at various points along the way
SNP pipeline
7. High SNP density
Cumulative count of differences
Iterative density filtering (Richa Agarwala modification of
Science. 2011 Jan 28;331(6016):430-4.
8. Type Total targets in k-
mer tree
Targets in clusters (single linkage
<= 50 SNPs)
Salmonella 45297 38794
Listeria 9621 8135
E. coli & Shigella 13144 6046
Campylobacter 2234 1569
Acinteobacter 2179 1299
Elizabethkingia 89 74
Serratia 336 227
Klebsiella 1194 677
Total targets (May 2016)
12. there are several rows as NULL – means the target
either is not in a cluster (check last column) or is in a cluster
without any other isolate of the opposite type
rows with low SNP count are significant
these isolates are all <10 SNPs, and they all are in the same
cluster
15. MN chicken kiev outbreak
NCBI Pathogen Detection SNP Pipeline: example 2 – chicken kiev outbreak
16. NCBI Pathogen Detection SNP Pipeline Web viewer (coming soon):
example 3 – Elizabethkingia outbreak
17. wgMLST approach
• Complementary to SNP analysis e.g. consistency check
• Efficient for initial clustering of all isolates in species
• Generate loci using “essentially complete” RefSeq genomes
Organism Number of loci Genome in loci Number of genomes Major species
Acinetobacter 2420 58.25% 43/47 Baumannii
Campylobacter 1257 68.36% 90/132 Jejuni
Escherichia 2896 52.97% 159/165 Coli
Klebsiella 4004 82.54% 67/82 Pneumoniae
Listeria 2364 73.88% 73/81 Monocytogenes
Salmonella 3469 66.98% 137/147 Enterica
R&D: wgMLST
18. • Fast & relatively simple
• Epidemiologists are
familiar with it
• Good for initial clustering
• Different heuristics
• Can use special markers
for e.g. serovars
• Still need to deal with
assembly errors
• Recombination can still
be a problem…
wgMLST – a
complementary
method
Loci are not
independent
R&D: wgMLST
19. NCBI’s Role in Combating Antibiotic Resistant
Bacteria
“Create a repository of resistant bacterial
strains (an “isolate bank”) and maintain a
well-curated reference database that
describes the characteristics of these
strains.”
“Develop and maintain a national sequence
database of resistant pathogens.”
20. AMR efforts at NCBI
• With collaborators, build database of sequenced isolates with standardized
AMR metadata (i.e. accept antibiograms) (2019 Samples as of May 16 -
http://www.ncbi.nlm.nih.gov/biosample/?term=antibiogram[filter])
• Collaborators include: (CDC, WRAIR, FDA, B&W)
• Stable, up-to-date database of AMR genes with standardized nomenclature
• Collaborators (CARD)
• RefSeq set released by June 2016
• Implement and validate tools for identifying AMR genes in new isolates
21. Antibiogram Fields
• Fields designed to find balance between comprehensiveness and ease of
submission
• Data dictionaries based on outside expertise (ASM, CLSI) standardize input and
minimize ‘data drift’
24. Acknowledgements
Richa Agarwala
Azat Badretdin
Slava Brover
Joshua Cherry
Vyacheslav
Chetvernin
Robert Cohen
Michael DiCuccio
Mike Feldgarden
Dan Haft
William Klimke
Alex Kotliarov
Arjun Prasad
Edward Rice
Kirill Rotmistrovskyy
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. http://www.ncbi.nlm.nih.gov
National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA
CDC
FDA/CFSAN
USDA-FSIS
PHE/FERA
NIHGRI
NIAID
WRAIR
Broad
Wadsworth/MDH
Vendors: PacBio, Illumina, Roche
Stephen Sherry
Sergey Shiryev
Martin Shumway
Tatiana Tatusova
Igor Tolstoy
Chunlin Xiao
Leonid Zaslavsky
Alexander Zasypkin
Alejandro A. Schaffer
Lukas Wagner
Aleksandr Morgulis
David Lipman
James Ostell