The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens

The NCBI Pathogen Analysis Pipeline
to Support
Real Time Sequencing of Foodborne Pathogens
William Klimke
GMI9

NCBI Pathogen Detection Pipeline
NCBISubmissionPortal
BioSamples
SRA
GenBank
BioProject
NCBI Pathogen Pipeline
Kmer analysis
Genome Assembly
Genome Annotation
Genome Placement
Clustering
SNP analysis
Tree Construction
Reports
QC

sample_name
organism
strain/isolate
Category (attribute_package)
1a) Clinical/Host-associated
1a1) specific_host
1a2) isolation_source
1a3) host-disease
OR
1b) Environmental/Food/Other
1b1) isolation_source
collection_date
Geographic location
6a) geo_loc_name
OR
6b) lat_lon
collected by
Where
When
Who
What
minimal metadata
NCBI Biosample – Pathogen Template
(Foodborne Outbreaks)
https://submit.ncbi.nlm.nih.gov/subs/biosample/
https://www.ncbi.nlm.nih.gov/biosample/docs/
http://www.ncbi.nlm.nih.gov/projects/biosample/validate/

NCBI Pathogen Detection Pipeline
Submissions (Jan – May, 2016)

Automated Bacterial Assembly
SRA Reads sample 1
Trim reads
(Ns, adaptor)
Reference
Distance tree
Find closest reference genome(s)
ArgoCA (Combined Assembly)
De novo assembly panel
Argo (Reference
assisted assembly) SOAP denovo GS-assembler (newbler)MaSuRCA Celera Assembler
Reads remapped to combined assembly
Contig fasta
Read placements (bam)
Quality profile
SPAdes

1. Initial partition of isolates within each species by kmer distances
2. Within each partition, blast comparison of all pairs of genomes
3. Single linkage clusters with at most 50 SNPs
4. Within clusters, SNPs with respect to one reference
5. Generate final SNP list and phylogenetic trees
Filtering:
• Base level
• Repeat
• Density
Problematic genomes are eliminated at various points along the way
SNP pipeline

High SNP density
Cumulative count of differences
Iterative density filtering (Richa Agarwala modification of
Science. 2011 Jan 28;331(6016):430-4.

Type Total targets in k-
mer tree
Targets in clusters (single linkage
<= 50 SNPs)
Salmonella 45297 38794
Listeria 9621 8135
E. coli & Shigella 13144 6046
Campylobacter 2234 1569
Acinteobacter 2179 1299
Elizabethkingia 89 74
Serratia 336 227
Klebsiella 1194 677
Total targets (May 2016)

http://www.ncbi.nlm.nih.gov/pathogens/
Results Available Now

there are several rows as NULL – means the target
either is not in a cluster (check last column) or is in a cluster
without any other isolate of the opposite type
rows with low SNP count are significant
these isolates are all <10 SNPs, and they all are in the same
cluster

NCBI Pathogen Detection SNP Pipeline: example 1 - stone fruit outbreak

http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6410a6.htm?s_cid=mm6410a6_e#Fig
similar results to CDC wgMLST

MN chicken kiev outbreak
NCBI Pathogen Detection SNP Pipeline: example 2 – chicken kiev outbreak

NCBI Pathogen Detection SNP Pipeline Web viewer (coming soon):
example 3 – Elizabethkingia outbreak

wgMLST approach
• Complementary to SNP analysis e.g. consistency check
• Efficient for initial clustering of all isolates in species
• Generate loci using “essentially complete” RefSeq genomes
Organism Number of loci Genome in loci Number of genomes Major species
Acinetobacter 2420 58.25% 43/47 Baumannii
Campylobacter 1257 68.36% 90/132 Jejuni
Escherichia 2896 52.97% 159/165 Coli
Klebsiella 4004 82.54% 67/82 Pneumoniae
Listeria 2364 73.88% 73/81 Monocytogenes
Salmonella 3469 66.98% 137/147 Enterica
R&D: wgMLST

• Fast & relatively simple
• Epidemiologists are
familiar with it
• Good for initial clustering
• Different heuristics
• Can use special markers
for e.g. serovars
• Still need to deal with
assembly errors
• Recombination can still
be a problem…
wgMLST – a
complementary
method
Loci are not
independent
R&D: wgMLST

NCBI’s Role in Combating Antibiotic Resistant
Bacteria
“Create a repository of resistant bacterial
strains (an “isolate bank”) and maintain a
well-curated reference database that
describes the characteristics of these
strains.”
“Develop and maintain a national sequence
database of resistant pathogens.”

AMR efforts at NCBI
• With collaborators, build database of sequenced isolates with standardized
AMR metadata (i.e. accept antibiograms) (2019 Samples as of May 16 -
http://www.ncbi.nlm.nih.gov/biosample/?term=antibiogram[filter])
• Collaborators include: (CDC, WRAIR, FDA, B&W)
• Stable, up-to-date database of AMR genes with standardized nomenclature
• Collaborators (CARD)
• RefSeq set released by June 2016
• Implement and validate tools for identifying AMR genes in new isolates

Antibiogram Fields
• Fields designed to find balance between comprehensiveness and ease of
submission
• Data dictionaries based on outside expertise (ASM, CLSI) standardize input and
minimize ‘data drift’

mcr-1 encoding organisms Total
E. coli 11
Salmonella 10
Antibiotic resistance

NCBI Outputs
Kmer tree
ftp://ftp.ncbi.nlm.nih.gov/pathogen/Results/
• Genome Workbench
• full SNP reports
• Integrated web-based interactive
system*
• AMR reports*
• wgMLST*

Acknowledgements
Richa Agarwala
Azat Badretdin
Slava Brover
Joshua Cherry
Vyacheslav
Chetvernin
Robert Cohen
Michael DiCuccio
Mike Feldgarden
Dan Haft
William Klimke
Alex Kotliarov
Arjun Prasad
Edward Rice
Kirill Rotmistrovskyy
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. http://www.ncbi.nlm.nih.gov
National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA
CDC
FDA/CFSAN
USDA-FSIS
PHE/FERA
NIHGRI
NIAID
WRAIR
Broad
Wadsworth/MDH
Vendors: PacBio, Illumina, Roche
Stephen Sherry
Sergey Shiryev
Martin Shumway
Tatiana Tatusova
Igor Tolstoy
Chunlin Xiao
Leonid Zaslavsky
Alexander Zasypkin
Alejandro A. Schaffer
Lukas Wagner
Aleksandr Morgulis
David Lipman
James Ostell

The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens

Semelhante a The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens (20)

Mais de ExternalEvents

Mais de ExternalEvents (20)

Último

Último (20)

The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pipeline to Support Real Time Sequencing of Foodborne Pathogens