"Development of FDA MicroDB: A Regulatory-Grade
Microbial Reference Database" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by the National Institute for Standards and Technology October 2014 by Heike Sichtig, PhD from the FDA and Luke Tallon from IGS UMSOM.
Explainable AI for distinguishing future climate change scenarios
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
1. U.S. Food and Drug Administration
Institute for Genome Sciences
Development of FDA MicroDB:
A Regulatory-Grade
Microbial Reference Database
Heike Sichtig, Ph.D.
Division of Microbiology Devices
OIR/CDRH/FDA/HHS
Heike.Sichtig@fda.hhs.gov
Genomics Resource Center
Institute for Genome Sciences
ljtallon@som.umaryland.edu
October 21-22, 2014
Luke Tallon
UMSOM
NIST Workshop to Identify Standards Needed to Support Pathogen Identification
via Next-Generation Sequencing, NIST, MD
2. 2
Microbial NGS-Based Diagnostic Devices
• OIR/DMD working on a fast-tracked Draft Guidance
• On April 1st 2014 held Public Workshop
“Advancing Regulatory Science for High Throughput Sequencing
Devices for Microbial Identification and Detection of Antimicrobial
Resistance Markers” [FR Doc No: 2014-04940]
• Workshop agenda, discussion paper and webcast online:
http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm386967.htm
Objectives:
1. Streamline/shorten clinical trials for microbial diagnosis/identification
2. Establish a new comparator algorithm for assays developed using this
new technology
3. Develop regulatory science standards for microbial genome sequencing
4. Investigate the regulatory science required for antimicrobial resistance
determination through microbial genome sequence information.
3. 3
Inter-Agency Working Group on Feasibility
Approach:
• Formed a diverse working group FDA, NIH-NCBI, NIAID, DTRA,
LLNL, and CDC
• Conducted small pilot study to generate information to evaluate
quality of existing sequences in the public domain (In Progress)
• Identify the pre-existing high-quality deposits, and build from
there
• Will use information to set quality bar for sequence outputs for
our ongoing sequencing efforts
• Utilized existing standards (if available) for technical and isolate
metadata –no need to re-invent
• Attention given to connecting antimicrobial resistance
phenotype to genomic deposits – clinical collection site
4. Looking ahead: Predictions for Reference Databases
– Multiple levels of Reference DBs likely
• “High quality” genomes only
– For validation and clinical use
• “High quality” + other available genomes
– For testing and development
• Requires definition of “high quality” that must include
some draft genomes
– Extensive screening required
• Human and other hosts; chimeras
• Artificial constructs
– Separate bacterial, viral, fungal reference DBs
– Publicly available (NCBI/EMBL/DDBJ)
4
Courtesy of Tom Slezak
7. Microbial Reference Database (MicroDB)($1,67M)
• Identify “gaps” and target sequencing efforts (Funding awarded by FDA/OCET)
7
• All raw reads, assemblies, annotations, metadata sent to NCBI and
accessible to the PUBLIC
• Traceable results that could be reevaluated as necessary
>600 Clinically
Relevant and MCM
Microorganisms
Highly
Controlled
and
Documented
Approach
Collaborations with Clinical Labs and Repositories
• Children’s National Hospital
• DoD Critical Reagents Program (CRP, USAMRIID)
• FDA-CFSAN, FDA-CBER, FDA-CDER
• DHS National Biodefense Analysis and
Countermeasures Center (NBACC)
• The Rockefeller University
• Culture Collections: ATCC, DSMZ
Sequencing Center (UMD IGS)
• Hybrid Approach (PacBio and Illumina)
• Deposit of Raw Reads at NCBI (SRA)
• Deposit of Assemblies at NCBI
• Deposit of Annotations at NCBI
• FDA Interface to Access Data
8. MicroDB Requirements
A. Extracted Genomic DNA (gDNA)
– Extracted gDNA should be of high quality and purity, and at sufficient concentration to
achieve a suitable yield to assure adequate depth and breadth of genomic coverage for
the type of sequencing method employed.
B. BioSample Metadata
– A minimal description of the isolate source material is necessary for traceability. We are
using 14 descriptors as outlined below. (Note: Minimal metadata is modeled in part after
NCBI’s minimal pathogen template)
– Unique ID, organism, strain/isolate, sample site, specimen type, host disease, collection
date, collected by, patient age, gender, geographic location, AST method*, AST method
manufacturer*, Antimicrobial Susceptibilities*
C. Sequencing Data
– The minimum requirement for sequencing data is that the generated raw reads should be
deposited in NCBI’s Sequence Read Archive (SRA) and assemblies should be deposited
at NCBI’s Assembly division. The availability of raw reads and assemblies will provide a
pathway to re-analyze the data as newer technologies emerge. Furthermore, annotation
data should be deposited when available.
– Raw reads, assemblies, annotations*
*not used as a criteria for exclusion 8
9. MicroDB Requirements
D. Sequencing Metadata
– A minimal description of the sequencing process is necessary for traceability. We are
using 7 descriptors as outlined below including bioinformatics tool information for assembly
and annotation, and genomic coverage information.
– Library, platform, submitted by, fold coverage, pipeline, assembler, annotation tool*
E. Suggested phenotypic metadata*
– A description of the phenotypic information is suggested to create a link between the
phenotypic traits of particular organisms and their genomic sequence. We are
recommending 5 descriptors as outlined below (1-4 are also included in sections B and C).
– Annotation, AST method, AST method manufacturer, antimicrobial susceptibilities,
additional phenotypic data
*not used as a criteria for exclusion 9
10. NCBI Submission Cases
1. Childrens National Medical Center
– Submit all data when available
– Register sample metadata via BioSample
– Submit raw reads and assemblies generated by IGS when available
2. FDA/CFSAN
– Collaborative agreement: Wait for genome announcements
– Follow same procedures as for 1 and put a ‘6 month hold’ to
release data, lift hold when genome announcements are out
3. Rockefeller University
– Collaborative agreement: Wait for publication
– Follow same procedures as for 1 and put a ‘6 month hold’ to
release data, lift hold when publication is out
Similar agreements in place with other collaborators depending
on their needs
10
11. Project
Approach
• Sequencing
in
large
batches
– Illumina
HiSeq
paired-‐end
sequencing:
>200x
– PacBio
long-‐insert
SMRT
P4-‐C2
sequencing:
>80-‐100x
• Assembly
– PacBio
only
(HGAP,
PBcR
CA)
– Illumina
only
(CA,
MaSuRCA)
– PacBio/Illumina
hybrid
(CA)
– Minimal
manual
QA/QC
&
curaon
• Automated
Annotaon
• Base
modificaon
detecon
• Raw
reads
-‐>
NCBI
SRA
• Assembled
&
annotated
genomes
-‐>
Genbank
– NCBI
BIOPROJECT
ID:
PRJNA231221
• FDA
Web
interface
to
aggregate
data
12. Progress
-‐
Batch
1
Rockefeller
(50)
• Uniform
sample
set
– Staphylococcus
aureus
– 2.8Mbp
genome
size
– 32.8
%GC
– Significant
metadata
CNH/CFSAN
(41)
• Diverse
sample
set
– 18
genera
represented
– 2
–
8
Mbp
genome
size
range
– 38
–
67
%GC
range
Wikimedia
Commons
Wikimedia
Commons
NCBI
BioProject:
PRJNA231221
13. Rockefeller
Samples
• Sequencing
– Avg
Illumina
cvg:
578x
– Avg
PacBio
cvg:
185x
– 1
or
2
SMRT
cells
each
• Assembly:
– 32
of
50
in
single
cong
chromosome
– Average
cong
count
=
5
– “Best”
assembly:
• HGAP
=
29
• CA
hybrid
=
21
• Most
differences
subtle
• Annotaon
complete
• Final
QC
&
data
submissions
underway
14. CNH/CFSAN
Samples
• Sequencing
– Avg
Illumina
cvg:
315x
– Avg
PacBio
cvg:
167x
• 2
SMRT
cells
each
• Assembly
– 12
of
41
in
single
cong
chromosome
• 29
in
<=
5
congs
– Avg
cong
count
=
4.5
– Median
cong
count
=
3
– “Best”
assembly
(of
41):
• HGAP
=
24
• PBcR
CA
=
14
• CA
hybrid
=
3
• Annotaon
underway
15. ROCK_290 Celera8 ctg vs. ref
0 500000 1000000 1500000 2000000 2500000
gi|374362062|gb|CP003033.1|
2500000
2000000
1500000
1000000
500000
0
ctg7180000000002
100
80
60
40
20
0
Assembly
QC
&
Curaon
%similarity
CA8
–
Ill/PB
hybrid
Largest
Ctg
Len:
2,759,091bp
Total
asm
Ctg
Len:
2,770,822
bp
ROCK_290 HGAP2 ctg vs. ref
0 500000 1000000 1500000 2000000 2500000
gi|374362062|gb|CP003033.1|
ssccff77118800000000000000001134||qquuiivveerr
QRY
ssscccfff777111888000000000000000000000000111012|||qqquuuiiivvveeerrr
100
80
60
40
20
0
%similarity
HGAP2
Largest
Ctg
Len:
2,128,476bp
Total
asm
Ctg
Len:
2,802,621
bp
18. FDA Micro Team
Peyton Hobson, Brittany Goldberg, Kevin Snyder, Tamara Feldblyum, Uwe Scherf, Sally Hojvat
C ollaborators
18
Thank You
LLNL
Tom Slezak
NIH-NCBI
Bill Klimke, Martin Shumway, David Lipman
NIH-NIAID
Vivien Dugan, Maria Giovani
DTRA
Matt Tobelmann, Chris Detter, Eric
VanGieson, Nels Olsen
CDC
Duncan MacCannell
FDA-CFSAN
Maria Hoffmann, Cary Pirone, Andrea
Ottessen, Marc Allard, Eric Brown
NMRC
Kim Bishop-Lilly, Ken Frey
IGS@UMD
Lisa Sadzewicz, Luke Tallon, Naomi
Sengamalay, Al Godinez, Sandy
Ott, Sushma Nagaraj, Claire Fraser
Rockefeller University
Bryan Utter, Douglas Deutsch
Children’s National Medical Center
Brittany Goldberg, Joseph Campos
DOD-CRP
Shanmuga Sozhamannan, Mike Smith
DOD-USAMRIID
Tim Minogue
NBACC
Adam Phillippy, Nick Bergman
ATCC
Liz Kerrigan
DSMZ
Cathrin Sproer