2. Topics
- PacBio SMRT Sequencing Technology Development
- Human Genome Sequencing with PacBio Systems
- The Role of NIST GIAB Reference Material in PacBio
Sequencing Technology Development, Optimization
and Demonstration
5. PACIFIC BIOSCIENCES® CONFIDENTIAL
Long Reads
- Average >10,000 bases
High Consensus Accuracy
- Achieves >99.999% (30x)
Uniform, Unbiased Coverage
- Lack of GC% or sequence
complexity bias
DNA Modification Detection
- Epigenome characterization
SMRT SEQUENCING DATA CHARACTERISTICS
6. PACIFIC BIOSCIENCES® CONFIDENTIAL
AREAS OF PACBIO TECHNOLOGY DEVELOPMENT
Library
Preparation
Sequencing Data Analysis
Instruments
SMRT Cells Zero-Mode
Waveguides
Phospholinked
Nucleotides
DNA Shearing
Size Selection
SMRTbell™
Library
Preparation
Primary Analysis
- Base calling
Secondary & Tertiary Analysis
- Mapping
(daligner/BLASR)
- Consensus accuracy
(Quiver / HGAP)
- De novo assembly
(Falcon / MHAP)
- SV calling
- Phasing
- Epigenetic analysis
Consumables
PacBio® RS II SEQUEL™ SYSTEM
7. PACIFIC BIOSCIENCES® CONFIDENTIAL
PRODUCT RELEASES OVER THE LAST FOUR YEARS
7
Feb 2012
C2 Launch
May 2012
v1.3.1 SW Release
– Base Mods
Aug 2012
v1.3.2
MagBead
Release
Nov 2012
v1.3.3
Microbial
Base Modification
XL Chemistry
Stage Start
Jan 2013
SMRT® Cells v3
HGAP/Quiver
Oct 2013
v2.1
• P5-C3
release
• HGAP 2.0
Apr 2013
RS II Product
Release
• 75K to
150K ZMW
• 2x Throughput
Mar 2014
v2.2
• IsoSeq™
• HLA-Typing Oct 2015
Sequel System
Oct 2014
v2.3
• P6-C4
release
Apr 2015
Barcode Support
Increased throughput by over 100x
9. NIST GIAB REFERENCE MATERIAL 8398
- Serves as a well characterized control material to facilitate development of novel library
preparation and sequencing methods for human genomes at PacBio.
10. PACIFIC BIOSCIENCES® CONFIDENTIAL
LIBRARY PREPARATION
DNA Sample
Building of the
SMRTbell Template
Sample Preparation
Repair Ends
Ligate Adapters
Purify DNA
Binding
Fragment DNA
11. ASSESSING THE IMPACT OF DNA QUALITY
ON READ LENGTH
Human gDNA samples from NIST GIAB:
NA12878: CEPH/Utah Pedigree 1463, Lot K6
Thanks Dave Hsu!
E. coli K12 gDNA is mostly >40 kb (same gel)
Both NA12878 samples show significant degradation
Look similar to Coriell samples
PFGE conditions:
Bio-Rad CHEF Mapper XA System
1% PFG-certified agarose gel in 0.5x TBE
~200 ng DNA per lane
Auto-algorithm program
Low = 5 kb
High = 150 kb
Markers:
1 kb Extension Ladder (Invitrogen)
5 kb DNA Ladder (Bio-Rad)
EtBr stained post-electrophoresis
Typhoon imaging:
Fluorescence mode, EtBr channel
100 microns resolution
+3 mm focal plane
- Initial QC of human gDNA samples (NIST/Stanford)
12. Performance of NIST/NA12878 Libraries and E.coli K12
Metrics from SMRT Portal RS.PreAssembler.2
>15 kb libraries loaded at 25 pM on-chip (OCPW)
>30 and >40 kb libraries loaded at 75 pM on-chip (OCPW)
Sample nReads #Bases Mean RL RL N50
NA12878_15kb 84,969 1,150 Mb 13,533 18,622
K12_15kb 24,941 378 Mb 15,161 21,140
K12_30kb_DDR 60,460 1,031 Mb 17,055 24,745
K12_40kb_DDR 51,679 922 Mb 17,835 26,282
13. TYPICAL P6-C4 CHEMISTRY READ LENGTH
PERFORMANCE ON A HUMAN GENOME
Data per SMRT Cell: 0.5 – 1 Gb
20 kb size-selected human library
4 hour movie
P6-C4 chemistry
14. NEW LARGE INSERT LIBRARY PREPARATION
PROTOCOLS
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Megaruptor-Shearing.pdf
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Needle_Shearing.pdf
16. THE HUMAN GENOME – FEBRUARY 2001
Source: Science. 2001 Feb 16;291(5507):1304-51., Nature. 2001 Feb 15;409(6822):860-921.
17. THE HUMAN GENOME
- Over 6 billion base pairs
- Organized into 23 chromosomes
- With 2 copies of each
- One maternal, one paternal
- Carrying 20,000 genes
- Each encoding an average of 3 proteins
Source: NHGRI fact sheet
Accessing variation in the human genome enables genetic research.
“Much of the missing heritability (the 'dark matter' of the
genome) will probably turn up as the technology advances.”
- Francis Collins
Nature 464, 674-675 (1 April 2010)
18. PACIFIC BIOSCIENCES® CONFIDENTIAL
TYPES OF INFORMATION COLLECTED FROM
PACBIO SEQUENCING OF A HUMAN GENOME
DNA
-Single-Nucleotide Variation (SNPs) ← Illumina “$1000 Genome”
-Structural Variation (SVs) ← Illumina “$1000 Genome”
-Haplotype Phasing ← Cloning/Sanger sequencing
-Epigenetics ← Illumina + bisulfite sequencing
-De Novo Genome Assembly ← Illumina + Hi-C/Dovetail
RNA
-Expression Quantitation ← Illumina
-Isoform Characterization ← PacBio
PacBio Genome
19. PACBIO SEQUENCING AND ASSEMBLY OF NA12878
“We sequenced NA12878 genomic DNA across 851
Pre P5-C3 and 162 P5-C3 [SMRT Cells] to generate
24× and 22× coverage with aligned mean read
lengths of 2,425 and 4,891 base pairs, respectively.”
23. “It is time to stop thinking
that merely more DNA
sequencing will give us the
variants that determine
human traits”
“We encourage the use of a
range of sequencing
technologies to explore
highly variable and complex
genomic regions in a large
number of human samples.”
http://www.nature.com/ng/journal/v47/n9/pdf/ng.3397.pdf
SEPTEMBER 2015 -
24. “Full resolution of variation
is only guaranteed by
complete de novo assembly
of a genome.”
“We … emphasize the
importance of complete de
novo assembly as opposed
to read mapping as the
primary means to
understanding the full range
of human genetic variation.”
VOLUME 16 | NOVEMBER 2015 | 627
Source: www.nature.com/nrg/journal/v16/n11/full/nrg3933.html
25. COST-PER-GENOME DILEMMA (QUANTITY VS. QUALITY)
NCBI-34
Contig N50 29 Mb
HuRef: 107 kb
BGI YH: 7.4 kb
KB1: 5.5 kb
NA12878: 24 kb
CHM1: 144 kb
RP11: 127 kb
According to NHGRI
website, the definition
of “sequencing a
genome” changed in
the year 2008 to refer
to “re-sequencing” in
lieu of “de novo
assembly.”
- Obtaining a de novo human genome that has the same scientific quality standard as
the initial HGP work has NOT followed Moore’s law.
Source: NHGRI – Genome Sequencing Costs - http://www.genome.gov/sequencingcosts/
28. Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improveme
REFERENCE ASSEMBLY QUALITY STANDARDS
29. Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improveme
MGI METHOD FOR IMPROVING REFERENCE GENOMES
30. Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/
20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
HUMAN GENOME DE NOVO ASSEMBLIES
Year Technology Assembler Sample
2007 ABI 3730 Celera HuRef
2009 Illumina GA
SOAP
de novo
BGI YH
2010
454 GS Flx
Titanium
Newbler KB1
2010 Illumina GA ALLPATHS-LG NA12878
2013
454 GS, HiSeq,
MiSeq
Newbler RP11_0.7
2014
HiSeq, BAC
clones
Reference-
guided
CHM1
2014 PacBio RS II FALCON CHM1
2015 PacBio RS II FALCON CHM13
2015 PacBio RS II FALCON AK1
2015 PacBio RS II FALCON HuRef
2015 PacBio RS II FALCON PC-9*
2015 PacBio RS II FALCON SK-BR-3*
*cancer cell lines
0.11
0.007
0.006
0.024
0.13
0.14
4.38
12.98
7.28
10.38
3.58
2.56
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Contig N50 (Mb)
26.9 Mb - NCBI: GCA_001297185.1
31. THE HUMAN GENOME - 2015
http://www.ncbi.nlm.nih.gov/assembly/GCA_001297185.1/
Contig N50
26.9 MB
32. TOWARDS PLATINUM GENOMES: PACBIO RELEASES A
NEW, HIGHER QUALITY CHM1 ASSEMBLY TO NCBI
Figure 1. The PacBio CHM1 assembly resolves the q arms of
chromosomes 2 and 6 into very few contigs, with max contigs
107 Mbp and 109 Mbp long, respectively.
Posted: Friday, October 2, 2015
Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog
34. NIST GENOME IN A BOTTLE (GIAB) PROJECT
34
Ashkenazim Trio de novo Genome Sequencing Project
Collaborative project with Icahn School of Medicine at Mt. Sinai, New York City
Sequencing:
• Generated PacBio de novo human sequencing from the GIAB Ashkenazim son-father-
mother trio from the Personal Genome Project (HG002, HG003, HG004).
• The AJ genomes are candidate NIST Reference Materials planned for release in 2016.
• PacBio coverage is 69X, 32X, and 30X for HG002, HG003, and HG004, respectively.
• A paper describing these data and other data from GIAB is now on biorxiv
Sequencing data publicly posted on NCBI:
• NIST Human HG002 NA24385 (Ashkenazim Trio Son) on NCBI FTP site here.
• NIST Human HG003 NA24149 (Ashkenazim Trio Father) on NCBI FTP site here.
• NIST Human HG004 NA24143 (Ashkenazim Trio Mother) on NCBI FTP site here.
https://github.com/PacificBiosciences/DevNet/wiki/Genome-in-a-Bottle-Ashkenazim-Trio
35. GIAB PacBio Assembly Summary
with SV calls derived from de novo
assemblies
Mount Sinai: Ali Bashir, Matthew Pendleton, Ryan Neff
Pacific Biosciences: Jason Chin
Reed College: Anna Ritz
36. Overview
• Steps for SV calling
– De novo Falcon assembly
– Reference-based comparison
• Mapping with BLASR and Nucmer
– Secondary refined using HMM
– Re-examination of potential deviations in the
reference with raw-reads
• Currently extending MultiBreak-SV
41. PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total
HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235
HG004 9189 12290 2589 24068
42. PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total
HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235
HG004 9189 12290 2589 24068
Note: Log x-scale to show full event sizes
43. SV calls consistent between assembly
approaches (Falcon vs. Celera)
Insertion Deletion
Other
44. Ongoing
• Refining raw read-based analysis:
– Build new calls
– Mark false-positives
– Identifying discrepancies between two assemblies
– Force calling trios
• Improving heterozygous calls missed via local
assembly
• Refining “other” categories
– e.g. splitting out simple and complex inversions
• Merging BioNano/10X calls with PacBio data
45. ROLE OF NIST GIAB AJ TRIO PROJECT AND REFERENCE
MATERIAL IN PACBIO TECHNOLOGY DEVELOPMENT
- PacBio characterization data serves as a public resource for data analysis methods
development by community:
- Structural variation
- SNV calling
- De novo assembly
- Phasing & haplotype reconstruction
- Methylation / Epigenetic analysis
- Analytical data from multiple-platforms serves as validation for algorithm development
- Characterization data and reference material provide a benchmark for development of novel
methods
- New chemistry development to increase read-length and accuracy
(e.g., library prep methods, polymerase, etc.)
- Scaffolding using novel library perpetration methods
- Rare variant calling with dilution analysis
- Well-characterized RM will serve as a resource for future use in internal quality testing
- Consumables
- Instruments
- Analysis methods