Jan2016 pac bio giab

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved.
NIST Genome in a Bottle (GIAB) Consortium
Workshop at Stanford University
Luke Hickey – Senior Director, Human BioMedical Sciences, PacBio
January 29, 2016

Topics
- PacBio SMRT Sequencing Technology Development
- Human Genome Sequencing with PacBio Systems
- The Role of NIST GIAB Reference Material in PacBio
Sequencing Technology Development, Optimization
and Demonstration

PacBio SMRT Sequencing
Technology

PACIFIC BIOSCIENCES® CONFIDENTIAL
SINGLE MOLECULE, REAL-TIME (SMRT) DNA SEQUENCING

Long Reads
- Average >10,000 bases
High Consensus Accuracy
- Achieves >99.999% (30x)
Uniform, Unbiased Coverage
- Lack of GC% or sequence
complexity bias
DNA Modification Detection
- Epigenome characterization
SMRT SEQUENCING DATA CHARACTERISTICS

AREAS OF PACBIO TECHNOLOGY DEVELOPMENT
Library
Preparation
Sequencing Data Analysis
Instruments
SMRT Cells Zero-Mode
Waveguides
Phospholinked
Nucleotides
DNA Shearing
Size Selection
SMRTbell™
Library
Preparation
Primary Analysis
- Base calling
Secondary & Tertiary Analysis
- Mapping
(daligner/BLASR)
- Consensus accuracy
(Quiver / HGAP)
- De novo assembly
(Falcon / MHAP)
- SV calling
- Phasing
- Epigenetic analysis
Consumables
PacBio® RS II SEQUEL™ SYSTEM

PRODUCT RELEASES OVER THE LAST FOUR YEARS
7
Feb 2012
C2 Launch
May 2012
v1.3.1 SW Release
– Base Mods
Aug 2012
v1.3.2
MagBead
Release
Nov 2012
v1.3.3
Microbial
Base Modification
XL Chemistry
Stage Start
Jan 2013
SMRT® Cells v3
HGAP/Quiver
Oct 2013
v2.1
• P5-C3
release
• HGAP 2.0
Apr 2013
RS II Product
Release
• 75K to
150K ZMW
• 2x Throughput
Mar 2014
v2.2
• IsoSeq™
• HLA-Typing Oct 2015
Sequel System
Oct 2014
v2.3
• P6-C4
release
Apr 2015
Barcode Support
Increased throughput by over 100x

0
2000
4000
6000
8000
10000
12000
14000
HISTORY OF READ LENGTH PERFORMANCE
AverageReadLength(bp)
2008 2009 2010 2011 2012 2013 2014 2015
Early PacBio chemistries
453 1012 1734
LPR
FCR
ECR2
C2–C2
P4–C2
P5–C3
Average Read Length: 10,000 - 15,000 bp
Throughput / SMRT® Cell: 750 Mb – 1.25 Gb
Consensus Accuracy: QV50 @30-fold P6–C4

NIST GIAB REFERENCE MATERIAL 8398
- Serves as a well characterized control material to facilitate development of novel library
preparation and sequencing methods for human genomes at PacBio.

LIBRARY PREPARATION
DNA Sample
Building of the
SMRTbell Template
Sample Preparation
Repair Ends
Ligate Adapters
Purify DNA
Binding
Fragment DNA

ASSESSING THE IMPACT OF DNA QUALITY
ON READ LENGTH
Human gDNA samples from NIST GIAB:
NA12878: CEPH/Utah Pedigree 1463, Lot K6
Thanks Dave Hsu!
E. coli K12 gDNA is mostly >40 kb (same gel)
Both NA12878 samples show significant degradation
Look similar to Coriell samples
PFGE conditions:
Bio-Rad CHEF Mapper XA System
1% PFG-certified agarose gel in 0.5x TBE
~200 ng DNA per lane
Auto-algorithm program
Low = 5 kb
High = 150 kb
Markers:
1 kb Extension Ladder (Invitrogen)
5 kb DNA Ladder (Bio-Rad)
EtBr stained post-electrophoresis
Typhoon imaging:
Fluorescence mode, EtBr channel
100 microns resolution
+3 mm focal plane
- Initial QC of human gDNA samples (NIST/Stanford)

Performance of NIST/NA12878 Libraries and E.coli K12
Metrics from SMRT Portal RS.PreAssembler.2
>15 kb libraries loaded at 25 pM on-chip (OCPW)
>30 and >40 kb libraries loaded at 75 pM on-chip (OCPW)
Sample nReads #Bases Mean RL RL N50
NA12878_15kb 84,969 1,150 Mb 13,533 18,622
K12_15kb 24,941 378 Mb 15,161 21,140
K12_30kb_DDR 60,460 1,031 Mb 17,055 24,745
K12_40kb_DDR 51,679 922 Mb 17,835 26,282

TYPICAL P6-C4 CHEMISTRY READ LENGTH
PERFORMANCE ON A HUMAN GENOME
Data per SMRT Cell: 0.5 – 1 Gb
20 kb size-selected human library
4 hour movie
P6-C4 chemistry

NEW LARGE INSERT LIBRARY PREPARATION
PROTOCOLS
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Megaruptor-Shearing.pdf
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Needle_Shearing.pdf

Sequencing Human Genomes
So, you sequenced a human genome … how well did you do?

THE HUMAN GENOME – FEBRUARY 2001
Source: Science. 2001 Feb 16;291(5507):1304-51., Nature. 2001 Feb 15;409(6822):860-921.

THE HUMAN GENOME
- Over 6 billion base pairs
- Organized into 23 chromosomes
- With 2 copies of each
- One maternal, one paternal
- Carrying 20,000 genes
- Each encoding an average of 3 proteins
Source: NHGRI fact sheet
Accessing variation in the human genome enables genetic research.
“Much of the missing heritability (the 'dark matter' of the
genome) will probably turn up as the technology advances.”
- Francis Collins
Nature 464, 674-675 (1 April 2010)

TYPES OF INFORMATION COLLECTED FROM
PACBIO SEQUENCING OF A HUMAN GENOME
DNA
-Single-Nucleotide Variation (SNPs) ← Illumina “$1000 Genome”
-Structural Variation (SVs) ← Illumina “$1000 Genome”
-Haplotype Phasing ← Cloning/Sanger sequencing
-Epigenetics ← Illumina + bisulfite sequencing
-De Novo Genome Assembly ← Illumina + Hi-C/Dovetail
RNA
-Expression Quantitation ← Illumina
-Isoform Characterization ← PacBio
PacBio Genome

PACBIO SEQUENCING AND ASSEMBLY OF NA12878
“We sequenced NA12878 genomic DNA across 851
Pre P5-C3 and 162 P5-C3 [SMRT Cells] to generate
24× and 22× coverage with aligned mean read
lengths of 2,425 and 4,891 base pairs, respectively.”

TABLE 1. NA12878 – PACBIO ASSEMBLY RESULTS

FIGURE 2. TANDEM-REPEAT DETECTION FROM SINGLE
MOLECULES PREDICTS A LARGE DIVERGENCE FROM
REFERENCE.

REPEAT EXPANSION DISEASES
Sergei M. Mirkin (2007). Expandable DNA repeats and human disease, Nature 447, 932-940

“It is time to stop thinking
that merely more DNA
sequencing will give us the
variants that determine
human traits”
“We encourage the use of a
range of sequencing
technologies to explore
highly variable and complex
genomic regions in a large
number of human samples.”
http://www.nature.com/ng/journal/v47/n9/pdf/ng.3397.pdf
SEPTEMBER 2015 -

“Full resolution of variation
is only guaranteed by
complete de novo assembly
of a genome.”
“We … emphasize the
importance of complete de
novo assembly as opposed
to read mapping as the
primary means to
understanding the full range
of human genetic variation.”
VOLUME 16 | NOVEMBER 2015 | 627
Source: www.nature.com/nrg/journal/v16/n11/full/nrg3933.html

COST-PER-GENOME DILEMMA (QUANTITY VS. QUALITY)
NCBI-34
Contig N50 29 Mb
HuRef: 107 kb
BGI YH: 7.4 kb
KB1: 5.5 kb
NA12878: 24 kb
CHM1: 144 kb
RP11: 127 kb
According to NHGRI
website, the definition
of “sequencing a
genome” changed in
the year 2008 to refer
to “re-sequencing” in
lieu of “de novo
assembly.”
- Obtaining a de novo human genome that has the same scientific quality standard as
the initial HGP work has NOT followed Moore’s law.
Source: NHGRI – Genome Sequencing Costs - http://www.genome.gov/sequencingcosts/

NHGRI GenomeTV: https://www.youtube.com/watch?v=PdVdlzWhaLE

Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improveme
REFERENCE ASSEMBLY QUALITY STANDARDS

Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improveme
MGI METHOD FOR IMPROVING REFERENCE GENOMES

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/
20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
HUMAN GENOME DE NOVO ASSEMBLIES
Year Technology Assembler Sample
2007 ABI 3730 Celera HuRef
2009 Illumina GA
SOAP
de novo
BGI YH
2010
454 GS Flx
Titanium
Newbler KB1
2010 Illumina GA ALLPATHS-LG NA12878
2013
454 GS, HiSeq,
MiSeq
Newbler RP11_0.7
2014
HiSeq, BAC
clones
Reference-
guided
CHM1
2014 PacBio RS II FALCON CHM1
2015 PacBio RS II FALCON CHM13
2015 PacBio RS II FALCON AK1
2015 PacBio RS II FALCON HuRef
2015 PacBio RS II FALCON PC-9*
2015 PacBio RS II FALCON SK-BR-3*
*cancer cell lines
0.11
0.007
0.006
0.024
0.13
0.14
4.38
12.98
7.28
10.38
3.58
2.56
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Contig N50 (Mb)
26.9 Mb - NCBI: GCA_001297185.1

THE HUMAN GENOME - 2015
http://www.ncbi.nlm.nih.gov/assembly/GCA_001297185.1/
Contig N50
26.9 MB

TOWARDS PLATINUM GENOMES: PACBIO RELEASES A
NEW, HIGHER QUALITY CHM1 ASSEMBLY TO NCBI
Figure 1. The PacBio CHM1 assembly resolves the q arms of
chromosomes 2 and 6 into very few contigs, with max contigs
107 Mbp and 109 Mbp long, respectively.
Posted: Friday, October 2, 2015
Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog

Source: MGI http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
REFERENCE GENOME IMPROVEMENT

NIST GENOME IN A BOTTLE (GIAB) PROJECT
34
Ashkenazim Trio de novo Genome Sequencing Project
Collaborative project with Icahn School of Medicine at Mt. Sinai, New York City
Sequencing:
• Generated PacBio de novo human sequencing from the GIAB Ashkenazim son-father-
mother trio from the Personal Genome Project (HG002, HG003, HG004).
• The AJ genomes are candidate NIST Reference Materials planned for release in 2016.
• PacBio coverage is 69X, 32X, and 30X for HG002, HG003, and HG004, respectively.
• A paper describing these data and other data from GIAB is now on biorxiv
Sequencing data publicly posted on NCBI:
• NIST Human HG002 NA24385 (Ashkenazim Trio Son) on NCBI FTP site here.
• NIST Human HG003 NA24149 (Ashkenazim Trio Father) on NCBI FTP site here.
• NIST Human HG004 NA24143 (Ashkenazim Trio Mother) on NCBI FTP site here.
https://github.com/PacificBiosciences/DevNet/wiki/Genome-in-a-Bottle-Ashkenazim-Trio

GIAB PacBio Assembly Summary
with SV calls derived from de novo
assemblies
Mount Sinai: Ali Bashir, Matthew Pendleton, Ryan Neff
Pacific Biosciences: Jason Chin
Reed College: Anna Ritz

Overview
• Steps for SV calling
– De novo Falcon assembly
– Reference-based comparison
• Mapping with BLASR and Nucmer
– Secondary refined using HMM
– Re-examination of potential deviations in the
reference with raw-reads
• Currently extending MultiBreak-SV

PacBio Falcon Assembly Stats Trio
Sample Contigs Average N50 Max Total Size
HG002 13231 230Kb 4.1 Mb 31.6 Mb 3.04 Gb
HG003 17873 172kb 4.6 Mb 21.5Mb 3.08 Gb
HG004 16487 185kb 5.3 Mb 22.6 Mb 3.05 Gb
Log y-scale Log x-scale

Both high/low coverage AJ assemblies
highly consistent with GRCh38
HG002

HG003

HG004

PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total
HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235
HG004 9189 12290 2589 24068

PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total
HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235
HG004 9189 12290 2589 24068
Note: Log x-scale to show full event sizes

SV calls consistent between assembly
approaches (Falcon vs. Celera)
Insertion Deletion
Other

Ongoing
• Refining raw read-based analysis:
– Build new calls
– Mark false-positives
– Identifying discrepancies between two assemblies
– Force calling trios
• Improving heterozygous calls missed via local
assembly
• Refining “other” categories
– e.g. splitting out simple and complex inversions
• Merging BioNano/10X calls with PacBio data

ROLE OF NIST GIAB AJ TRIO PROJECT AND REFERENCE
MATERIAL IN PACBIO TECHNOLOGY DEVELOPMENT
- PacBio characterization data serves as a public resource for data analysis methods
development by community:
- Structural variation
- SNV calling
- De novo assembly
- Phasing & haplotype reconstruction
- Methylation / Epigenetic analysis
- Analytical data from multiple-platforms serves as validation for algorithm development
- Characterization data and reference material provide a benchmark for development of novel
methods
- New chemistry development to increase read-length and accuracy
(e.g., library prep methods, polymerase, etc.)
- Scaffolding using novel library perpetration methods
- Rare variant calling with dilution analysis
- Well-characterized RM will serve as a resource for future use in internal quality testing
- Consumables
- Instruments
- Analysis methods

1000+ PUBLICATIONS TO DATE FEATURING PACBIO
SEQUENCING
0
100
200
300
400
500
600
700
800
2011 2012 2013 2014 2015
Human Biomedical
Plant & Animal
Microbiology

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,
SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.
All other trademarks are the sole property of their respective owners.
www.pacb.com

PACBIO RS II
150+ PLACEMENTS
Some pins represent multiple placements

Jan2016 pac bio giab

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (14)

Semelhante a Jan2016 pac bio giab

Semelhante a Jan2016 pac bio giab (20)

Mais de GenomeInABottle

Mais de GenomeInABottle (20)

Último

Último (20)

Jan2016 pac bio giab