SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
FIND MEANING IN COMPLEXITY
© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures.
Jason Chin (@infoecho) / Sept. 20 2014, GRC Workshop, Cambridge,
UK
Learning Genomic Structures From De
Novo Assembly and Long-read Mapping
de novol
Cost per Genome Dilemma
2
Sequencing cost is down for sure, but getting a de novo human genome that has the
same scientific standard as the initial work does NOT follow Moore’s law.
PacBio® CHM1: 4378 kb
from just single random fragment
library
HGP, N50 ~100kb
NCBI-34
Contig N50 29Mb
HuRef: 107kb
BGI YH: 7.4kb
KB1: 5.5kb
NA12878: 24kb
CHM1: 144kb
RP11: 127kb
According to the NHGRI
website, the definition of
“sequencing a genome”
changed in 2008.
The 1000 Genomes Project
starts in 2008, too.
Question Asked!!
•  Since the 1000 Genomes
Project, we have learned a lot
of about point mutations. Can
we go beyond that?
•  What if we have 50, 100 or
more human assemblies so we
can address all genetic
variations as much as
possible?
•  Will one day all human genome
sequencing be done in de novo
fashion?
–  If so, how can we get ready
for that as bioinformatists?
3
Evan Eichler , In Future Opportunities
for Genome Sequencing and Beyond,
July 28-29, 2014
Where We Are Now
•  One PacBio® human data set is publicly available, more are likely to
come
•  Multiple groups have successfully assembled the public CHM1 data
set independently with new algorithms from raw data
•  With new alignment/assembly tools from Gene Myers:
one can assemble a genome in ~ 20,000 CPU-hours. (20X faster
than 400,000+ CPU-hours from previous effort.)
4
New Assembly Statistics done
With Daligner:
	
  
#Seqs	
  	
  	
  5,058	
  
Mean	
  	
  	
  	
  562,695	
  
Max	
  	
  	
  	
  	
  27,292,514	
  
n50	
  	
  	
  	
  	
  5,265,098	
  
Total	
  	
  	
  2,846,115,586	
  http://dazzlerblog.wordpress.com
What Can We Learn from High-contiguity
De Novo Human Assemblies?
5
What Can We Learn from High-contiguity Human
Assemblies?
•  Low-hanging Fruits
–  Calling SNPs (assembly not needed, but it helps)
–  Calling structure variants with whole-genome alignment
approaches
–  Inferring repeats by coverage analysis
•  Assembly graph can provide information for understanding
more complicated polymorphisms
6
Call SNPs / Example: HLA-B
7
Call Structure Variation By Whole-genome Alignment
•  Whole-genome alignments ( ~ 1 hr in a 32-core machine)
–  With multi-threaded Mummer
–  Clustering the hits with Mgaps and identified “gaps” in the alignments,
convert to bed format for visualization
8
Structure Variants Called in Chromosome 1
Distribution of The Structure Variation Sizes
•  Number of insertions/deletions: 13796 SV calls (for insertion or deletion >
100 bp against hg19)
9
PacBio® vs. Short-read Alignment View for SV in the MHC region
10
318bp insertion
Assembly Graph
11
Each edge is associated with a sequence.
Every path is a candidate of a model of part
of the genome.
From Gene Myers’ ISMB 2014 Keynote talk
Dissect a Contig from a String Graph
The autonomy of a contig from a string graph layout
12
A contig: a linear non-branching path
Each node: the begin (5’) or end (3’)
of a read
Each edge: a continuous sub-
sequence from one read
Ek:	
  (V1,	
  V2,	
  Read,	
  Range)	
  =	
  
	
  (	
  00099576_1:B,	
  00101043_0:B,	
  00101043_0,	
  1991-­‐0	
  )	
  
	
  
Read	
  1:	
  00099576_1,	
  Read	
  2:	
  00101043_0	
  
	
  
In practice, we might just encode the paths in a contig rather than each single
edge:
C	
  =	
  (Ek,	
  Ek+1,	
  Ek+2,	
  Ek+2)	
  =	
  (Pj	
  Pj+1)	
  	
  	
  
V1 V2 V3 V4 V5
Ek Ek+1 Ek+2 Ek+3
V1 V3 V5
Pj Pj+1
C =
=
Assembly String Graph of CHM1 Genome
•  Largest connect component: 31998 nodes, 39399 edges, ~36.5%
(~1Gbp) of the human genome (total: 87572 nodes, 94530 edges)
13
Centromere?
Casey Bergman:
“it almost looks like an
electron micrograph of
the nucleus”
#convergence
Polymorphism Structure vs. Local Assembly Graph
Structure
14
SNPs
SNPs SNPs
SVsSVs
Diploid Genome
Segmental Duplication
Similar String Graph
Identify Contigs: A New Proposal
SNPs
SNPs SNPs
SVs
SVs
Associated
contig 1
Associated
contig 2
Primary
contig
1 full length contig + 2 associated contigs
Keep the long-range information
while maintaining the relations of
the alternative alleles.
Contig 4076 Alignment Around DPY19L2 Locus
Same contig
Contig Graph and Segmental Duplication
Contig 4076, one primary contig, 3 associate contigs, aligned to Chr7 and Chr12
Coting 4076 Alignment to Chr7
Same contig
SV calls from
CHM1 asm
SV calls from
GRC38
Local Neighborhood Subgraph of Contig 4076
19
Examining an Assembly Graph at Contig Level Around
1q21
•  Contig graph, 1q21, contig 4108, another potential segmental
duplication?
20
Another Intriguing Case
21
•  Contig 4006 mapped to chr 9
The aligned region changes a lot in GRC38.
Contig Coverage Analysis
22
18.5 X
2 * 18.5 X
3 * 18.5 X
High coverage long contigs
40 contigs > 100kbp
> 2.5 * 18.5 X
Poor assemblies,
alignment artifacts,
or sequence errors?
High repeat elements
Checking the Complexity of the High-coverage Contigs
23
Contig 4006, 687kb, 53x coverage
Contig 4235, 453k, 59x coverage
Contig 3842, 235k, 54x coverage
Warning: These contigs may not be 100% correctly assembled due to
some nasty repeats. However, the local graphs give hints about the
true genome structures.
How does the High-coverage Contig Look?
24
>2000X in this region
How does The High-coverage Contig Look?
25
High-coverage
Region
Alpha satellites?
For Research Use Only. Not for use in diagnostic procedures.
Extreme Repeats
26
Identify Centromere Alpha-satellite Structure
•  Most of the nasty contig graphs are around the centromere.
Currently, it remains hard to get long contigs around those very long
tandem repeats.
•  However, we can still learn many useful things from long-read data
•  Tool In Development: α-Centauri for identifying different high-order
repeat structures (https://github.com/volkansevim/alpha-CENTAURI,
Volkan Sevim, Ali Bashir & Karen Miga )
27
Centromere Alpha Satellites Have Non-trivial High-order
Repeat Structure
28
Karen Miga
Example: A Read Reconstructs a 24-mer HOR
29
Align monomer to each other to
identify near identical mon0mers
Identify HOR with the monomer
IDs and positions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
171819
20
21
22
23
24
Many Other Open Topics
•  Low-coverage assembly: cost vs. quality analysis
•  Phasing for haplotypes
•  Crowd-sourcing infrastructure for examining / annotating / correcting
genome assemblies
•  Evaluation about SNPs calling with short reads on better assembly
•  Large-scale comparative genomes with de novo assemblies
•  Assembly-graph data format
•  Visualization Techniques
•  Combining other data types, e.g. optical mapping
30
It is a very exciting time. We still need more tools to harvest
information to generate new knowledge.
For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq
are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
31

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Grc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudgeGrc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudge
 
ABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop Schneider
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Variant Calling II
Variant Calling IIVariant Calling II
Variant Calling II
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 

Semelhante a Alignment Approaches II: Long Reads

Semelhante a Alignment Approaches II: Long Reads (20)

Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
London Calling 2019: Karen Miga
London Calling 2019: Karen MigaLondon Calling 2019: Karen Miga
London Calling 2019: Karen Miga
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
04_Assembly_2022.pdf
04_Assembly_2022.pdf04_Assembly_2022.pdf
04_Assembly_2022.pdf
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copy
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 

Mais de Genome Reference Consortium

Mais de Genome Reference Consortium (20)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 

Último (20)

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

Alignment Approaches II: Long Reads

  • 1. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures. Jason Chin (@infoecho) / Sept. 20 2014, GRC Workshop, Cambridge, UK Learning Genomic Structures From De Novo Assembly and Long-read Mapping de novol
  • 2. Cost per Genome Dilemma 2 Sequencing cost is down for sure, but getting a de novo human genome that has the same scientific standard as the initial work does NOT follow Moore’s law. PacBio® CHM1: 4378 kb from just single random fragment library HGP, N50 ~100kb NCBI-34 Contig N50 29Mb HuRef: 107kb BGI YH: 7.4kb KB1: 5.5kb NA12878: 24kb CHM1: 144kb RP11: 127kb According to the NHGRI website, the definition of “sequencing a genome” changed in 2008. The 1000 Genomes Project starts in 2008, too.
  • 3. Question Asked!! •  Since the 1000 Genomes Project, we have learned a lot of about point mutations. Can we go beyond that? •  What if we have 50, 100 or more human assemblies so we can address all genetic variations as much as possible? •  Will one day all human genome sequencing be done in de novo fashion? –  If so, how can we get ready for that as bioinformatists? 3 Evan Eichler , In Future Opportunities for Genome Sequencing and Beyond, July 28-29, 2014
  • 4. Where We Are Now •  One PacBio® human data set is publicly available, more are likely to come •  Multiple groups have successfully assembled the public CHM1 data set independently with new algorithms from raw data •  With new alignment/assembly tools from Gene Myers: one can assemble a genome in ~ 20,000 CPU-hours. (20X faster than 400,000+ CPU-hours from previous effort.) 4 New Assembly Statistics done With Daligner:   #Seqs      5,058   Mean        562,695   Max          27,292,514   n50          5,265,098   Total      2,846,115,586  http://dazzlerblog.wordpress.com
  • 5. What Can We Learn from High-contiguity De Novo Human Assemblies? 5
  • 6. What Can We Learn from High-contiguity Human Assemblies? •  Low-hanging Fruits –  Calling SNPs (assembly not needed, but it helps) –  Calling structure variants with whole-genome alignment approaches –  Inferring repeats by coverage analysis •  Assembly graph can provide information for understanding more complicated polymorphisms 6
  • 7. Call SNPs / Example: HLA-B 7
  • 8. Call Structure Variation By Whole-genome Alignment •  Whole-genome alignments ( ~ 1 hr in a 32-core machine) –  With multi-threaded Mummer –  Clustering the hits with Mgaps and identified “gaps” in the alignments, convert to bed format for visualization 8 Structure Variants Called in Chromosome 1
  • 9. Distribution of The Structure Variation Sizes •  Number of insertions/deletions: 13796 SV calls (for insertion or deletion > 100 bp against hg19) 9
  • 10. PacBio® vs. Short-read Alignment View for SV in the MHC region 10 318bp insertion
  • 11. Assembly Graph 11 Each edge is associated with a sequence. Every path is a candidate of a model of part of the genome. From Gene Myers’ ISMB 2014 Keynote talk
  • 12. Dissect a Contig from a String Graph The autonomy of a contig from a string graph layout 12 A contig: a linear non-branching path Each node: the begin (5’) or end (3’) of a read Each edge: a continuous sub- sequence from one read Ek:  (V1,  V2,  Read,  Range)  =    (  00099576_1:B,  00101043_0:B,  00101043_0,  1991-­‐0  )     Read  1:  00099576_1,  Read  2:  00101043_0     In practice, we might just encode the paths in a contig rather than each single edge: C  =  (Ek,  Ek+1,  Ek+2,  Ek+2)  =  (Pj  Pj+1)       V1 V2 V3 V4 V5 Ek Ek+1 Ek+2 Ek+3 V1 V3 V5 Pj Pj+1 C = =
  • 13. Assembly String Graph of CHM1 Genome •  Largest connect component: 31998 nodes, 39399 edges, ~36.5% (~1Gbp) of the human genome (total: 87572 nodes, 94530 edges) 13 Centromere? Casey Bergman: “it almost looks like an electron micrograph of the nucleus” #convergence
  • 14. Polymorphism Structure vs. Local Assembly Graph Structure 14 SNPs SNPs SNPs SVsSVs Diploid Genome Segmental Duplication Similar String Graph
  • 15. Identify Contigs: A New Proposal SNPs SNPs SNPs SVs SVs Associated contig 1 Associated contig 2 Primary contig 1 full length contig + 2 associated contigs Keep the long-range information while maintaining the relations of the alternative alleles.
  • 16. Contig 4076 Alignment Around DPY19L2 Locus Same contig
  • 17. Contig Graph and Segmental Duplication Contig 4076, one primary contig, 3 associate contigs, aligned to Chr7 and Chr12
  • 18. Coting 4076 Alignment to Chr7 Same contig SV calls from CHM1 asm SV calls from GRC38
  • 19. Local Neighborhood Subgraph of Contig 4076 19
  • 20. Examining an Assembly Graph at Contig Level Around 1q21 •  Contig graph, 1q21, contig 4108, another potential segmental duplication? 20
  • 21. Another Intriguing Case 21 •  Contig 4006 mapped to chr 9 The aligned region changes a lot in GRC38.
  • 22. Contig Coverage Analysis 22 18.5 X 2 * 18.5 X 3 * 18.5 X High coverage long contigs 40 contigs > 100kbp > 2.5 * 18.5 X Poor assemblies, alignment artifacts, or sequence errors? High repeat elements
  • 23. Checking the Complexity of the High-coverage Contigs 23 Contig 4006, 687kb, 53x coverage Contig 4235, 453k, 59x coverage Contig 3842, 235k, 54x coverage Warning: These contigs may not be 100% correctly assembled due to some nasty repeats. However, the local graphs give hints about the true genome structures.
  • 24. How does the High-coverage Contig Look? 24 >2000X in this region
  • 25. How does The High-coverage Contig Look? 25 High-coverage Region Alpha satellites?
  • 26. For Research Use Only. Not for use in diagnostic procedures. Extreme Repeats 26
  • 27. Identify Centromere Alpha-satellite Structure •  Most of the nasty contig graphs are around the centromere. Currently, it remains hard to get long contigs around those very long tandem repeats. •  However, we can still learn many useful things from long-read data •  Tool In Development: α-Centauri for identifying different high-order repeat structures (https://github.com/volkansevim/alpha-CENTAURI, Volkan Sevim, Ali Bashir & Karen Miga ) 27
  • 28. Centromere Alpha Satellites Have Non-trivial High-order Repeat Structure 28 Karen Miga
  • 29. Example: A Read Reconstructs a 24-mer HOR 29 Align monomer to each other to identify near identical mon0mers Identify HOR with the monomer IDs and positions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 20 21 22 23 24
  • 30. Many Other Open Topics •  Low-coverage assembly: cost vs. quality analysis •  Phasing for haplotypes •  Crowd-sourcing infrastructure for examining / annotating / correcting genome assemblies •  Evaluation about SNPs calling with short reads on better assembly •  Large-scale comparative genomes with de novo assemblies •  Assembly-graph data format •  Visualization Techniques •  Combining other data types, e.g. optical mapping 30 It is a very exciting time. We still need more tools to harvest information to generate new knowledge.
  • 31. For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. 31