Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
2. The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
6. Genome Status
Data
Source
Origin Assembly
Accession
Status
CHM1 NA GCA_001297185.1 Assembly Improvement
CHM13 NA GCA_000983455.2 Assembly Assessment
NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted
HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted
HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted
NA12878 European GCA_002077035.2 Chr-level Assembly Submitted
HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted
HG02818 Gambian Assembly Underway
HG02059 Kinh-Vietnamese Assembly Assessment
NA19434 Luhya Assembly Assessment
HG04217 Telugu Data Production Underway
HG03486 Mende Assembly Underway**
** First Sequel only data set
8. Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies
Submit conitg
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly
22. Falcon Assembly of NA12878 in CYP2D6 Region
CYP2D8
CYP2D7
CYP2D6
Alignment of
NA12878 to
GRCh38
Region of NA12878 that
doesn’t exist in GRCh38
Shows Duplication of
CYP2D7 gene in
NA12878 genome
26. 10X Data – Separating a Heterozygous Allele
GRCh38
NA12878
Falcon
10X Allele 1
10X Allele 2
Heterozygous SV identified by Bionano
10X Supernova assembly used - GCA_002022845.1
27. Short Term Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon Unzip assemblies for all
samples
• Improve those assemblies
• Identifying misassemblies
• Making the breaks where needed
• Scaffolding the assemblies
• Incorporating BACs as they are finished
• Create Chromosomal AGPs
• Submit to Genbank
28. Longer Term Future Work
• Better Utilization of the Reference
• Mapping Strategies
• Graph based alignments
• Other alt-aware read mapping strategies
• Alternative reference data display challenges – How should we
present data
• Do we continue the current scheme of alt alleles?
• Full reference sequences?
• 2 Haplo-resolved sequences for each allele
• Using Falcon unzip
• Using 10X
• Other technologies?
29. Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Nick Sisneros
Sarah Kingan
Luke Hickey
Greg Concepcion
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath
Notas do Editor
As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly.
By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37.
This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant.
This example shows how multiple haplotypes in the assembly can cause problems
In the past few years we have been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 diploid genomes and 2 haploidgenomes. Currently we are working on our 10th diploid genome. These genomes will help to add diversity to the reference.
As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well. For the initial few genomes, we were targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
To date, data has been generated for 2 Haploid genomes and 10 diploid genomes, all at ~60X coverage or higher. We have a lot of data and a lot of assemblies to work with. For 2 of the diploid genomes, we have Chromosome level assemblies, the rest are at the contig leve.
**2 additional genomes – data will be generated soon
Here are the assembly stats we have for all of the genomes we have assembled to date. All of these genomes are being assembled using Falcon. With the newer version of Falcon, we are seeing a huge increase in contiguity. In most cases, the N50 has increased by 3 times. FALCON-integrate 1.7.5, Various assemblies are generated, minimum seed read lengths and min_cov
We generate multiple assemblies, varying the minimum seed read length and min_cov. From those 20 or so assemblies, we the Raw data is generally submitted a month or so after production of the data is completed
This diagram shows the work flow for the Bionano Irys system. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
BioNano has also identified a second enzyme that nicks well for human genomes. You can create a second map with the other enzyme and then through softtware improvements that are coming in the next month, will be able to align you sequence to both maps. This will increase the N50 by 2 times.
used 14k_120_120_1
Once we identified which assembly version we wanted to improve, we aligned to BioNano, SV calls were generated as well as doing hybrid scaffolding. During the hybrid scaffolding process, conflicts are identified. For this genome, 51 contflicts were identified. We looked at the sequence alignments for all of these conflicts and found 35 to be pacbio assemblie errors. WE also looked through the translocation and complex SV calls, as well as a rough alignment of the assembly to GRCh38 to identify contigs that crossed chromosomes. From looking through all of this data, 69 breaks were done. You will see that breaking the obvious chimeric contigs only brought the N50 down a little bit to 25.7 Mb.
Sequence alignments were looked at for all conflicts, then to narrow down the complex and translocations first looked at the BioNano alignments in Irysview
This is the same Pacbio contig as in the last slide, only this time, it is comparing the pacbio contig to GRCh38, it in the top panel you can see
We have also been using the bionano maps to identify variation between our genomes and the reference. In this example, there are 2 haplotypes in BN compared to GRCh38 – This appears to be a heterozygous inversion in NA19240.
Here is a list of initial set of SV calls of our genomes when compared to GRCh38. These contain both homozygous and heterozygous calls.
I have a few examples of what we have been seeing in these assemblies. We decided to take a look at the MHC region, of NA19240. This is a comparison of the BioNano map of NA19240 to the reference, the reference is in green and the NA19240 BN map in blue. It looks like from the BN map there is a ~65kb insertion.
We then aligned the contig from Jason’s most recent assembly to the current reference as well as the alts. This is the region that cooresponds to the insertion in the BN map, so from this initial look, it appears there is an insertion here in this assembly. Need to look at it further to evaluate if this would be a useful addition to the alts that already are present.
CYP2D6 is a very diverse genomic region that has implications on drug metabolism. In collaboration with the Pharmaco Genomics Research Network (PGRN), we have sequenced multiple alleles in this region using fosmid libraries created from ethnically diverse individuals. Within the region, there is also another Cyp gene, CYP2D7 and a pseudogene called CYP2D8 that contain with common repeats interspersed between genes and pseudogene copies, facilitating genomic rearrangements. The gene CYP2D6 and the associated pseudo genes are shown here, along with some of the different alleles we have sequenced.
This is the alignment of NA12878 to GRCh38 as well as the genes aligned to the NA12878
IT was important, especailly in highly variable regions of the gneome to capture both alleles from the diploid samples. In collaboration with Pacbio, they have generated an unzip assembly for us. Here is a diagram showing how with Falcon you will be missing allelic variation, but by using Falcon unzip, you should capture the variation that is present. You end up with a set of very contiguous primary contigs and then a set of smaller haplotigs that contain the variation.
Gambian assembly was done at Pacbio for us and this version is polished
I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.