Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Â
150224 grc kms
1. Characterizing extreme diversity in
the human genome using a single
haplotype genomic resource
Karyn Meltz Steinberg, Ph.D.
AGBT 2015 GRC Workshop
@KMS_Meltzy
2. 1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Slide courtesy of S. Girirajan
Human Genetic Variation
3. 1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
4. 1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
5. 1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
6. 1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
Sequencing
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
7. Extreme diversity in the human genome
• <99.5% identity to the reference
• Refractory to traditional sequencing efforts
• Loci often contain gene families associated with
immune response and xenobiotic metabolism
8. HLA is a classic example of an extremely diverse locus
• Critical to immune response
• Characterized by overdominant
selection
• Alleles are linked and segregate as
distinct haplotypes
• Shaped by gene duplication and
diversification
15. A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity
sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies
(ONLY but noted by color differences)
With a haploid genome, allelic differences are eliminated, and
base differences are likely indicative of repeat copies
20. CHM1_1.1 Assembly
• Reference-guided assembly
• SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.
2
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
CHM1 Assembly Paper - Genome Research Steinberg et al. 2014
24. Alignment of CHM1 Illumina data to assembly revealed
regions of extreme heterogeneity
Heterozygous Homozygous Total
Variants 64033 22513 86546
In RepeatMasked (RM) sequence 37060 14833 51893
In Segmental duplication (SD) 30670 4843 35513
In RM and SD 51466 17174 68640
Ts:Tv 1.5 0.7 1.2
Mean SNV density/kb 0.02 0.008 0.03
There are significantly more heterozygous variants in repetitive
sequence than expected (p<1x10-16). BAC ends mapping discordantly
and in multiple loci are significantly enriched for segmental
duplications (p<1x10-5).
26. CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map
~15kb additional data
27. BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
inAssembly
Gap in SequenceCHM1_1.1 Assembly
CHM1 BioNano Map
28. Conclusion
• Extremely diverse regions of the genome are difficult to
characterize due to issues distinguishing allelic from
paralogous duplications
• CHM1_1.1 highly contiguous single haplotype
representation of the genome
• Identified regions of misassembly or reference-ized
regions
• Utilize long read technology and nanopore technology to
attempt to fix these regions
29.
30.
31. Need to add more diversity to reference
• Finish another hydatidiform mole to platinum
status
• Finish 5 genomes to gold status
• NA19240 (Yoruban)
• NA12878 (European)
• HG00513 (Han Chinese)
• 2 “wildcards”
• Looking for underrepresented minority population
• Add high quality alternative sequences to
reference to create a population reference graph
or “pan genome”
32. Use colored de Bruijn graph structure to represent
population reference graph
33. Bioinformatic tool development in the future
• Alignment of short reads to population reference
graph
• Variant calling
• Variant reporting/Haplotype resolution
43. Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
44. Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
100 kbp of novel sequence
45. Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data
46. CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps
sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Steinberg et al, 2014
• Genome Research (Dec;24(12):2066-76)
There are many types of genetic variation in the human genome ranging from single nucleotide variants up to chromosomal abnormalities. Copy number variants fall within these two types of variation.
The techniques we use to assay these variants depends on the size of the variant and this in turn affects the throughput
SNP genotyping has very high throughput but the amount of genome that can be assayed is small
Array CGH is ideal for copy number variants but the throughput is much lower than SNP genotyping and is less effective for smaller variants
Karyotyping can detect large abnormalities and the throughput is similar to Array CGH
Finally sequencing should be able to assay all forms of genetic variation and the goal of next generation sequencing is to keep increasing the throughput
So I became interested in how to effectively identify and assay extreme genetic diversity. We expect any two human haplotypes to be 99.8-100% identical to one another. We define extreme genetic diversity as less than 99.5% identity to the reference sequence which is approximately 1 variant per 500 base pairs. These loci are often refractory to traditional sequencing efforts and are enriched for genes and gene families related to immune response and environmental detoxification.
The human leukocyte antigen locus is a classic example of high diversity in the human genome. HLA is critical to human disease as it binds to the T cell receptor and plays a critical role in antigen processing and presentation. The locus is characterized by overdominant selection where individuals with a heterozygous genotype have higher fitness than those with the homozygous genotype. Alleles at this locus are linked and segregate as distinct haplotypes and the locus has been shaped by extensive gene duplication and diversification.
Now what do I mean by gene duplication and diversification? Let’s say there is a gene with 4 functions that duplicates. It has a few different fates.
First the two copies could each acquire different mutations that inactivate certain functions but overall the two genes retain the same function as before
Secondly the duplicate copy could acquire mutations that endow it with novel functions
Third the duplicate gene could acquire mutations to the point that it no longer functions leading to loss of the gene or pseudogenization
A duplicated region of the genome that is larger than 1 kilobase and has greater than 90% identity is called a segmental duplication. These duplications can predispose loci to further rearrangement via non-allelic homologous recombination. This can lead to deletions or duplications of intervening sequence
Segmental duplications may also lead to inversions. They are an important part of the genome’s architecture as they serve as hotspots that can create more complex architecture and diversity.
Here is an example of one of those segmentally duplicated regions. The SRGAP2 gene family maps to 3 specific regions on chr 1. All three of these loci were poorly assembled in GRCh37, This region was resequenced using the CH17 single haplotype BAC library.
This diagram shows the homologous regions using Miropeats, where the green lines indicate nearly identical segments between SRGAP2A and the duplicate paralogs. The blue lines delineate the larger extent the homology between SRGAP2B and C. Notice the scale of these region, the red boxed regions are 244 kb of sequence that is nearly identical among among all three loci. In GRCh37, these regions contained multiple haplotypes and were very fragmented By sequencing these regions with the CH17 BACs, we were able to full resolve all three of these regions
Here is another representation of this region. The graphic at the bottom shows the alignment of the fix patch to the reference assembly. The blue boxes highlight sequences on the patches that were completely missing from GRCh37.
By sequencing these regions from a single haplotype source, approximately 500kb of sequence, that was previously misassembled, incorrectly oriented or completely missing, has now been resolved.
As I mentioned, the CHM1_1.1, which is the Illumina-based assembly, was a reference guided assembly created by Richa Agarwala at NCBI, using her SRPRISM assembler. The process involves alignment of the Illumina data to the GRCh37 primary assembly. One thing that is unique about this assembly compared to other whole genome assemblies is the fact that we used many BAC tilepaths in segmentally duplicated regions. There were 45 total paths used in the assembly and then another 104 singletons that were included.
This assembly is available for download from Genbank using the following link. We have recently published a paper on this assembly with further analysis.
At the time this assembly was generated, it had the longest N50 contig length of any human whole genome assembly in Genbank.
Another CHM1 resource I mentioned is the BioNano Genome Map. BioNano is a nanochannel technology where the DNA is nicked and labeled at specific recognition sites, so you end up with nick sites along the DNA molecules in context, similar to a restriction digest, only you have the added benefit of the nicks being in context.
This is showing the same region as the previous example – The top green line here is the GRCh38 reference in silico representation and the bottom blue lines are the map. The alignment of these two data sets shows the same size discrepancy, so the BioNano map data confirms the extra data found in CHM1 at this position. If this data were accessioned, we could add this sequence to the reference, it would become a fix patch if we think there is an error in GRCh38, or a novel patch if we believe this to be variation
Here is another way to use the BioNano map to identify potential assembly issues. This time, I have the BioNano map aligned to the CHM1 Illumina assembly. The green bar represents our CHM1 Illumina based sequence assembly and the blue bars are the map contigs. The first tag, labeled here as a collapse is indicating that there is more data present here in the map contig. To look into the possibility that we have a collapse in our assembly, we examined the Illumina reads aligned back to the CHM1_1.1 assembly. We found that the reads were piling up in the region indicating a collapse, so it looks as if the BioNano map is correct through this region. The expansion label is indicating that there is too much data in the assembly. Within this region we have a gap in our assembly, so our gap is likely sized too big. I think in this example, a standard 50Kb gap size was used in the assembly, which likely indicates that we were not sure what the size was.
Immunoglobulin molecules are formed when somatic recombination occurs between one V, one D and one J gene.
The second major issue is that the current reference sequence was assembled from three lymphoblastoid cell lines. These will be subject to somatic recombination and don’t reflect a true haplotype
Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
This is reflected in the enormous and highly significant Fst values shown here.
Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
Using Fst we see that this locus is significantly differentiated between the Asians and Africans.
We identify and annotate 47 functional V genes, 27 functional D genes and six functional J genes in the mole haplotype.
There are 7 functional V genes present in mole not in reference; 3 V genes deleted from mole
In conclusion, using the CH17 hydatidiform mole BAC library, we identify approximately 100 kbp of novel sequence not present in the human reference
Knowing the utility of this single haploid source, it was decided to sequence the whole genome of CHM1. At the time we started this project, Illumina data was the only cost effective method of generating a whole genome assembly. We generated over 100X coverage of Illumina paired end reads, a reference guided assembly was produced using this data. More recently Pac Bio generated >50X coverage of CHM1 in long read data and we have also had a BioNano Genome map generated.
As I mentioned, the CHM1_1.1 assembly was a reference guided assembly created by Richa Agarwala at NCBI, using her SRPRISM assembler. The process involves alignment of the Illumina data to the GRCh37 primary assembly. One thing that is unique about this assembly compared to other whole genome assemblies is the fact that we used many BAC tilepaths in segmentally duplicated regions. There were 45 total paths used in the assembly and then another 104 singletons that were included. Then a final step was done to compare the assembly back to GRCh37 to provide appropriate gap sizes. This assembly is available for download from Genbank using the following link. We also have a paper that should be published soon and you can go to the Bioarchive to find it there.