150224 grc kms

Characterizing extreme diversity in
the human genome using a single
haplotype genomic resource
Karyn Meltz Steinberg, Ph.D.
AGBT 2015 GRC Workshop
@KMS_Meltzy

1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Slide courtesy of S. Girirajan
Human Genetic Variation

1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant
How do we assay them?

1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant

1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Array-CGH
Karyotyping
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant

1 bp 1 chr
Frequency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Array-CGH
Karyotyping
Sequencing
SNP genotyping
1 bp 1 chr
Throughput
1 kb 1 Mb
Size of variant

Extreme diversity in the human genome
• <99.5% identity to the reference
• Refractory to traditional sequencing efforts
• Loci often contain gene families associated with
immune response and xenobiotic metabolism

HLA is a classic example of an extremely diverse locus
• Critical to immune response
• Characterized by overdominant
selection
• Alleles are linked and segregate as
distinct haplotypes
• Shaped by gene duplication and
diversification

Segmental duplications can predispose loci to further
rearrangement via NAHR

A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity
sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies
(ONLY but noted by color differences)
With a haploid genome, allelic differences are eliminated, and
base differences are likely indicative of repeat copies

SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
Dennis, et.al. 2012
SRGAP2A
SRGAP2B
SRGAP2C

1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21

Hydatidiform mole
Let’s sequence and assemble the whole genome!

CHM1_1.1 Assembly
• Reference-guided assembly
• SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.
2
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
CHM1 Assembly Paper - Genome Research Steinberg et al. 2014

CHM1_1.1 assembly is highly contiguous compared to
other WGS based assemblies

Integrating BAC tiling paths improved assembly

Alignment of CHM1 Illumina data to assembly revealed
regions of extreme heterogeneity
Heterozygous Homozygous Total
Variants 64033 22513 86546
In RepeatMasked (RM) sequence 37060 14833 51893
In Segmental duplication (SD) 30670 4843 35513
In RM and SD 51466 17174 68640
Ts:Tv 1.5 0.7 1.2
Mean SNV density/kb 0.02 0.008 0.03
There are significantly more heterozygous variants in repetitive
sequence than expected (p<1x10-16). BAC ends mapping discordantly
and in multiple loci are significantly enriched for segmental
duplications (p<1x10-5).

Identified 549 novel protein coding genes not annotated
in GRCh37

CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map
~15kb additional data

BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
inAssembly
Gap in SequenceCHM1_1.1 Assembly
CHM1 BioNano Map

Conclusion
• Extremely diverse regions of the genome are difficult to
characterize due to issues distinguishing allelic from
paralogous duplications
• CHM1_1.1 highly contiguous single haplotype
representation of the genome
• Identified regions of misassembly or reference-ized
regions
• Utilize long read technology and nanopore technology to
attempt to fix these regions

Need to add more diversity to reference
• Finish another hydatidiform mole to platinum
status
• Finish 5 genomes to gold status
• NA19240 (Yoruban)
• NA12878 (European)
• HG00513 (Han Chinese)
• 2 “wildcards”
• Looking for underrepresented minority population
• Add high quality alternative sequences to
reference to create a population reference graph
or “pan genome”

Use colored de Bruijn graph structure to represent
population reference graph

Bioinformatic tool development in the future
• Alignment of short reads to population reference
graph
• Variant calling
• Variant reporting/Haplotype resolution

Adapted from Weinstein et al, 2009

The GRCh37 reference sequence was assembled
from three lymphoblastoid cell lines
Not a true haplotype
Incomplete

The CH17 haplotype is quite different from the reference

Novel insertion

Complex Indel

Hotspot/Recurrent Mutation

60 kbp Insertion
(Hotspot)
African Asian European

Duplication (influenza)

44 kbp Duplication
(influenza)
African Asian European

Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number

Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
100 kbp of novel sequence

Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data

CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps
sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Steinberg et al, 2014
• Genome Research (Dec;24(12):2066-76)

LILR (leukocyte
immunoglobulin-like
receptor)/KIR (killer
immunoglobulin receptor)
Immunoglobulin Kappa chain
Immunoglobulin Lambda chain
TCRA/B
17q21.31 inversion
polymorphism
Immunoglobulin
heavy chain locus
CYP2D6
SRGAP2
15q13.3
inversion
polymorphism

150224 grc kms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to 150224 grc kms

Similar to 150224 grc kms (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (15)

Recently uploaded

Recently uploaded (20)

150224 grc kms

Editor's Notes