This document summarizes the challenges of integrating historical human genetic variation data from analog formats into digital genomic databases. It discusses issues with standardizing phenotypic data, variant call formats from clinical labs, reference assemblies, and defining mutations consistently. Harmonizing these diverse data sources will improve access and interpretation of human genetic variation.
1. Converting from Analog to Digital
Integrating the historical archive of human variation in an NGS world
Deanna M. Church
Staff Scientist, NCBI
@deannachurch Genome Informatics Alliance 2013
2. Acknowledgements
GeT-RM
Lisa Kalman (CDC)
Birgit Funke (Harvard)
Mahduri Hegde (Emory)
Maryam Halavi
Chao Chen
Jon Trow
Douglas Slotta
Peter Meric
Daniel Frishberg
Victor Ananiev
ClinVar
Alex Astashyn
Shanmuga Chitipiralla
Douglas Hoffman
Wonhee Jang
Brandi Kattman
Melissa Landrum
Jennifer Lee
Adriana Malheiro
Wendy Rubinstein
George Riley
Amanjeev Sethi
Ricardo Villamarin
ISCA
Christa Lese Martin (Geisinger)
Erin Riggs (Geisinger)
Jose Mena
Mike Feolo
Tim Hefferon
John Garner
John Lopez
GRC
Valerie Schneider (NCBI)
The Genome Institute at Washington University
The Wellcome Trust Sanger Institute
The European Bioinformatics Institute
6. Variant Call (dbVar
submission)
Array data files
Clinical Labs
QC Analysis
Curation
Data regularization
dbGaP
Controlled Access
Web access
FTP Access
Assembly
Remapping
dbVar
ISCA
UCSC
DGV
DGVa
NCBI
Approved Users
BioProject ID
ClinVardbGaP projects need
a sponsoring NIH
institute to run the
DAC (NICHD)
7. ASD
Atrial Septum Defect Autism Spectrum Disorder
??
No HPO
1,814
HPO
6,770
Riggs et al, 2012
~2 HPO terms/case
(max of 16)
The Human Phenotype Ontology
14. Dennis et al., 2012
1q32 1q21 1p21
1p21 patch alignment to chromosome 1
15. Hydin: chr16 (16q22.2)
Hydin2: chr1 (1q21.1)
Missing in NCBI35 Unlocalized in NCBI36/GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
17. Kidd et al, 2007APOBEC cluster
Part of chr22 assembly
Alternate locus for chr22
White: Insertion
Black: Deletion
22. Reporting Standards: Not standard
Twelve submitting labs to date
Twelve custom scripts to regularize data
Despite defined formats here:
http://www.ncbi.nlm.nih.gov/projects/variation/get-rm
What are the issues?
23. Reporting Standards: Not standard
What are the issues?
Better Example: QUAL*
*Required sixth column in VCF file
10.01-18357.11
2.6-21.2
0-21.2
20-3070
Allele string
34.79-44624.03
None
20-46006
24. c.1956+15C>CT
Reporting Standards: Not standard
What are the issues?
Lab reporting a single nucleotide change (C->T) het change as:
c.1956+15C>T[=]
HGVS standards says this should be reported as:
Lab reporting a single nucleotide change (A->G) hom change as:
c.670+9A>G
HGVS standards says this should be reported as:
c.[670+9A>G];[670+9A>G]
25. Defining a reference sequence: Data validation
NM_007171.3:c.942T>CReported as:
Base in transcript is a ‘C’ not a ‘T’
The reference is not just the is the chromosome sequences of the primary assembly unit, but also includes the alternate loci and patches, which are used to provide additional sequence representations at selected genomic regions. The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.