9. GRC Beginnings
Distributed data
Old Assembly Model
Genome not in INSDC Database
10. Build sequence contigs based on contigs
defined in TPF.
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis
Switch point
Consensus sequence
15. Distributed data
Centralized Data
Old Assembly Model
Genome not in INSDC Database
16. Large-Scale Variation Complicates Genome Assembly
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
19. UGT2B17 MHC MAPT GRCh37 (hg19)
7 alternate haplotypes
at the MHC
Alternate loci released as:
FASTA
AGP
Alignment to chromosome
http://genomereference.org
20.
21. Assembly (e.g. GRCh37)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)
ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
8
ALT
9
24. Oh No! Not a new
version of the human
genome!
http://genomereference.org
25.
26. Assembly (e.g. GRCh37.p5)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)
ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
Genomic 8
Region
(ABO)
Genomic ALT
Region 9
(SMA)
Genomic
Region
(PECAM1)
Patches
…
27. TBC1D3C TBC1D3 TBC1D3H
TBC1D3C
Myo19 region (17q21)
28. 70 Fix PATCHES: Chromosome will update in GRCh38
(adds >1 Mb of novel sequence to the assembly)
71 Novel PATCHES: Additional sequence added
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
29. Distributed data
Centralized Data
Old Assembly Model
Updated Assembly Model
Genome not in INSDC Database
Genome in INSDC Database
30. Data Archives
GenBank
Data in a common format
Data in a single location (and mirrored)
Most quality checked prior to deposition
Robust data tracking mechanism (accession.version)
Data owned by submitter
31. Data tracking
ABC14-1065514J1
Date Phase Gaps Length
FP565796.1 21-Oct-2009 1 1
FP565796.2 14-Oct-2010 1 0
FP565796.3 07-Nov-2010 3 0
39. Assembly (e.g. GRCh37.p5)
GCA_000001405.6 /GCF_000001405.17
ALT GCA_000001345.1/
Primary GCA_000001305.1/ 4 GCF_000001345.1
Assembly GCF_000001305.13
ALT GCA_000001355.1/
5 GCF_000001355.1
Non-nuclear GCA_000006015.1/ ALT GCA_000001365.1/
assembly unit GCF_000006015.1 6 GCF_000001365.2
(e.g. MT)
ALT GCA_000001375.1/
7 GCF_000001375.1
ALT GCA_000001315.1/
1 GCF_000001315.1
ALT GCA_000001385.1/
8 GCF_000001385.1
ALT GCA_000001325.1/
2 GCF_000001325.2
ALT GCA_000001395.1/
9 GCF_000001395.1
ALT GCA_000001335.1/
3 GCF_000001335.1 GCA_000005045.5
Patches
GCF_000005045.4
40. GenBank vs RefSeq
Submitter Owned RefSeq Owned
Redundancy Non-Redundant
Updated rarely Curated
INSDC Not INSDC
BRCA1
83 genomic records 3 genomic records
31 mRNA records 5 mRNA records
27 protein records 1 RNA record
5 protein records
41.
42. RefSeq for Assemblies
Typical assembly edits
Addition of non-nuclear (e.g. MT) assembly units
Removal of contamination
Drop unlocalized/unplaced scaffolds
Mask contamination that is placed on chromosome
48. Genome Data is MORE than just the Genome
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
NM_000336.2:c.800C>T
52. Thanks!
The Genome Reference Consortium
The Genome Center at Washington University
The Wellcome Trust Sanger Institute
The European Bioinformatics Institute
The National Center for Biotechnology Information
Church group at NCBI For Slides:
Valerie Schneider Francoise Thibaud-Nissen
Nathan Bouk Evan Eichler
Hsiu-Chuan Chen Steve Sherry
Peter Meric
Victor Ananiev
Chao Chen
John Lopez
John Garner
Tim Hefferon
NCBI
Cliff Clausen
Notas do Editor
Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
Show alignment of a feature from first slide to show how far down the chromosome it has moved…
Keeping track of people is way easier than keeping track of assemblies.