SlideShare uma empresa Scribd logo
1 de 52
The Evolution of Genome Data


                Deanna M. Church, NCBI




@deannachurch
Collins FS et al, 1998




   Throughput: 500 Mb/year
     Cost: < $0.25 per base
Variation: 100,000 SNPs mapped
ClinVar
                        140,000                                                                                                                            2,500,000
                                                                                                                                GTR
                                         Twenty Two Years of Growth:                                                            Genome Remapping Service
                                                                                                                                PubMed Health
                                                                                                                                CloneDB
                        120,000
                                         NCBI Data and User Services                                          Public Access
                                                                                                                                Genome Decoration Page
                                                                                                              Influenza Seqs.
                                                       GenBank Base Pairs                                     GenSAT                                       2,000,000
                                                       Users (Average)                                        GeneTests
                                                                                                     PubChem                            Peptidome
                        100,000                                                                      Trace Archive                      BioSystems
                                                                                                     CCDS                               Flu H1N1
                                                                                                     Cancer Chromosomes
                                                                                                     Environmental Samples
                                                                                                                               Discovery Initiative         1,500,000
Base Pairs (Millions)




                         80,000                                                       PubMed Central Entrez Genes              Entrez Sensors




                                                                                                                                                                        Users/Weekday
                                                                                      BLINK              Mouse Composite       Primer BLAST
                                                                                      MapViewer           Genome
                                                                                      GEO                Gnomon         Seq Read Archive
                                                                                      GeneRIFs                          UniSTS
                                                                                                   WGS
                                                                                                                        RefSeqGene
                         60,000                                                                    HLA Haplotypes
                                                                                  Human Genome Human Genome-TPA Genome Reference
                                                                                  LinkOut                                 Consortium                        1,000,000
                                                                                             dbMHC                                             dbVar
                                                                       PubMed LocusLink                                                        Epigenomics
                                                                                             BookShelf
                                                                       PSI-BLAST RefSeq                                                        MyNCBI
                                                           BankIt                            Human Genome-
                                                                       VAST       dbSNP                                                        1000 Genomes
                         40,000                            Genomes                            Transcripts Alignments
                                                                       ePCR                                                                    Project
                                                           Taxonomy         Microbial Genomes                          Genome-Wide
                                                                            PHI-BLAST                                    Association Studies
                                              3D Structure        OMIM      CGAP                                       dbGap                                500,000
                                              Network Entrez      GeneMap                                              Entrez Portal
                         20,000                                   Cn3D
                                                        WWW
                                             GenBank              UniGene
                                                        dbSTS
                                       Entrez at NCBI
                                  BLAST      dbEST

                             0                                                                                                                             0
                              1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Steve Sherry, NCBI

                                                   60
                                                         Millions
NCBI dbSNP database growth                               of rs-ids
human variations                                   50


                                                   40


                                                   30


                                                   20

Non-redundant                                              STR & Indel
                                                   10
                                                           SNP
annotations
                                                           Ambiguous mapping

 1999    2000                    2005      2011
                                            2010




                                                         Millions
Submissions                                              of submissions
                                                   25
by project
                                                   50

                                                   75

                                                   100
                                                           1000 Genomes
                                                   125     Other projects
                                                           HapMap
                                                   150     TSC
dbSNP build 135. November 2011
                                                   175
Kidd et al, 2007 APOBEC cluster




BLACK: Deletion
White: Insertion
http://www.ncbi.nlm.nih.gov/dbvar
Church et al., 2011 PLoS




http://genomereference.org
GRC Beginnings


       Distributed data

    Old Assembly Model

Genome not in INSDC Database
Build sequence contigs based on contigs
defined in TPF.
 Check for orientation consistencies
 Select switch points
 Instantiate sequence for further analysis


                 Switch point




                      Consensus sequence
http://genomereference.org
Community Input
Distributed data
      Centralized Data

    Old Assembly Model

Genome not in INSDC Database
Large-Scale Variation Complicates Genome Assembly

         Sequences from haplotype 1
         Sequences from haplotype 2




Old Assembly model: compress into a consensus



New Assembly model: represent both haplotypes
UGT2B17 Region




NCBI36 (hg18)
UGT2B17 Region
NCBI36 NC_000004.10 (chr4) Tiling Path
                AC079749.5         AC147055.2                                            AC019173.4                AC021146.7
  AC074378.4                 AC134921.2                               AC140484.1                      AC093720.2




                              TMPRSS11E                                                         TMPRSS11E2


GRCh37 NC_000004.11 (chr4) Tiling Path
                              AC079749.5                 AC147055.2                                                AC021146.7
  AC074378.4                                    AC134921.1                         AC093720.2


                                    TMPRSS11E


GRCh37: NT_167250.1 (UGT2B17 alternate locus)
                                                   AC019173.4                                                      AC021146.7
   AC074378.4                                                                                    AC226496.2
                AC140484.1

                                     TMPRSS11E2



Xue Y et al, 2008
UGT2B17   MHC                  MAPT   GRCh37 (hg19)




                             7 alternate haplotypes
                                        at the MHC

                               Alternate loci released as:
                                                    FASTA
                                                      AGP
                              Alignment to chromosome


http://genomereference.org
Assembly (e.g. GRCh37)
PAR                Non-nuclear
       Primary    assembly unit
       Assembly      (e.g. MT)

                   ALT       ALT   ALT
       Genomic      1         2     3
        Region
         (MHC)
       Genomic
                   ALT       ALT   ALT
        Region      4         5     6
      (UGT2B17)
       Genomic
        Region
                                   ALT
                   ALT
        (MAPT)                      7
                    8

                   ALT
                    9
Richa Agarwala




MHC Alternate locus
  Alignment to chr6
Oh No! Not a new
                             version of the human
                             genome!




http://genomereference.org
Assembly (e.g. GRCh37.p5)
PAR                Non-nuclear
       Primary    assembly unit
       Assembly      (e.g. MT)

                   ALT       ALT   ALT
       Genomic      1         2     3
         Region
         (MHC)
       Genomic
                   ALT       ALT   ALT
         Region     4         5     6
      (UGT2B17)
       Genomic
         Region
                                   ALT
                   ALT
        (MAPT)                      7
       Genomic      8
         Region
         (ABO)
       Genomic     ALT
         Region     9
         (SMA)
       Genomic
         Region
       (PECAM1)
                  Patches
         …
TBC1D3C         TBC1D3   TBC1D3H




                TBC1D3C




Myo19 region (17q21)
70 Fix PATCHES: Chromosome will update in GRCh38
  (adds >1 Mb of novel sequence to the assembly)

71 Novel PATCHES: Additional sequence added
  (adds >800K of novel sequence to the assembly)

                                                   Releasing patches quarterly
Distributed data
      Centralized Data
    Old Assembly Model
   Updated Assembly Model
Genome not in INSDC Database
  Genome in INSDC Database
Data Archives




                     GenBank



   Data in a common format
   Data in a single location (and mirrored)
   Most quality checked prior to deposition
   Robust data tracking mechanism (accession.version)
   Data owned by submitter
Data tracking

ABC14-1065514J1
                Date       Phase   Gaps      Length

FP565796.1   21-Oct-2009    1       1

FP565796.2   14-Oct-2010    1       0

FP565796.3   07-Nov-2010    3       0
Mouse chrX: 34,800,000-34,890,000

NC_000086.1
          2
          4
          3
          6
          5
          7   CM001013.1
                       2
Mouse chrX: 35,000,000-36,000000
           MGSCv3       MGSCv36




                    X
What’s in a name?

GRCh37
hg19

               Zv7
               danRer5

  MGSCv37
mm8
    NCBIM37
By any other name…




chr21:8,913,216-9,246,964
By any other name…




Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
hg19
               GRCh37




http://www.ncbi.nlm.nih.gov/genome/assembly
Assembly (e.g. GRCh37.p5)
                 GCA_000001405.6 /GCF_000001405.17
                                            ALT      GCA_000001345.1/
  Primary        GCA_000001305.1/            4       GCF_000001345.1
  Assembly       GCF_000001305.13
                                            ALT      GCA_000001355.1/
                                             5       GCF_000001355.1

  Non-nuclear    GCA_000006015.1/           ALT      GCA_000001365.1/
 assembly unit   GCF_000006015.1             6       GCF_000001365.2
    (e.g. MT)
                                            ALT      GCA_000001375.1/
                                             7       GCF_000001375.1
ALT    GCA_000001315.1/
 1     GCF_000001315.1
                                            ALT      GCA_000001385.1/
                                             8       GCF_000001385.1
ALT    GCA_000001325.1/
 2     GCF_000001325.2
                                            ALT      GCA_000001395.1/
                                             9       GCF_000001395.1
ALT    GCA_000001335.1/
 3     GCF_000001335.1                               GCA_000005045.5
                                           Patches
                                                     GCF_000005045.4
GenBank               vs      RefSeq
Submitter Owned              RefSeq Owned
  Redundancy                 Non-Redundant
 Updated rarely                 Curated
    INSDC                      Not INSDC

                     BRCA1
83 genomic records            3 genomic records
31 mRNA records               5 mRNA records
27 protein records            1 RNA record
                              5 protein records
RefSeq for Assemblies

Typical assembly edits
  Addition of non-nuclear (e.g. MT) assembly units
  Removal of contamination
    Drop unlocalized/unplaced scaffolds
    Mask contamination that is placed on chromosome
http://www.ncbi.nlm.nih.gov/genome
Understanding relationships between
                 assemblies using alignments




First Pass   Reciprocal best hit




Second Pass        Non-reciprocal, duplicative hits
NCBI36




                                            GRCh37.p5




No second pass alignments in GRCh37.p5

http://www.ncbi.nlm.nih.gov/tools/gbench/
Genome Data is MORE than just the Genome
Genome Data is MORE than just the Genome
  ATGCGTGCAAAATGCAGTGAGT
   ATGCGTGCAAAATGCAGTGAGT
    ATGCGTGCAAAATGCAGTGAGT
      ATGCGTGCAAAATGCAGTGAGT




NM_000336.2:c.800C>T
ATGCGTGCAAAATGCAGTGAGT
 ATGCGTGCAAAATGCAGTGAGT
  ATGCGTGCAAAATGCAGTGAGT
    ATGCGTGCAAAATGCAGTGAGT




NM_000336.2:c.800C>T
NC_000001.10:g.(?_20700513)_(21062644_?)del
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
http://www.youtube.com/NCBINLM   @NCBI   http://www.facebook.com/ncbi.nlm

http://www.ncbi.nlm.nih.gov/education/
Thanks!
 The Genome Reference Consortium
  The Genome Center at Washington University
  The Wellcome Trust Sanger Institute
  The European Bioinformatics Institute
  The National Center for Biotechnology Information

  Church group at NCBI                                For Slides:
    Valerie Schneider                                  Francoise Thibaud-Nissen
    Nathan Bouk                                        Evan Eichler
    Hsiu-Chuan Chen                                    Steve Sherry
    Peter Meric
    Victor Ananiev
    Chao Chen
    John Lopez
    John Garner
    Tim Hefferon
                                                      NCBI
    Cliff Clausen

Mais conteúdo relacionado

Semelhante a Church nhgri 2012

The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...Borlaug Global Rust Initiative
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectFundación Ramón Areces
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012gregcaporaso
 
Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Sage Base
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009bosc
 
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Sage Base
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsReece Hart
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionThermo Fisher Scientific
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;applicationFyzah Bashir
 
Scratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeScratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeVince Smith
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
 
6.남영도110923
6.남영도1109236.남영도110923
6.남영도110923drugmetabol
 
Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Sage Base
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Beldaeugenibc
 

Semelhante a Church nhgri 2012 (20)

The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome Project
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012
 
Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
 
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome Commons
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
NCBI
NCBINCBI
NCBI
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein production
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Scratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeScratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics Landscape
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
6.남영도110923
6.남영도1109236.남영도110923
6.남영도110923
 
Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Friend Oslo 2012-09-09
Friend Oslo 2012-09-09
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Belda
 

Mais de Deanna Church

Mais de Deanna Church (16)

Church SFAF2014 keynote
Church SFAF2014 keynoteChurch SFAF2014 keynote
Church SFAF2014 keynote
 
Church_NCBIvariation2013
Church_NCBIvariation2013Church_NCBIvariation2013
Church_NCBIvariation2013
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Church iowa2013
Church iowa2013Church iowa2013
Church iowa2013
 
Church emory2013
Church emory2013Church emory2013
Church emory2013
 
Church GeT-RM
Church GeT-RMChurch GeT-RM
Church GeT-RM
 
Church sfaf13
Church sfaf13Church sfaf13
Church sfaf13
 
Church gia13
Church gia13Church gia13
Church gia13
 
Church apr2013
Church apr2013Church apr2013
Church apr2013
 
Church ngs
Church ngsChurch ngs
Church ngs
 
Church agbt13 merge
Church agbt13 mergeChurch agbt13 merge
Church agbt13 merge
 
Church clinical2012
Church clinical2012Church clinical2012
Church clinical2012
 
Church isca2012
Church isca2012Church isca2012
Church isca2012
 
Church gmod2012 pt2
Church gmod2012 pt2Church gmod2012 pt2
Church gmod2012 pt2
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Church Fif2009
Church Fif2009Church Fif2009
Church Fif2009
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Church nhgri 2012

  • 1. The Evolution of Genome Data Deanna M. Church, NCBI @deannachurch
  • 2. Collins FS et al, 1998 Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped
  • 3. ClinVar 140,000 2,500,000 GTR Twenty Two Years of Growth: Genome Remapping Service PubMed Health CloneDB 120,000 NCBI Data and User Services Public Access Genome Decoration Page Influenza Seqs. GenBank Base Pairs GenSAT 2,000,000 Users (Average) GeneTests PubChem Peptidome 100,000 Trace Archive BioSystems CCDS Flu H1N1 Cancer Chromosomes Environmental Samples Discovery Initiative 1,500,000 Base Pairs (Millions) 80,000 PubMed Central Entrez Genes Entrez Sensors Users/Weekday BLINK Mouse Composite Primer BLAST MapViewer Genome GEO Gnomon Seq Read Archive GeneRIFs UniSTS WGS RefSeqGene 60,000 HLA Haplotypes Human Genome Human Genome-TPA Genome Reference LinkOut Consortium 1,000,000 dbMHC dbVar PubMed LocusLink Epigenomics BookShelf PSI-BLAST RefSeq MyNCBI BankIt Human Genome- VAST dbSNP 1000 Genomes 40,000 Genomes Transcripts Alignments ePCR Project Taxonomy Microbial Genomes Genome-Wide PHI-BLAST Association Studies 3D Structure OMIM CGAP dbGap 500,000 Network Entrez GeneMap Entrez Portal 20,000 Cn3D WWW GenBank UniGene dbSTS Entrez at NCBI BLAST dbEST 0 0 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 4. Steve Sherry, NCBI 60 Millions NCBI dbSNP database growth of rs-ids human variations 50 40 30 20 Non-redundant STR & Indel 10 SNP annotations Ambiguous mapping 1999 2000 2005 2011 2010 Millions Submissions of submissions 25 by project 50 75 100 1000 Genomes 125 Other projects HapMap 150 TSC dbSNP build 135. November 2011 175
  • 5. Kidd et al, 2007 APOBEC cluster BLACK: Deletion White: Insertion
  • 7.
  • 8. Church et al., 2011 PLoS http://genomereference.org
  • 9. GRC Beginnings Distributed data Old Assembly Model Genome not in INSDC Database
  • 10. Build sequence contigs based on contigs defined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
  • 11.
  • 13.
  • 15. Distributed data Centralized Data Old Assembly Model Genome not in INSDC Database
  • 16. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
  • 18. UGT2B17 Region NCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC074378.4 AC134921.2 AC140484.1 AC093720.2 TMPRSS11E TMPRSS11E2 GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC147055.2 AC021146.7 AC074378.4 AC134921.1 AC093720.2 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC019173.4 AC021146.7 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2 Xue Y et al, 2008
  • 19. UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org
  • 20.
  • 21. Assembly (e.g. GRCh37) PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 8 ALT 9
  • 22. Richa Agarwala MHC Alternate locus Alignment to chr6
  • 23.
  • 24. Oh No! Not a new version of the human genome! http://genomereference.org
  • 25.
  • 26. Assembly (e.g. GRCh37.p5) PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 Genomic 8 Region (ABO) Genomic ALT Region 9 (SMA) Genomic Region (PECAM1) Patches …
  • 27. TBC1D3C TBC1D3 TBC1D3H TBC1D3C Myo19 region (17q21)
  • 28. 70 Fix PATCHES: Chromosome will update in GRCh38 (adds >1 Mb of novel sequence to the assembly) 71 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Releasing patches quarterly
  • 29. Distributed data Centralized Data Old Assembly Model Updated Assembly Model Genome not in INSDC Database Genome in INSDC Database
  • 30. Data Archives GenBank  Data in a common format  Data in a single location (and mirrored)  Most quality checked prior to deposition  Robust data tracking mechanism (accession.version)  Data owned by submitter
  • 31. Data tracking ABC14-1065514J1 Date Phase Gaps Length FP565796.1 21-Oct-2009 1 1 FP565796.2 14-Oct-2010 1 0 FP565796.3 07-Nov-2010 3 0
  • 34. What’s in a name? GRCh37 hg19 Zv7 danRer5 MGSCv37 mm8 NCBIM37
  • 35. By any other name… chr21:8,913,216-9,246,964
  • 36. By any other name… Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
  • 37. hg19 GRCh37 http://www.ncbi.nlm.nih.gov/genome/assembly
  • 38.
  • 39. Assembly (e.g. GRCh37.p5) GCA_000001405.6 /GCF_000001405.17 ALT GCA_000001345.1/ Primary GCA_000001305.1/ 4 GCF_000001345.1 Assembly GCF_000001305.13 ALT GCA_000001355.1/ 5 GCF_000001355.1 Non-nuclear GCA_000006015.1/ ALT GCA_000001365.1/ assembly unit GCF_000006015.1 6 GCF_000001365.2 (e.g. MT) ALT GCA_000001375.1/ 7 GCF_000001375.1 ALT GCA_000001315.1/ 1 GCF_000001315.1 ALT GCA_000001385.1/ 8 GCF_000001385.1 ALT GCA_000001325.1/ 2 GCF_000001325.2 ALT GCA_000001395.1/ 9 GCF_000001395.1 ALT GCA_000001335.1/ 3 GCF_000001335.1 GCA_000005045.5 Patches GCF_000005045.4
  • 40. GenBank vs RefSeq Submitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA1 83 genomic records 3 genomic records 31 mRNA records 5 mRNA records 27 protein records 1 RNA record 5 protein records
  • 41.
  • 42. RefSeq for Assemblies Typical assembly edits Addition of non-nuclear (e.g. MT) assembly units Removal of contamination Drop unlocalized/unplaced scaffolds Mask contamination that is placed on chromosome
  • 44. Understanding relationships between assemblies using alignments First Pass Reciprocal best hit Second Pass Non-reciprocal, duplicative hits
  • 45.
  • 46. NCBI36 GRCh37.p5 No second pass alignments in GRCh37.p5 http://www.ncbi.nlm.nih.gov/tools/gbench/
  • 47. Genome Data is MORE than just the Genome
  • 48. Genome Data is MORE than just the Genome ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT NM_000336.2:c.800C>T
  • 49. ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT NM_000336.2:c.800C>T NC_000001.10:g.(?_20700513)_(21062644_?)del
  • 51. http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm http://www.ncbi.nlm.nih.gov/education/
  • 52. Thanks! The Genome Reference Consortium The Genome Center at Washington University The Wellcome Trust Sanger Institute The European Bioinformatics Institute The National Center for Biotechnology Information Church group at NCBI For Slides: Valerie Schneider Francoise Thibaud-Nissen Nathan Bouk Evan Eichler Hsiu-Chuan Chen Steve Sherry Peter Meric Victor Ananiev Chao Chen John Lopez John Garner Tim Hefferon NCBI Cliff Clausen

Notas do Editor

  1. Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  2. Show alignment of a feature from first slide to show how far down the chromosome it has moved…
  3. Keeping track of people is way easier than keeping track of assemblies.