O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Kim Pruitt trainingbiocuration2015

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Kim Pruitt biocuration2015
Kim Pruitt biocuration2015
Carregando em…3
×

Confira estes a seguir

1 de 61 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Anúncio

Semelhante a Kim Pruitt trainingbiocuration2015 (20)

Mais recentes (20)

Anúncio

Kim Pruitt trainingbiocuration2015

  1. 1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8th International Biocuration Conference Training workshop April 23, 2015
  2. 2. National Center for Biotechnology Information RefSeq overview What is RefSeq? How does it compare to GenBank? What are the advantages? How is the dataset built? • Curated data • Sequence analysis • Curation in-depth – examples • Data access
  3. 3. National Center for Biotechnology Information An NCBI project to provide reference sequence standards, that incorporate current knowledge, for genomes, transcripts, and proteins. What is RefSeq? Vertebrates Eukaryotes Prokaryotes Virus Genomes 169 503 31,000 4,538 Genes 4 million 9.2 million 2 million 200,000 Transcripts 5.6 million 11 million 20,000 na Proteins 4.9 million 10 million 38 million 214,287 Counts taken in early March 2015
  4. 4. National Center for Biotechnology Information RefSeq versus GenBank GenBank RefSeq Is archival (member of INSDC) Yes No Source of sequence Submitter GenBank (INSDC) Source of annotation Submitter GenBank, Collaboration, Literature, Curation, Computation Genome is always annotated No Yes for archaea, bacteria, eukaryotes ‘Owner’ of sequence records and annotation Submitter NCBI NCBI staff can update based on user requests Submitter must authorize RefSeq may drop contamination RefSeq may add transcript/protein/pseudogene based on data analysis and curation RefSeq may update annotation Annotation may be curated by NCBI staff No Yes
  5. 5. National Center for Biotechnology Information Advantages: Consistency Non-redundant Use current names Expanded feature annotation Connected to Gene information Products & Access: Annotated genomes, transcripts, proteins Gene, BLAST, FTP, programming API 15 years of building RefSeq www.ncbi.nlm.nih.gov/refseq/ Curation: Correct errors Add new records Add functional information Connect sequence to function Gene & protein names Functional sequence elements Curation focus Human Mouse Rat Zebrafish Cow Chicken
  6. 6. National Center for Biotechnology Information RefSeqs unique contribution for vertebrates • Correct transcript/protein sequence even if genome is incomplete/wrong • Clear information on data source & evidence • Connect DNA<>RNA<>Protein • Connect sequence regions to function - for both transcripts and proteins NM_001033952.2
  7. 7. National Center for Biotechnology Information RefSeq Genomes in a Nutshell Sequence Assembly (Annotate) Submit GenBank/INSDC GenomeSubmitter Sequence Meta-data Nucleotide Protein BioSampleAssembly BioProject SRA (reads) FTPBLAST Web eUtils Access RefSeq Creation Annotation Pipeline RefSeq Curation Collaboration BLAST FTP RefSeq Gene Genome Tracks Reports Assembly HomoloGene Data Submissions RefSeq Process Flows Resources
  8. 8. National Center for Biotechnology Information RefSeq genomes: Leveraging computation & curation www.ncbi.nlm.nih.gov/genome/annotation_euk/process/ Genes Curation International CCDS Collaboration Genome Reference Consortium (GRC) RefSeqs Nomenclature Groups Model Organism Databases UniProtKB/ SwissProt miRBase Sequence Analysis Literature Review Iterative process Iterative process Quality Checks Model RefSeqs Gene FTP Nucleotide Protein Annotation Pipeline Align: RefSeq cDNAs Proteins RNA-Seq Interpret: Build models Call orthologs: vs. human Filter: Best hits Assign GeneID Assign Accession Public release User Feedback! Curated RefSeqs
  9. 9. National Center for Biotechnology Information Annotation - a conservative approach 2. stromal antigen 3-like 5 pseudogene 3. poliovirus receptor related immunoglobulin domain pseudogene 4. paired immunoglobin-like type 2 receptor beta (regulation of inflammatory responses) 1. STAG3L5P-PVRIG2P-PILRB readthrough Annotate every exon that is observed once? Consolidate information to represent supported genes and transcripts! X
  10. 10. National Center for Biotechnology Information Exon coverage Log2 scale graphs Interpreted introns Model RefSeqs Curated Track names Rabbit - GeneID:103352519 - Assembly: OryCun2.0 Annotation pipeline results in NCBI Gene Access genome annotation information including RNA-Seq tracks Not annotated in Ensembl 76 RNA-Seq tracks Ensembl track Configure
  11. 11. National Center for Biotechnology Information How to identify a RefSeq sequence record Keyword: • RefSeq Accession format: Two alpha + _+ 6-9 digits – or - Two alpha + _ + GenBank accession RefSeq categories (transcripts & proteins): • Known RefSeq • Subject to curation • Accession prefix N*_ • Model RefSeq • Evidence-based predictions • Accession prefix X*_ www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2
  12. 12. National Center for Biotechnology Information RefSeq overview Curated data Genes Sequence Publications Imported data • Sequence analysis • Curation in-depth – examples • Data access
  13. 13. National Center for Biotechnology Information Review data • Gene information • Gene-2-sequence associations • Publications • Data from collaborators Resolve Errors • Remove wrong name synonyms, publications • Fix sequence associations • Update gene type • Correct collaborator Gene: NCBI Gene associations Add data • Create RefSeq records • RefSeq Attributes & Summary • Transcript variant description • Alternate names, publications Import • Add data from collaborators Update DB • Add, update, remove accessions to match GenBank QA • Identify data conflicts for curator review BULK PROCESSES CURATION
  14. 14. National Center for Biotechnology Information How do we curate? • Collaborations • Nomenclature, MODs, UniProt, Genome Reference Consortium, individual scientists • In-depth sequence analysis • Genome, transcript and protein sequence • Alignments • RNA-Seq • QA tests • Epigenomics • Clinical variants • Literature review mRNA, ncRNA, protein, and pseudogene records Collaboration Sequence Analysis Literature Curation Guidelines Validation Vertebrate transcripts WWW – FTP - BLAST Genome Annotation
  15. 15. National Center for Biotechnology Information Tracking data & curation consistency • Standard operating procedures • Curation decision trees • ncRNA <> pseudo <> protein-coding? • 5’ complete transcript <>partial? • Sequence analysis tools and CGI’s • Support collaborations Data management Curation management • Specifications for the product • Relational database to track data and curation decisions over time • Process flows • Data validation • Disaster recovery/backup • Public access
  16. 16. National Center for Biotechnology Information What do we curate? •Genes: • Type, location, length • Names, Summary • Publications • Gene-2-accession bins •Imported data •Sequence: • Accuracy, length • Alternate splice products • Sequence features • Functional regions RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ Protein-coding Pseudogene ncRNAs Unknown ???
  17. 17. National Center for Biotechnology Information Curating Literature • Curation Review for Genes • Move to correct gene • Add functional citations • Mark to include on RefSeq • GeneRIF submissions from public • Add RefSeq attribute and citation • Most publications are added from: • National Library of Medicine MeSH indexing service • Sequence records • Nomenclature groups, MODs, GO, OMIM, GWAS catalog, more…
  18. 18. National Center for Biotechnology Information GeneRIFs – an annotated bibliography http://www.ncbi.nlm.nih.gov/gene/10309 RefSeq curators review GeneRIF submissions from individuals to correct spelling, check the gene association, and remove irrelevant submissions.
  19. 19. National Center for Biotechnology Information Curation supports data import processes Gene Backend Database HGNC MGD RGD XenBase ZFIN QTL db Pseudo geneOrg MIRBASE OMIM CGNC Generic Processing Dataflow FTP/API Compare to known data Update if OK Report for curation if conflicts found
  20. 20. National Center for Biotechnology Information Curating data import errors • Manually add or update some data • HGNC may have: HGNC ID 1 = genome location ‘x’ = ENSG ID 1 • Processing can’t identify corresponding GeneID • Curator reviews genomic location and either updates or creates a Gene record. • Coordinate with data sources to reconcile data association conflicts between sites • NCBI may have: Gene ID 1 = HGNC ID 1 = Accession 123 • HGNC may have: HGNC ID 1 = Gene ID 1 = Accession 234 • NCBI may have: Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)
  21. 21. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Tools Quality assurance checks • Curation in-depth - examples • Data access
  22. 22. National Center for Biotechnology Information Quick access to stored BLAST results View hits in NCBI’s genome browser Gene back-end curation database In-house: Set of BLAST searches per accession Results are stored for 3 months Quick access to results UniVec EST NR Genome Blastn Blastx blastp
  23. 23. National Center for Biotechnology Information Sequence and alignment analysis using NCBI’s Genome Workbench www.ncbi.nlm.nih.gov/tools/gbench/ An application for viewing and analyzing sequence data from NCBI databases, or upload your data for analysis • Compiled for several operating systems • Analysis: BLAST and more • Supports many display options • graphical • alignments • dot plot • phylogenetic trees • more
  24. 24. National Center for Biotechnology Information General layout Data display area Project Tree shows loaded data Search for features, search the sequence, search for open reading frames Monitor the progress of analysis tasks * *
  25. 25. National Center for Biotechnology Information Multi-pane cross alignment view Turkey_5.0 Chromosome 1 Turkey_2.01 Chromosome 1
  26. 26. National Center for Biotechnology Information Search
  27. 27. National Center for Biotechnology Information
  28. 28. National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select the analysis option from the Tool menu
  29. 29. National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select analysis option from the Tool menu
  30. 30. National Center for Biotechnology Information Display the phylogentic tree calculated from selected CELF proteins.
  31. 31. National Center for Biotechnology Information Genome workbench - Multiple protein alignment display Curation use: - Orthology review - Gene type review - Sequence conservation
  32. 32. National Center for Biotechnology Information RADAR – a Genome Workbench plug-in for RefSeq Curation Displays Information on: Genomic region, gene annotation RNA-seq called introns CpG Islands, Repeats, variation, more QA results for newly build RefSeq Aligned RefSeqs, cDNAs, ESTs Coding sequence region (green) Strain data Clone library Stored in DB with quality concern (D) Multiple alignments to the genome (M) Consensus splice sites (‘a’, ‘d’) Mismatches Indels Unaligned ends (not shown) LibraryStrainNew RefSeq QA RefSeq Analysis, Display, and Recommendation
  33. 33. National Center for Biotechnology Information RADAR • Functions • RNAseq supported intron • ORF finder • Signal peptides • Transmembrane regions • Compare/diff transcripts • Find similar transcripts • Integrated QA tests • View nucleotide • View translation • Links to web for details
  34. 34. National Center for Biotechnology Information Review data • Gene information • Gene-2-sequence associations • Publications • Data from collaborators Resolve Errors • Remove wrong name synonyms, publications • Fix sequence associations • Update gene type • Correct collaborator Gene: NCBI Gene associations Add data • Create RefSeq records • RefSeq Attributes & Summary • Transcript variant description • Alternate names, publications and GeneRIF Import •Add data from collaborators Update DB •Add, update, remove accessions to match GenBank QA •Identify data conflicts for curator review PROCESS CURATION
  35. 35. National Center for Biotechnology Information Quality assurance tests Tests are available in the NCBI C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/ Transcript tests – protein tests – genome tests – alignment tests Results over time Sequence tested Results summary Details (not shown)
  36. 36. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Curation in-depth – examples Work flow Making decisions Working with collaborators RefSeq curated data is in Gene Annotating RefSeq records • Data access
  37. 37. National Center for Biotechnology Information AAAAAA AAAAAA AAAAAA General process flow for manual transcript-based curation gt ag gt ag Identify quality full-length cDNAs or ESTs Determine the supported complete CDS Extend 5’ and 3’ ends using all aligning transcript data Representative RefSeqs AAAAAA Identify splice variants and assess their protein-coding capacity Protein-coding variant that encodes an alternate C-terminus Non-coding variant that is subject to nonsense-mediated decay (NMD) NMs NR
  38. 38. National Center for Biotechnology Information Transcript-based curation process Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR Curated NMs are based on full- length transcripts UTRs are extended Model XMs are created computationally based on transcript and RNA-seq data and often lack full-length support. RNA-seq alignments Model Known Aligned cDNAs Chr 12 NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20)
  39. 39. National Center for Biotechnology Information Determining protein-coding potential of a variant Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent full- length transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF renders the transcript a candidate for nonsense-mediated decay (NMD) . non-coding variants (NR_) protein-coding variant (NM_) NMD candidate ORFs are short < 60 aa NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20)
  40. 40. National Center for Biotechnology Information Detailed documentation improves consistency • 1 long cDNA • Or, 2 lines of support: • Overlapping partial transcripts + more support • Protein homology or ORF conservation or publication • Consensus splice sites • ORF length >=100 aa • If <100 aa require more support • Not apparently pseudogene • 1 long cDNA if > 2 exons • 2 independent lines of support if 2 exons • 5 lines of support if 1 exon • ORF length <100aa • No quality protein hits (blastX) • Consensus splice • Consider if syntenic region in human, mouse • No other data (publication) indicates it is protein-coding • 3’ end does not correspond to genomic polyA Non-coding RNA lociProtein-coding RNA loci
  41. 41. National Center for Biotechnology Information Using Epigenomic data to determine 5’ completeness H3K4me3 tracks from the UCSC Genome Browser Example: mouse Fgd4 gene (Gene ID: 224014). NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38 UCSC Browser
  42. 42. National Center for Biotechnology Information Representing genes based on published data Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus. Literature review confirms the short ORF is functional. Assembly: GRCh38.p2, chromosome 4. 54 aa ORF Functional data support the 54 aa ORF NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2
  43. 43. National Center for Biotechnology Information Gene type decisions depend on transcript data, epigenomics and functional studies Example: Human FALEC gene (Gene ID: 100874054) Assembly: GRCh38.p2; chromosome 1 The locus is supported by a single two-exon EST (AL713297.1) Epigenomic marks support the 5’ completeness of the transcripts data Published data support a functional role for this lncRNA NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20) UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19)
  44. 44. National Center for Biotechnology Information Working with nomenclature groups to coordinate changes Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027). Human Annotation Release 107 Private comments in the in-house Gene database record the curation history RefSeq proteins (red)
  45. 45. National Center for Biotechnology Information AAAAAA Functional annotation on the RefSeq record Example: Human GHRL gene (Gene ID: 51738) - ghrelin/obestatin prepropeptide GHRL gene Prepro-ghrelin Mature peptides pro-ghrelin Ghrelin-28 Obestatin Ghrelin C-Ghrelin Signal peptide Ghrelin C-Ghrelin http://www.ncbi.nlm.nih.gov/protein/NP_057446.1
  46. 46. National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/gene/51738 • Mature peptides were annotated on protein products of 8 alternatively spliced transcripts (red arrows). • The Graphics display shown in NCBI’s Gene resource was reconfigured to show all transcripts and proteins, and to show the protein features. GRLH annotation display in NCBI’s Gene resource
  47. 47. National Center for Biotechnology Information Micro RNA annotation – collaboration with miRBase RefSeq annotates the mature microRNAs RefSeq represents the miRNA stem- loop precursor Gene Graphics view NCBI imports data directly from miRBase (mirbase.org) miRBase ID: MI0000443 Example: Human MIR124-1 (Gene ID: 406907) NR_029668.1 http://www.ncbi.nlm.nih.gov/gene/406907
  48. 48. National Center for Biotechnology Information RefSeq NR_029668.1 - Human MIR124-1 - Gene ID: 406907 RefSeq record – feature annotation for miRNAs http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1
  49. 49. National Center for Biotechnology Information Feature annotation – More examples of feature annotation will be provided in Session 1
  50. 50. RefSeq collaborates to improve genome annotation GRCh38 – The gap is fixed in the updated assembly. RefSeq and Sanger collaborate to produce matching annotation on the new assembly. GRCh37 – Several exons of the human COPG2 RefSeq were missing in the reference genome assembly. Curators constructed the RefSeq from transcripts and reported the assembly gap to the Genome Reference Consortium (GRC). Chromosome 7 GRCh37/hg19 NC_000007.13 Chromosome 7 GRCh38/hg20 NC_000007.14 CCDS – The annotated CDS is tracked by the Consensus CDS (CCDS) collaboration once NCBI and Ensembl have both annotated the protein
  51. 51. National Center for Biotechnology Information Caution: using RefSeq data from non-NCBI resources missing XM_ variant missing pseudogene locus missing locus UCSC’s Genome Browser RefSeq Genes track GRCh37/hg19 - Also missing for UCSC GRCh38/hg20 NCBI’s Graphics Viewer GRCh38/hg20
  52. 52. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Curation in-depth – examples Data access
  53. 53. National Center for Biotechnology Information Finding RefSeq data in NCBI’s Gene resource • NCBI’s Gene resource is primarily based on RefSeq • Gene integrates data from many sources: • RefSeq & GeneRIF • Official Nomenclature • Gene Ontology • Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and more • Gene provides a unique ID and includes RefSeq details: • RefSeq genome annotation • RefSeq details including transcript variant descriptions • Report of exon coordinates
  54. 54. National Center for Biotechnology Information RefSeq data in Gene • Genomic regions, transcripts, proteins • Find genome annotation datails • NCBI Reference Sequences (RefSeqs) • Find information for individual accessions
  55. 55. National Center for Biotechnology Information Manual curation provides annotation for Gene Example: human GHRL (GeneID:51738) Nomenclature Summary Publications RefSeq transcript variant descriptions
  56. 56. National Center for Biotechnology Information Navigating from Gene to Sequence to download
  57. 57. National Center for Biotechnology Information Nucleotide & Protein queries • Build a query starting with: refseq[filter] • Add an organism: AND human[organism] • Add a name, a RefSeq attribute, or a specific feature type • AND ghrelin-27[protein name] • Or… ‘AND mat_peptide[feature key]’ Or … ‘AND obestatin[protein name]’ Protein database query example: refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]
  58. 58. National Center for Biotechnology Information RefSeq in BLAST
  59. 59. National Center for Biotechnology Information Bulk retrievals • RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/ • Comprehensive bi-monthly release organized by major groups (e.g., vertebrate_mammals, etc.) • Weekly updates of transcript/protein records for some organisms • Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ • Releases of genome assembly and annotation data. Updated to add new file formats, when assembly updates, when there is a major annotation update. • Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/ • Reports Gene to RefSeq accession associations, and more. • NCBI Programming Utilities (eUtils) – supports scripted retreivals • Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/ • Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/
  60. 60. National Center for Biotechnology Information User feedback and RefSeq updates • Feedback: http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi • RefSeq Updates: subscribe to the refseq-admin mail list http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/ • NCBI News http://www.ncbi.nlm.nih.gov/news/ RefSeq Home page Gene report pages
  61. 61. National Center for Biotechnology Information Databases & programming • Terence Murphy • Olga Ermolaeva • Craig Wallin • Alex Astashyn • David Maganadze • Mike DiCuccio • Andrei Shkeda • Donna Maglott Acknowledgements Stacy Ciufo Eric Cox Diana Haddad Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Wenjun Li Kelly McGarvey Mike Murphy Nuala O'Leary Kathleen O’Neill Shashi Pujar Bhanu Rajput Sanjida Rangwala Lillian Riddick Barbara Robberts Brian Smith-White Anjana Raina Vatsan Dave Webb Matt Wright RefSeq Curators (Vertebrates & Other taxa) NCBI Leadership • David Lipman • James Ostell Genome Workbench & RADAR • Anatoliy Kuznetsov • David Falk • Andrei Shkeda

×