The document discusses curating sequence and literature data for RefSeq and Gene at the National Center for Biotechnology Information. It provides an overview of RefSeq, describing what RefSeq is, how it compares to GenBank, its advantages, and how the RefSeq dataset is built through curated data and sequence analysis. It then discusses the curation process in depth, including examples of curating genes, transcripts, proteins, and literature. It also describes the tools and quality assurance checks used in curation.
1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA
Curating sequence and literature
data for RefSeq and Gene
Kim D. Pruitt
8th International Biocuration Conference
Training workshop
April 23, 2015
2. National Center for Biotechnology Information
RefSeq overview
What is RefSeq?
How does it compare to GenBank?
What are the advantages?
How is the dataset built?
• Curated data
• Sequence analysis
• Curation in-depth – examples
• Data access
3. National Center for Biotechnology Information
An NCBI project to provide reference sequence standards, that
incorporate current knowledge, for genomes, transcripts, and proteins.
What is RefSeq?
Vertebrates Eukaryotes Prokaryotes Virus
Genomes 169 503 31,000 4,538
Genes 4 million 9.2 million 2 million 200,000
Transcripts 5.6 million 11 million 20,000 na
Proteins 4.9 million 10 million 38 million 214,287
Counts taken in early March 2015
4. National Center for Biotechnology Information
RefSeq versus GenBank
GenBank RefSeq
Is archival (member of INSDC) Yes No
Source of sequence Submitter GenBank (INSDC)
Source of annotation Submitter GenBank, Collaboration, Literature, Curation,
Computation
Genome is always annotated No Yes for archaea, bacteria, eukaryotes
‘Owner’ of sequence records and annotation Submitter NCBI
NCBI staff can update based on user requests Submitter must
authorize
RefSeq may drop contamination
RefSeq may add transcript/protein/pseudogene
based on data analysis and curation
RefSeq may update annotation
Annotation may be curated by NCBI staff No Yes
5. National Center for Biotechnology Information
Advantages:
Consistency
Non-redundant
Use current names
Expanded feature annotation
Connected to Gene information
Products & Access:
Annotated genomes, transcripts, proteins
Gene, BLAST, FTP, programming API
15 years of building RefSeq
www.ncbi.nlm.nih.gov/refseq/
Curation:
Correct errors
Add new records
Add functional information
Connect sequence to function
Gene & protein names
Functional sequence elements
Curation focus
Human
Mouse
Rat
Zebrafish
Cow
Chicken
6. National Center for Biotechnology Information
RefSeqs unique contribution for vertebrates
• Correct transcript/protein sequence even if genome is incomplete/wrong
• Clear information on data source & evidence
• Connect DNA<>RNA<>Protein
• Connect sequence regions to function
- for both transcripts and proteins
NM_001033952.2
7. National Center for Biotechnology Information
RefSeq Genomes in a Nutshell
Sequence
Assembly
(Annotate)
Submit
GenBank/INSDC GenomeSubmitter
Sequence
Meta-data
Nucleotide Protein
BioSampleAssembly BioProject
SRA
(reads)
FTPBLAST
Web
eUtils
Access
RefSeq Creation
Annotation Pipeline
RefSeq Curation
Collaboration
BLAST
FTP
RefSeq Gene
Genome Tracks
Reports Assembly HomoloGene
Data Submissions
RefSeq
Process Flows
Resources
8. National Center for Biotechnology Information
RefSeq genomes: Leveraging computation & curation
www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
Genes
Curation
International CCDS
Collaboration
Genome Reference
Consortium (GRC)
RefSeqs
Nomenclature
Groups
Model Organism
Databases
UniProtKB/
SwissProt
miRBase
Sequence Analysis
Literature Review
Iterative process
Iterative process
Quality Checks
Model
RefSeqs
Gene
FTP
Nucleotide
Protein
Annotation Pipeline
Align:
RefSeq
cDNAs
Proteins
RNA-Seq
Interpret:
Build models
Call orthologs:
vs. human
Filter:
Best hits
Assign GeneID
Assign Accession
Public release
User Feedback!
Curated RefSeqs
9. National Center for Biotechnology Information
Annotation - a conservative approach
2. stromal antigen 3-like 5 pseudogene
3. poliovirus receptor related immunoglobulin domain pseudogene
4. paired immunoglobin-like type 2 receptor beta
(regulation of inflammatory responses)
1. STAG3L5P-PVRIG2P-PILRB readthrough
Annotate every exon
that is observed once?
Consolidate information
to represent supported
genes and transcripts!
X
10. National Center for Biotechnology Information
Exon coverage
Log2 scale graphs
Interpreted introns
Model RefSeqs
Curated
Track names
Rabbit - GeneID:103352519 - Assembly: OryCun2.0
Annotation pipeline results in NCBI Gene
Access genome annotation information including RNA-Seq tracks
Not annotated in Ensembl 76
RNA-Seq tracks
Ensembl track
Configure
11. National Center for Biotechnology Information
How to identify a RefSeq sequence record
Keyword:
• RefSeq
Accession format:
Two alpha + _+ 6-9 digits – or -
Two alpha + _ + GenBank accession
RefSeq categories
(transcripts & proteins):
• Known RefSeq
• Subject to curation
• Accession prefix N*_
• Model RefSeq
• Evidence-based predictions
• Accession prefix X*_
www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2
12. National Center for Biotechnology Information
RefSeq overview
Curated data
Genes
Sequence
Publications
Imported data
• Sequence analysis
• Curation in-depth – examples
• Data access
13. National Center for Biotechnology Information
Review data
• Gene information
• Gene-2-sequence associations
• Publications
• Data from collaborators
Resolve
Errors
• Remove wrong name synonyms, publications
• Fix sequence associations
• Update gene type
• Correct collaborator Gene: NCBI Gene associations
Add data
• Create RefSeq records
• RefSeq Attributes & Summary
• Transcript variant description
• Alternate names, publications
Import • Add data from
collaborators
Update
DB
• Add, update,
remove accessions
to match GenBank
QA
• Identify data
conflicts for
curator review
BULK PROCESSES CURATION
14. National Center for Biotechnology Information
How do we curate?
• Collaborations
• Nomenclature, MODs, UniProt, Genome
Reference Consortium, individual scientists
• In-depth sequence analysis
• Genome, transcript and protein sequence
• Alignments
• RNA-Seq
• QA tests
• Epigenomics
• Clinical variants
• Literature review
mRNA, ncRNA, protein,
and pseudogene records
Collaboration
Sequence Analysis
Literature
Curation
Guidelines
Validation
Vertebrate transcripts
WWW – FTP - BLAST
Genome Annotation
15. National Center for Biotechnology Information
Tracking data & curation consistency
• Standard operating procedures
• Curation decision trees
• ncRNA <> pseudo <> protein-coding?
• 5’ complete transcript <>partial?
• Sequence analysis tools and CGI’s
• Support collaborations
Data management Curation management
• Specifications for the product
• Relational database to track data and curation
decisions over time
• Process flows
• Data validation
• Disaster recovery/backup
• Public access
16. National Center for Biotechnology Information
What do we curate?
•Genes:
• Type, location, length
• Names, Summary
• Publications
• Gene-2-accession bins
•Imported data
•Sequence:
• Accuracy, length
• Alternate splice products
• Sequence features
• Functional regions
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
Protein-coding Pseudogene
ncRNAs Unknown ???
17. National Center for Biotechnology Information
Curating Literature
• Curation Review for Genes
• Move to correct gene
• Add functional citations
• Mark to include on RefSeq
• GeneRIF submissions from public
• Add RefSeq attribute and citation
• Most publications are added from:
• National Library of Medicine MeSH
indexing service
• Sequence records
• Nomenclature groups, MODs, GO,
OMIM, GWAS catalog, more…
18. National Center for Biotechnology Information
GeneRIFs – an annotated bibliography
http://www.ncbi.nlm.nih.gov/gene/10309
RefSeq curators review GeneRIF submissions from
individuals to correct spelling, check the gene
association, and remove irrelevant submissions.
19. National Center for Biotechnology Information
Curation supports data import processes
Gene
Backend
Database
HGNC
MGD
RGD
XenBase
ZFIN
QTL db
Pseudo
geneOrg
MIRBASE
OMIM
CGNC
Generic
Processing
Dataflow
FTP/API
Compare to known data
Update if OK
Report for curation if
conflicts found
20. National Center for Biotechnology Information
Curating data import errors
• Manually add or update some data
• HGNC may have: HGNC ID 1 = genome location ‘x’ = ENSG ID 1
• Processing can’t identify corresponding GeneID
• Curator reviews genomic location and either updates or creates a Gene record.
• Coordinate with data sources to reconcile data association conflicts
between sites
• NCBI may have: Gene ID 1 = HGNC ID 1 = Accession 123
• HGNC may have: HGNC ID 1 = Gene ID 1 = Accession 234
• NCBI may have: Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)
21. National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Tools
Quality assurance checks
• Curation in-depth - examples
• Data access
22. National Center for Biotechnology Information
Quick access to stored BLAST results
View hits in NCBI’s genome browser
Gene back-end curation database
In-house: Set of BLAST searches per accession
Results are stored for 3 months
Quick access to results
UniVec
EST
NR
Genome
Blastn
Blastx
blastp
23. National Center for Biotechnology Information
Sequence and alignment analysis using NCBI’s
Genome Workbench
www.ncbi.nlm.nih.gov/tools/gbench/
An application for viewing and
analyzing sequence data from
NCBI databases, or upload your
data for analysis
• Compiled for several
operating systems
• Analysis: BLAST and more
• Supports many display
options
• graphical
• alignments
• dot plot
• phylogenetic trees
• more
24. National Center for Biotechnology Information
General layout
Data display area
Project Tree shows loaded data
Search for features, search the sequence, search for open reading frames
Monitor the progress of analysis tasks
*
*
25. National Center for Biotechnology Information
Multi-pane cross alignment view
Turkey_5.0
Chromosome 1
Turkey_2.01
Chromosome 1
28. National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select the analysis option from the Tool menu
29. National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select analysis option from the Tool menu
30. National Center for Biotechnology Information
Display the phylogentic tree calculated
from selected CELF proteins.
31. National Center for Biotechnology Information
Genome workbench - Multiple protein
alignment display
Curation use:
- Orthology review
- Gene type review
- Sequence conservation
32. National Center for Biotechnology Information
RADAR – a Genome Workbench plug-in for RefSeq Curation
Displays Information on:
Genomic region, gene annotation
RNA-seq called introns
CpG Islands, Repeats, variation, more
QA results for newly build RefSeq
Aligned RefSeqs, cDNAs, ESTs
Coding sequence region (green)
Strain data
Clone library
Stored in DB with quality concern (D)
Multiple alignments to the genome (M)
Consensus splice sites (‘a’, ‘d’)
Mismatches
Indels
Unaligned ends (not shown)
LibraryStrainNew RefSeq
QA
RefSeq Analysis, Display, and Recommendation
33. National Center for Biotechnology Information
RADAR
• Functions
• RNAseq supported intron
• ORF finder
• Signal peptides
• Transmembrane regions
• Compare/diff transcripts
• Find similar transcripts
• Integrated QA tests
• View nucleotide
• View translation
• Links to web for details
34. National Center for Biotechnology Information
Review data
• Gene information
• Gene-2-sequence associations
• Publications
• Data from collaborators
Resolve
Errors
• Remove wrong name synonyms, publications
• Fix sequence associations
• Update gene type
• Correct collaborator Gene: NCBI Gene associations
Add data
• Create RefSeq records
• RefSeq Attributes & Summary
• Transcript variant description
• Alternate names, publications and GeneRIF
Import •Add data from
collaborators
Update
DB
•Add, update,
remove
accessions to
match GenBank
QA
•Identify data
conflicts for
curator review
PROCESS CURATION
35. National Center for Biotechnology Information
Quality assurance tests
Tests are available in the NCBI C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/
Transcript tests – protein tests – genome tests – alignment tests
Results
over time
Sequence
tested
Results
summary
Details (not
shown)
36. National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Work flow
Making decisions
Working with collaborators
RefSeq curated data is in Gene
Annotating RefSeq records
• Data access
37. National Center for Biotechnology Information
AAAAAA
AAAAAA
AAAAAA
General process flow for manual transcript-based curation
gt ag gt ag
Identify
quality full-length
cDNAs or ESTs
Determine the supported
complete CDS
Extend 5’ and 3’ ends
using all aligning
transcript data
Representative
RefSeqs AAAAAA
Identify splice variants
and assess their
protein-coding capacity
Protein-coding variant that encodes an
alternate C-terminus
Non-coding variant that is subject to
nonsense-mediated decay (NMD)
NMs
NR
38. National Center for Biotechnology Information
Transcript-based curation process
Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR
Curated NMs are
based on full-
length transcripts
UTRs are
extended
Model XMs are created
computationally based on
transcript and RNA-seq data and
often lack full-length support.
RNA-seq
alignments
Model
Known
Aligned
cDNAs
Chr 12
NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20)
39. National Center for Biotechnology Information
Determining protein-coding potential of a variant
Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent full-
length transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF
renders the transcript a candidate for nonsense-mediated decay (NMD) .
non-coding variants (NR_)
protein-coding variant (NM_)
NMD candidate
ORFs are short < 60 aa
NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20)
40. National Center for Biotechnology Information
Detailed documentation improves consistency
• 1 long cDNA
• Or, 2 lines of support:
• Overlapping partial transcripts + more support
• Protein homology or ORF conservation or
publication
• Consensus splice sites
• ORF length >=100 aa
• If <100 aa require more support
• Not apparently pseudogene
• 1 long cDNA if > 2 exons
• 2 independent lines of support if 2 exons
• 5 lines of support if 1 exon
• ORF length <100aa
• No quality protein hits (blastX)
• Consensus splice
• Consider if syntenic region in human, mouse
• No other data (publication) indicates it is
protein-coding
• 3’ end does not correspond to genomic polyA
Non-coding RNA lociProtein-coding RNA loci
41. National Center for Biotechnology Information
Using Epigenomic data to determine 5’ completeness
H3K4me3 tracks
from the UCSC
Genome Browser
Example: mouse Fgd4 gene (Gene ID: 224014). NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38
UCSC Browser
42. National Center for Biotechnology Information
Representing genes based on published data
Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene
with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus.
Literature review confirms the short ORF is functional.
Assembly: GRCh38.p2, chromosome 4.
54 aa ORF
Functional data support the 54 aa ORF
NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2
43. National Center for Biotechnology Information
Gene type decisions depend on transcript data,
epigenomics and functional studies
Example: Human FALEC gene (Gene ID: 100874054)
Assembly: GRCh38.p2; chromosome 1
The locus is supported by a single
two-exon EST (AL713297.1)
Epigenomic marks support the 5’
completeness of the transcripts data
Published data support a functional
role for this lncRNA
NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20)
UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19)
44. National Center for Biotechnology Information
Working with nomenclature groups to coordinate changes
Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027).
Human Annotation Release 107
Private comments in the in-house Gene database record the curation history
RefSeq
proteins
(red)
45. National Center for Biotechnology Information
AAAAAA
Functional annotation on the RefSeq record
Example: Human GHRL gene (Gene ID: 51738)
- ghrelin/obestatin prepropeptide
GHRL gene
Prepro-ghrelin
Mature
peptides
pro-ghrelin
Ghrelin-28 Obestatin
Ghrelin C-Ghrelin
Signal
peptide
Ghrelin C-Ghrelin
http://www.ncbi.nlm.nih.gov/protein/NP_057446.1
46. National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/gene/51738
• Mature peptides were annotated on protein products of 8
alternatively spliced transcripts (red arrows).
• The Graphics display shown in NCBI’s Gene resource was
reconfigured to show all transcripts and proteins, and to
show the protein features.
GRLH annotation display
in NCBI’s Gene resource
47. National Center for Biotechnology Information
Micro RNA annotation – collaboration with miRBase
RefSeq annotates the mature microRNAs
RefSeq represents
the miRNA stem-
loop precursor
Gene Graphics view
NCBI imports data directly from miRBase (mirbase.org)
miRBase ID:
MI0000443
Example: Human MIR124-1 (Gene ID: 406907)
NR_029668.1
http://www.ncbi.nlm.nih.gov/gene/406907
48. National Center for Biotechnology Information
RefSeq NR_029668.1
- Human MIR124-1
- Gene ID: 406907
RefSeq record – feature annotation for miRNAs
http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1
49. National Center for Biotechnology Information
Feature annotation –
More examples of feature annotation will be provided in Session 1
50. RefSeq collaborates to improve genome annotation
GRCh38 – The gap is fixed in
the updated assembly. RefSeq
and Sanger collaborate to
produce matching annotation
on the new assembly.
GRCh37 – Several exons of the
human COPG2 RefSeq were
missing in the reference genome
assembly. Curators constructed
the RefSeq from transcripts and
reported the assembly gap to
the Genome Reference
Consortium (GRC).
Chromosome 7 GRCh37/hg19 NC_000007.13
Chromosome 7 GRCh38/hg20 NC_000007.14
CCDS – The annotated CDS is
tracked by the Consensus CDS
(CCDS) collaboration once NCBI
and Ensembl have both
annotated the protein
51. National Center for Biotechnology Information
Caution: using RefSeq data from non-NCBI resources
missing XM_ variant
missing pseudogene
locus
missing locus
UCSC’s Genome Browser
RefSeq Genes track
GRCh37/hg19
- Also missing for UCSC
GRCh38/hg20
NCBI’s Graphics Viewer
GRCh38/hg20
52. National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Data access
53. National Center for Biotechnology Information
Finding RefSeq data in NCBI’s Gene resource
• NCBI’s Gene resource is primarily based on RefSeq
• Gene integrates data from many sources:
• RefSeq & GeneRIF
• Official Nomenclature
• Gene Ontology
• Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and
more
• Gene provides a unique ID and includes RefSeq details:
• RefSeq genome annotation
• RefSeq details including transcript variant descriptions
• Report of exon coordinates
54. National Center for Biotechnology Information
RefSeq data in Gene
• Genomic regions, transcripts, proteins
• Find genome annotation datails
• NCBI Reference Sequences (RefSeqs)
• Find information for individual accessions
55. National Center for Biotechnology Information
Manual curation provides annotation for Gene
Example: human GHRL (GeneID:51738)
Nomenclature
Summary
Publications
RefSeq transcript
variant
descriptions
56. National Center for Biotechnology Information
Navigating from Gene to Sequence to download
57. National Center for Biotechnology Information
Nucleotide & Protein queries
• Build a query starting with: refseq[filter]
• Add an organism: AND human[organism]
• Add a name, a RefSeq attribute, or a specific feature type
• AND ghrelin-27[protein name]
• Or… ‘AND mat_peptide[feature key]’ Or … ‘AND obestatin[protein name]’
Protein database query example:
refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]
59. National Center for Biotechnology Information
Bulk retrievals
• RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/
• Comprehensive bi-monthly release organized by major groups (e.g.,
vertebrate_mammals, etc.)
• Weekly updates of transcript/protein records for some organisms
• Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
• Releases of genome assembly and annotation data. Updated to add new file formats,
when assembly updates, when there is a major annotation update.
• Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/
• Reports Gene to RefSeq accession associations, and more.
• NCBI Programming Utilities (eUtils) – supports scripted retreivals
• Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/
• Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/
60. National Center for Biotechnology Information
User feedback and RefSeq updates
• Feedback:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi
• RefSeq Updates: subscribe to the refseq-admin mail list
http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/
• NCBI News
http://www.ncbi.nlm.nih.gov/news/
RefSeq Home page Gene report pages
61. National Center for Biotechnology Information
Databases & programming
• Terence Murphy
• Olga Ermolaeva
• Craig Wallin
• Alex Astashyn
• David Maganadze
• Mike DiCuccio
• Andrei Shkeda
• Donna Maglott
Acknowledgements
Stacy Ciufo
Eric Cox
Diana Haddad
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Wenjun Li
Kelly McGarvey
Mike Murphy
Nuala O'Leary
Kathleen O’Neill
Shashi Pujar
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
Barbara Robberts
Brian Smith-White
Anjana Raina Vatsan
Dave Webb
Matt Wright
RefSeq Curators (Vertebrates & Other taxa)
NCBI Leadership
• David Lipman
• James Ostell
Genome Workbench & RADAR
• Anatoliy Kuznetsov
• David Falk
• Andrei Shkeda