SlideShare uma empresa Scribd logo
1 de 61
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA
Curating sequence and literature
data for RefSeq and Gene
Kim D. Pruitt
8th International Biocuration Conference
Training workshop
April 23, 2015
National Center for Biotechnology Information
RefSeq overview
What is RefSeq?
How does it compare to GenBank?
What are the advantages?
How is the dataset built?
• Curated data
• Sequence analysis
• Curation in-depth – examples
• Data access
National Center for Biotechnology Information
An NCBI project to provide reference sequence standards, that
incorporate current knowledge, for genomes, transcripts, and proteins.
What is RefSeq?
Vertebrates Eukaryotes Prokaryotes Virus
Genomes 169 503 31,000 4,538
Genes 4 million 9.2 million 2 million 200,000
Transcripts 5.6 million 11 million 20,000 na
Proteins 4.9 million 10 million 38 million 214,287
Counts taken in early March 2015
National Center for Biotechnology Information
RefSeq versus GenBank
GenBank RefSeq
Is archival (member of INSDC) Yes No
Source of sequence Submitter GenBank (INSDC)
Source of annotation Submitter GenBank, Collaboration, Literature, Curation,
Computation
Genome is always annotated No Yes for archaea, bacteria, eukaryotes
‘Owner’ of sequence records and annotation Submitter NCBI
NCBI staff can update based on user requests Submitter must
authorize
RefSeq may drop contamination
RefSeq may add transcript/protein/pseudogene
based on data analysis and curation
RefSeq may update annotation
Annotation may be curated by NCBI staff No Yes
National Center for Biotechnology Information
Advantages:
Consistency
Non-redundant
Use current names
Expanded feature annotation
Connected to Gene information
Products & Access:
Annotated genomes, transcripts, proteins
Gene, BLAST, FTP, programming API
15 years of building RefSeq
www.ncbi.nlm.nih.gov/refseq/
Curation:
Correct errors
Add new records
Add functional information
Connect sequence to function
Gene & protein names
Functional sequence elements
Curation focus
Human
Mouse
Rat
Zebrafish
Cow
Chicken
National Center for Biotechnology Information
RefSeqs unique contribution for vertebrates
• Correct transcript/protein sequence even if genome is incomplete/wrong
• Clear information on data source & evidence
• Connect DNA<>RNA<>Protein
• Connect sequence regions to function
- for both transcripts and proteins
NM_001033952.2
National Center for Biotechnology Information
RefSeq Genomes in a Nutshell
Sequence
Assembly
(Annotate)
Submit
GenBank/INSDC GenomeSubmitter
Sequence
Meta-data
Nucleotide Protein
BioSampleAssembly BioProject
SRA
(reads)
FTPBLAST
Web
eUtils
Access
RefSeq Creation
Annotation Pipeline
RefSeq Curation
Collaboration
BLAST
FTP
RefSeq Gene
Genome Tracks
Reports Assembly HomoloGene
Data Submissions
RefSeq
Process Flows
Resources
National Center for Biotechnology Information
RefSeq genomes: Leveraging computation & curation
www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
Genes
Curation
International CCDS
Collaboration
Genome Reference
Consortium (GRC)
RefSeqs
Nomenclature
Groups
Model Organism
Databases
UniProtKB/
SwissProt
miRBase
Sequence Analysis
Literature Review
Iterative process
Iterative process
Quality Checks
Model
RefSeqs
Gene
FTP
Nucleotide
Protein
Annotation Pipeline
Align:
RefSeq
cDNAs
Proteins
RNA-Seq
Interpret:
Build models
Call orthologs:
vs. human
Filter:
Best hits
Assign GeneID
Assign Accession
Public release
User Feedback!
Curated RefSeqs
National Center for Biotechnology Information
Annotation - a conservative approach
2. stromal antigen 3-like 5 pseudogene
3. poliovirus receptor related immunoglobulin domain pseudogene
4. paired immunoglobin-like type 2 receptor beta
(regulation of inflammatory responses)
1. STAG3L5P-PVRIG2P-PILRB readthrough
Annotate every exon
that is observed once?
Consolidate information
to represent supported
genes and transcripts!
X
National Center for Biotechnology Information
Exon coverage
Log2 scale graphs
Interpreted introns
Model RefSeqs
Curated
Track names
Rabbit - GeneID:103352519 - Assembly: OryCun2.0
Annotation pipeline results in NCBI Gene
Access genome annotation information including RNA-Seq tracks
Not annotated in Ensembl 76
RNA-Seq tracks
Ensembl track
Configure
National Center for Biotechnology Information
How to identify a RefSeq sequence record
Keyword:
• RefSeq
Accession format:
Two alpha + _+ 6-9 digits – or -
Two alpha + _ + GenBank accession
RefSeq categories
(transcripts & proteins):
• Known RefSeq
• Subject to curation
• Accession prefix N*_
• Model RefSeq
• Evidence-based predictions
• Accession prefix X*_
www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2
National Center for Biotechnology Information
RefSeq overview
Curated data
Genes
Sequence
Publications
Imported data
• Sequence analysis
• Curation in-depth – examples
• Data access
National Center for Biotechnology Information
Review data
• Gene information
• Gene-2-sequence associations
• Publications
• Data from collaborators
Resolve
Errors
• Remove wrong name synonyms, publications
• Fix sequence associations
• Update gene type
• Correct collaborator Gene: NCBI Gene associations
Add data
• Create RefSeq records
• RefSeq Attributes & Summary
• Transcript variant description
• Alternate names, publications
Import • Add data from
collaborators
Update
DB
• Add, update,
remove accessions
to match GenBank
QA
• Identify data
conflicts for
curator review
BULK PROCESSES CURATION
National Center for Biotechnology Information
How do we curate?
• Collaborations
• Nomenclature, MODs, UniProt, Genome
Reference Consortium, individual scientists
• In-depth sequence analysis
• Genome, transcript and protein sequence
• Alignments
• RNA-Seq
• QA tests
• Epigenomics
• Clinical variants
• Literature review
mRNA, ncRNA, protein,
and pseudogene records
Collaboration
Sequence Analysis
Literature
Curation
Guidelines
Validation
Vertebrate transcripts
WWW – FTP - BLAST
Genome Annotation
National Center for Biotechnology Information
Tracking data & curation consistency
• Standard operating procedures
• Curation decision trees
• ncRNA <> pseudo <> protein-coding?
• 5’ complete transcript <>partial?
• Sequence analysis tools and CGI’s
• Support collaborations
Data management Curation management
• Specifications for the product
• Relational database to track data and curation
decisions over time
• Process flows
• Data validation
• Disaster recovery/backup
• Public access
National Center for Biotechnology Information
What do we curate?
•Genes:
• Type, location, length
• Names, Summary
• Publications
• Gene-2-accession bins
•Imported data
•Sequence:
• Accuracy, length
• Alternate splice products
• Sequence features
• Functional regions
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
Protein-coding Pseudogene
ncRNAs Unknown ???
National Center for Biotechnology Information
Curating Literature
• Curation Review for Genes
• Move to correct gene
• Add functional citations
• Mark to include on RefSeq
• GeneRIF submissions from public
• Add RefSeq attribute and citation
• Most publications are added from:
• National Library of Medicine MeSH
indexing service
• Sequence records
• Nomenclature groups, MODs, GO,
OMIM, GWAS catalog, more…
National Center for Biotechnology Information
GeneRIFs – an annotated bibliography
http://www.ncbi.nlm.nih.gov/gene/10309
RefSeq curators review GeneRIF submissions from
individuals to correct spelling, check the gene
association, and remove irrelevant submissions.
National Center for Biotechnology Information
Curation supports data import processes
Gene
Backend
Database
HGNC
MGD
RGD
XenBase
ZFIN
QTL db
Pseudo
geneOrg
MIRBASE
OMIM
CGNC
Generic
Processing
Dataflow
FTP/API
Compare to known data
Update if OK
Report for curation if
conflicts found
National Center for Biotechnology Information
Curating data import errors
• Manually add or update some data
• HGNC may have: HGNC ID 1 = genome location ‘x’ = ENSG ID 1
• Processing can’t identify corresponding GeneID
• Curator reviews genomic location and either updates or creates a Gene record.
• Coordinate with data sources to reconcile data association conflicts
between sites
• NCBI may have: Gene ID 1 = HGNC ID 1 = Accession 123
• HGNC may have: HGNC ID 1 = Gene ID 1 = Accession 234
• NCBI may have: Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)
National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Tools
Quality assurance checks
• Curation in-depth - examples
• Data access
National Center for Biotechnology Information
Quick access to stored BLAST results
View hits in NCBI’s genome browser
Gene back-end curation database
In-house: Set of BLAST searches per accession
Results are stored for 3 months
Quick access to results
UniVec
EST
NR
Genome
Blastn
Blastx
blastp
National Center for Biotechnology Information
Sequence and alignment analysis using NCBI’s
Genome Workbench
www.ncbi.nlm.nih.gov/tools/gbench/
An application for viewing and
analyzing sequence data from
NCBI databases, or upload your
data for analysis
• Compiled for several
operating systems
• Analysis: BLAST and more
• Supports many display
options
• graphical
• alignments
• dot plot
• phylogenetic trees
• more
National Center for Biotechnology Information
General layout
Data display area
Project Tree shows loaded data
Search for features, search the sequence, search for open reading frames
Monitor the progress of analysis tasks
*
*
National Center for Biotechnology Information
Multi-pane cross alignment view
Turkey_5.0
Chromosome 1
Turkey_2.01
Chromosome 1
National Center for Biotechnology Information
Search
National Center for Biotechnology Information
National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select the analysis option from the Tool menu
National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select analysis option from the Tool menu
National Center for Biotechnology Information
Display the phylogentic tree calculated
from selected CELF proteins.
National Center for Biotechnology Information
Genome workbench - Multiple protein
alignment display
Curation use:
- Orthology review
- Gene type review
- Sequence conservation
National Center for Biotechnology Information
RADAR – a Genome Workbench plug-in for RefSeq Curation
Displays Information on:
Genomic region, gene annotation
RNA-seq called introns
CpG Islands, Repeats, variation, more
QA results for newly build RefSeq
Aligned RefSeqs, cDNAs, ESTs
Coding sequence region (green)
Strain data
Clone library
Stored in DB with quality concern (D)
Multiple alignments to the genome (M)
Consensus splice sites (‘a’, ‘d’)
Mismatches
Indels
Unaligned ends (not shown)
LibraryStrainNew RefSeq
QA
RefSeq Analysis, Display, and Recommendation
National Center for Biotechnology Information
RADAR
• Functions
• RNAseq supported intron
• ORF finder
• Signal peptides
• Transmembrane regions
• Compare/diff transcripts
• Find similar transcripts
• Integrated QA tests
• View nucleotide
• View translation
• Links to web for details
National Center for Biotechnology Information
Review data
• Gene information
• Gene-2-sequence associations
• Publications
• Data from collaborators
Resolve
Errors
• Remove wrong name synonyms, publications
• Fix sequence associations
• Update gene type
• Correct collaborator Gene: NCBI Gene associations
Add data
• Create RefSeq records
• RefSeq Attributes & Summary
• Transcript variant description
• Alternate names, publications and GeneRIF
Import •Add data from
collaborators
Update
DB
•Add, update,
remove
accessions to
match GenBank
QA
•Identify data
conflicts for
curator review
PROCESS CURATION
National Center for Biotechnology Information
Quality assurance tests
Tests are available in the NCBI C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/
Transcript tests – protein tests – genome tests – alignment tests
Results
over time
Sequence
tested
Results
summary
Details (not
shown)
National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Work flow
Making decisions
Working with collaborators
RefSeq curated data is in Gene
Annotating RefSeq records
• Data access
National Center for Biotechnology Information
AAAAAA
AAAAAA
AAAAAA
General process flow for manual transcript-based curation
gt ag gt ag
Identify
quality full-length
cDNAs or ESTs
Determine the supported
complete CDS
Extend 5’ and 3’ ends
using all aligning
transcript data
Representative
RefSeqs AAAAAA
Identify splice variants
and assess their
protein-coding capacity
Protein-coding variant that encodes an
alternate C-terminus
Non-coding variant that is subject to
nonsense-mediated decay (NMD)
NMs
NR
National Center for Biotechnology Information
Transcript-based curation process
Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR
Curated NMs are
based on full-
length transcripts
UTRs are
extended
Model XMs are created
computationally based on
transcript and RNA-seq data and
often lack full-length support.
RNA-seq
alignments
Model
Known
Aligned
cDNAs
Chr 12
NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20)
National Center for Biotechnology Information
Determining protein-coding potential of a variant
Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent full-
length transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF
renders the transcript a candidate for nonsense-mediated decay (NMD) .
non-coding variants (NR_)
protein-coding variant (NM_)
NMD candidate
ORFs are short < 60 aa
NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20)
National Center for Biotechnology Information
Detailed documentation improves consistency
• 1 long cDNA
• Or, 2 lines of support:
• Overlapping partial transcripts + more support
• Protein homology or ORF conservation or
publication
• Consensus splice sites
• ORF length >=100 aa
• If <100 aa require more support
• Not apparently pseudogene
• 1 long cDNA if > 2 exons
• 2 independent lines of support if 2 exons
• 5 lines of support if 1 exon
• ORF length <100aa
• No quality protein hits (blastX)
• Consensus splice
• Consider if syntenic region in human, mouse
• No other data (publication) indicates it is
protein-coding
• 3’ end does not correspond to genomic polyA
Non-coding RNA lociProtein-coding RNA loci
National Center for Biotechnology Information
Using Epigenomic data to determine 5’ completeness
H3K4me3 tracks
from the UCSC
Genome Browser
Example: mouse Fgd4 gene (Gene ID: 224014). NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38
UCSC Browser
National Center for Biotechnology Information
Representing genes based on published data
Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene
with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus.
Literature review confirms the short ORF is functional.
Assembly: GRCh38.p2, chromosome 4.
54 aa ORF
Functional data support the 54 aa ORF
NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2
National Center for Biotechnology Information
Gene type decisions depend on transcript data,
epigenomics and functional studies
Example: Human FALEC gene (Gene ID: 100874054)
Assembly: GRCh38.p2; chromosome 1
The locus is supported by a single
two-exon EST (AL713297.1)
Epigenomic marks support the 5’
completeness of the transcripts data
Published data support a functional
role for this lncRNA
NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20)
UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19)
National Center for Biotechnology Information
Working with nomenclature groups to coordinate changes
Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027).
Human Annotation Release 107
Private comments in the in-house Gene database record the curation history
RefSeq
proteins
(red)
National Center for Biotechnology Information
AAAAAA
Functional annotation on the RefSeq record
Example: Human GHRL gene (Gene ID: 51738)
- ghrelin/obestatin prepropeptide
GHRL gene
Prepro-ghrelin
Mature
peptides
pro-ghrelin
Ghrelin-28 Obestatin
Ghrelin C-Ghrelin
Signal
peptide
Ghrelin C-Ghrelin
http://www.ncbi.nlm.nih.gov/protein/NP_057446.1
National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/gene/51738
• Mature peptides were annotated on protein products of 8
alternatively spliced transcripts (red arrows).
• The Graphics display shown in NCBI’s Gene resource was
reconfigured to show all transcripts and proteins, and to
show the protein features.
GRLH annotation display
in NCBI’s Gene resource
National Center for Biotechnology Information
Micro RNA annotation – collaboration with miRBase
RefSeq annotates the mature microRNAs
RefSeq represents
the miRNA stem-
loop precursor
Gene Graphics view
NCBI imports data directly from miRBase (mirbase.org)
miRBase ID:
MI0000443
Example: Human MIR124-1 (Gene ID: 406907)
NR_029668.1
http://www.ncbi.nlm.nih.gov/gene/406907
National Center for Biotechnology Information
RefSeq NR_029668.1
- Human MIR124-1
- Gene ID: 406907
RefSeq record – feature annotation for miRNAs
http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1
National Center for Biotechnology Information
Feature annotation –
More examples of feature annotation will be provided in Session 1
RefSeq collaborates to improve genome annotation
GRCh38 – The gap is fixed in
the updated assembly. RefSeq
and Sanger collaborate to
produce matching annotation
on the new assembly.
GRCh37 – Several exons of the
human COPG2 RefSeq were
missing in the reference genome
assembly. Curators constructed
the RefSeq from transcripts and
reported the assembly gap to
the Genome Reference
Consortium (GRC).
Chromosome 7 GRCh37/hg19 NC_000007.13
Chromosome 7 GRCh38/hg20 NC_000007.14
CCDS – The annotated CDS is
tracked by the Consensus CDS
(CCDS) collaboration once NCBI
and Ensembl have both
annotated the protein
National Center for Biotechnology Information
Caution: using RefSeq data from non-NCBI resources
missing XM_ variant
missing pseudogene
locus
missing locus
UCSC’s Genome Browser
RefSeq Genes track
GRCh37/hg19
- Also missing for UCSC
GRCh38/hg20
NCBI’s Graphics Viewer
GRCh38/hg20
National Center for Biotechnology Information
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Data access
National Center for Biotechnology Information
Finding RefSeq data in NCBI’s Gene resource
• NCBI’s Gene resource is primarily based on RefSeq
• Gene integrates data from many sources:
• RefSeq & GeneRIF
• Official Nomenclature
• Gene Ontology
• Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and
more
• Gene provides a unique ID and includes RefSeq details:
• RefSeq genome annotation
• RefSeq details including transcript variant descriptions
• Report of exon coordinates
National Center for Biotechnology Information
RefSeq data in Gene
• Genomic regions, transcripts, proteins
• Find genome annotation datails
• NCBI Reference Sequences (RefSeqs)
• Find information for individual accessions
National Center for Biotechnology Information
Manual curation provides annotation for Gene
Example: human GHRL (GeneID:51738)
Nomenclature
Summary
Publications
RefSeq transcript
variant
descriptions
National Center for Biotechnology Information
Navigating from Gene to Sequence to download
National Center for Biotechnology Information
Nucleotide & Protein queries
• Build a query starting with: refseq[filter]
• Add an organism: AND human[organism]
• Add a name, a RefSeq attribute, or a specific feature type
• AND ghrelin-27[protein name]
• Or… ‘AND mat_peptide[feature key]’ Or … ‘AND obestatin[protein name]’
Protein database query example:
refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]
National Center for Biotechnology Information
RefSeq in BLAST
National Center for Biotechnology Information
Bulk retrievals
• RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/
• Comprehensive bi-monthly release organized by major groups (e.g.,
vertebrate_mammals, etc.)
• Weekly updates of transcript/protein records for some organisms
• Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
• Releases of genome assembly and annotation data. Updated to add new file formats,
when assembly updates, when there is a major annotation update.
• Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/
• Reports Gene to RefSeq accession associations, and more.
• NCBI Programming Utilities (eUtils) – supports scripted retreivals
• Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/
• Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/
National Center for Biotechnology Information
User feedback and RefSeq updates
• Feedback:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi
• RefSeq Updates: subscribe to the refseq-admin mail list
http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/
• NCBI News
http://www.ncbi.nlm.nih.gov/news/
RefSeq Home page Gene report pages
National Center for Biotechnology Information
Databases & programming
• Terence Murphy
• Olga Ermolaeva
• Craig Wallin
• Alex Astashyn
• David Maganadze
• Mike DiCuccio
• Andrei Shkeda
• Donna Maglott
Acknowledgements
Stacy Ciufo
Eric Cox
Diana Haddad
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Wenjun Li
Kelly McGarvey
Mike Murphy
Nuala O'Leary
Kathleen O’Neill
Shashi Pujar
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
Barbara Robberts
Brian Smith-White
Anjana Raina Vatsan
Dave Webb
Matt Wright
RefSeq Curators (Vertebrates & Other taxa)
NCBI Leadership
• David Lipman
• James Ostell
Genome Workbench & RADAR
• Anatoliy Kuznetsov
• David Falk
• Andrei Shkeda

Mais conteúdo relacionado

Mais procurados

The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...ExternalEvents
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3GenomeInABottle
 
Aug2015 salit standards architecture
Aug2015 salit standards architectureAug2015 salit standards architecture
Aug2015 salit standards architectureGenomeInABottle
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han caoGenomeInABottle
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsGenomeInABottle
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomesGenomeInABottle
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshopGenomeInABottle
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGenomeInABottle
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesJackie Wirz, PhD
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsGenomeInABottle
 
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...QIAGEN
 
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Functional Genomics Data Society
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Databasenist-spin
 

Mais procurados (20)

The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
Aug2015 salit standards architecture
Aug2015 salit standards architectureAug2015 salit standards architecture
Aug2015 salit standards architecture
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomes
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
Rna seq
Rna seq Rna seq
Rna seq
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptx
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners Slides
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methods
 
Ncbi
NcbiNcbi
Ncbi
 
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...
 
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 

Destaque

Aequatus browser: Visualising complex similarity relationships among species
Aequatus browser:  Visualising complex similarity relationships among speciesAequatus browser:  Visualising complex similarity relationships among species
Aequatus browser: Visualising complex similarity relationships among speciesAnil Thanki
 
Java Introductie
Java IntroductieJava Introductie
Java Introductiembruggen
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 

Destaque (6)

IntOGen & Gitools
IntOGen & GitoolsIntOGen & Gitools
IntOGen & Gitools
 
Aequatus browser: Visualising complex similarity relationships among species
Aequatus browser:  Visualising complex similarity relationships among speciesAequatus browser:  Visualising complex similarity relationships among species
Aequatus browser: Visualising complex similarity relationships among species
 
Java Introductie
Java IntroductieJava Introductie
Java Introductie
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
DisGeNET Tutorial SWAT4LS 2015-12-07
DisGeNET Tutorial SWAT4LS 2015-12-07DisGeNET Tutorial SWAT4LS 2015-12-07
DisGeNET Tutorial SWAT4LS 2015-12-07
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 

Semelhante a Kim Pruitt trainingbiocuration2015

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Biological databases
Biological databasesBiological databases
Biological databasesAshfaq Ahmad
 
Preclinical Scale Bioprocessing, Nov. 2, 2009
Preclinical Scale Bioprocessing, Nov. 2, 2009Preclinical Scale Bioprocessing, Nov. 2, 2009
Preclinical Scale Bioprocessing, Nov. 2, 2009David Bienvenue
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Alejandro Borges
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdfnedalalazzwy
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Nathan Olson
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
160628 giab for festival of genomics
160628 giab for festival of genomics160628 giab for festival of genomics
160628 giab for festival of genomicsGenomeInABottle
 
Assign 2.0 software for the analysis of Phred quality values for quality con...
Assign 2.0  software for the analysis of Phred quality values for quality con...Assign 2.0  software for the analysis of Phred quality values for quality con...
Assign 2.0 software for the analysis of Phred quality values for quality con...Crystal Sanchez
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural cropsPulipati Gangadhara Rao
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 

Semelhante a Kim Pruitt trainingbiocuration2015 (20)

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Preclinical Scale Bioprocessing, Nov. 2, 2009
Preclinical Scale Bioprocessing, Nov. 2, 2009Preclinical Scale Bioprocessing, Nov. 2, 2009
Preclinical Scale Bioprocessing, Nov. 2, 2009
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
NCBI
NCBINCBI
NCBI
 
160628 giab for festival of genomics
160628 giab for festival of genomics160628 giab for festival of genomics
160628 giab for festival of genomics
 
Assign 2.0 software for the analysis of Phred quality values for quality con...
Assign 2.0  software for the analysis of Phred quality values for quality con...Assign 2.0  software for the analysis of Phred quality values for quality con...
Assign 2.0 software for the analysis of Phred quality values for quality con...
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural crops
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 

Último

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 

Último (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 

Kim Pruitt trainingbiocuration2015

  • 1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8th International Biocuration Conference Training workshop April 23, 2015
  • 2. National Center for Biotechnology Information RefSeq overview What is RefSeq? How does it compare to GenBank? What are the advantages? How is the dataset built? • Curated data • Sequence analysis • Curation in-depth – examples • Data access
  • 3. National Center for Biotechnology Information An NCBI project to provide reference sequence standards, that incorporate current knowledge, for genomes, transcripts, and proteins. What is RefSeq? Vertebrates Eukaryotes Prokaryotes Virus Genomes 169 503 31,000 4,538 Genes 4 million 9.2 million 2 million 200,000 Transcripts 5.6 million 11 million 20,000 na Proteins 4.9 million 10 million 38 million 214,287 Counts taken in early March 2015
  • 4. National Center for Biotechnology Information RefSeq versus GenBank GenBank RefSeq Is archival (member of INSDC) Yes No Source of sequence Submitter GenBank (INSDC) Source of annotation Submitter GenBank, Collaboration, Literature, Curation, Computation Genome is always annotated No Yes for archaea, bacteria, eukaryotes ‘Owner’ of sequence records and annotation Submitter NCBI NCBI staff can update based on user requests Submitter must authorize RefSeq may drop contamination RefSeq may add transcript/protein/pseudogene based on data analysis and curation RefSeq may update annotation Annotation may be curated by NCBI staff No Yes
  • 5. National Center for Biotechnology Information Advantages: Consistency Non-redundant Use current names Expanded feature annotation Connected to Gene information Products & Access: Annotated genomes, transcripts, proteins Gene, BLAST, FTP, programming API 15 years of building RefSeq www.ncbi.nlm.nih.gov/refseq/ Curation: Correct errors Add new records Add functional information Connect sequence to function Gene & protein names Functional sequence elements Curation focus Human Mouse Rat Zebrafish Cow Chicken
  • 6. National Center for Biotechnology Information RefSeqs unique contribution for vertebrates • Correct transcript/protein sequence even if genome is incomplete/wrong • Clear information on data source & evidence • Connect DNA<>RNA<>Protein • Connect sequence regions to function - for both transcripts and proteins NM_001033952.2
  • 7. National Center for Biotechnology Information RefSeq Genomes in a Nutshell Sequence Assembly (Annotate) Submit GenBank/INSDC GenomeSubmitter Sequence Meta-data Nucleotide Protein BioSampleAssembly BioProject SRA (reads) FTPBLAST Web eUtils Access RefSeq Creation Annotation Pipeline RefSeq Curation Collaboration BLAST FTP RefSeq Gene Genome Tracks Reports Assembly HomoloGene Data Submissions RefSeq Process Flows Resources
  • 8. National Center for Biotechnology Information RefSeq genomes: Leveraging computation & curation www.ncbi.nlm.nih.gov/genome/annotation_euk/process/ Genes Curation International CCDS Collaboration Genome Reference Consortium (GRC) RefSeqs Nomenclature Groups Model Organism Databases UniProtKB/ SwissProt miRBase Sequence Analysis Literature Review Iterative process Iterative process Quality Checks Model RefSeqs Gene FTP Nucleotide Protein Annotation Pipeline Align: RefSeq cDNAs Proteins RNA-Seq Interpret: Build models Call orthologs: vs. human Filter: Best hits Assign GeneID Assign Accession Public release User Feedback! Curated RefSeqs
  • 9. National Center for Biotechnology Information Annotation - a conservative approach 2. stromal antigen 3-like 5 pseudogene 3. poliovirus receptor related immunoglobulin domain pseudogene 4. paired immunoglobin-like type 2 receptor beta (regulation of inflammatory responses) 1. STAG3L5P-PVRIG2P-PILRB readthrough Annotate every exon that is observed once? Consolidate information to represent supported genes and transcripts! X
  • 10. National Center for Biotechnology Information Exon coverage Log2 scale graphs Interpreted introns Model RefSeqs Curated Track names Rabbit - GeneID:103352519 - Assembly: OryCun2.0 Annotation pipeline results in NCBI Gene Access genome annotation information including RNA-Seq tracks Not annotated in Ensembl 76 RNA-Seq tracks Ensembl track Configure
  • 11. National Center for Biotechnology Information How to identify a RefSeq sequence record Keyword: • RefSeq Accession format: Two alpha + _+ 6-9 digits – or - Two alpha + _ + GenBank accession RefSeq categories (transcripts & proteins): • Known RefSeq • Subject to curation • Accession prefix N*_ • Model RefSeq • Evidence-based predictions • Accession prefix X*_ www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2
  • 12. National Center for Biotechnology Information RefSeq overview Curated data Genes Sequence Publications Imported data • Sequence analysis • Curation in-depth – examples • Data access
  • 13. National Center for Biotechnology Information Review data • Gene information • Gene-2-sequence associations • Publications • Data from collaborators Resolve Errors • Remove wrong name synonyms, publications • Fix sequence associations • Update gene type • Correct collaborator Gene: NCBI Gene associations Add data • Create RefSeq records • RefSeq Attributes & Summary • Transcript variant description • Alternate names, publications Import • Add data from collaborators Update DB • Add, update, remove accessions to match GenBank QA • Identify data conflicts for curator review BULK PROCESSES CURATION
  • 14. National Center for Biotechnology Information How do we curate? • Collaborations • Nomenclature, MODs, UniProt, Genome Reference Consortium, individual scientists • In-depth sequence analysis • Genome, transcript and protein sequence • Alignments • RNA-Seq • QA tests • Epigenomics • Clinical variants • Literature review mRNA, ncRNA, protein, and pseudogene records Collaboration Sequence Analysis Literature Curation Guidelines Validation Vertebrate transcripts WWW – FTP - BLAST Genome Annotation
  • 15. National Center for Biotechnology Information Tracking data & curation consistency • Standard operating procedures • Curation decision trees • ncRNA <> pseudo <> protein-coding? • 5’ complete transcript <>partial? • Sequence analysis tools and CGI’s • Support collaborations Data management Curation management • Specifications for the product • Relational database to track data and curation decisions over time • Process flows • Data validation • Disaster recovery/backup • Public access
  • 16. National Center for Biotechnology Information What do we curate? •Genes: • Type, location, length • Names, Summary • Publications • Gene-2-accession bins •Imported data •Sequence: • Accuracy, length • Alternate splice products • Sequence features • Functional regions RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ Protein-coding Pseudogene ncRNAs Unknown ???
  • 17. National Center for Biotechnology Information Curating Literature • Curation Review for Genes • Move to correct gene • Add functional citations • Mark to include on RefSeq • GeneRIF submissions from public • Add RefSeq attribute and citation • Most publications are added from: • National Library of Medicine MeSH indexing service • Sequence records • Nomenclature groups, MODs, GO, OMIM, GWAS catalog, more…
  • 18. National Center for Biotechnology Information GeneRIFs – an annotated bibliography http://www.ncbi.nlm.nih.gov/gene/10309 RefSeq curators review GeneRIF submissions from individuals to correct spelling, check the gene association, and remove irrelevant submissions.
  • 19. National Center for Biotechnology Information Curation supports data import processes Gene Backend Database HGNC MGD RGD XenBase ZFIN QTL db Pseudo geneOrg MIRBASE OMIM CGNC Generic Processing Dataflow FTP/API Compare to known data Update if OK Report for curation if conflicts found
  • 20. National Center for Biotechnology Information Curating data import errors • Manually add or update some data • HGNC may have: HGNC ID 1 = genome location ‘x’ = ENSG ID 1 • Processing can’t identify corresponding GeneID • Curator reviews genomic location and either updates or creates a Gene record. • Coordinate with data sources to reconcile data association conflicts between sites • NCBI may have: Gene ID 1 = HGNC ID 1 = Accession 123 • HGNC may have: HGNC ID 1 = Gene ID 1 = Accession 234 • NCBI may have: Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)
  • 21. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Tools Quality assurance checks • Curation in-depth - examples • Data access
  • 22. National Center for Biotechnology Information Quick access to stored BLAST results View hits in NCBI’s genome browser Gene back-end curation database In-house: Set of BLAST searches per accession Results are stored for 3 months Quick access to results UniVec EST NR Genome Blastn Blastx blastp
  • 23. National Center for Biotechnology Information Sequence and alignment analysis using NCBI’s Genome Workbench www.ncbi.nlm.nih.gov/tools/gbench/ An application for viewing and analyzing sequence data from NCBI databases, or upload your data for analysis • Compiled for several operating systems • Analysis: BLAST and more • Supports many display options • graphical • alignments • dot plot • phylogenetic trees • more
  • 24. National Center for Biotechnology Information General layout Data display area Project Tree shows loaded data Search for features, search the sequence, search for open reading frames Monitor the progress of analysis tasks * *
  • 25. National Center for Biotechnology Information Multi-pane cross alignment view Turkey_5.0 Chromosome 1 Turkey_2.01 Chromosome 1
  • 26. National Center for Biotechnology Information Search
  • 27. National Center for Biotechnology Information
  • 28. National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select the analysis option from the Tool menu
  • 29. National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select analysis option from the Tool menu
  • 30. National Center for Biotechnology Information Display the phylogentic tree calculated from selected CELF proteins.
  • 31. National Center for Biotechnology Information Genome workbench - Multiple protein alignment display Curation use: - Orthology review - Gene type review - Sequence conservation
  • 32. National Center for Biotechnology Information RADAR – a Genome Workbench plug-in for RefSeq Curation Displays Information on: Genomic region, gene annotation RNA-seq called introns CpG Islands, Repeats, variation, more QA results for newly build RefSeq Aligned RefSeqs, cDNAs, ESTs Coding sequence region (green) Strain data Clone library Stored in DB with quality concern (D) Multiple alignments to the genome (M) Consensus splice sites (‘a’, ‘d’) Mismatches Indels Unaligned ends (not shown) LibraryStrainNew RefSeq QA RefSeq Analysis, Display, and Recommendation
  • 33. National Center for Biotechnology Information RADAR • Functions • RNAseq supported intron • ORF finder • Signal peptides • Transmembrane regions • Compare/diff transcripts • Find similar transcripts • Integrated QA tests • View nucleotide • View translation • Links to web for details
  • 34. National Center for Biotechnology Information Review data • Gene information • Gene-2-sequence associations • Publications • Data from collaborators Resolve Errors • Remove wrong name synonyms, publications • Fix sequence associations • Update gene type • Correct collaborator Gene: NCBI Gene associations Add data • Create RefSeq records • RefSeq Attributes & Summary • Transcript variant description • Alternate names, publications and GeneRIF Import •Add data from collaborators Update DB •Add, update, remove accessions to match GenBank QA •Identify data conflicts for curator review PROCESS CURATION
  • 35. National Center for Biotechnology Information Quality assurance tests Tests are available in the NCBI C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/ Transcript tests – protein tests – genome tests – alignment tests Results over time Sequence tested Results summary Details (not shown)
  • 36. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Curation in-depth – examples Work flow Making decisions Working with collaborators RefSeq curated data is in Gene Annotating RefSeq records • Data access
  • 37. National Center for Biotechnology Information AAAAAA AAAAAA AAAAAA General process flow for manual transcript-based curation gt ag gt ag Identify quality full-length cDNAs or ESTs Determine the supported complete CDS Extend 5’ and 3’ ends using all aligning transcript data Representative RefSeqs AAAAAA Identify splice variants and assess their protein-coding capacity Protein-coding variant that encodes an alternate C-terminus Non-coding variant that is subject to nonsense-mediated decay (NMD) NMs NR
  • 38. National Center for Biotechnology Information Transcript-based curation process Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR Curated NMs are based on full- length transcripts UTRs are extended Model XMs are created computationally based on transcript and RNA-seq data and often lack full-length support. RNA-seq alignments Model Known Aligned cDNAs Chr 12 NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20)
  • 39. National Center for Biotechnology Information Determining protein-coding potential of a variant Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent full- length transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF renders the transcript a candidate for nonsense-mediated decay (NMD) . non-coding variants (NR_) protein-coding variant (NM_) NMD candidate ORFs are short < 60 aa NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20)
  • 40. National Center for Biotechnology Information Detailed documentation improves consistency • 1 long cDNA • Or, 2 lines of support: • Overlapping partial transcripts + more support • Protein homology or ORF conservation or publication • Consensus splice sites • ORF length >=100 aa • If <100 aa require more support • Not apparently pseudogene • 1 long cDNA if > 2 exons • 2 independent lines of support if 2 exons • 5 lines of support if 1 exon • ORF length <100aa • No quality protein hits (blastX) • Consensus splice • Consider if syntenic region in human, mouse • No other data (publication) indicates it is protein-coding • 3’ end does not correspond to genomic polyA Non-coding RNA lociProtein-coding RNA loci
  • 41. National Center for Biotechnology Information Using Epigenomic data to determine 5’ completeness H3K4me3 tracks from the UCSC Genome Browser Example: mouse Fgd4 gene (Gene ID: 224014). NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38 UCSC Browser
  • 42. National Center for Biotechnology Information Representing genes based on published data Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus. Literature review confirms the short ORF is functional. Assembly: GRCh38.p2, chromosome 4. 54 aa ORF Functional data support the 54 aa ORF NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2
  • 43. National Center for Biotechnology Information Gene type decisions depend on transcript data, epigenomics and functional studies Example: Human FALEC gene (Gene ID: 100874054) Assembly: GRCh38.p2; chromosome 1 The locus is supported by a single two-exon EST (AL713297.1) Epigenomic marks support the 5’ completeness of the transcripts data Published data support a functional role for this lncRNA NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20) UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19)
  • 44. National Center for Biotechnology Information Working with nomenclature groups to coordinate changes Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027). Human Annotation Release 107 Private comments in the in-house Gene database record the curation history RefSeq proteins (red)
  • 45. National Center for Biotechnology Information AAAAAA Functional annotation on the RefSeq record Example: Human GHRL gene (Gene ID: 51738) - ghrelin/obestatin prepropeptide GHRL gene Prepro-ghrelin Mature peptides pro-ghrelin Ghrelin-28 Obestatin Ghrelin C-Ghrelin Signal peptide Ghrelin C-Ghrelin http://www.ncbi.nlm.nih.gov/protein/NP_057446.1
  • 46. National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/gene/51738 • Mature peptides were annotated on protein products of 8 alternatively spliced transcripts (red arrows). • The Graphics display shown in NCBI’s Gene resource was reconfigured to show all transcripts and proteins, and to show the protein features. GRLH annotation display in NCBI’s Gene resource
  • 47. National Center for Biotechnology Information Micro RNA annotation – collaboration with miRBase RefSeq annotates the mature microRNAs RefSeq represents the miRNA stem- loop precursor Gene Graphics view NCBI imports data directly from miRBase (mirbase.org) miRBase ID: MI0000443 Example: Human MIR124-1 (Gene ID: 406907) NR_029668.1 http://www.ncbi.nlm.nih.gov/gene/406907
  • 48. National Center for Biotechnology Information RefSeq NR_029668.1 - Human MIR124-1 - Gene ID: 406907 RefSeq record – feature annotation for miRNAs http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1
  • 49. National Center for Biotechnology Information Feature annotation – More examples of feature annotation will be provided in Session 1
  • 50. RefSeq collaborates to improve genome annotation GRCh38 – The gap is fixed in the updated assembly. RefSeq and Sanger collaborate to produce matching annotation on the new assembly. GRCh37 – Several exons of the human COPG2 RefSeq were missing in the reference genome assembly. Curators constructed the RefSeq from transcripts and reported the assembly gap to the Genome Reference Consortium (GRC). Chromosome 7 GRCh37/hg19 NC_000007.13 Chromosome 7 GRCh38/hg20 NC_000007.14 CCDS – The annotated CDS is tracked by the Consensus CDS (CCDS) collaboration once NCBI and Ensembl have both annotated the protein
  • 51. National Center for Biotechnology Information Caution: using RefSeq data from non-NCBI resources missing XM_ variant missing pseudogene locus missing locus UCSC’s Genome Browser RefSeq Genes track GRCh37/hg19 - Also missing for UCSC GRCh38/hg20 NCBI’s Graphics Viewer GRCh38/hg20
  • 52. National Center for Biotechnology Information RefSeq overview Curated data Sequence analysis Curation in-depth – examples Data access
  • 53. National Center for Biotechnology Information Finding RefSeq data in NCBI’s Gene resource • NCBI’s Gene resource is primarily based on RefSeq • Gene integrates data from many sources: • RefSeq & GeneRIF • Official Nomenclature • Gene Ontology • Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and more • Gene provides a unique ID and includes RefSeq details: • RefSeq genome annotation • RefSeq details including transcript variant descriptions • Report of exon coordinates
  • 54. National Center for Biotechnology Information RefSeq data in Gene • Genomic regions, transcripts, proteins • Find genome annotation datails • NCBI Reference Sequences (RefSeqs) • Find information for individual accessions
  • 55. National Center for Biotechnology Information Manual curation provides annotation for Gene Example: human GHRL (GeneID:51738) Nomenclature Summary Publications RefSeq transcript variant descriptions
  • 56. National Center for Biotechnology Information Navigating from Gene to Sequence to download
  • 57. National Center for Biotechnology Information Nucleotide & Protein queries • Build a query starting with: refseq[filter] • Add an organism: AND human[organism] • Add a name, a RefSeq attribute, or a specific feature type • AND ghrelin-27[protein name] • Or… ‘AND mat_peptide[feature key]’ Or … ‘AND obestatin[protein name]’ Protein database query example: refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]
  • 58. National Center for Biotechnology Information RefSeq in BLAST
  • 59. National Center for Biotechnology Information Bulk retrievals • RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/ • Comprehensive bi-monthly release organized by major groups (e.g., vertebrate_mammals, etc.) • Weekly updates of transcript/protein records for some organisms • Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ • Releases of genome assembly and annotation data. Updated to add new file formats, when assembly updates, when there is a major annotation update. • Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/ • Reports Gene to RefSeq accession associations, and more. • NCBI Programming Utilities (eUtils) – supports scripted retreivals • Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/ • Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/
  • 60. National Center for Biotechnology Information User feedback and RefSeq updates • Feedback: http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi • RefSeq Updates: subscribe to the refseq-admin mail list http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/ • NCBI News http://www.ncbi.nlm.nih.gov/news/ RefSeq Home page Gene report pages
  • 61. National Center for Biotechnology Information Databases & programming • Terence Murphy • Olga Ermolaeva • Craig Wallin • Alex Astashyn • David Maganadze • Mike DiCuccio • Andrei Shkeda • Donna Maglott Acknowledgements Stacy Ciufo Eric Cox Diana Haddad Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Wenjun Li Kelly McGarvey Mike Murphy Nuala O'Leary Kathleen O’Neill Shashi Pujar Bhanu Rajput Sanjida Rangwala Lillian Riddick Barbara Robberts Brian Smith-White Anjana Raina Vatsan Dave Webb Matt Wright RefSeq Curators (Vertebrates & Other taxa) NCBI Leadership • David Lipman • James Ostell Genome Workbench & RADAR • Anatoliy Kuznetsov • David Falk • Andrei Shkeda