Introduction to Apollo: A webinar for the i5K Research Community

Introduction to Apollo 
C o l l a b o r a t i v e g e n o m e a n n o t a t i o n e d i t i n g 
 
 
A webinar for the i5K Research Community
Monica Munoz-Torres, PhD | @monimunozto 
Berkeley Bioinformatics Open-Source Projects (BBOP) 
Lawrence Berkeley National Laboratory |  
University of California Berkeley | U.S. Department of Energy
 
i5K Pilot Project Species Call | 13 October, 2015

OUTLINE 
Web
Apollo
Collabora've
Cura'on
and

Interac've
Analysis
of
Genomes

2OUTLINE
•  Today
we
will
discover

how
to
extract
very

valuable
informa'on

about
a
genome
through

cura'on
eﬀorts.

APOLLO DEVELOPMENT
APOLLO DEVELOPERS 3
h* p://G e nom e Ar c hite c t. or g /

Nathan Dunn
Eric Yao
JBrowse, UC Berkeley
Christine Elsik’s Lab,
University of Missouri
Suzi Lewis
Principal Investigator
BBOP

Moni Munoz-Torres
Stephen Ficklin
GenSAS,
Washington State University
Colin DieshDeepak Unni

4
AFTER THIS TALK WE WILL...
v Be@er
understand
genome
cura'on
in
the
context
of
annota'on:

assembled
genome
à
automated
annota=on
à
manual
annota=on

v Become
familiar
with
the
environment
and
func'onality
of
the
Apollo

genome
annota'on
edi'ng
tool.

v Learn
to
iden'fy
homologs
of
known
genes
of
interest
in
a
newly

sequenced
genome.

v Learn
about
corrobora'ng
and
modifying
automa'cally
annotated
gene

models
using
available
evidence
in
Apollo.

What to expect

A
typical
genome

sequencing
project

6
Genome Sequencing Project
Anatomy of a genome sequencing project
Experimental design, sampling.
Comparative analyses
Consensus
Gene Set
Manual
Annotation
Automated
Annotation
Sequencing Assembly
Synthesis &
dissemination.

CURATING GENOMES 
steps involved
1  Genera=on
of
Gene
Models

calling
ORFs,
one
or
more

rounds
of
gene
predic'on,

etc.

2  Annota=on
of
gene
models

Describing
func'on,

expression
pa@erns,

metabolic
network

memberships.

3  Manual
annota=on

CURATING GENOMES 7

GENOME ANNOTATION 
objectives and uses
Curating Genomes 8
The
gene
set
of
an
organism
informs
a
variety
of
studies:

•  Gene
number,
GC%,
TE
composi'on,
repe''ve
regions.

•  Func'onal
assignments.

•  Molecular
evolu'on,
sequence
conserva'on.

•  Gene
families.

•  Metabolic
pathways.

•  What
makes
an
organism
what
it
is?

What
makes
a
bee
a
“bee”?

Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild

REMEMBER...  
for manual annotation
To
remember…
Biological
concepts
to
be@er

understand
manual
annota'on

10BIO-REFRESHER
•  GLOSSARY

from
con1g
to
splice
site

•  CENTRAL
DOGMA

in
molecular
biology

•  WHAT
IS
A
GENE?

deﬁning
your
goal

•  TRANSCRIPTION

mRNA
in
detail

•  TRANSLATION

and
other
deﬁni'ons

•  GENOME
CURATION

steps
involved

11BIO-REFRESHER
WHAT IS A GENE?
v  A
con'nuously
evolving
concept
paints
a
very
complex

picture
of
molecular
ac'vity:

“A
gene
is
a
locatable
region
of
genomic
sequence,
corresponding
to

a
unit
of
inheritance,
which
is
associated
with
regulatory
regions,

transcribed
regions
and/or
other
func'onal
sequence
regions”.

-‐
The
Sequence
Ontology

12BIO-REFRESHER
WHAT IS A GENE?
v  ...
also
long
transcripts,
dispersed
regula1on.

“The
gene
is
a
DNA
segment
that
contributes
to
phenotype
and
func'on.
In

the
absence
of
demonstrated
func'on,
a
gene
may
be
characterized
by

sequence,
transcrip'on
or
homology.”

-‐
The
ENCODE
Project

https://www.encodeproject.org/

13BIO-REFRESHER
“The
gene
is
a
union

of
genomic
sequences

encoding
a
coherent

set
of
poten'ally

overlapping
func'onal

products.”

Gerstein et al., 2007. Genome Res
THE GENE: a moving target

14BIO-REFRESHER
TRANSLATION 
reading frames
v  Reading
frame
is
a
manner
of
dividing
the
sequence
of
nucleo'des
in
mRNA

(or
DNA)
into
a
set
of
consecu've,
non-‐overlapping
triplets
(codons).

v  Three
frames
can
be
read
in
the
5’
à
3’
direc'on.
Given
that
DNA
has
two

an'-‐parallel
strands,
an
addi'onal
three
frames
are
possible
to
be
read
on

the
an'-‐sense
strand.
Six
total
possible
reading
frames
exist.

v  In
eukaryotes,
only
one
reading
frame
per
sec'on
of
DNA
is
biologically

relevant
at
a
'me:
it
has
the
poten'al
to
be
transcribed
into
RNA
and

translated
into
protein.
This
is
called
the
OPEN
READING
FRAME
(ORF)

•  ORF
=
Start
signal
+
coding
sequence
(divisible
by
3)
+
Stop
signal

v  The
sec'ons
of
the
mature
mRNA
transcribed
with
the
coding
sequence
but

not
translated
are
called
UnTranslated
Regions
(UTR);
one
at
each
end.

15BIO-REFRESHER
TRANSLATION 
splice sites
v  The
spliceosome
catalyzes
the
removal
of
introns
and
the
liga'on
of
ﬂanking

exons.

•  introns:
spaces
inside
the
gene,
not
part
of
the
coding
sequence

•  exons:
expression
units
(of
the
coding
sequence)

v  Splicing
signals
(from
the
point
of
view
of
an
intron):

•  One
splice
signal
(site)
on
the
5’
end:
usually
GT
(less
common:
GC)

•  And
a
3’
end
splice
site:
usually
AG

•  Canonical
splice
sites
look
like
this:
…]5’-‐GT/AG-‐3’[…

v  It
is
possible
to
produce
more
than
one
protein
(polypep'de)
sequence
from

the
same
genic
region,
by
alterna'vely
bringing
exons
together=
alterna=ve

splicing.
For
example,
the
gene
Dscam
(Drosophila)
has
38,000
alterna'vely

spliced
mRNAs
=
isoforms

16BIO-REFRESHER
TRANSLATION 
phase
v  Introns
can
interrupt
the
reading
frame
of
a
gene
by
inser'ng
a
sequence

between
two
consecu've
codons

v  Between
the
ﬁrst
and
second
nucleo'de
of
a
codon

v  Or
between
the
second
and
third
nucleo'de
of
a
codon

"Exon and Intron classes”. Licensed under Fair use via Wikipedia

17
"Gene structure" by Daycd- Wikimedia Commons
BIO-REFRESHER
mRNA 
now in your mind
•  Although
of
brief
existence,
understanding
mRNAs
is
crucial,

as
they
will
become
the
center
of
your
work.

19GENE PREDICTION & ANNOTATION
PREDICTION & ANNOTATION
v  Iden'ﬁca'on
and
annota'on
of
genome
features:

•  primarily
focuses
on
protein-‐coding
genes.

•  also
iden'ﬁes
RNAs
(tRNA,
rRNA,
long
and
small
non-‐coding

RNAs
(ncRNA)),
regulatory
mo'fs,
repe''ve
elements,
etc.

•  happens
in
2
phases:

1.  Computa'on
phase

2.  Annota'on
phase

COMPUTATION PHASE
a.  Experimental
data
are
aligned
to
the
genome:
expressed
sequence
tags,

RNA-‐sequencing
reads,
proteins
(homologous
and
heterologous).

b.  Gene
predic=ons
are
generated:

-‐
ab
ini1o:
based
on
nucleo'de
sequence
and
composi'on

e.g.
Augustus,
GENSCAN,
geneid,
fgenesh,
etc.

-‐
evidence-‐driven:
iden'fying
also
domains
and
mo'fs

e.g.
SGP2,
JAMg,
fgenesh++,
etc.

Result:
the
single
most
likely
coding
sequence,
no
UTRs,
no
isoforms.

Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174

ANNOTATION PHASE
Experimental
data
(evidence)
and
predic'ons
are
synthe'zed
into
gene

annota'ons.

Result:
gene
models
that
[generally]
include
UTRs,
isoforms,
evidence
trails.

Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
5’
UTR
3’
UTR

22
In
some
cases
algorithms
and
metrics
used
to
generate

consensus
sets
may
actually
reduce
the
accuracy
of
the
gene’s

representa'on.

CONSENSUS GENE SETS
Gene
models
may
be
organized
into
sets
using:

v  combiners
for
automa'c
integra'on
of
predicted
sets

e.g:
GLEAN,
EvidenceModeler

or

v  tools
packaged
into
pipelines

e.g:
MAKER,
PASA,
Gnomon,
Ensembl,
etc.

GENE PREDICTION & ANNOTATION

ANNOTATION 
an imperfect art
No one is perfect, least of all automated annotation. 23
New
technology
brings
new
challenges:

•  Assembly
errors
can
cause
fragmented

annota'ons

•  Limited
coverage
makes
precise

iden'ﬁca'on
a
diﬃcult
task

Image: www.BroadInstitute.org

MANUAL ANNOTATION 
improving predictions
Precise
elucida=on
of
biological
features

encoded
in
the
genome
requires
careful

examina=on
and
review.

Schiex
et
al.
Nucleic
Acids
2003
(31)
13:
3738-‐3741

Automated Predictions
Experimental Evidence
Manual Annotation – to the rescue. 24
cDNAs,
HMM
domain
searches,
RNAseq,

genes
from
other
species.

25
BIOCURATION 
structural and functional adjustments
Iden=ﬁes
elements
that
best

represent
the
underlying
biology

and
eliminates
elements
that

reﬂect
systemic
errors
of

automated
analyses.

Assigns
func=on
through

compara've
analysis
of
similar

genome
elements
from
closely

related
species
using
literature,

databases,
and
experimental
data.

MANUAL ANNOTATION
h@p://GeneOntology.org

1

2

GENOME ANNOTATION 
an inherently collaborative task
GENE PREDICTION & ANNOTATION 26
Researchers
oGen
turn
to
colleagues
for
second

opinions
and
insight
from
those
with
exper1se
in

par1cular
areas
(e.g.,
domains,
families).

So
many
sequences,
not
enough
hands.

APOLLO 
collaborative genome annotation editing tool
27
v  Web
based,
integrated
with
JBrowse.

v  Supports
real
'me
collabora'on.

v  Automa'c
genera'on
of
ready-‐made

computable
data.

v  Supports
annota'on
of
genes,

pseudogenes,

tRNAs,
snRNAs,
snoRNAs,
ncRNAs,
miRNAs,
TEs,
and
repeats.

v  Intui've
annota'on,
gestures,
and
pull-‐down
menus
to
create
and

edit
transcripts
and
exons
structures,
insert
comments
(CV,
freeform

text),
associate
GO
terms,
etc.

APOLLO
h@p://GenomeArchitect.org

Con'nuous
training
and
support
for
hundreds
of
geographically
dispersed

scien'sts,
from
diverse
research
communi'es,
in
conduc'ng
manual

annota'ons
eﬀorts
to
recover
coding
sequences
in
agreement
with
all

available
biological
evidence
using
Apollo.

28
LESSONS LEARNED
APOLLO
•  Collabora've
work
dis'lls
invaluable
knowledge

29
A LITTLE TRAINING GOES A LONG WAY!
Provided
with
adequate
tools,
wet
lab
scien'sts
make
excep'onal

curators
who
can
easily
learn
to
maximize
the
genera'on
of
accurate,

biologically
supported
gene
models.

APOLLO

Sort
Apollo - current version at i5K Workspace@NAL
31
The
Sequence
Selec'on
Window

4. Becoming Acquainted with Web Apollo.
31

32
APOLLO 
annotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color
by
CDS
frame,

toggle
strands,
set
color

scheme
and
highlights.

-‐
Upload
evidence
files

(GFF3,
BAM,
BigWig),

-‐
combina=on
track

-‐
sequence
search
track

Query
the
genome
using

BLAT.

Naviga'on
and
zoom.

Search
for
a
gene

model
or
a
scaffold.

Get
coordinates
and
“rubber

band”
selec'on
for
zooming.

Login

User-‐created

annota'ons.

New

annotator

panel.

Evidence

Tracks

Stage
and

cell-‐type

specific

transcrip'on

data.

h@p://genomearchitect.org/web_apollo_user_guide

34 | 34
GENERAL PROCESS OF CURATION 
main steps to remember
1.  Select
or
find
a
region
of
interest,
e.g.
scaffold.

2.  Select
appropriate
evidence
tracks
to
review
the
gene
model.

3.  Determine
whether
a
feature
in
an
exis'ng
evidence
track

will
provide
a
reasonable
gene
model
to
start
working.

4.  If
necessary,
adjust
the
gene
model.

5.  Check
your
edited
gene
model
for
integrity
and
accuracy
by

comparing
it
with
available
homologs.

6.  Comment
and
finish.

USER NAVIGATION 
removable side dock
HIGHLIGHTED IMPROVEMENTS 35
Annotations Organism Users Groups AdminTracks
Reference
Sequence

EDITS & EXPORTS 
annotation details, exon boundaries, data export
1 2
Annotations
1
2
gene

mRNA

Reference
Sequences
3
FASTA

GFF3

EDITS & EXPORTS 
annotation details, exon boundaries, data export
3

38 | 38
USER NAVIGATION
Annotator

panel.

•  Choose
appropriate
evidence
from
list
of
“Tracks”
on
annotator
panel.

•  Select
&
drag
elements
from
evidence
track
into
the
‘User-‐created
Annota1ons’
area.

•  Hovering
over
annota'on
in
progress
brings
up
an
informa'on
pop-‐up.

•  Crea'ng
a
new
annota'on

39 | 39
USER NAVIGATION
•  Annota'on
right-‐click
menu

40 | 40
USER NAVIGATION
•  ‘Zoom
to
base
level’
op'on
reveals
the
DNA
Track.

41 | 41
USER NAVIGATION
•  Color
exons
by
CDS
from
the
‘View’
menu.

42 |
Zoom
in/out
with
keyboard:

shit
+
arrow
keys
up/down

42
USER NAVIGATION
•  Toggle
reference
DNA
sequence
and
transla=on
frames
in
forward

strand.
Toggle
models
in
either
direc'on.

“Simple
case”:

-‐
the
predicted
gene
model
is
correct
or
nearly
correct,
and

-‐
this
model
is
supported
by
evidence
that
completely
or
mostly

agrees
with
the
predic'on.

-‐
evidence
that
extends
beyond
the
predicted
model
is
assumed

to
be
non-‐coding
sequence.

The
following
are
simple
modiﬁca'ons.

45 | 45
ANNOTATING SIMPLE CASES
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

•  A
conﬁrma'on
box
will
warn
you
if
the
receiving
transcript
is
not
on
the

same
strand
as
the
feature
where
the
new
exon
originated.

•  Check
‘Start’
and
‘Stop’
signals
ater
each
edit.

46
ADDING EXONS

If
transcript
alignment
data
are
available
&
extend
beyond
your
original
annota'on,

you
may
extend
or
add
UTRs.

1.  Right
click
at
the
exon
edge
and
‘Zoom
to
base
level’.

2.  Place
the
cursor
over
the
edge
of
the
exon
un1l
it
becomes
a
black
arrow
then
click

and
drag
the
edge
of
the
exon
to
the
new
coordinate
posi'on
that
includes
the
UTR.

47
ADDING UTRs
To
add
a
new
spliced
UTR
to
an
exis'ng

annota'on
also
follow
the
procedure
for
adding
an
exon.


To
modify
an
exon
boundary
and
match

data
in
the
evidence
tracks:
select

both
the
[oﬀending]
exon
and
the

feature
with
the
expected
boundary,

then
right
click
on
the
annota'on
to

select
‘Set
3’
end’
or
‘Set
5’
end’
as

appropriate.

In
some
cases
all
the
data
may
disagree
with
the
annota'on,
in

other
cases
some
data
support
the
annota'on
and
some
of
the

data
support
one
or
more
alterna've
transcripts.
Try
to
annotate

as
many
alterna've
transcripts
as
are
well
supported
by
the
data.

48
MATCHING EXON BOUNDARY TO EVIDENCE

Non-‐canonical
splice
sites
ﬂags.
Double
click:
selec'on
of

feature
and
sub-‐features

Evidence
Tracks
Area

‘User-‐created
Annota1ons’
Track

Edge-‐matching

Apollo’s
edi'ng
logic
(brain):

§  selects
longest
ORF
as
CDS

§  ﬂags
non-‐canonical
splice
sites

49
ORFs AND SPLICE SITES

Non-‐canonical
splices
are
indicated
by

an
orange
circle
with
a
white

exclama'on
point
inside,
placed
over

the
edge
of
the
oﬀending
exon.

Canonical
splice
sites:

3’-‐…exon]GA
/
TG[exon…-‐5’

5’-‐…exon]GT
/
AG[exon…-‐3’

reverse
strand,
not
reverse-‐complemented:

forward
strand

50
SPLICE SITES
Zoom
to
review
non-‐canonical

splice
site
warnings.
Although

these
may
not
always
have
to
be

corrected
(e.g
GC
donor),
they

should
be
ﬂagged
with
a

comment.

Exon/intron
splice
site
error
warning

Curated
model


Apollo
calculates
the
longest
possible
open
reading

frame
(ORF)
that
includes
canonical
‘Start’
and

‘Stop’
signals
within
the
predicted
exons.

If
‘Start’
appears
to
be
incorrect,
modify
it
by
selec'ng

an
in-‐frame
‘Start’
codon
further
up
or

downstream,
depending
on
evidence
(proteins,

RNAseq).

It
may
be
present
outside
the
predicted
gene

model,
within
a
region
supported
by
another

evidence
track.

In
very
rare
cases,
the
actual
‘Start’
codon
may
be

non-‐canonical
(non-‐ATG).

51
‘Start’ AND ‘Stop’ SITES

1.  Zoom
in
to
clearly
resolve
each
exon
as
a
dis'nct
rectangle.

2.  Two
exons
from
diﬀerent
tracks
sharing
the
same
start/end
coordinates

display
a
red
bar
to
indicate
matching
edges.

3.  Selec'ng
the
whole
annota'on
or
one
exon
at
a
'me,
use
this
edge-‐
matching
func'on
and
scroll
along
the
length
of
the
annota'on,

verifying
exon
boundaries
against
available
data.

Use
square
[
]
brackets
to
scroll
from
exon
to
exon.

User
curly
{
}
brackets
to
scroll
from
annota'on
to
annota'on.

4.  Check
if
cDNA
/
RNAseq
reads
lack
one
or
more
of
the
annotated
exons

or
include
addi'onal
exons.

52
CHECKING EXON INTEGRITY

Evidence
may
support
joining
two
or
more
diﬀerent
gene
models.

Warning:
protein
alignments
may
have
incorrect
splice
sites
and
lack
non-‐conserved
regions!

1.  In
‘User-‐created
Annota<ons’
area
shit-‐click
to
select
an
intron
from
each
gene
model
and

right
click
to
select
the
‘Merge’
op'on
from
the
menu.

2.  Drag
suppor'ng
evidence
tracks
over
the
candidate
models
to
corroborate
overlap,
or

review
edge
matching
and
coverage
across
models.

3.  Check
the
resul'ng
transla'on
by
querying
a
protein
database
e.g.
UniProt,
NCBI
nr.
Add

comments
to
record
that
this
annota'on
is
the
result
of
a
merge.

54
Red
lines
around
exons:

‘edge-‐matching’
allows
annotators
to
conﬁrm
whether
the

evidence
is
in
agreement
without
examining
each
exon
at
the

base
level.

COMPLEX CASES
merge two gene predictions on the same scaffold
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

One
or
more
splits
may
be
recommended
when:

-‐
different
segments
of
the
predicted
protein
align
to
two
or
more
different

gene
families

-‐
predicted
protein
doesn’t
align
to
known
proteins
over
its
en're
length

-‐
Transcript
data
may
support
a
split,
but
first
verify
whether
they
are

alterna've
transcripts.

55
COMPLEX CASES
split a gene prediction

DNA
Track

‘User-‐created
Annota=ons’
Track

56
COMPLEX CASES
correcting frameshifts and single-base errors
Always
remember:
when
annota'ng
gene
models
using
Apollo,
you
are
looking
at
a
‘frozen’
version
of

the
genome
assembly
and
you
will
not
be
able
to
modify
the
assembly
itself.


57
COMPLEX CASES
correcting selenocysteine containing proteins

58
COMPLEX CASES
correcting selenocysteine containing proteins

1.  Apollo
allows
annotators
to
make
single
base
modifica'ons
or
frameshits
that
are
reflected
in

the
sequence
and
structure
of
any
transcripts
overlapping
the
modifica'on.
These

manipula'ons
do
NOT
change
the
underlying
genomic
sequence.

2.  If
you
determine
that
you
need
to
make
one
of
these
changes,
zoom
in
to
the
nucleo'de
level

and
right
click
over
a
single
nucleo'de
on
the
genomic
sequence
to
access
a
menu
that

provides
op'ons
for
crea'ng
inser'ons,
dele'ons
or
subs'tu'ons.

3.  The
‘Create
Genomic
Inser<on’
feature
will
require
you
to
enter
the
necessary
string
of

nucleo'de
residues
that
will
be
inserted
to
the
right
of
the
cursor’s
current
loca'on.
The

‘Create
Genomic
Dele<on’
op'on
will
require
you
to
enter
the
length
of
the
dele'on,
star'ng

with
the
nucleo'de
where
the
cursor
is
posi'oned.
The
‘Create
Genomic
Subs<tu<on’
feature

asks
for
the
string
of
nucleo'de
residues
that
will
replace
the
ones
on
the
DNA
track.

4.  Once
you
have
entered
the
modifica'ons,
Apollo
will
recalculate
the
corrected
transcript
and

protein
sequences,
which
will
appear
when
you
use
the
right-‐click
menu
‘Get
Sequence’

op'on.
Since
the
underlying
genomic
sequence
is
reflected
in
all
annota'ons
that
include
the

modified
region
you
should
alert
the
curators
of
your
organisms
database
using
the

‘Comments’
sec'on
to
report
the
CDS
edits.

5.  In
special
cases
such
as
selenocysteine
containing
proteins
(read-‐throughs),
right-‐click
over
the

offending/premature
‘Stop’
signal
and
choose
the
‘Set
readthrough
stop
codon’
op'on
from

the
menu.

59
COMPLEX CASES
correcting frameshifts, single-base errors, and selenocysteines

60 | 60
USER NAVIGATION
•  Annotation right-click menu

61
Annota'ons,
annota'on
edits,
and
History:
stored
in
a
centralized
database.

61
USER NAVIGATION

Follow
the
checklist
un'l
you
are
happy
with
the
annota'on!

And
remember
to…

–  comment
to
validate
your
annota'on,
even
if
you
made
no
changes
to
an

exis'ng
model.
Think
of
comments
as
your
vote
of
conﬁdence.

–  or
add
a
comment
to
inform
the
community
of
unresolved
issues
you

think
this
model
may
have.

62 | 62
Always
Remember:
Apollo
cura'on
is
a
community
eﬀort
so
please

use
comments
to
communicate
the
reasons
for
your

annota'on.
Your
comments
will
be
visible
to
everyone.

COMPLETING THE ANNOTATION

63 | 63
USER NAVIGATION
•  Annotation right-click menu

64
The
Annota'on
Informa=on
Editor

64
USER NAVIGATION
DBXRefs
are
database
crossed
references:
if
you
have

reason
to
believe
that
this
gene
is
linked
to
a
gene
in
a

public
database
(including
your
own),
then
add
it
here.

65
The
Annota'on
Informa=on
Editor

•  Add
PubMed
IDs

•  Include
GO
terms
as
appropriate

from
any
of
the
three
ontologies

•  Write
comments
sta'ng
how
you

have
validated
each
model.

65
USER NAVIGATION

•  Check
‘Start’
and
‘Stop’
sites.

•  Check

splice
sites:
most
splice
sites
display

these
residues
…]5’-‐GT/AG-‐3’[…

•  Check
if
you
can
annotate
UTRs,
for
example

using
RNA-‐Seq
data:

–  align
it
against
relevant
genes/gene
family

–  blastp
against
NCBI’s
RefSeq
or
nr

•  Check
for
gaps
in
the
genome.

•  Addi'onal
func'onality
may
be
necessary:

–  merging
2
gene
predic'ons
on
the
same

scaffold

–  merging
2
gene
predic'ons
from
different

scaffolds

–  splifng
a
gene
predic'on

–  correc'ng
frameshigs
and
other
errors
in

the
genome
assembly

–  annota'ng
selenocysteines,
correc'ng

single-‐base
errors,
etc.

67 | 67
•  Add:

–  Important
project
informa'on
in
the
form
of

comments

–  IDs
from
public
databases
e.g.
GenBank
(via

DBXRef),
gene
symbol(s),
common
name(s),

synonyms,
top
BLAST
hits,
orthologs
with

species
names,
and
everything
else
you
can

think
of,
because
you
are
the
expert.

–  Comments
about
the
kinds
of
changes
you

made
to
the
gene
model
of
interest,
if
any.

–  Any
appropriate
func'onal
assignments,
e.g.
via

BLAST,
RNA-‐Seq
data,
literature
searches,
etc.

CHECKLIST
for accuracy and integrity
MANUAL ANNOTATION CHECKLIST

69i5K Workspace@NAL
THE COLLABORATIVE CURATION PROCESS AT i5K 
1.  A
computa'onally
predicted
consensus
gene
set
has
been
generated

using
mul'ple
lines
of
evidence;
e.g.
LDEC_v0.5.3-‐Models

2.  i5K
Projects
will
integrate
consensus
computa'onal
predic'ons
with

manual
annota'ons
to
produce
an
updated
Oﬃcial
Gene
Set
(OGS):

Achtung!

•  If
it’s
not
on
either
track,
it
won’t
make
the
OGS!

•  If
it’s
there
and
it
shouldn’t,
it
will
s'll
make
the
OGS!

70i5K Workspace@NAL
THE COLLABORATIVE CURATION PROCESS AT i5K 
3.  In
some
cases
algorithms
and
metrics
used
to
generate
consensus
sets

may
actually
reduce
the
accuracy
of
the
gene’s
representa'on.
User

your
judgment
and
choose
a
diﬀerent
model
to
annotate.

4.  Isoforms:
drag
original
and
alterna'vely
spliced
form
to
‘User-‐created

Annota<ons’
area.

5.  If
an
annota'on
needs
to
be
removed
from
the
consensus
set,
drag
it
to

the
‘User-‐created
Annota<ons’
area
and
label
as
‘Delete’
on
the

Informa1on
Editor.

6.  Overlapping
interests?
Collaborate
to
reach
agreement.

7.  Follow
guidelines
for
i5K
Pilot
Species
Projects,
at
h@p://goo.gl/LRu1VY

Example
Example 72

Cura'on
example
using
the
Hyalella
azteca

genome
(amphipod
crustacean).

What do we know about this genome?
•  Currently
publicly
available
data
at
NCBI:

•  >37,000

nucleo'de
seqsà
scaffolds,
mitochondrial
genes

•  344

amino
acid
seqsà
mitochondrion

•  47

ESTs

•  0

conserved
domains
iden'fied

•  0

“gene”
entries
submi@ed

•  Data
at
i5K
Workspace@NAL
(annota'on
hosted
at
USDA)

-‐
10,832
scaffolds:
23,288
transcripts:
12,906
proteins

Example 73

PubMed Search:  
what’s new?
Example 74

PubMed Search: what’s new?
Example 75
“Ten
popula'ons
(3
cultures,
7
from
California
water

bodies)
differed
by
at
least
550-‐fold
in
sensi=vity
to

pyrethroids.”

“By
sequencing
the
primary
pyrethroid
target
site,
the

voltage-‐gated
sodium
channel
(vgsc),
we
show
that

point
muta'ons
and
their
spread
in
natural
popula'ons

were
responsible
for
differences
in
pyrethroid

sensi'vity.”

“The
finding
that
a
non-‐target
aqua'c
species
has

acquired
resistance
to
pes'cides
used
only
on
terrestrial

pests
is
troubling
evidence
of
the
impact
of
chronic

pes=cide
transport
from
land-‐based
applica'ons
into

aqua'c
systems.”

How many sequences are there, publicly available,
for our gene of interest?
Example 76
•  Para,
(voltage-‐gated
sodium
channel
alpha

subunit;
Nasonia
vitripennis).

•  NaCP60E
(Sodium
channel
protein
60
E;
D.

melanogaster).

–  MF:
voltage-‐gated
ca'on
channel
ac'vity

(IDA,
GO:0022843).

–  BP:
olfactory
behavior
(IMP,
GO:
0042048),
sodium
ion
transmembrane

transport
(ISS,GO:0035725).

–  CC:
voltage-‐gated
sodium
channel

complex
(IEA,
GO:0001518).

And
what
do
we
know
about
them?

Retrieving sequences for a  
sequence similarity search.
Example 77
>vgsc-‐Segment3-‐DomainII

RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDG
QMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR

BLAT search 
 
input

Example 78


BLAT search 
 
results

Example 79
•  High-‐scoring
segment
pairs
(hsp)

are
listed
in
tabulated
format.

•  Clicking
on
one
line
of
results

sends
you
to
those
coordinates.

BLAST at i5K  
h*ps://i5k.nal.usda.gov/blast
Example 80


BLAST at i5K  
h*ps://i5k.nal.usda.gov/blast

Example 81

BLAST at i5K: hsps
in
“BLAST+
Results”
track

Example 82

Creating a new gene model: drag and drop
Example 83
•  Apollo
automa'cally
calculates
longest
ORF.

•  In
this
case,
ORF
includes
the
high-‐scoring
segment
pairs
(hsp),

marked
here
in
blue.

•  Note
that
gene
is
transcribed
from
reverse
strand.

Get Sequence
Example 85
http://blast.ncbi.nlm.nih.gov/Blast.cgi

Also, flanking sequences (other gene models) vs. NCBI nr
Example 86
In
this
case,
two
gene

models
upstream,
at
5’

end.

BLAST
hsps

Review alignments
Example 87
HaztTmpM006234

HaztTmpM006233

HaztTmpM006232

Hypothesis for vgsc gene model
Example 88

Editing: merge the three models
Example 89
Merge
by
dropping
an

exon
or
gene
model

onto
another.

Merge
by
selec'ng

two
exons
(holding

down
“Shit”)
and

using
the
right
click

menu.

or…

Result of merging the gene models:
Example 90

Editing: correct offending splice site
Example 91
Modify
exon
/
intron

boundary:

-‐  Drag
the
end
of
the

exon
to
the
nearest

canonical
splice
site.

or

-‐  Use
right-‐click
menu.

Editing: set translation start
Example 92

Editing: delete exon not supported by evidence
Example 93
Delete
ﬁrst
exon
from

HaztTmpM006233

Editing: add an exon supported by RNAseq
Example 94
•  RNAseq
reads
show
evidence
in
support
of
transcribed
product,
which
was
not
predicted.

•  Add
exon
at
coordinates
97946-‐98012
by
dragging
up
one
of
the
RNAseq
reads.

Editing: adjust offending splice site using evidence
Example 95

Editing: adjust other boundaries supported by evidence
Example 96

Finished model
Example 97
Corroborate
integrity
and
accuracy
of
the
model:

-‐
Start
and
Stop

-‐
Exon
structure
and
splice
sites
…]5’-‐GT/AG-‐3’[…

-‐
Check
the
predicted
protein
product
vs.
NCBI
nr,
UniProt,
etc.

Information Editor
•  DBXRefs:
e.g.
NP_001128389.1,
N.

vitripennis,
RefSeq

•  PubMed
iden'ﬁer:
PMID:
24065824

•  Gene
Ontology
IDs:
GO:0022843,
GO:
0042048,
GO:0035725,
GO:0001518.

•  Comments

•  Name,
Symbol

•  Approve
/
Delete
radio
bu@on

Example 98
Comments

(if
applicable)

PUBLIC DEMO
100 | 100
APOLLO ON THE WEB 
instructions
At
i5K

1.  Register
for
access
to
Apollo
at
the
i5K
Workspace@NAL
at

h@ps://i5k.nal.usda.gov/web-‐apollo-‐registra'on

2.  Contact
the
coordinator
for
each
species
community
to
receive
more

informa'on
about
how
to
contribute.
Contact
info
is
available
on
each

organism’s
page.

PUBLIC DEMO
101 | 101
APOLLO ON THE WEB 
instructions
Public
Honey
bee
demo
available
at:

h@p://GenomeArchitect.org/WebApolloDemo

APOLLO 
demonstration
PUBLIC DEMO 102
Demonstra'on
video
is
available
at

h@ps://youtu.be/VgPtAP_fvxY

OUTLINE 
Web
Apollo
Collabora've
Cura'on
and

Interac've
Analysis
of
Genomes

103OUTLINE
•  BIO-‐REFRESHER

biological
concepts
for
cura'on

•  ANNOTATION

automa'c
predic'ons

•  MANUAL
ANNOTATION

necessary,
collabora've

•  APOLLO

advancing
collabora've
cura'on

•  EXAMPLE

demos

Thank you! 104
•  Berkeley
Bioinforma=cs
Open-‐source
Projects
(BBOP),

Berkeley
Lab:
Apollo
and
Gene
Ontology
teams.
Suzanna

E.
Lewis
(PI).

•  §
Chris1ne
G.
Elsik
(PI).
University
of
Missouri.

•  *
Ian
Holmes
(PI).
University
of
California
Berkeley.

•  Arthropod
genomics
community:
i5K
Steering

Commi@ee
(esp.
Sue
Brown
(Kansas
State)),
Alexie

Papanicolaou
(UWS),
and
the
Honey
Bee
Genome

Sequencing
Consor'um.

•  Stephen
Ficklin,
GenSAS,
Washington
State
University

•  Apollo
is
supported
by
NIH
grants
5R01GM080203
from

NIGMS,
and
5R01HG004483
from
NHGRI.
Both
projects

are
also
supported
by
the
Director,
Oﬃce
of
Science,

Oﬃce
of
Basic
Energy
Sciences,
of
the
U.S.
Department

of
Energy
under
Contract
No.
DE-‐AC02-‐05CH11231

• 

•  For
your
a*en=on,
thank
you!

Apollo

Nathan
Dunn

Colin
Diesh
§

Deepak
Unni
§

Gene
Ontology

Chris
Mungall

Seth
Carbon

Heiko
Dietze

BBOP

Apollo:
h@p://GenomeArchitect.org

GO:
h@p://GeneOntology.org

i5K:
h@p://arthropodgenomes.org/wiki/i5K

Thank
you!

NAL
at
USDA

Monica
Poelchau

Christopher
Childers

Gary
Moore

Mei-‐Ju
Chen

HGSC
at
BCM

fringy
Richards

Kim
Worley

JBrowse

Eric
Yao
*

Introduction to Apollo: A webinar for the i5K Research Community

Introduction to Apollo: A webinar for the i5K Research Community

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Introduction to Apollo: A webinar for the i5K Research Community

Semelhante a Introduction to Apollo: A webinar for the i5K Research Community (20)

Mais de Monica Munoz-Torres

Mais de Monica Munoz-Torres (9)

Último

Último (20)

Introduction to Apollo: A webinar for the i5K Research Community