2014 villefranche

C. Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
May 2014
ctb@msu.edu
Applying mRNAseq to non-model organisms:
challenges, opportunities, and solutions

We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints available.

Sequencing has become very
inexpensive.

Sequencing costs
 Approximately $1000 of mRNAseq will yield a
decent transcriptome.
 Multiple samples will allow you to generate gene
inventories.
 For the ascidian project I will show you,
 1 graduate student,
 2 transcriptomes,
 3 genomes…

Mapping => quantitation
Reference transcriptome required.

Interpreting RNAseq requires gene
models:
http://www.hitseq.com/images/RNA-seq_AS.jpg

The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most extant computational tools focus on model
organisms –
 Assume low polymorphism (internal variation)
 Assume reference genome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure
…and cannot easily or directly be used on critters of
interest.

Outline
1. Challenges of non-model
transcriptomics.
2. Lamprey: too much data, not enough
genome
3. Digital normalization as a coping
mechanism
4. …applied to Molgulid ascidians…
5. …and back to lamprey.
6. More transcriptome challenges
7. What’s next?
Note: I also work on metagenomics, which I will not discuss t

Sea lamprey in the Great Lakes
 Non-native
 Parasite of
medium to large
fishes
 Caused
populations of
host fishes to
crash
Li Lab / Y-W C-D

The problem of lamprey:
 Diverged at base of vertebrates;
evolutionarily distant from model
organisms.
 Large, complicated genome (~2 GB)
 Relatively little existing sequence.
 We sequenced the liver genome…

Lamprey has incomplete genomic sequence
J. Smith et al., PNAS 2009
Evidence of somatic recombination;
100s of mb of sequence eliminated
from genome during development.
More recent evidence (unpub, J.
Smith et al.) suggests that this loss
is developmentally regulated,
results in changes in gene
expression (due to loss of genes!),
and is tissue specific.
Liver genome is not the entire
genome.

Lamprey tissues for which we have
mRNAseq
embryo stages (late blastula,
gastrula, neurula, 22b, neural-
crest migration, 24c1,24c2)
metamorphosis 3 (intestine,
kidney)
ovulatory female head skin
adult intestine
kidney)
preovulatory female eye
adult kidney
metamorphosis 5 (liver, intestine,
kidney)
preovulatory female tail skin
brain paired
kidney)
prespermiating male gill
freshwater (gill, intestine, kidney)
kidney)
mature adult male rope tissue
larval (gill, kidney, liver, intestine) monocytes
spermiating male gill
juvenile (intestine, liver, kidney) brain (0,3,21 dpi)
spermiating male head skin
lips spinal cord (0.3.21 dpi)
supraneural tissue
kidney) spermiating male muscle
small parasite distal intestine,
kidney, proximal intestine
metamorphosis 2 (liver, intestine, salt water (gill, intestine)

Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!

Shared low-level
transcripts may
not reach the
threshold for
assembly.

Main problem (4 years ago):
We have a massive amount of data
that challenges existing computers
when we try to assemble it all
together.

Solution: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
We can discard it for
you…

Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of sequencing.
=> Enables analyses that are otherwise completely
impossible.

Evaluating diginorm – how?
 Can’t assemble lamprey w/o
diginorm; are results any good &
how would we know?
 Need comparative data set
 …ascidians!

Looking at the Molgula…
Putnam et al., 2008,
Nature.Modified from Swalla 2001

Sea squirts!
Molgula oculata
Molgula occulta
Molgula oculata Ciona intestinalis
Elijah Lowe; collaboration w/Billie Swalla

Challenging organisms to work on --
 Only spawn ~1 month out of the year
 Located off the northern coast of France (Roscoff)
 Hybrids not found outside of lab conditions
 Species cannot be cultured (yet)
 Wet lab techniques are not fully developed for species

Tail loss and notochord genes
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996

Diginorm applied to Molgula
embryonic mRNAseq

Substantial
time savings
(3-5x)
<< RAM
Elijah Lowe

Question: does it matter what
assembly pipeline you use? (No)
3
70
25
1
36
13563
35
13
7
4 23 8 1
6
5
Diginorm V/O Raw V/O
Diginorm trinity Raw trinity
Numbers are putative orthologs (reciprocal
best hits) w/Ciona intestinalis, calculated for
each assembly.
Elijah Lowe

How complete are these
transcriptomes?
Elijah Lowe

Shift in differentially expressed genes
from gastrulation to neurulation
M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula
Differentially expressed during neurulation in M. ocu vs M. occ

Notochord gene expression similar to
tailed species
-10 -5 0 5 10 15
-10-5051015
Expression difference Hybrid vs Parent species
log2(hybrid)-log2(oculata)
log2(hybrid)-log2(occulta)

M. occulta transgenic NoTrlc
Alberto Stolfi & Lionel Christiaen

Lionel Christaen Claudia Racioppi
NYU Statione Zoologica Napoli

Enabling Molgula research…
 Develop candidate genes to generate
hypotheses about gene network
evolution;
 Rapid development of genomic
resources => reporter constructs.
Doesn’t answer any biological questions
directly, but enables us to go looking for
things much faster!

Transcriptome assembly
thoughts
 We can (now) assemble really big data
sets, and get pretty good results.
 We have lots of evidence (some
presented here :) that some assemblies
are not strongly affected by digital
normalization.
(Note: normalization algorithm is now
standard part of Trinity mRNAseq
pipeline.)

Transcriptome results - lamprey
 Started with 5.1 billion reads from 50
different tissues.
(4 years of computational research, and
about 1 month of compute time, GO
HERE)
Ended with:

Lamprey transcriptome basic
stats
 616,000 transcripts (!)
 263,000 transcript families (!)
(This seems like a lot.)

Lamprey transcriptome basic
stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families have transcripts >
1kb
(compare with mouse: 17331 of 29769 genes
are > 1kb)
So, estimation by thumb ~ not that off, for long
transcripts.

Common vs rare genes
#transcripts
# samples
Camille Scott

Can look at transcripts by tissue -
-
Camille Scott

Too… many… samples…
Camille Scott
Presence/absence clustering

Expression-based clustering
Some known biology recapitulated; and… ???
Camille Scott

Next steps with lamprey
 Far more complete transcriptome than the one
generated from the genome!
 (…but suffering from contamination,
oversensitivity to unprocessed transcripts, …?)
 Enabling studies in –
 Basal vertebrate phylogeny
 Biliary atresia
 Evolutionary origin of brown fat (previously thought
to be mammalian only!)
 Pheromonal response in adults
 Spinal cord regeneration

Next challenges
OK, we can deal with volume of data,
make pretty pictures, and ... Now what?

Contamination!
Both experimental or “real” contaminants are big pro
Camille Scott

Pathway predictions vary
dramatically depending on data
set, annotation
Likit Preeyanon
KEGG
pathway
comparison
across several
different gene
annotation
sets for
chicken

The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889.lide courtesy Erich Schwarz

Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now
cheaper than data gathering (i.e.
essentially free);
 …plus, we can run most of our
approaches in the cloud (per-hour
rental compute resources – e.g.
Amazon Web Services).

1. khmer-protocols
 Effort to provide standard “cheap”
assembly protocols for the cloud.
 Entirely copy/paste; ~2-6 days from
raw reads to assembly,
annotations, and differential
expression analysis.
 Open, versioned, forkable, citable.
(“Don’t bother me unless it doesn’t
work.”)
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression

CC0; BSD; on github; in reStructuredText.

A few thoughts on our
approach…
 Explicitly a “protocol” – explicit steps, copy-paste,
customizable.
 No requirement for computational expertise or
significant computational hardware.
 ~1-5 days to teach a bench biologist to use.
 $100-150 of rental compute (“cloud computing”)…
 …for $1000 data set.
 Adding in quality control and internal validation
steps.

2. Data availability is important for
annotating distant sequences
Anything else Mollusc Cephalopod
no similarity

Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people’s existing data for free,
IFF they open it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open Marine Transcriptome
Project” blog post;
Note: data sets can now be cited.

First results: Loligo
genomic/transcriptome resources
Putting other people’s sequences where my
mouth is:
w/Josh Rosenthal and Benton Grav

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Camille Scott
 Jordan Fish
 Michael Crusoe
 Leigh Sheneman
 Billie Swalla (UW)
 Josh Rosenthal (UPR)
 Weiming Li, MSU
 Ona Bloom
(Feinstein), Jen
Morgan (MBL), Joe
Buxbaum (MSSM)
Funding
USDA NIFA; NSF IOS;
NIH; BEACON.

C. Titus Brown Billie J. Swalla
MSU UW

2014 villefranche

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a 2014 villefranche

Semelhante a 2014 villefranche (20)

Mais de c.titus.brown

Mais de c.titus.brown (20)

Último

Último (20)

2014 villefranche