2014 whitney-research

Like the dog that caught the bus:
now what?
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
Feb 2014
ctb@msu.edu

The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most extant computational tools focus on model

organisms –
 Assume low polymorphism (internal variation)
 Assume reference genome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure

…and cannot easily or directly be used on critters of
interest.

Isoform analysis – some easy…

Isoform analysis – some hard

Counting methods mostly rely on presence of unique
sequence to which to map.

Types of Alternative Splicing
40%

25%

<5%, more in plants, fungi, protozoa

Karen H, Lev-Maor G & Ast G Nat Genet 2010

locate, given genomic
sequence

Genome-reference-free assembly
leads to many isoforms.

Massive redundancy!

Gene models can be “collapsed”
given genomic sequence.

In sum,
 mRNAseq is pretty easy to deal with if you have a

good genomic sequence.
 We don‟t have a good genomic sequence for

many organisms, including lamprey.
 We need to do de novo assembly to construct a

transcriptome from short reads.
 We also have lots and lots of mRNAseq

sequence:

The problem of lamprey…
 Diverged at base of vertebrates; evolutionarily

distant from model organisms.
 Large, complicated genome (~2 GB)

 Relatively little existing sequence.
 We sequenced the liver genome…

Sea lamprey in the Great Lakes
 Non-native

 Parasite of

medium to large
fishes
 Caused
populations of
host fishes to
crash

Li Lab / Y-W C-D

Lamprey has incomplete genomic sequence

Evidence of somatic recombination;
100s of mb of sequence eliminated
from genome during development.
More recent evidence (unpub, J.
Smith et al.) suggests that this loss
is developmentally
regulated, results in changes in
gene expression (due to loss of
genes!), and is tissue specific.
Liver genome is not the entire
genome.

J. Smith et al., PNAS 2009

Lamprey tissues for which we have
mRNAseq
embryo stages (late
blastula, gastrula, neurula, 22b, n
eural-crest migration, 24c1,24c2)

metamorphosis 3 (intestine,
kidney)

ovulatory female head skin
preovulatory female eye

adult intestine

kidney)

preovulatory female tail skin

adult kidney

metamorphosis 5 (liver, intestine,
kidney)

brain paired

kidney)

prespermiating male gill

freshwater (gill, intestine, kidney)

kidney)

mature adult male rope tissue

larval (gill, kidney, liver, intestine)

monocytes

juvenile (intestine, liver, kidney)

brain (0,3,21 dpi)

lips

spinal cord (0.3.21 dpi)

kidney)

metamorphosis 2 (liver, intestine,

spermiating male muscle

spermiating male gill
spermiating male head skin
supraneural tissue
small parasite distal intestine,
kidney, proximal intestine
salt water (gill, intestine)

Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness

…but for lots and lots of fragments!

Shared low-level
transcripts may
not reach the
threshold for
assembly.

Two problems:
 We have a massive amount of data that

challenges existing computers, and we want to
assemble it all together.
 We need to construct transcript families (to

collapse isoforms) without having a solid
reference genome.

Solution 1: Digital normalization
(a computational version of library normalization)

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
We can discard it for
you…

Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.

=> Enables analyses that are otherwise completely
impossible.

Solution 2: Partitioning transcripts
into “transcript families”

Transcript family

Pell et al., 2012, PNAS

Transcriptome results - lamprey
 Started with 5.1 billion reads from 50 different

tissues.
(4 years of computational research, and about 1
month of compute time, GO HERE)

Ended with:

Lamprey transcriptome basic
stats
 616,000 transcripts (!)
 263,000 transcript families (!)

(This seems like a lot.)

Lamprey transcriptome basic
stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families have transcripts >

1kb
(compare with mouse: 17331 of 29769 genes
are > 1kb)
So, estimation by thumb ~ not that off, for long
transcripts.

Validation -Assume computers lie. How do we judge precision
& recall?
1) Homology!

Do we see sequence similarity to e.g. mouse
sequences?
1) Orthogonal data sets and analyses

For example, look at sperm genome, or
independently cloned CDS.

Evolution: mouse
 58,000 lamprey transcript families have some

matches to mouse.
 10,000 putative orthologs (reciprocal best hits)
So that‟s a pretty good sign.
(expecting about ~30k total genes)
Conclusion:
These numbers “feel” good to me; hard to know
what to expect after ~350-500 mya.

Orthogonal data set: pm2 (liver
genome)
 64% of our new transcript families have a match in

pm2.
 71% of conserved transcript families have a
match in pm2.
 83% of long transcripts have a match in pm2.
Good – we don‟t expect 100%, because we know pm2
is probably missing stuff. So that means:

Conclusion:
At least 64% of transcript families are “really lamprey”
(and > 83% of the long transcripts!)

Orthogonal data set: sperm genome
 94.2% of ref-based transcripts have a match in

sperm genome.
 98.2% of full-length cDNAs have a match in
sperm genome.

So sperm genome is “pretty good” for cross
validation.
But only
 71% of our new transcript families have a match
in sperm genome. ??

 94.2% of ref-based transcripts have a match in sperm

genome.
 98.2% of full-length cDNAs have a match in sperm
genome.
New transcriptome:
 71% of transcript families have a match in sperm
genome.
 92% (!!) of long transcript families have a match in
sperm genome.
(Since the sperm genome is low coverage, this length
dependence makes sense – the longer the

 94.2% of ref-based transcripts have a match in sperm

genome.
 98.2% of full-length cDNA have a match in sperm
genome.
New transcriptome:
 71% of new transcriptome families have a match in
sperm genome.
 92% (!!) of long transcript families have a match in
sperm genome.
Conclusion:
Our is poorer than but comparable

Orthogonal data set: full-length
cDNAs
 We can look at both precision and recall by

asking
 Are known sequences represented completely by a

single transcript? (“best match”)
 Are known sequences covered by one or more
transcripts? (“total matches”)
70%
90%

Ref-based (lamp0) ”best” are better
than new assembly (lamp3)

lamp3 “total” is better than lamp0

Conclusions from full-length
cDNA
 Ref-based data set has longer “best matches”

(better precision; less fragmented)
 De novo assembly is more sensitive overall
(better recall; contains more real sequences)

Mapping percentages
(with orthogonal data)
Ona Bloom generated more data; how much
maps?

Ref-based
New/all
New/long

BR
SC
29.20%
42.94%
100.00%100.00%
45.99%
46.89%

Conclusion:
Ref-based is considerably less “complete” than
new, de novo transcriptome assembly.

Lamprey transcriptome conclusions
 A substantial portion of the new transcriptome seems

“good”:
 58k transcript families with mouse homology, 10k






orthologs;
20k transcript families with transcripts > 1kb.
Good matches to liver genome & sperm genome.
Reasonable numbers ~mouse.
Much (!) better than ref-based for mapping. (2x as good)

But!
 Poor recall of known full-length cDNA !?
 240k partitions with only small sequences !?
=> microbial contamination?

Separate question: how much of the
pm2 genome is missing??
 64% of lamp3 transcript families match to pm2.
 82.5% of long transcript families match to pm2.
 71% of lamp3 transcript families conserved with

mouse match in pm2.
Conclusion I:
Probably about 30% of genic sequence is missing.

Separate question: how much of the
pm2 genome is missing??
 64% of lamp3 transcript families match to pm2.
 82.5% of long transcript families match to pm2.
 71% of lamp3 transcript families conserved with

mouse match in pm2.
 22.5% of sperm genome contigs have no hits in

pm2.

Conclusion II (firmer):
About 30% of single-copy sequence is missing.

CEGMA based completeness estimates
(Core eukaryotic genes)
Number
seqs

Completeness /
100% matches

Completeness /
partial matches

lamp3 entire
620k
lamp3 all ORFs >
80aa
269k
lamp3 longest
ORF in tr
80k

70.6

96.4

46.4

89

41.1

77.8

lamp0

44.7

62.5

11k

Camille Scott

Looking at the Molgula…

Putnam et
Modified al., 2008, Nature.
from Swalla 2001

What do these animals look like?
Molgula oculata

Molgula oculata

Molgula occulta

Ciona intestinalis

Tail loss and notochord genes

a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange
Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996

Diginorm applied to Molgula
embryonic mRNAseq
No.$ reads Reads$
of$
kept
M.#
occulta$
F+3
M.#
occulta$
F+3
M.#
occulta$
F+4
M.#
occulta$
F+5
M.#
occulta$
F+6
M.#
occulta!Total
M.#
oculata$
F+3
M.#
oculata$
F+4
M.#
oculata$
F+6
M.#
oculata!Total

42,174,510
50,018,302
44,948,983
53,692,296
45,782,981
236,617,072
47,045,433
52,890,938
50,156,895
150,093,266

15,642,268
6,012,894
3,499,935
2,993,715
2,774,342
30,923,154
10,754,899
3,949,489
2,874,196
17,578,584

Percentage$
kept
?
?
?
?
?
13%
?
?
?
11.70%

Question: does normalization “lose”
transcript information?
M. occulta
Diginorm
Raw

37

C. intestinalis

13623

M. oculata
Diginorm
Raw

17

missing 2446

64

C. intestinalis

13646

15

missing 2398

Reciprocal best hit vs. Ciona
Blast e-value cutoff: 1e-6
Elijah Lowe

Transcriptome assembly
thoughts
 We can (now) assemble really big data sets, and

get pretty good results.
 We have lots of evidence (some presented here :)

that some assemblies are not strongly affected by
digital normalization.

Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.

1. khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression

CC0; BSD; on github; in reStructuredText.

2. Data availability is important for
annotating distant sequences
no similarity

Anything else

Mollusc

Cephalopod

Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people‟s existing data for

free, IFF they open it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open Marine Transcriptome
Project” blog post;

First results: Loligo
genomic/transcriptome resources
Putting other people‟s sequences where my
mouth is:

Tools to routinely update metazoan
orthology/homology relationships
 > 100 mRNAseq data sets already;
 Build interconnections between them via homology;
 Build tools to update interconnections as new data

sets arrive.
 Provide raw data, processed data, underlying

tools, simple Web interface, all CC0/in da
cloud/open/reproducible.
(Question: what biology problems could we tackle?)

“Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.

We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Josh Rosenthal

(UPR)
 Weiming Li, MSU
 Ona Bloom
(Feinstein), Jen
Morgan (MBL), Joe
Funding
Buxbaum (MSSM)
USDA NIFA; NSF IOS;
NIH; BEACON.

2014 whitney-research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2014 whitney-research

Similar to 2014 whitney-research (20)

More from c.titus.brown

More from c.titus.brown (20)

Recently uploaded

Recently uploaded (20)

2014 whitney-research

Editor's Notes