The document discusses challenges in analyzing non-model organism transcriptomics data due to lack of reference genomes and presents two solutions: digital normalization to reduce massive amounts of sequencing data while retaining important information, and partitioning transcripts into "transcript families" to collapse isoforms without a reference genome. It then provides examples of applying these approaches to lamprey and tunicate transcriptome data.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
2014 whitney-research
1. Like the dog that caught the bus:
now what?
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
Feb 2014
ctb@msu.edu
2. The challenges of non-model
transcriptomics
Missing or low quality genome reference.
Evolutionarily distant.
Most extant computational tools focus on model
organisms –
Assume low polymorphism (internal variation)
Assume reference genome
Assume somewhat reliable functional annotation
More significant compute infrastructure
…and cannot easily or directly be used on critters of
interest.
9. In sum,
mRNAseq is pretty easy to deal with if you have a
good genomic sequence.
We don‟t have a good genomic sequence for
many organisms, including lamprey.
We need to do de novo assembly to construct a
transcriptome from short reads.
We also have lots and lots of mRNAseq
sequence:
10. The problem of lamprey…
Diverged at base of vertebrates; evolutionarily
distant from model organisms.
Large, complicated genome (~2 GB)
Relatively little existing sequence.
We sequenced the liver genome…
11. Sea lamprey in the Great Lakes
Non-native
Parasite of
medium to large
fishes
Caused
populations of
host fishes to
crash
Li Lab / Y-W C-D
12. The problem of lamprey…
Diverged at base of vertebrates; evolutionarily
distant from model organisms.
Large, complicated genome (~2 GB)
Relatively little existing sequence.
We sequenced the liver genome…
13. Lamprey has incomplete genomic sequence
Evidence of somatic recombination;
100s of mb of sequence eliminated
from genome during development.
More recent evidence (unpub, J.
Smith et al.) suggests that this loss
is developmentally
regulated, results in changes in
gene expression (due to loss of
genes!), and is tissue specific.
Liver genome is not the entire
genome.
J. Smith et al., PNAS 2009
14. Lamprey tissues for which we have
mRNAseq
embryo stages (late
blastula, gastrula, neurula, 22b, n
eural-crest migration, 24c1,24c2)
metamorphosis 3 (intestine,
kidney)
ovulatory female head skin
preovulatory female eye
adult intestine
metamorphosis 4 (intestine,
kidney)
preovulatory female tail skin
adult kidney
metamorphosis 5 (liver, intestine,
kidney)
brain paired
metamorphosis 6 (intestine,
kidney)
prespermiating male gill
freshwater (gill, intestine, kidney)
metamorphosis 7 (intestine,
kidney)
mature adult male rope tissue
larval (gill, kidney, liver, intestine)
monocytes
juvenile (intestine, liver, kidney)
brain (0,3,21 dpi)
lips
spinal cord (0.3.21 dpi)
metamorphosis 1 (intestine,
kidney)
metamorphosis 2 (liver, intestine,
spermiating male muscle
spermiating male gill
spermiating male head skin
supraneural tissue
small parasite distal intestine,
kidney, proximal intestine
salt water (gill, intestine)
15. Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
17. Two problems:
We have a massive amount of data that
challenges existing computers, and we want to
assemble it all together.
We need to construct transcript families (to
collapse isoforms) without having a solid
reference genome.
18. Solution 1: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
We can discard it for
you…
25. Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
=> Enables analyses that are otherwise completely
impossible.
26. Solution 2: Partitioning transcripts
into “transcript families”
Transcript family
Pell et al., 2012, PNAS
27. Transcriptome results - lamprey
Started with 5.1 billion reads from 50 different
tissues.
(4 years of computational research, and about 1
month of compute time, GO HERE)
Ended with:
29. Lamprey transcriptome basic
stats
616,000 transcripts
263,000 transcript families
Only 20436 transcript families have transcripts >
1kb
(compare with mouse: 17331 of 29769 genes
are > 1kb)
So, estimation by thumb ~ not that off, for long
transcripts.
30. Validation -Assume computers lie. How do we judge precision
& recall?
1) Homology!
Do we see sequence similarity to e.g. mouse
sequences?
1) Orthogonal data sets and analyses
For example, look at sperm genome, or
independently cloned CDS.
31. Evolution: mouse
58,000 lamprey transcript families have some
matches to mouse.
10,000 putative orthologs (reciprocal best hits)
So that‟s a pretty good sign.
(expecting about ~30k total genes)
Conclusion:
These numbers “feel” good to me; hard to know
what to expect after ~350-500 mya.
32. Orthogonal data set: pm2 (liver
genome)
64% of our new transcript families have a match in
pm2.
71% of conserved transcript families have a
match in pm2.
83% of long transcripts have a match in pm2.
Good – we don‟t expect 100%, because we know pm2
is probably missing stuff. So that means:
Conclusion:
At least 64% of transcript families are “really lamprey”
(and > 83% of the long transcripts!)
33. Orthogonal data set: sperm genome
94.2% of ref-based transcripts have a match in
sperm genome.
98.2% of full-length cDNAs have a match in
sperm genome.
So sperm genome is “pretty good” for cross
validation.
But only
71% of our new transcript families have a match
in sperm genome. ??
34. Orthogonal data set: sperm genome
94.2% of ref-based transcripts have a match in sperm
genome.
98.2% of full-length cDNAs have a match in sperm
genome.
New transcriptome:
71% of transcript families have a match in sperm
genome.
92% (!!) of long transcript families have a match in
sperm genome.
(Since the sperm genome is low coverage, this length
dependence makes sense – the longer the
35. Orthogonal data set: sperm genome
94.2% of ref-based transcripts have a match in sperm
genome.
98.2% of full-length cDNA have a match in sperm
genome.
New transcriptome:
71% of new transcriptome families have a match in
sperm genome.
92% (!!) of long transcript families have a match in
sperm genome.
Conclusion:
Our is poorer than but comparable
36. Orthogonal data set: full-length
cDNAs
We can look at both precision and recall by
asking
Are known sequences represented completely by a
single transcript? (“best match”)
Are known sequences covered by one or more
transcripts? (“total matches”)
70%
90%
41. Conclusions from full-length
cDNA
Ref-based data set has longer “best matches”
(better precision; less fragmented)
De novo assembly is more sensitive overall
(better recall; contains more real sequences)
42. Mapping percentages
(with orthogonal data)
Ona Bloom generated more data; how much
maps?
Ref-based
New/all
New/long
BR
SC
29.20%
42.94%
100.00%100.00%
45.99%
46.89%
Conclusion:
Ref-based is considerably less “complete” than
new, de novo transcriptome assembly.
43. Lamprey transcriptome conclusions
A substantial portion of the new transcriptome seems
“good”:
58k transcript families with mouse homology, 10k
orthologs;
20k transcript families with transcripts > 1kb.
Good matches to liver genome & sperm genome.
Reasonable numbers ~mouse.
Much (!) better than ref-based for mapping. (2x as good)
But!
Poor recall of known full-length cDNA !?
240k partitions with only small sequences !?
=> microbial contamination?
44. Separate question: how much of the
pm2 genome is missing??
64% of lamp3 transcript families match to pm2.
82.5% of long transcript families match to pm2.
71% of lamp3 transcript families conserved with
mouse match in pm2.
Conclusion I:
Probably about 30% of genic sequence is missing.
45. Separate question: how much of the
pm2 genome is missing??
64% of lamp3 transcript families match to pm2.
82.5% of long transcript families match to pm2.
71% of lamp3 transcript families conserved with
mouse match in pm2.
22.5% of sperm genome contigs have no hits in
pm2.
Conclusion II (firmer):
About 30% of single-copy sequence is missing.
46. CEGMA based completeness estimates
(Core eukaryotic genes)
Number
seqs
Completeness /
100% matches
Completeness /
partial matches
lamp3 entire
620k
lamp3 all ORFs >
80aa
269k
lamp3 longest
ORF in tr
80k
70.6
96.4
46.4
89
41.1
77.8
lamp0
44.7
62.5
11k
Camille Scott
47. Looking at the Molgula…
Putnam et
Modified al., 2008, Nature.
from Swalla 2001
48. What do these animals look like?
Molgula oculata
Molgula oculata
Molgula occulta
Ciona intestinalis
49. Tail loss and notochord genes
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange
Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
51. Question: does normalization “lose”
transcript information?
M. occulta
Diginorm
Raw
37
C. intestinalis
13623
M. oculata
Diginorm
Raw
17
missing 2446
64
C. intestinalis
13646
15
missing 2398
Reciprocal best hit vs. Ciona
Blast e-value cutoff: 1e-6
Elijah Lowe
52. Transcriptome assembly
thoughts
We can (now) assemble really big data sets, and
get pretty good results.
We have lots of evidence (some presented here :)
that some assemblies are not strongly affected by
digital normalization.
53. Practical implications of diginorm
Data is (essentially) free;
For some problems, analysis is now cheaper
than data gathering (i.e. essentially free);
…plus, we can run most of our approaches in
the cloud.
54. 1. khmer-protocols
Read cleaning
Effort to provide standard “cheap”
assembly protocols for the cloud.
Diginorm
Entirely copy/paste; ~2-6 days from
raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
Open, versioned, forkable, citable.
Assembly
Annotation
RSEM differential
expression
56. 2. Data availability is important for
annotating distant sequences
no similarity
Anything else
Mollusc
Cephalopod
57. Can we incentivize data sharing?
~$100-$150/transcriptome in the cloud
Offer to analyze people‟s existing data for
free, IFF they open it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open Marine Transcriptome
Project” blog post;
59. Tools to routinely update metazoan
orthology/homology relationships
> 100 mRNAseq data sets already;
Build interconnections between them via homology;
Build tools to update interconnections as new data
sets arrive.
Provide raw data, processed data, underlying
tools, simple Web interface, all CC0/in da
cloud/open/reproducible.
(Question: what biology problems could we tackle?)
60. “Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.
61.
62. We practice open science!
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/research.html
Preprints: on arXiv, q-bio:
„diginorm arxiv‟
63. Acknowledgements
Lab members involved
Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman
Collaborators
Josh Rosenthal
(UPR)
Weiming Li, MSU
Ona Bloom
(Feinstein), Jen
Morgan (MBL), Joe
Funding
Buxbaum (MSSM)
USDA NIFA; NSF IOS;
NIH; BEACON.
Editor's Notes
Transcripts are then mapped back to the chicken genome. Because the transcripts are mature mRNA, only exons will map to the genome.The solid boxes represent exons.As shown in this figure, different isoforms are detected using different parameter settings. The explanation for this phenomenon is unknown. The goal of this step is to define all exons in each gene.
Larvae/stream bottoms 3-6 years; parasitic adult -> great lakes, 12-20 months feeding. 5-8 years. 40 lbs of fish per life as parasite. 98% of fish in great lakes went away!
XXX
Marine invertebrates, chordata phylum, notochord, hollow dorsal nerve cord, pharyngeal slits and a post anal tail at some point in life. colonial, filter feeders
Notochord cells present, do not intercalate or extend