Presentation on how to chat with PDF using ChatGPT code interpreter
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
1. One gene – one transcript… mostly
Mar Gonzales Porta, Adam Frankish, Johan Rung,
Jennifer Harrow, Alvis Brazma
European Bioinformatics Institute
European Molecular Biology Laboratory
Wellcome Trust Sanger Institute
2. Analysis of RNA-seq data across different tissues
and cell lines reveals that the majority of genes have
a single dominant transcript
Mar Gonzàlez-Porta
Adam Frankish (Sanger)
Johan Rung (EBI)
Jennifer Harrow (Sanger)
Alvis Brazma (EBI)
To appear in Genome Biology on July 1, 2013
http://genomebiology.com/
4. The number of genes in human genome
• Estimates before the Human Genome Project
• ~100,000 genes
• After
• ~21,000 genes
• By comparison
• Yeast ~6,000 genes
• C. Elegance ~17,000
5. Central gogma revised – one gene many
transcritps and proteins via alternative
splicing
In human, there are 21,405 protein coding genes and
141,031 different isoforms (92,581 of which are protein
coding) annotated, 17,413 genes have >1 isoform
6. From RNA-seq and other recent experiments
• Most human genes have more than one splice-form
expressed [Pan et al, Nat Genetics 2008], [Wang et al, Nat
Genetics 2008] [Mortazavi, Nat Methods 2008]
• Several isoforms per gene are often expressed to significant
levels either in the same cell type or across different [Wang
et al, Nat Genetics 2008], [Tang et al, Nat Methods 2009],
[Trapnell et al, Nat Biotechy 2010]
• Isoform expression is regulated [Waks et al, Mol Syst Biol
2011] but splicing can be noisy [Melamud, NAR 2009]
7. ENCODE – Nature, September 2012
• ‘Isoform expression by a gene does not follow a
minimalistic expression strategy, resulting in a tendency
for genes to express many isoforms simultaneously’
• ‘[…] alternative isoforms within a gene are not expressed
at similar levels, and one isoform dominates in a given
condition)’
• Which is it then?
8. A fundamental question still remains – are
different isoforms of the same gene
expressed at similar levels?
Expression
level
Isoforms 1 2 3 4 5 6
?
Isoforms 1 2 3 4 5 6
We think this is a fundamental question related to the complexity
of transcritpome
9. Data and methods
Illumina Body Map
16 tissues
PE (80M)
no replicates
HiSeq 2000
ENCODE
5 cell lines
PE (40M)
technical replicates
GAII
• 2 different datasets: 46 samples
• Three different state-of-art tools for transcript quantification
(MISO, Cufflinks and mmseq)
• Direct evidence from splice junctions
• Simulated data – can the methods distinuish between the two
scenarios?
• All approaches produced very consistent results
10. Most genes express one predominant
transcript
Expression
level
79% of cases
56% of cases
2-fold dominance 5-fold dominance
14. Most genes express one predominant
transcript
We detect a total of 31,902 transcripts expressed above 1 FPKM in at least one
tissue and 26,641 of these are major transcripts (ratio 1.12)
15.
16. There is just over one highly expressed
transcript per gene!
85% of transcriptome comes from the
dominant transcripts
Is it the same dominant transcript in all
tissues or different tissues tend to have
different dominant transcripts?
20. Some of the major transcripts are non-
canonical
AES
21. Do the dominant transcripts of protein coding
genes always code proteins?
• Only in about 80% cases
• However much more often in cytosol than in nucleolus
• In nucleolus the retained intron is predominantly located
towards the 3’ end of the transcript
• Is this because we extract mRNA before the splicing is
completed or can a retained intron be a form of expression
regulation?
22. Conclusions
• Most genes express one predominant transcript over the
rest
• ~85% of the mRNA pool comes from major transcripts
• Major transcripts tend to be recurrent across samples,
switch events exist but only a small number of these are
likely to express different proteins
• Despite the transcriptome complexity the central dogma
of molecular biology may be closer to the truth than
recently believed!