Single-Cell Transcriptome Analysis of Pluripotent Stem Cells
1. Single-Cell Transcriptome Analysis
of Pluripotent Stem Cells
Nacho Caballero
Center for Regenerative Medicine
Boston University
Jun 12, 2017
From raw data to insights
11. Demultiplex
One pair of
sequencing
files
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
CTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTG
+
AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE<
…
Millions of reads
Barcoded
sequencing
files
AT
CG
12. Demultiplex
One pair of
sequencing
files
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
CTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTG
+
AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE<
…
Millions of reads
Metadata file
Cell_id Condition1 Condition2
Cell_01 BU3 red
Cell_02 BU3 green
Cell_03 C17 red
Cell_04 C17 green
Cell_05 BU3 red
Cell_06 BU3 green
…
Barcoded
sequencing
files
AT
CG
13. Demultiplex
One pair of
sequencing
files
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
CTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTG
+
AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE<
…
Millions of reads
Metadata file
Cell_id Condition1 Condition2
Cell_01 BU3 red
Cell_02 BU3 green
Cell_03 C17 red
Cell_04 C17 green
Cell_05 BU3 red
Cell_06 BU3 green
…
Barcoded
sequencing
files
AT
CG
Short
simple
names
14. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Analysis pipeline
17. Good cDNA quality
Read length is often inversely correlated with base-pair
sequencing quality
Position in Read
AvgSequenceQuality
18. Good cDNA quality Average quality
Read length is often inversely correlated with base-pair
sequencing quality
Position in Read
AvgSequenceQuality
19. Good cDNA quality Average quality Bad quality
Read length is often inversely correlated with base-pair
sequencing quality
Position in Read
AvgSequenceQuality
28. AGGCAGAGGGGCGAGATGCA…
1358 reads aligned to the SFTPC
gene in this cell
SFTPC gene
We quantify the gene expression in a cell by counting how many
reads align to each gene
29. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
30. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
31. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
32. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
33. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
34. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
35. Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
40-60% of the raw reads cannot be used to quantify gene expression
36. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Analysis pipeline
37. Filter out cells with fewer than 5K aligned reads
Numberofalignedreads
1M
10K
1K
0
120 Cells
38. Filter out cells with a high percentage of mitochondrial
gene counts (indicative of a broken cell membrane)
%ofMitochondrialgenecounts
100%
75%
50%
0
48 Cells
25%
39. Filter out cells with less than 2K expressed genes
Numberofexpressedgenes
6K
4K
0
30 Cells
2K
40. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Analysis pipeline
42. Raw count data
Assume that most genes are not differentially expressed
Normalized expression data
43. Raw count data
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
44. Raw count data
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
Apply the scaling factors and log
45. Raw count data
Normalization corrects for differences in capture
efficiency, sequencing depth and other technical bias
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
Apply the scaling factors and log
53. Typical questions
What are the expression differences
between my experimental groups?
What are the subpopulations in my data?
54. Typical questions
What are the expression differences
between my experimental groups?
What are the subpopulations in my data?
What are the gene expression patterns
in each subpopulation?
64. ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 2
Silhouette coefficient: 0.48
TREAT
CONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to
determine the optimal number of groups
65. ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 3
Silhouette coefficient: 0.56
TREAT
CONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to
determine the optimal number of groups
66. ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 4
Silhouette coefficient: 0.47
TREAT
CONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to
determine the optimal number of groups
68. ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes
69. ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes
70. ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes
Variance
Average
expression
Highly variable
genes
71. ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes
Variance
Average
expression
Highly variable
genes
72. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Analysis pipeline
83. Geneset enrichment analysis depends on the
quality of the geneset
MsigDB hallmark genesets only contain 4000 genes
84. Geneset enrichment analysis depends on the
quality of the geneset
MsigDB hallmark genesets only contain 4000 genes
MAKE YOUR OWN GENESETS FROM THE LITERATURE
85.
86.
87.
88.
89.
90.
91.
92.
93.
94. Remember to provide a metadata file
Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
95. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
More reads is usually better than longer reads
96. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
You will only be able to align 50% of your reads
97. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
Assume that 50% of your cells could fail
98. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
High variance doesn’t imply subpopulations
99. Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
Make your own gene lists!
100. Slides available at: bit.ly/crem_bioinformatics
Raw data Initial QC Alignment and
Quantification
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways