Microarray as one of recent biomedical technologies produce high dimensional data. This makes statistical analysis become challenging. I presented an overview of microarray analysis specifically in the use of gene expression profiling in a discussion.
2. Outline
• Biological
background
– Central
Dogma
– DNA
– Genes
• Genomics
• Microarrays
• Gene
Expression
data
analysis
pipeline
• What’s
next
??
Gene
expression
analysis
3. Central Dogma
http://compbio.pbworks.com
Gene
expression
analysis
4. DeoxyriboNucleic Acid (DNA)
• DNA
is
the
organic
molecule
that
carries
the
informaBon
used
by
a
cell
to
build
the
proteins
that
carry
out
most
of
the
biological
processes
in
a
cell.
• Double
helix
• Pair:
G
≡
C,A
=
T
• Example
sequence:
ATGCTGATCGATGCAGAATCGATC
wikipedia
• Length
of
human
DNA
is
about
3
×
109
base
pair
(bp)
• Between
us,
DNA
99.9
%
the
same,
• Our
DNA
99
%
the
same
chimpanzees.
analysis
Gene
expression
5. Gene
• The
full
DNA
sequence
of
an
organism
is
called
its
genome
• A
segment
that
specifies
the
sequence
of
a
protein.
• Length:
1000-‐3000
bases
• Approximately
around
20,000
-‐25,000
genes
Gene
expression
analysis
h(p://www.dna-‐sequencing-‐service.com/dna-‐sequencing/gene-‐dna/
6. Genetic Code
• NucleoBde
sequence
of
a
mRNA
is
translated
into
the
amino
acid
sequence
of
the
corresponding
protein.
Gene
expression
analysis
hp://www.cs.tau.ac.il/~rshamir/
7. Genomics
• Genomics
is
the
study
of
all
the
genes
of
a
cell,
or
Bssue,
at
:
– the
DNA
(genotype),
e.g.,
GWAS
SNP,
CNV
etc…
– mRNA
(transcriptomics),
Gene
expression,
– or
protein
levels
(proteomics).
• FuncBonal
Genomics:
study
of
the
funcBonality
of
specific
genes,
their
relaBons
to
diseases,
their
associated
proteins
and
their
parBcipaBon
in
biological
processes.
Gene
expression
analysis
8. Gene Expression
• Different
Bssues
in
the
same
human
may
express
different
genes,
according
to
their
role
in
the
human
body.
• The
same
cell
may
express
different
genes
under
different
circumstances
(stress,
nutriBon,
etc.).
• Cells
express
different
genes
during
lifeBme
(for
instance,
embryonic
gene
expression
differs
from
adult
gene
expression).
• Technologies
for
measuring
mRNA
assume:
– The
level
of
mRNA
in
the
cell
is
an
indicaBon
of
the
protein
level
in
the
cell,
since
the
major
regularity
is
on
the
subscripBon
process,
and
not
the
transcripBon
process.
– Genes
are
expressed
only
when
needed.
Gene
expression
analysis
10. Microarray Technologies
• Two
type
of
microarray
technologies:
– Single
channel
– Dual
channel
• Plaforms:
– Affymetrix,
– Illumina,
– Agilent
Gene
expression
analysis
11. Microarrays Applications
• Gene
expression
profiling
(our
focus)
• SNP
arrays
for
studying
single
nucleoBde
polymorphisms
(SNP)
and
copy
number
variaBons
(CNV)
such
as
deleBons
or
inserBons.
• Etc:
– ChIP
on
chip
for
invesBgaBng
protein
binding
site
occupancy,
– Exon
arrays
to
search
for
alternaBve
splicing
events
– Tiling
arrays
for
idenBfying
novel
transcripts
that
are
either
coding
or
non-‐coding.
Gene
expression
analysis
12. Microarrays Applications: MammaPrint
• MammaPrint-‐
test,
can
determine
the
likelihood
of
breast
cancer
returning
within
10
years
aher
treatment.
• First
FDA-‐approved
molecular
test
that
is
based
on
microarray
technology.
• Predict
whether
exisBng
cancer
will
metastasize.
• InvesBgate
the
paerns
and
behavior
of
large
numbers
of
genes.
• The
recurrence
of
cancer
is
partly
dependent
on
the
acBvaBon
and
suppression
of
certain
genes
located
in
the
tumor.
• MammaPrint
can
measure
the
acBvity
of
those
genes,
then
it
can
predict
paBents’
odds
of
the
cancer
spreading.
Gene
expression
analysis
15. Log2 Intensity
• Response:
log2
Intensity
…….
why?
• StaBsBcs:
Log-‐transforming
the
data
makes
the
intensity
distribuBon
more
symmetric
and
bell-‐shaped,
i.e.,
a
normal
distribuBon
• Biology:
The
biological
processes
in
whole
individuals
presumably
act
in
a
mulBplicaBve
way.
Log-‐transformaBon
exactly
makes
the
intensiBes
and
the
expression
levels
behave
in
a
mulBplicaBve
way.
Gene
expression
analysis
16. Normalization
• Process
to
remove
systemaBc
errors
which
can
cause
considerable
biases.
• SystemaBc
errors
are
due
to:
– Different
incorporaBon
efficiencies
of
dyes.
– Different
amounts
of
mRNA
in
the
tested
sample,
causing
different
expression
levels.
– Difference
in
experimenter
or
protocol
(if
data
were
gathered
in
different
labs).
– Different
scanning
parameters
– Differences
between
chips
created
in
different
producBon
batches.
• Example:
QGene
expression
analysis
uanBle
normalizaBon
19. Microrrays, Applications
• IdenBfy
diseases
related
genes
• ClassificaBon,
example
Mamaprint
• Cluster
genes
• Clusters
the
samples
(disease
stages,
Bssues)
:
class
discovery
• Clusters
genes
and
samples
• Pharmacogenomics:
– Personalized
medicine:
individualize
therapies
– Target
based
medicine:
More
effecBve
but
less
side
effect
dGene
expression
analysis
rugs.
20. Data Analysis Challenges
• The
curse
of
high-‐dimensionality:
• Obstacle
in
the
soluBon
of
classificaBon
and
clustering
problems
• Problem
of
mulBple
tesBng
problem:
the
problem
of
having
an
increased
number
of
false
posiBve
results
because
the
same
hypothesis
is
tested
mulBple
Bmes.
• MulBple
tesBng
correcBon:
– FWER:
Bonferroni,
Holm.
– FDR:
BH,
BY
Gene
expression
analysis
21. Identification of Differential Genes
• Discover
genes
with
different
expression
in
two
or
more
different
Bssues/
condiBons.
• Fold
change
• t-‐type
test:
– t-‐
test
– Modified
t-‐test:
Significance
Analyss
of
Microarray
(SAM),
t
-‐
LIMMA
• Linear
Models
for
Microarray
Data
(LIMMA)
Gene
expression
analysis
22. Clustering
• Clustering
genes
or
condiBons
or
both.
• Deducing
funcBons
of
unknown
genes
from
known
genes
with
similar
expression
paerns.
• IdenBfying
disease
profiles
-‐
Bssues
with
similar
pathology
should
yield
similar
expression
profiles.
• Co-‐expression
of
genes
may
imply
co-‐regulaBon.
• ClassificaBon
of
biological
condiBons.
• Drug
development
Gene
expression
analysis
24. Classification
• Classification of tumor malignancies into
known classes : supervised learning;
• Identification of marker genes that
characterize the different tumor classes:
feature selection.
Genes distinguishing ALL from AML (two
types of leukemia).
Gene
expression
analysis
25. Classification
• Methods:
– Discriminant
analysis
:
LDA,
K
nearest
neighbor.
– ClassificaBon
Tree
– LogisBc
regression,
penalized
LR:
LASSO.
– Neural
network
– Support
vector
machines
(SVM)
– Random
forest,
etc…..
A
survey
of
these
methods:
hp://www.ibiostat.be/publicaBons/phd/suzyvansanden.pdf
hp://www.stat.cmu.edu/~jiashun/Research/sohware/Data/papers/
dudoit.pdf
Gene
expression
analysis
26. Pathways Analysis
• We
discover
DE
genes,
what's
next?
• IdenBfy
which
pathways
(e,g,.
GO
KEGG)
terms
are
most
commonly
associated
with
the
DE
genes.
• Methods:
GEA,
GSEA,
NEA,
etc.
Gene
expression
analysis
27. What’s next
• Next-‐generaBon
sequencing
+
No
need
to
know
the
sequence
of
the
transcript.
+
There
are
no
arBfacts
due
to
cross-‐hybridizaBon
+
Beer
quanBtaBon
of
low
abundance
transcripts.
-‐
New
data
types
and
huge
data
volumes.
-‐
Quality
• EpigeneBcs
– The
study
of
heritable
changes
in
genome
funcBon
that
occur
without
a
change
in
DNA
sequence
(
hp://epigenome.eu/en/1,1,0
).
– DNA
methylaBon
Gene
expression
analysis
28. Reference
• Gohlmann,,
H.
and
Talloen,
W,
Gene
Expression
Studies
Using
Affymetrix
Microarrays,
Chapman
&
Hall/CRC
MathemaBcal
&
ComputaBonal
Biology,
2009.
• hp://www.cs.tau.ac.il/~rshamir/ge/09/
Other
useful
books:
• Gentleman
R,
Carey
V,
Huber
W,
Irizarry
R,
Dudoit
S,
editors:
BioinformaBcs
and
computaBonal
biology
soluBons
using
R
and
Bioconductor
.
Springer
Science,
New
York,
2005.
• Amaratunga
D,
Cabrera
J:
ExploraBon
and
Analysis
of
DNA
Microarray
and
Protein
Array
Data.
Wiley-‐Interscience,
2004.
Gene
expression
analysis