This document discusses the role of statisticians in personalized medicine and provides an overview of statistical methods used in bioinformatics. It describes how statisticians are involved in all stages of drug development from discovery through clinical trials. Personalized medicine aims to determine an individual's unique characteristics to select the best treatment. Advanced technologies like microarrays and next-generation sequencing generate large genomic datasets that require statistical analysis for applications like disease classification, biomarker discovery, and identifying disease subtypes and targeted therapies. The document outlines statistical methods used for tasks like microarray data analysis, RNA sequencing, and finding subtype-specific genes and transcripts.
The Role of The Statisticians in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics
1. The Role of Statisticians in
Personalized Medicine:
An Overview of Statistical
Methods in Bioinformatics
Setia Pramana
STIS
Jakarta, 8 August 2014
Setia Pramana 1
2. Outline
• Drug Development
• Personalized Medicine
• Central Dogma
• Microarray Data Analysis
• Next Generation Sequencing
• Summary
Setia Pramana 2
3. Drug Developments
• Takes 10-15 years
• Cost millions USD
• Who: Pharmaceutical, biotechnology, device companies,
Universities and government research agencies
• Regulatory: The US Food and Drug Administration, BP POM
• Evaluate:
– Safety – can people take it?
– Efficacy – does it do anything in humans?
– Effectiveness – is it better or at least as good as what is
currently available?
– Do the benefits outweigh the risks?
Setia Pramana 3
4. Drug Development
• The Stages:
- Drug Discovery
- Pre-clinical Development
- Clinical Development 4 Phases
• Statisticians are involved in all stages
• Stages are highly regulated
• Result is based on most of patients
• But .. Patients are created differently!
Setia Pramana 4
6. Patients Heterogeneity
• We’re all different in
- Physiological, demographic characteristics
- Medical history
- Genetic/genomic characteristics
• What works for a patient with one set of
characteristics might not work for another!
Setia Pramana 6
7. Patients Heterogeneity
• “One size does not fit all”
• Use a patient’s characteristics to determine best
treatment for him/her
• Genomic information is a great potential
-- > Personalized medicine:
“The right treatment for the right patient at the right
time”
Setia Pramana 7
8. Personalized Medicine
• The ability to determine an individual's unique molecular
characteristics and to use those genetic distinctions to
diagnose more finely an individual's disease, select
treatments that increase the chances of a successful
outcome and reduce possible adverse reactions.
• Personalized medicine also is the ability to predict an
individual's susceptibility to diseases and thus to try to
shape steps that may help avoid or reduce the extent to
which an individual will experience a disease
Setia Pramana 8
9. Subgroup Identification and Targeted
Treatment
• Determine subgroups of patients who share certain
characteristics and would get better on a particular
treatment
• Discover biomarkers which can identify the subgroup
• Focus on finding and treating a subgroup
Setia Pramana 9
10. Subgroup Identification and Targeted
Treatment
Genotype Phenotype Intervention Outcome
Mutations/SNP
Gene/Protein
Expression
Epigenetics
Diseases
Disability
Etc.
Drugs
Therapies
Regimes
Personalized
medicine
Setia Pramana 10
11. Advanced Biomedical Technologies
• High-throughput microarrays and molecular imaging
to monitor SNPs, gene and protein expressions
• Next-Generation Sequencing
Setia Pramana 11
14. Gene
• The full DNA sequence of an organism is called its
genome
• A gene is a segment that specifies the sequence of
one or more protein.
Setia Pramana 14
15. Genomics
• The study of all the genes of a cell, or tissue, at :
– the DNA (genotype), e.g., GWAS SNP, CNV etc…
– mRNA (transcriptomics), Gene expression,
– or protein levels (proteomics).
• Functional Genomics: study the functionality of specific
genes, their relations to diseases, their associated
proteins and their participation in biological processes.
Setia Pramana 15
17. Microarray
• DNA microarrays are biotechnologies which
allow the monitoring of expression of
thousand genes.
Setia Pramana 17
18. Applications
• High efficacy and low/no side effect drug
• Genes related disease.
• Biological discovery
– new and better molecular diagnostics
– new molecular targets for therapy
– finding and refining biological pathways
• Molecular diagnosis of leukemia, breast cancer, etc.
• Appropriate treatment for genetic signature
• Potential new drug targets
Setia Pramana 18
19. Microarray
Overview of the process
of generating high
throughput gene
expression data using
microarrays.
Setia Pramana 19
23. Challenges
• Mega data, difficult to visualize
• Too few records (columns/samples), usually < 100
• Too many rows(genes), usually > 10,000
• Too many genes likely leading to False positives
• For exploration, a large set of all relevant genes is
desired
• For diagnostics or identification of therapeutic
targets, the smallest set of genes is needed
• Model needs to be explainable to biologists
Setia Pramana 23
24. Type of Microarray Data Analysis
• Gene Selection
–find genes for therapeutic targets
• Classification (Supervised)
–identify disease (biomarker study)
–predict outcome / select best treatment
• Clustering (Unsupervised)
–find new biological classes / refine existing ones
–Understanding regulatory relationship/pathway
–exploration
Setia Pramana 24
25. Gene Selection
• Modified t-test
• Significance Analysis of Microarray (SAM)
• Limma (Linear model for microarrays )
• Linear Mixed model
• Logistics Regression
• Lasso (least absolute selection and shrinkage operator)
• Elastic-net
• Etc,
Setia Pramana 25
27. Clustering
• Cluster the genes
• Cluster the
arrays/conditions
• Cluster both simultaneously
• K-means
• Hierarchical
• Biclustering algorithms
Setia Pramana 27
28. Clustering
• Cluster or Classify
genes according to
tumors
• Cluster tumors
according to genes
Setia Pramana 28
29.
30. Classification
• Linear Discriminant Analysis
• K nearest Neighbor
• Logistic regression
• L1 Penalized Logistic Regression
• Neural Network
• Support Vector Machines
• Random forest
• etc
Setia Pramana 30
31. Aim: To improve understanding of host protein
profiles during disease progression especially in
children.
32. Classification of Malaria Subtypes
•Identify panel of proteins which could distinguish
between different subtypes.
•Implement L1-penalized logistic regression
33. Penalized Logistic Regression
•Logistic regression is a supervised method for binary
or multi-class classification.
•In high-dimensional data (e.g., microarray): More
variables than the observations Classical logistic
regression does not work.
•Other problems: Variables are correlated
(multicolinierity) and over fitting.
•Solution: Introduce a penalty for complexity in the
model.
35
35. • Shrinks all regression coefficients () toward zero
and set some of them to zero.
• Performs parameter estimation and variable
selection at the same time.
• The choice of λ is crucial and chosen via k-fold
cross-validation procedure.
• The procedure is implemented in an R package
called penalized.
37
L1 Penalized Logistic Regression
50. Subtype-specific Transcripts/Isoforms
• Breast invasive carcinoma (BRCA) from the Cancer
Genome Atlas Project (TCGA).
• 329 tumor samples.
• Platform: illumina
• Paired-end reads (length 50 bp).
• 20 -100 million reads
Setia Pramana 53
51. Subtype-specific Transcripts/Isoforms
• To discover transcripts/isoforms which are only
significantly (high/low) expressed in a certain cancer
subtype.
Pramana, et.al 54NBBC 2013
52. Analysis Flow
329 samples TCGA
Discovery set
179 samples
Validation set
- TCGA 150 samples
- External samples
Classification to mol-subtypes
- Use Swedish microarray data as
training data.
- Based on gene level FPKM
- Median and variance normalization
- K-nearest neighbor
- Classifier genes selection
Subtype-specific Transcript
- Transcript level FPKM of all
genes
- For each transcript: Robust
contrast tests.
- Multiple testing adjustment.
Pramana, et.al 55NBBC 2013
56. Software?
• R now is growing, especially in bioinformatics
– Statistics, data analysis, machine learning
– Free
– High Quality
– Open Source
– Extendable (you can submit and publish your own package!!)
– Can be integrated with other languages (C/C++, Java, Python)
– Large active user community
– Command-based (-)
Setia Pramana 59
57. My Current Research
• Integration of Somatic Mutation, Expression and
Functional Data Reveals Potential Driver Genes Predictive
of Breast Cancer Survival (KI, Ewha Univ, Brescia Univ).
• Molecular Subtyping of Breast Cancers using RNA-
Sequence Data (KI, Ewha Univ, Brescia Univ).
• The genomic surveillance of drug-resistant tuberculosis
(FKUI, NUS).
• Genomics screening for prostate cancer (KI)
• Molecular subtyping of Malaria (KI, Scilab, Eijkman Inst.)
• Health Technology Assessment (FKUI, Depkes)
Setia Pramana 60
58. Summary
• Statistics plays important roles in developing
personalized medicine
• Multidisciplinary field need collaboration with
different experts.
• Bioinformaticians is one of the sexiest job
• Big Data in Medicine: Numerous opportunities to be
explored and discovered.
Setia Pramana 61
From the sample tissue, RNA is extracted and chopped into million of small pieces call fragments. The length of these fragments are around 400 bp.
These RNA fragments are then converted to cDNA. Then these cDNA fragments are sequenced or read from both ends, so it so called paired-end reads
These reads are then mapped to a reference genome using package Tophat.
Next step is to measure gene expression for each gene and transcript by a so called FPKM.
For example here is a gene with three exons and this gene have two transcripts or isoforms.
I have to remind you that some genes are actually do not produce a single protein instead a single gene could produce different forms of protein by so called alternative splicing which produces different version of RNA from a single gene.
For example in this gene have three exons and transcript one comes from first and third exon, and second transcripts are produced by exon 1 and 2.
The small dots here are the mapped cDNA, so for each transcript we can obtain number of fragments that are mapped to that transcript.
The abudance of RNA propotionate to the number of reads. FPKM is actually number of fragments devided by the gene length and the total number op mapped reads.
After for each transcripts FPKM is obtain, a gene level FPKM can be obtained by summing up FPKM of all transcript of the gene.
For the analysis, we split the data into two sets, a discovery and validation set.
In the discovery set, we start with classification of the samples into molecular subtypes using Swedish microarray data as training using K nearest neighbor.
Since the gene expression measurement in the two platform are different, before classification we normalized them together using median and variance normalization.
Then we select genes which can classify the sample best.
Once the subtypes are obtained, next is to obtain subtype specific transcript using transcript level FPKM. Here in order to obtain transcript that are up or down regulated in a specific subtype, robust contrast tests were performed.