SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
Chien-Wei (Masaki) Lin
1
Research Statement
During my past working experiences and current PhD program training, I have been working
on many methodology and collaborative researches in multi-omics high-throughput data (both
microarray and sequencing data). Inspired by the real data, I am particularly interested in
developing methodology in statistical genetics, bioinformatics, power and sample size
calculation tool, integrative/meta-analysis methods for different omics data type, and
supervised/unsupervised machine learning applications. For collaboration, I have been working
closely with scientists in many fields, such as cardiovascular epidemiology, psychiatry, and
cancer biology. I have experiences in various types of omics data, including single nucleotide
polymorphism (SNP), copy number variation (CNV), DNA methylation, gene expression,
proteomics (peptide), and metabolomics data. In the rest of the statement, I will summarize my
past and ongoing research projects and future research plans.
1. Methodology Research
1.1. Power calculation tool in sequencing data
Unlike earlier fluorescence-based technologies such as microarray, modelling of next generation
sequencing (NGS) data should consider discrete count data. In addition to sample size,
sequencing depth is also directly related to the experimental cost. Consequently, given total
budget and pre-specified unit experimental cost, the experimental design issue in NGS data is
conceptually a more complex multi-dimensional constrained optimization problem rather than
one-dimensional sample size calculation in traditional hypothesis setting.
Past and Ongoing Research Projects
• Power calculation tool in RNA-Seq and Methyl-Seq data [1, 2]
Most existing methods focus on single gene formula, that is, given the type-I error, effect
size, and sample size for one gene, and then derive the corresponding statistical power
based on a certain association test. However, in high-throughput data, statistical power
should be considered for thousands of genes simultaneously, where genome-wide
versions of type-I error and power are considered as false discovery rate and expected
discovery rate.
We proposed two statistical frameworks, namely ``SeqDesign'' [1] and ``MethylSeqDesign”
[2], to utilize pilot data for power calculation and experimental design of RNA-Seq and
Methyl-Seq experiments, respectively. The approach is based on mixture model fitting of
p-value distribution from pilot data and a parametric bootstrap procedure based on
approximated Wald test statistics to infer genome-wide power for optimal sample size and
sequencing depth. Our method contributed in the following ways:
Chien-Wei (Masaki) Lin
2
- Our method modeled count data adequately, and considers false discover rate to
control type-I error. By incorporating pilot data, our method can reflect the
characteristics of target experiments.
- We provided intuitive visualization tool to guide various practical experimental designs
for practitioners.
- We performed simulations and real data applications to evaluate the performance of
our methods and compared to existing methods. We showed that our method
outperforms the others.
Future Research Plan
The technology evolves rapidly. As new type of omics experiment develops, challenges for power
calculation and design issues will arise. I would like to explore possibilities to extend this statistical
framework to other type of omics data, such as single-cell experiments, in the future.
1.2. Meta-analysis and integrative analysis
Nowadays, many scientific findings suffer from low reproducibility/reliability, that is, the findings
could not be replicated in another cohort even under the same/similar experimental conditions,
which mostly due to complexity of omics data analysis and biological variation. In other words,
reliable findings across multiple studies are more desirable. Meta-analysis has been successfully
used to achieve this goal by combining effect sizes/p-values from multiple studies. Moreover,
based on central dogma of molecular biology, different types of omics data work jointly as a
system. Therefore, biological findings/conclusions that draw from multiple omics data have
higher reliability and interpretability by its nature.
More and more databases now become available to greatly facilitate these type of analysis; for
example, dbGaP for genetic variants data, NCBI GEO database for gene expression data, and
TCGA/ICGC database for pan cancer data.
Past and Ongoing Research Projects
• Meta-Analytic Robust Classifier [3]
In biomedical research, predicting disease diagnosis, prognosis or survival is an important
application. Robust and interpretable classifiers are usually favored for their clinical and
translational potential. The top scoring pair (TSP) algorithm is an example that applies a
simple rank-based algorithm to identify rank-altered gene pairs for classifier construction.
However, many classification algorithms suffer from low performance in cross-study
validation, including TSP. Hence, I participated to develop a meta-analytic top scoring pair
(MetaKTSP) framework to combine data from multiple transcriptomic studies and
generate a single robust prediction model. Our method has following conclusions:
- We conducted simulation analysis to compare with other popular single study based
methods. The results showed our method outperforms the others.
Chien-Wei (Masaki) Lin
3
- We showed that in real data, the biomarkers we identified from multiple studies have
robust prediction power and better biological interpretation.
• Integrative analysis of SNP and gene expression [4, 5]
It has been shown that SNPs are informative for tracing ancestral ethnicity of individuals.
However, it’s useful only for classifying distinct continental origins but cannot discriminate
individuals from closely related ancestral lineages. We found that gene expression data
also supplies ethnic information which is supplemental to SNPs [4]. Our contributions are
summarized as below:
- To the best of our knowledge, we are the first study that integrate SNP and gene
expression data to aid classification of subjects from closely related ethnic populations.
- By integrating SNP and gene expression data together, we can construct the ancestral
prediction model with a reduced number of markers and provide higher accuracy.
Expression quantitative trait loci (eQTL) analysis becomes popular because of its functional
meaning (SNP regulates gene expression). However, most of the analysis is single locus-
based analysis. We use partial least square (PLS) method to investigate the roles of gene-
based eQTL in ancestral ethnicity and pharmacogenetics [5]. We observed ancestry
information enriched in eQTL and can be used to construct prediction model to distinguish
subjects from close ethnic populations. Also, we identified 2 ancestry-informative eQTL
associated with adverse drug reactions and/or drug response.
Future Research Plan
Data science (a.k.a. “Big data”) is an emerging discipline. As a statistician, how to extract
information from these big data properly is an attractive topic to me. I am keen to develop
methods for meta-analysis in various types of statistical problems and for integrating various
types of data (including multi-omics data and brain imaging data).
1.3. General Bioinformatics Problems
Bioinformatics is an interdisciplinary field that provides useful software/tools to assist biologists
understand the biological data better and deeper. The applications are very wide and the needs
are getting higher and higher. For examples, database like TCGA and ICGC which provide
abundant data, and integrative tool like UCSC Xena which collects, analyzes, and visualizes the
data greatly facilitate this area.
Past and Ongoing Research Projects
• SNP array quality control [6]
Genome-wide SNP arrays contain hundreds of thousands of SNPs. The quality of arrays
plays an important role in downstream analysis. We define a quality index for each array
by quantifying the overall deviation of the individual-based allele frequencies from
reference frequencies. Our method can successfully detect poor-quality SNP arrays. A
Chien-Wei (Masaki) Lin
4
software called SAQC (written in R and R-GUI) is provided as a quality evaluation and
visualization tool.
• Meta-analysis suite for transcriptomic data [7]
There are many applications in transcriptomic data, for examples, finding genes
differentially expressed in different conditions/groups, quality assessment, clustering,
classification, biological pathway analysis. Many meta-analysis tools have been developed
for each different purpose. We have built a comprehensive suite called ``MetaOmics”
(written in R and Shiny) which provides interactive graphical user interface for biologists.
• Statistical metabolomics tool [8]
Metabolomics data provide opportunities to decipher metabolic mechanisms. Data quality
has been shown as concerns in metabolomics data and must be appropriately addressed
along with downstream statistical analysis. We developed a R tool called ``SMART”, which
can analyze input files with different formats, visually represent various types of data
features, implement peak alignment and annotation, conduct quality control for samples
and peaks, explore batch effects, and perform association analysis.
Future Research Plan
I am happy to work closely with biologists and learn from data to know what kind of tools they
need. It’s easy to foresee that data from genome-wide experiments will grow in quantity,
dimension and complexity. As a statistician, I am eager to develop useful statistical tools to help
biologists analyze, visualize and interpret the data.
Besides the topics I mention above, I am also interested in unsupervised/supervised machine
learning problems. Many machine learning algorithms have been proposed in different fields,
and I would like to explore if any of them could be applied in biological data.
2. Collaboration/Application Research
I have many collaboration opportunities with scientists in many different research areas, such as
cardiovascular epidemiology, psychiatry, and cancer biology. These experiences have been great
trainings for me to understand important biological questions and translate statistical languages
for biologists. I am eager to work with biologists to help them decipher the underlying
pathological mechanism and be inspired to develop useful statistical methodology.
Past and Ongoing Research Projects
• Age effect in human orbitofrontal cortex [9, 10, 11, 12, 13]
We performed genotyping and gene expression analyses on two brain regions (BA47 and
BA11) of 209 healthy postmortem brain samples (in age from 16 to 91) to investigate
molecular mechanisms and genetic modulation in brain during aging process [9]. We
defined a ``delta age” measure as the individual deviation in molecular age from
chronological age, which reflects accelerated or delayed aging for each individual. Finally,
Chien-Wei (Masaki) Lin
5
we performed GWAS analysis and developed a polygenic risk score to investigate genetic
modulation that predicts delta age.
The same dataset was used to investigate other aspects as well. We conducted isoform-
specific analysis on KALRN gene (involved in regulation of the actin cytoskeleton within
dendrites) [10]. The overexpression of two isoforms, KAL9 and KAL12, were hypothesized
to associated with age. Our analysis indicated the age effect is significant, but modest. Also,
our work concluded that global KALRN expression analysis might be misleading and future
studies should focus on isoform-specific quantification.
We also investigated the VSNL1 gene, which is a peripheral biomarker for Alzheimer
disease (AD). We found VSNL1 was significantly co-expressed with genes in pathways for
calcium signaling, AD, long-term potentiation, long-term depression, and trafficking of
AMPA receptors [11]. These findings provide an unbiased link between VSNL1 and
molecular mechanisms of AD, including pathways implicated in synaptic pathology in AD.
Another gene, FREM3 was known to be associated with major depression disorder (MDD)
in GWAS study. We investigated how the nearby SNPs affect the FREM3 brain gene
expression level [12]. Our work suggested that common genetic variation associated with
reduced FREM3 expression may confer risk for accelerated aging.
BDNF and SST expression level are known to decrease robustly with age, and lower
expression level of both genes have been observed in many brain disorders. However, the
underlying mechanism that decreases the expression level of both genes is unknown, and
our work suggests DNA methylation may be the proximal mechanism [13]. On the other
hand, there are a consistent set of age-related genes, and we have another work that
extends the view from these two genes to global age-related genes [14]. And again, we
investigated if DNA methylation is the underlying mechanism for those genes undergo age-
related changes.
• Major Depressive Episode/Disorder [15, 16, 17, 18]
We proposed a gene coexpression analysis to investigate biological pathways associated
with antidepressant treatment response predisposition and regulation by microRNAs in
major depressive episode (MDE) samples and control samples [15]. Our work underlines
the importance of inflammation-related pathways and the involvement of a large miRNA
program as biological processes associated with antidepressant treatment response.
We investigated the neurobiological abnormalities related to late-life depression (LLD) by
peripheral proteomic panel and structural brain imaging for LLD patients and control
samples [16]. We found differential expressed proteins are enriched in biological pathways
related to abnormal immune-inflammatory control, cell survival and proliferation,
proteostasis control, lipid metabolism, and intracellular signaling, which increase brain and
systemic allostatic load leading to the downstream negative outcomes of LLD.
Using the same proteomic dataset, we investigated if a systemic molecular pattern
associated with aging (senescent-associated secretory phenotype [SASP]) is elevated in
adults with LLD [17]. Our result suggests that individuals with LLD display enhanced aging-
Chien-Wei (Masaki) Lin
6
related molecular patterns that are associated with higher medical comorbidity and worse
cognitive function.
In another study, we have 11 transcriptomic datasets from human postmortem brains with
MDD [18]. We developed a meta analytic clustering method to identify coexpression
modules across 11 studies. We further incorporated the information from GWAS studies
of brain disorders, and we identified a module consistently and significantly associated
with MDD and other complex brain disorders. Our work demonstrates the importance of
integrating transcriptome data and incorporating GWAS results to decipher the molecular
pathology of MDD and other complex brain disorders.
Future Research plan
In my future research, I will keep actively seeking for collaboration opportunities from local
biologists. I am particularly interested in working on application problems that also inspire my
methodology research.
References
[1] Lin, C.-W.*
, Liao, G.*
, Lee, M. L. T., Park, Y., & Tseng, G. C. (2017). SeqDesign: A framework
for RNA-Seq genome-wide power calculation and experimental design issues. (in
preparation)
[2] Lin, C.-W.*
, Liu, P.*
, Park, Y., & Tseng, G. C. (2017). MethylSeqDesign: A framework for
Methyl-Seq genome-wide power calculation and experimental design issues. (in
preparation)
[3] Kim, S., Lin, C.-W., & Tseng, G. C. (2016). MetaKTSP: A Meta-Analytic Top Scoring Pair
Method for Robust Cross-Study Validation of Omics Prediction Analysis. Bioinformatics,
32(March), btw115.
[4] Yang, H.-C., Wang, P.-L., Lin, C.-W., Chen, C.-H., & Chen, C.-H. (2012). Integrative analysis
of single nucleotide polymorphisms and gene expression efficiently distinguishes samples
from closely related ethnic populations. BMC Genomics, 13(1), 346.
[5] Yang, H.-C., Lin, C.-W., Chen, C.-W., & Chen, J. (2014). Applying genome-wide gene-based
expression quantitative trait locus mapping to study population ancestry and
pharmacogenetics. BMC Genomics, 15(1), 319.
[6] Yang, H.-C., Lin, H.-C., Kang, M., Chen, C.-H., Lin, C.-W., Li, L.-H., … Pan, W.-H. (2011). SAQC:
SNP array quality control. BMC Bioinformatics, 12, 100.
[7] Ma, T., Huo, Z., Kuo, A., Zeng, X., Zhu, L., Fang, A., Wang, L., Lin, C. W., Rahman, T., Liu, S.,
Park, Y., Kim, S., Li, J., Chang, L. C., Song, C., & Tseng, G. C. (2017) MetaOmics - a
Comprehensive Software Suite with Interactive Visualization for Transcriptomic Meta-
Analysis. (in preparation)
Chien-Wei (Masaki) Lin
7
[8] Liang, Y. J., Lin, Y. T., Chen, C. W., Lin, C. W., Chao, K. M., Pan, W. H., & Yang, H. C. (2016).
SMART: Statistical Metabolomics Analysis - An R Tool. Analytical Chemistry, 88(12), 6334–
6341.
[9] Lin, C.-W., Chang, L. C., Ma, T., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). Genetic
Modulation of Brain Molecular Aging. (In preparation)
[10] Grubisha, M. J.*
, Lin, C.-W.*
, Tseng, G. C., Penzes, P., Sibille, E., & Sweet, R. A. (2016). Age-
dependent increase in Kalirin-9 and Kalirin-12 transcripts in human orbitofrontal cortex.
European Journal of Neuroscience, 44(7), 2483–2492.
[11] Lin, C. W., Chang, L. C., Tseng, G. C., Kirkwood, C. M., Sibille, E. L., & Sweet, R. A. (2015).
VSNL1 co-expression networks in aging include calcium signaling, synaptic plasticity, and
Alzheimer’s disease pathways. Frontiers in Psychiatry, 6(MAR), 30.
[12] Nikolova, Y. S., Iruku, S. P., Lin, C.-W., Conley, E. D., Puralewski, R., French, B., … Sibille, E.
(2015). FRAS1-related extracellular matrix 3 (FREM3) single-nucleotide polymorphism
effects on gene expression, amygdala reactivity and perceptual processing speed: An
accelerated aging pathway of depression risk. Frontiers in Psychology, 6(September), 1377.
[13] McKinney, B. C., Lin, C.-W., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2015).
Hypermethylation of BDNF and SST Genes in the Orbital Frontal Cortex of Older
Individuals: A Putative Mechanism for Declining Gene Expression with Age.
Neuropsychopharmacology : Official Publication of the American College of
Neuropsychopharmacology, 40(11), 2604–13.
[14] McKinney, B. C.*
, Lin, C.-W.*
, Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). DNA
Methylation in the Human Frontal Cortex Reveals a Putative Mechanism for Age-by-
Disease Interactions. (In preparation)
[15] Belzeaux, R., Lin, C.-W., Ding, Y., Bergon, A., Ibrahim, E. C., Turecki, G., … Sibille, E. (2016).
Predisposition to treatment response in major depressive episode: A peripheral blood
gene coexpression network analysis. Journal of Psychiatric Research, 81, 119–126.
[16] Diniz, B. S., Lin, C. W., Sibille, E., Tseng, G., Lotrich, F., Aizenstein, H. J., … Butters, M. A.
(2016). Circulating biosignatures of late-life depression (LLD): Towards a comprehensive,
data-driven approach to understanding LLD pathophysiology. Journal of Psychiatric
Research, 82, 1–7.
[17] Diniz, B. S., Reynolds, C. F., Sibille, E., Lin, C.-W., Tseng, G., Lotrich, F., … Butters, M. A.
(2016). Enhanced Molecular Aging in Late-Life Depression: the Senescent Associated
Secretory Phenotype. The American Journal of Geriatric Psychiatry.
[18] Chang, L. C., Jamain, S., Lin, C. W., Rujescu, D., Tseng, G. C., & Sibille, E. (2014). A conserved
BDNF, glutamate- and GABA-enriched gene module related to human depression
identified by coexpression meta-analysis and DNA variant genome-wide association
studies. PLoS ONE, 9(3), e90980.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Ncbi
NcbiNcbi
Ncbi
 
Gene Expression Omnibus (GEO)
Gene Expression Omnibus (GEO)Gene Expression Omnibus (GEO)
Gene Expression Omnibus (GEO)
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Knowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsKnowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional Predictions
 
Ionomics
IonomicsIonomics
Ionomics
 
2 d gel electrophoresis
2 d gel electrophoresis2 d gel electrophoresis
2 d gel electrophoresis
 
2 d gel electrophresis
2 d gel electrophresis2 d gel electrophresis
2 d gel electrophresis
 
Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS)
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
ILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptxILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptx
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Ppt snp detection
Ppt snp detectionPpt snp detection
Ppt snp detection
 
Protein database
Protein databaseProtein database
Protein database
 
Nanopore sequencing (NGS)
Nanopore sequencing (NGS)Nanopore sequencing (NGS)
Nanopore sequencing (NGS)
 
Molecular diagnostic techniques
Molecular diagnostic techniques Molecular diagnostic techniques
Molecular diagnostic techniques
 
Protein protein interaction, functional proteomics
Protein protein interaction, functional proteomicsProtein protein interaction, functional proteomics
Protein protein interaction, functional proteomics
 

Destaque

GLASS, Ian (Mar15)
GLASS, Ian (Mar15)GLASS, Ian (Mar15)
GLASS, Ian (Mar15)
ian glass
 

Destaque (19)

Vilket programmerings språk borde jag lära?
Vilket programmerings språk borde jag lära?Vilket programmerings språk borde jag lära?
Vilket programmerings språk borde jag lära?
 
Tangki sumpit & mixing tank bio seven
Tangki sumpit & mixing tank bio sevenTangki sumpit & mixing tank bio seven
Tangki sumpit & mixing tank bio seven
 
Quebrando muros - Dafiti tech
Quebrando muros - Dafiti techQuebrando muros - Dafiti tech
Quebrando muros - Dafiti tech
 
LOGOS_Philosophy_iunie_2016_47to58
LOGOS_Philosophy_iunie_2016_47to58LOGOS_Philosophy_iunie_2016_47to58
LOGOS_Philosophy_iunie_2016_47to58
 
Micro c lab8(serial communication)
Micro c lab8(serial communication)Micro c lab8(serial communication)
Micro c lab8(serial communication)
 
Minecraft - A História
Minecraft - A HistóriaMinecraft - A História
Minecraft - A História
 
PC DOORS OPEN
PC DOORS OPENPC DOORS OPEN
PC DOORS OPEN
 
IV Mostra Cultural e Educativa
IV Mostra Cultural e EducativaIV Mostra Cultural e Educativa
IV Mostra Cultural e Educativa
 
IAP Certificate
IAP CertificateIAP Certificate
IAP Certificate
 
Actividad número 8 wilson rios
Actividad número 8 wilson riosActividad número 8 wilson rios
Actividad número 8 wilson rios
 
Examen final c1 v
Examen final c1   vExamen final c1   v
Examen final c1 v
 
CRM - Schlüssel zum Erfolg?
CRM - Schlüssel zum Erfolg?CRM - Schlüssel zum Erfolg?
CRM - Schlüssel zum Erfolg?
 
Stick diagram basics
Stick diagram basicsStick diagram basics
Stick diagram basics
 
Evaluation one comparing
Evaluation one comparing Evaluation one comparing
Evaluation one comparing
 
GLASS, Ian (Mar15)
GLASS, Ian (Mar15)GLASS, Ian (Mar15)
GLASS, Ian (Mar15)
 
Presentation on nokia
Presentation on nokiaPresentation on nokia
Presentation on nokia
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 
Product Development
Product DevelopmentProduct Development
Product Development
 
Tata Tinplate Company of India Ltd
Tata Tinplate Company of India LtdTata Tinplate Company of India Ltd
Tata Tinplate Company of India Ltd
 

Semelhante a Research Statement Chien-Wei Lin

2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 

Semelhante a Research Statement Chien-Wei Lin (20)

COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
 
Math, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical ResearchMath, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical Research
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
 
Computational biology
Computational biologyComputational biology
Computational biology
 
woot2
woot2woot2
woot2
 
C0344023028
C0344023028C0344023028
C0344023028
 
A Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction TechniquesA Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction Techniques
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 

Research Statement Chien-Wei Lin

  • 1. Chien-Wei (Masaki) Lin 1 Research Statement During my past working experiences and current PhD program training, I have been working on many methodology and collaborative researches in multi-omics high-throughput data (both microarray and sequencing data). Inspired by the real data, I am particularly interested in developing methodology in statistical genetics, bioinformatics, power and sample size calculation tool, integrative/meta-analysis methods for different omics data type, and supervised/unsupervised machine learning applications. For collaboration, I have been working closely with scientists in many fields, such as cardiovascular epidemiology, psychiatry, and cancer biology. I have experiences in various types of omics data, including single nucleotide polymorphism (SNP), copy number variation (CNV), DNA methylation, gene expression, proteomics (peptide), and metabolomics data. In the rest of the statement, I will summarize my past and ongoing research projects and future research plans. 1. Methodology Research 1.1. Power calculation tool in sequencing data Unlike earlier fluorescence-based technologies such as microarray, modelling of next generation sequencing (NGS) data should consider discrete count data. In addition to sample size, sequencing depth is also directly related to the experimental cost. Consequently, given total budget and pre-specified unit experimental cost, the experimental design issue in NGS data is conceptually a more complex multi-dimensional constrained optimization problem rather than one-dimensional sample size calculation in traditional hypothesis setting. Past and Ongoing Research Projects • Power calculation tool in RNA-Seq and Methyl-Seq data [1, 2] Most existing methods focus on single gene formula, that is, given the type-I error, effect size, and sample size for one gene, and then derive the corresponding statistical power based on a certain association test. However, in high-throughput data, statistical power should be considered for thousands of genes simultaneously, where genome-wide versions of type-I error and power are considered as false discovery rate and expected discovery rate. We proposed two statistical frameworks, namely ``SeqDesign'' [1] and ``MethylSeqDesign” [2], to utilize pilot data for power calculation and experimental design of RNA-Seq and Methyl-Seq experiments, respectively. The approach is based on mixture model fitting of p-value distribution from pilot data and a parametric bootstrap procedure based on approximated Wald test statistics to infer genome-wide power for optimal sample size and sequencing depth. Our method contributed in the following ways:
  • 2. Chien-Wei (Masaki) Lin 2 - Our method modeled count data adequately, and considers false discover rate to control type-I error. By incorporating pilot data, our method can reflect the characteristics of target experiments. - We provided intuitive visualization tool to guide various practical experimental designs for practitioners. - We performed simulations and real data applications to evaluate the performance of our methods and compared to existing methods. We showed that our method outperforms the others. Future Research Plan The technology evolves rapidly. As new type of omics experiment develops, challenges for power calculation and design issues will arise. I would like to explore possibilities to extend this statistical framework to other type of omics data, such as single-cell experiments, in the future. 1.2. Meta-analysis and integrative analysis Nowadays, many scientific findings suffer from low reproducibility/reliability, that is, the findings could not be replicated in another cohort even under the same/similar experimental conditions, which mostly due to complexity of omics data analysis and biological variation. In other words, reliable findings across multiple studies are more desirable. Meta-analysis has been successfully used to achieve this goal by combining effect sizes/p-values from multiple studies. Moreover, based on central dogma of molecular biology, different types of omics data work jointly as a system. Therefore, biological findings/conclusions that draw from multiple omics data have higher reliability and interpretability by its nature. More and more databases now become available to greatly facilitate these type of analysis; for example, dbGaP for genetic variants data, NCBI GEO database for gene expression data, and TCGA/ICGC database for pan cancer data. Past and Ongoing Research Projects • Meta-Analytic Robust Classifier [3] In biomedical research, predicting disease diagnosis, prognosis or survival is an important application. Robust and interpretable classifiers are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. However, many classification algorithms suffer from low performance in cross-study validation, including TSP. Hence, I participated to develop a meta-analytic top scoring pair (MetaKTSP) framework to combine data from multiple transcriptomic studies and generate a single robust prediction model. Our method has following conclusions: - We conducted simulation analysis to compare with other popular single study based methods. The results showed our method outperforms the others.
  • 3. Chien-Wei (Masaki) Lin 3 - We showed that in real data, the biomarkers we identified from multiple studies have robust prediction power and better biological interpretation. • Integrative analysis of SNP and gene expression [4, 5] It has been shown that SNPs are informative for tracing ancestral ethnicity of individuals. However, it’s useful only for classifying distinct continental origins but cannot discriminate individuals from closely related ancestral lineages. We found that gene expression data also supplies ethnic information which is supplemental to SNPs [4]. Our contributions are summarized as below: - To the best of our knowledge, we are the first study that integrate SNP and gene expression data to aid classification of subjects from closely related ethnic populations. - By integrating SNP and gene expression data together, we can construct the ancestral prediction model with a reduced number of markers and provide higher accuracy. Expression quantitative trait loci (eQTL) analysis becomes popular because of its functional meaning (SNP regulates gene expression). However, most of the analysis is single locus- based analysis. We use partial least square (PLS) method to investigate the roles of gene- based eQTL in ancestral ethnicity and pharmacogenetics [5]. We observed ancestry information enriched in eQTL and can be used to construct prediction model to distinguish subjects from close ethnic populations. Also, we identified 2 ancestry-informative eQTL associated with adverse drug reactions and/or drug response. Future Research Plan Data science (a.k.a. “Big data”) is an emerging discipline. As a statistician, how to extract information from these big data properly is an attractive topic to me. I am keen to develop methods for meta-analysis in various types of statistical problems and for integrating various types of data (including multi-omics data and brain imaging data). 1.3. General Bioinformatics Problems Bioinformatics is an interdisciplinary field that provides useful software/tools to assist biologists understand the biological data better and deeper. The applications are very wide and the needs are getting higher and higher. For examples, database like TCGA and ICGC which provide abundant data, and integrative tool like UCSC Xena which collects, analyzes, and visualizes the data greatly facilitate this area. Past and Ongoing Research Projects • SNP array quality control [6] Genome-wide SNP arrays contain hundreds of thousands of SNPs. The quality of arrays plays an important role in downstream analysis. We define a quality index for each array by quantifying the overall deviation of the individual-based allele frequencies from reference frequencies. Our method can successfully detect poor-quality SNP arrays. A
  • 4. Chien-Wei (Masaki) Lin 4 software called SAQC (written in R and R-GUI) is provided as a quality evaluation and visualization tool. • Meta-analysis suite for transcriptomic data [7] There are many applications in transcriptomic data, for examples, finding genes differentially expressed in different conditions/groups, quality assessment, clustering, classification, biological pathway analysis. Many meta-analysis tools have been developed for each different purpose. We have built a comprehensive suite called ``MetaOmics” (written in R and Shiny) which provides interactive graphical user interface for biologists. • Statistical metabolomics tool [8] Metabolomics data provide opportunities to decipher metabolic mechanisms. Data quality has been shown as concerns in metabolomics data and must be appropriately addressed along with downstream statistical analysis. We developed a R tool called ``SMART”, which can analyze input files with different formats, visually represent various types of data features, implement peak alignment and annotation, conduct quality control for samples and peaks, explore batch effects, and perform association analysis. Future Research Plan I am happy to work closely with biologists and learn from data to know what kind of tools they need. It’s easy to foresee that data from genome-wide experiments will grow in quantity, dimension and complexity. As a statistician, I am eager to develop useful statistical tools to help biologists analyze, visualize and interpret the data. Besides the topics I mention above, I am also interested in unsupervised/supervised machine learning problems. Many machine learning algorithms have been proposed in different fields, and I would like to explore if any of them could be applied in biological data. 2. Collaboration/Application Research I have many collaboration opportunities with scientists in many different research areas, such as cardiovascular epidemiology, psychiatry, and cancer biology. These experiences have been great trainings for me to understand important biological questions and translate statistical languages for biologists. I am eager to work with biologists to help them decipher the underlying pathological mechanism and be inspired to develop useful statistical methodology. Past and Ongoing Research Projects • Age effect in human orbitofrontal cortex [9, 10, 11, 12, 13] We performed genotyping and gene expression analyses on two brain regions (BA47 and BA11) of 209 healthy postmortem brain samples (in age from 16 to 91) to investigate molecular mechanisms and genetic modulation in brain during aging process [9]. We defined a ``delta age” measure as the individual deviation in molecular age from chronological age, which reflects accelerated or delayed aging for each individual. Finally,
  • 5. Chien-Wei (Masaki) Lin 5 we performed GWAS analysis and developed a polygenic risk score to investigate genetic modulation that predicts delta age. The same dataset was used to investigate other aspects as well. We conducted isoform- specific analysis on KALRN gene (involved in regulation of the actin cytoskeleton within dendrites) [10]. The overexpression of two isoforms, KAL9 and KAL12, were hypothesized to associated with age. Our analysis indicated the age effect is significant, but modest. Also, our work concluded that global KALRN expression analysis might be misleading and future studies should focus on isoform-specific quantification. We also investigated the VSNL1 gene, which is a peripheral biomarker for Alzheimer disease (AD). We found VSNL1 was significantly co-expressed with genes in pathways for calcium signaling, AD, long-term potentiation, long-term depression, and trafficking of AMPA receptors [11]. These findings provide an unbiased link between VSNL1 and molecular mechanisms of AD, including pathways implicated in synaptic pathology in AD. Another gene, FREM3 was known to be associated with major depression disorder (MDD) in GWAS study. We investigated how the nearby SNPs affect the FREM3 brain gene expression level [12]. Our work suggested that common genetic variation associated with reduced FREM3 expression may confer risk for accelerated aging. BDNF and SST expression level are known to decrease robustly with age, and lower expression level of both genes have been observed in many brain disorders. However, the underlying mechanism that decreases the expression level of both genes is unknown, and our work suggests DNA methylation may be the proximal mechanism [13]. On the other hand, there are a consistent set of age-related genes, and we have another work that extends the view from these two genes to global age-related genes [14]. And again, we investigated if DNA methylation is the underlying mechanism for those genes undergo age- related changes. • Major Depressive Episode/Disorder [15, 16, 17, 18] We proposed a gene coexpression analysis to investigate biological pathways associated with antidepressant treatment response predisposition and regulation by microRNAs in major depressive episode (MDE) samples and control samples [15]. Our work underlines the importance of inflammation-related pathways and the involvement of a large miRNA program as biological processes associated with antidepressant treatment response. We investigated the neurobiological abnormalities related to late-life depression (LLD) by peripheral proteomic panel and structural brain imaging for LLD patients and control samples [16]. We found differential expressed proteins are enriched in biological pathways related to abnormal immune-inflammatory control, cell survival and proliferation, proteostasis control, lipid metabolism, and intracellular signaling, which increase brain and systemic allostatic load leading to the downstream negative outcomes of LLD. Using the same proteomic dataset, we investigated if a systemic molecular pattern associated with aging (senescent-associated secretory phenotype [SASP]) is elevated in adults with LLD [17]. Our result suggests that individuals with LLD display enhanced aging-
  • 6. Chien-Wei (Masaki) Lin 6 related molecular patterns that are associated with higher medical comorbidity and worse cognitive function. In another study, we have 11 transcriptomic datasets from human postmortem brains with MDD [18]. We developed a meta analytic clustering method to identify coexpression modules across 11 studies. We further incorporated the information from GWAS studies of brain disorders, and we identified a module consistently and significantly associated with MDD and other complex brain disorders. Our work demonstrates the importance of integrating transcriptome data and incorporating GWAS results to decipher the molecular pathology of MDD and other complex brain disorders. Future Research plan In my future research, I will keep actively seeking for collaboration opportunities from local biologists. I am particularly interested in working on application problems that also inspire my methodology research. References [1] Lin, C.-W.* , Liao, G.* , Lee, M. L. T., Park, Y., & Tseng, G. C. (2017). SeqDesign: A framework for RNA-Seq genome-wide power calculation and experimental design issues. (in preparation) [2] Lin, C.-W.* , Liu, P.* , Park, Y., & Tseng, G. C. (2017). MethylSeqDesign: A framework for Methyl-Seq genome-wide power calculation and experimental design issues. (in preparation) [3] Kim, S., Lin, C.-W., & Tseng, G. C. (2016). MetaKTSP: A Meta-Analytic Top Scoring Pair Method for Robust Cross-Study Validation of Omics Prediction Analysis. Bioinformatics, 32(March), btw115. [4] Yang, H.-C., Wang, P.-L., Lin, C.-W., Chen, C.-H., & Chen, C.-H. (2012). Integrative analysis of single nucleotide polymorphisms and gene expression efficiently distinguishes samples from closely related ethnic populations. BMC Genomics, 13(1), 346. [5] Yang, H.-C., Lin, C.-W., Chen, C.-W., & Chen, J. (2014). Applying genome-wide gene-based expression quantitative trait locus mapping to study population ancestry and pharmacogenetics. BMC Genomics, 15(1), 319. [6] Yang, H.-C., Lin, H.-C., Kang, M., Chen, C.-H., Lin, C.-W., Li, L.-H., … Pan, W.-H. (2011). SAQC: SNP array quality control. BMC Bioinformatics, 12, 100. [7] Ma, T., Huo, Z., Kuo, A., Zeng, X., Zhu, L., Fang, A., Wang, L., Lin, C. W., Rahman, T., Liu, S., Park, Y., Kim, S., Li, J., Chang, L. C., Song, C., & Tseng, G. C. (2017) MetaOmics - a Comprehensive Software Suite with Interactive Visualization for Transcriptomic Meta- Analysis. (in preparation)
  • 7. Chien-Wei (Masaki) Lin 7 [8] Liang, Y. J., Lin, Y. T., Chen, C. W., Lin, C. W., Chao, K. M., Pan, W. H., & Yang, H. C. (2016). SMART: Statistical Metabolomics Analysis - An R Tool. Analytical Chemistry, 88(12), 6334– 6341. [9] Lin, C.-W., Chang, L. C., Ma, T., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). Genetic Modulation of Brain Molecular Aging. (In preparation) [10] Grubisha, M. J.* , Lin, C.-W.* , Tseng, G. C., Penzes, P., Sibille, E., & Sweet, R. A. (2016). Age- dependent increase in Kalirin-9 and Kalirin-12 transcripts in human orbitofrontal cortex. European Journal of Neuroscience, 44(7), 2483–2492. [11] Lin, C. W., Chang, L. C., Tseng, G. C., Kirkwood, C. M., Sibille, E. L., & Sweet, R. A. (2015). VSNL1 co-expression networks in aging include calcium signaling, synaptic plasticity, and Alzheimer’s disease pathways. Frontiers in Psychiatry, 6(MAR), 30. [12] Nikolova, Y. S., Iruku, S. P., Lin, C.-W., Conley, E. D., Puralewski, R., French, B., … Sibille, E. (2015). FRAS1-related extracellular matrix 3 (FREM3) single-nucleotide polymorphism effects on gene expression, amygdala reactivity and perceptual processing speed: An accelerated aging pathway of depression risk. Frontiers in Psychology, 6(September), 1377. [13] McKinney, B. C., Lin, C.-W., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2015). Hypermethylation of BDNF and SST Genes in the Orbital Frontal Cortex of Older Individuals: A Putative Mechanism for Declining Gene Expression with Age. Neuropsychopharmacology : Official Publication of the American College of Neuropsychopharmacology, 40(11), 2604–13. [14] McKinney, B. C.* , Lin, C.-W.* , Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). DNA Methylation in the Human Frontal Cortex Reveals a Putative Mechanism for Age-by- Disease Interactions. (In preparation) [15] Belzeaux, R., Lin, C.-W., Ding, Y., Bergon, A., Ibrahim, E. C., Turecki, G., … Sibille, E. (2016). Predisposition to treatment response in major depressive episode: A peripheral blood gene coexpression network analysis. Journal of Psychiatric Research, 81, 119–126. [16] Diniz, B. S., Lin, C. W., Sibille, E., Tseng, G., Lotrich, F., Aizenstein, H. J., … Butters, M. A. (2016). Circulating biosignatures of late-life depression (LLD): Towards a comprehensive, data-driven approach to understanding LLD pathophysiology. Journal of Psychiatric Research, 82, 1–7. [17] Diniz, B. S., Reynolds, C. F., Sibille, E., Lin, C.-W., Tseng, G., Lotrich, F., … Butters, M. A. (2016). Enhanced Molecular Aging in Late-Life Depression: the Senescent Associated Secretory Phenotype. The American Journal of Geriatric Psychiatry. [18] Chang, L. C., Jamain, S., Lin, C. W., Rujescu, D., Tseng, G. C., & Sibille, E. (2014). A conserved BDNF, glutamate- and GABA-enriched gene module related to human depression identified by coexpression meta-analysis and DNA variant genome-wide association studies. PLoS ONE, 9(3), e90980.