1. Chien-Wei (Masaki) Lin
1
Research Statement
During my past working experiences and current PhD program training, I have been working
on many methodology and collaborative researches in multi-omics high-throughput data (both
microarray and sequencing data). Inspired by the real data, I am particularly interested in
developing methodology in statistical genetics, bioinformatics, power and sample size
calculation tool, integrative/meta-analysis methods for different omics data type, and
supervised/unsupervised machine learning applications. For collaboration, I have been working
closely with scientists in many fields, such as cardiovascular epidemiology, psychiatry, and
cancer biology. I have experiences in various types of omics data, including single nucleotide
polymorphism (SNP), copy number variation (CNV), DNA methylation, gene expression,
proteomics (peptide), and metabolomics data. In the rest of the statement, I will summarize my
past and ongoing research projects and future research plans.
1. Methodology Research
1.1. Power calculation tool in sequencing data
Unlike earlier fluorescence-based technologies such as microarray, modelling of next generation
sequencing (NGS) data should consider discrete count data. In addition to sample size,
sequencing depth is also directly related to the experimental cost. Consequently, given total
budget and pre-specified unit experimental cost, the experimental design issue in NGS data is
conceptually a more complex multi-dimensional constrained optimization problem rather than
one-dimensional sample size calculation in traditional hypothesis setting.
Past and Ongoing Research Projects
• Power calculation tool in RNA-Seq and Methyl-Seq data [1, 2]
Most existing methods focus on single gene formula, that is, given the type-I error, effect
size, and sample size for one gene, and then derive the corresponding statistical power
based on a certain association test. However, in high-throughput data, statistical power
should be considered for thousands of genes simultaneously, where genome-wide
versions of type-I error and power are considered as false discovery rate and expected
discovery rate.
We proposed two statistical frameworks, namely ``SeqDesign'' [1] and ``MethylSeqDesign”
[2], to utilize pilot data for power calculation and experimental design of RNA-Seq and
Methyl-Seq experiments, respectively. The approach is based on mixture model fitting of
p-value distribution from pilot data and a parametric bootstrap procedure based on
approximated Wald test statistics to infer genome-wide power for optimal sample size and
sequencing depth. Our method contributed in the following ways:
2. Chien-Wei (Masaki) Lin
2
- Our method modeled count data adequately, and considers false discover rate to
control type-I error. By incorporating pilot data, our method can reflect the
characteristics of target experiments.
- We provided intuitive visualization tool to guide various practical experimental designs
for practitioners.
- We performed simulations and real data applications to evaluate the performance of
our methods and compared to existing methods. We showed that our method
outperforms the others.
Future Research Plan
The technology evolves rapidly. As new type of omics experiment develops, challenges for power
calculation and design issues will arise. I would like to explore possibilities to extend this statistical
framework to other type of omics data, such as single-cell experiments, in the future.
1.2. Meta-analysis and integrative analysis
Nowadays, many scientific findings suffer from low reproducibility/reliability, that is, the findings
could not be replicated in another cohort even under the same/similar experimental conditions,
which mostly due to complexity of omics data analysis and biological variation. In other words,
reliable findings across multiple studies are more desirable. Meta-analysis has been successfully
used to achieve this goal by combining effect sizes/p-values from multiple studies. Moreover,
based on central dogma of molecular biology, different types of omics data work jointly as a
system. Therefore, biological findings/conclusions that draw from multiple omics data have
higher reliability and interpretability by its nature.
More and more databases now become available to greatly facilitate these type of analysis; for
example, dbGaP for genetic variants data, NCBI GEO database for gene expression data, and
TCGA/ICGC database for pan cancer data.
Past and Ongoing Research Projects
• Meta-Analytic Robust Classifier [3]
In biomedical research, predicting disease diagnosis, prognosis or survival is an important
application. Robust and interpretable classifiers are usually favored for their clinical and
translational potential. The top scoring pair (TSP) algorithm is an example that applies a
simple rank-based algorithm to identify rank-altered gene pairs for classifier construction.
However, many classification algorithms suffer from low performance in cross-study
validation, including TSP. Hence, I participated to develop a meta-analytic top scoring pair
(MetaKTSP) framework to combine data from multiple transcriptomic studies and
generate a single robust prediction model. Our method has following conclusions:
- We conducted simulation analysis to compare with other popular single study based
methods. The results showed our method outperforms the others.
3. Chien-Wei (Masaki) Lin
3
- We showed that in real data, the biomarkers we identified from multiple studies have
robust prediction power and better biological interpretation.
• Integrative analysis of SNP and gene expression [4, 5]
It has been shown that SNPs are informative for tracing ancestral ethnicity of individuals.
However, it’s useful only for classifying distinct continental origins but cannot discriminate
individuals from closely related ancestral lineages. We found that gene expression data
also supplies ethnic information which is supplemental to SNPs [4]. Our contributions are
summarized as below:
- To the best of our knowledge, we are the first study that integrate SNP and gene
expression data to aid classification of subjects from closely related ethnic populations.
- By integrating SNP and gene expression data together, we can construct the ancestral
prediction model with a reduced number of markers and provide higher accuracy.
Expression quantitative trait loci (eQTL) analysis becomes popular because of its functional
meaning (SNP regulates gene expression). However, most of the analysis is single locus-
based analysis. We use partial least square (PLS) method to investigate the roles of gene-
based eQTL in ancestral ethnicity and pharmacogenetics [5]. We observed ancestry
information enriched in eQTL and can be used to construct prediction model to distinguish
subjects from close ethnic populations. Also, we identified 2 ancestry-informative eQTL
associated with adverse drug reactions and/or drug response.
Future Research Plan
Data science (a.k.a. “Big data”) is an emerging discipline. As a statistician, how to extract
information from these big data properly is an attractive topic to me. I am keen to develop
methods for meta-analysis in various types of statistical problems and for integrating various
types of data (including multi-omics data and brain imaging data).
1.3. General Bioinformatics Problems
Bioinformatics is an interdisciplinary field that provides useful software/tools to assist biologists
understand the biological data better and deeper. The applications are very wide and the needs
are getting higher and higher. For examples, database like TCGA and ICGC which provide
abundant data, and integrative tool like UCSC Xena which collects, analyzes, and visualizes the
data greatly facilitate this area.
Past and Ongoing Research Projects
• SNP array quality control [6]
Genome-wide SNP arrays contain hundreds of thousands of SNPs. The quality of arrays
plays an important role in downstream analysis. We define a quality index for each array
by quantifying the overall deviation of the individual-based allele frequencies from
reference frequencies. Our method can successfully detect poor-quality SNP arrays. A
4. Chien-Wei (Masaki) Lin
4
software called SAQC (written in R and R-GUI) is provided as a quality evaluation and
visualization tool.
• Meta-analysis suite for transcriptomic data [7]
There are many applications in transcriptomic data, for examples, finding genes
differentially expressed in different conditions/groups, quality assessment, clustering,
classification, biological pathway analysis. Many meta-analysis tools have been developed
for each different purpose. We have built a comprehensive suite called ``MetaOmics”
(written in R and Shiny) which provides interactive graphical user interface for biologists.
• Statistical metabolomics tool [8]
Metabolomics data provide opportunities to decipher metabolic mechanisms. Data quality
has been shown as concerns in metabolomics data and must be appropriately addressed
along with downstream statistical analysis. We developed a R tool called ``SMART”, which
can analyze input files with different formats, visually represent various types of data
features, implement peak alignment and annotation, conduct quality control for samples
and peaks, explore batch effects, and perform association analysis.
Future Research Plan
I am happy to work closely with biologists and learn from data to know what kind of tools they
need. It’s easy to foresee that data from genome-wide experiments will grow in quantity,
dimension and complexity. As a statistician, I am eager to develop useful statistical tools to help
biologists analyze, visualize and interpret the data.
Besides the topics I mention above, I am also interested in unsupervised/supervised machine
learning problems. Many machine learning algorithms have been proposed in different fields,
and I would like to explore if any of them could be applied in biological data.
2. Collaboration/Application Research
I have many collaboration opportunities with scientists in many different research areas, such as
cardiovascular epidemiology, psychiatry, and cancer biology. These experiences have been great
trainings for me to understand important biological questions and translate statistical languages
for biologists. I am eager to work with biologists to help them decipher the underlying
pathological mechanism and be inspired to develop useful statistical methodology.
Past and Ongoing Research Projects
• Age effect in human orbitofrontal cortex [9, 10, 11, 12, 13]
We performed genotyping and gene expression analyses on two brain regions (BA47 and
BA11) of 209 healthy postmortem brain samples (in age from 16 to 91) to investigate
molecular mechanisms and genetic modulation in brain during aging process [9]. We
defined a ``delta age” measure as the individual deviation in molecular age from
chronological age, which reflects accelerated or delayed aging for each individual. Finally,
5. Chien-Wei (Masaki) Lin
5
we performed GWAS analysis and developed a polygenic risk score to investigate genetic
modulation that predicts delta age.
The same dataset was used to investigate other aspects as well. We conducted isoform-
specific analysis on KALRN gene (involved in regulation of the actin cytoskeleton within
dendrites) [10]. The overexpression of two isoforms, KAL9 and KAL12, were hypothesized
to associated with age. Our analysis indicated the age effect is significant, but modest. Also,
our work concluded that global KALRN expression analysis might be misleading and future
studies should focus on isoform-specific quantification.
We also investigated the VSNL1 gene, which is a peripheral biomarker for Alzheimer
disease (AD). We found VSNL1 was significantly co-expressed with genes in pathways for
calcium signaling, AD, long-term potentiation, long-term depression, and trafficking of
AMPA receptors [11]. These findings provide an unbiased link between VSNL1 and
molecular mechanisms of AD, including pathways implicated in synaptic pathology in AD.
Another gene, FREM3 was known to be associated with major depression disorder (MDD)
in GWAS study. We investigated how the nearby SNPs affect the FREM3 brain gene
expression level [12]. Our work suggested that common genetic variation associated with
reduced FREM3 expression may confer risk for accelerated aging.
BDNF and SST expression level are known to decrease robustly with age, and lower
expression level of both genes have been observed in many brain disorders. However, the
underlying mechanism that decreases the expression level of both genes is unknown, and
our work suggests DNA methylation may be the proximal mechanism [13]. On the other
hand, there are a consistent set of age-related genes, and we have another work that
extends the view from these two genes to global age-related genes [14]. And again, we
investigated if DNA methylation is the underlying mechanism for those genes undergo age-
related changes.
• Major Depressive Episode/Disorder [15, 16, 17, 18]
We proposed a gene coexpression analysis to investigate biological pathways associated
with antidepressant treatment response predisposition and regulation by microRNAs in
major depressive episode (MDE) samples and control samples [15]. Our work underlines
the importance of inflammation-related pathways and the involvement of a large miRNA
program as biological processes associated with antidepressant treatment response.
We investigated the neurobiological abnormalities related to late-life depression (LLD) by
peripheral proteomic panel and structural brain imaging for LLD patients and control
samples [16]. We found differential expressed proteins are enriched in biological pathways
related to abnormal immune-inflammatory control, cell survival and proliferation,
proteostasis control, lipid metabolism, and intracellular signaling, which increase brain and
systemic allostatic load leading to the downstream negative outcomes of LLD.
Using the same proteomic dataset, we investigated if a systemic molecular pattern
associated with aging (senescent-associated secretory phenotype [SASP]) is elevated in
adults with LLD [17]. Our result suggests that individuals with LLD display enhanced aging-
6. Chien-Wei (Masaki) Lin
6
related molecular patterns that are associated with higher medical comorbidity and worse
cognitive function.
In another study, we have 11 transcriptomic datasets from human postmortem brains with
MDD [18]. We developed a meta analytic clustering method to identify coexpression
modules across 11 studies. We further incorporated the information from GWAS studies
of brain disorders, and we identified a module consistently and significantly associated
with MDD and other complex brain disorders. Our work demonstrates the importance of
integrating transcriptome data and incorporating GWAS results to decipher the molecular
pathology of MDD and other complex brain disorders.
Future Research plan
In my future research, I will keep actively seeking for collaboration opportunities from local
biologists. I am particularly interested in working on application problems that also inspire my
methodology research.
References
[1] Lin, C.-W.*
, Liao, G.*
, Lee, M. L. T., Park, Y., & Tseng, G. C. (2017). SeqDesign: A framework
for RNA-Seq genome-wide power calculation and experimental design issues. (in
preparation)
[2] Lin, C.-W.*
, Liu, P.*
, Park, Y., & Tseng, G. C. (2017). MethylSeqDesign: A framework for
Methyl-Seq genome-wide power calculation and experimental design issues. (in
preparation)
[3] Kim, S., Lin, C.-W., & Tseng, G. C. (2016). MetaKTSP: A Meta-Analytic Top Scoring Pair
Method for Robust Cross-Study Validation of Omics Prediction Analysis. Bioinformatics,
32(March), btw115.
[4] Yang, H.-C., Wang, P.-L., Lin, C.-W., Chen, C.-H., & Chen, C.-H. (2012). Integrative analysis
of single nucleotide polymorphisms and gene expression efficiently distinguishes samples
from closely related ethnic populations. BMC Genomics, 13(1), 346.
[5] Yang, H.-C., Lin, C.-W., Chen, C.-W., & Chen, J. (2014). Applying genome-wide gene-based
expression quantitative trait locus mapping to study population ancestry and
pharmacogenetics. BMC Genomics, 15(1), 319.
[6] Yang, H.-C., Lin, H.-C., Kang, M., Chen, C.-H., Lin, C.-W., Li, L.-H., … Pan, W.-H. (2011). SAQC:
SNP array quality control. BMC Bioinformatics, 12, 100.
[7] Ma, T., Huo, Z., Kuo, A., Zeng, X., Zhu, L., Fang, A., Wang, L., Lin, C. W., Rahman, T., Liu, S.,
Park, Y., Kim, S., Li, J., Chang, L. C., Song, C., & Tseng, G. C. (2017) MetaOmics - a
Comprehensive Software Suite with Interactive Visualization for Transcriptomic Meta-
Analysis. (in preparation)
7. Chien-Wei (Masaki) Lin
7
[8] Liang, Y. J., Lin, Y. T., Chen, C. W., Lin, C. W., Chao, K. M., Pan, W. H., & Yang, H. C. (2016).
SMART: Statistical Metabolomics Analysis - An R Tool. Analytical Chemistry, 88(12), 6334–
6341.
[9] Lin, C.-W., Chang, L. C., Ma, T., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). Genetic
Modulation of Brain Molecular Aging. (In preparation)
[10] Grubisha, M. J.*
, Lin, C.-W.*
, Tseng, G. C., Penzes, P., Sibille, E., & Sweet, R. A. (2016). Age-
dependent increase in Kalirin-9 and Kalirin-12 transcripts in human orbitofrontal cortex.
European Journal of Neuroscience, 44(7), 2483–2492.
[11] Lin, C. W., Chang, L. C., Tseng, G. C., Kirkwood, C. M., Sibille, E. L., & Sweet, R. A. (2015).
VSNL1 co-expression networks in aging include calcium signaling, synaptic plasticity, and
Alzheimer’s disease pathways. Frontiers in Psychiatry, 6(MAR), 30.
[12] Nikolova, Y. S., Iruku, S. P., Lin, C.-W., Conley, E. D., Puralewski, R., French, B., … Sibille, E.
(2015). FRAS1-related extracellular matrix 3 (FREM3) single-nucleotide polymorphism
effects on gene expression, amygdala reactivity and perceptual processing speed: An
accelerated aging pathway of depression risk. Frontiers in Psychology, 6(September), 1377.
[13] McKinney, B. C., Lin, C.-W., Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2015).
Hypermethylation of BDNF and SST Genes in the Orbital Frontal Cortex of Older
Individuals: A Putative Mechanism for Declining Gene Expression with Age.
Neuropsychopharmacology : Official Publication of the American College of
Neuropsychopharmacology, 40(11), 2604–13.
[14] McKinney, B. C.*
, Lin, C.-W.*
, Oh, H., Tseng, G. C., Lewis, D. A., & Sibille, E. (2017). DNA
Methylation in the Human Frontal Cortex Reveals a Putative Mechanism for Age-by-
Disease Interactions. (In preparation)
[15] Belzeaux, R., Lin, C.-W., Ding, Y., Bergon, A., Ibrahim, E. C., Turecki, G., … Sibille, E. (2016).
Predisposition to treatment response in major depressive episode: A peripheral blood
gene coexpression network analysis. Journal of Psychiatric Research, 81, 119–126.
[16] Diniz, B. S., Lin, C. W., Sibille, E., Tseng, G., Lotrich, F., Aizenstein, H. J., … Butters, M. A.
(2016). Circulating biosignatures of late-life depression (LLD): Towards a comprehensive,
data-driven approach to understanding LLD pathophysiology. Journal of Psychiatric
Research, 82, 1–7.
[17] Diniz, B. S., Reynolds, C. F., Sibille, E., Lin, C.-W., Tseng, G., Lotrich, F., … Butters, M. A.
(2016). Enhanced Molecular Aging in Late-Life Depression: the Senescent Associated
Secretory Phenotype. The American Journal of Geriatric Psychiatry.
[18] Chang, L. C., Jamain, S., Lin, C. W., Rujescu, D., Tseng, G. C., & Sibille, E. (2014). A conserved
BDNF, glutamate- and GABA-enriched gene module related to human depression
identified by coexpression meta-analysis and DNA variant genome-wide association
studies. PLoS ONE, 9(3), e90980.