1. The big-data analytics challenge –
combining statistical and algorithmic
perspectives
Anat Reiner-Benaim
Department of Statistics
University of Haifa
IDC, May 14, 2015
2. Outline
IDC, May 2015
data science -
◦ Definition?
◦ Who needs it?
◦ The elements of data science
Analysis:
◦ Modeling
◦ Software
Examples:
◦ Scheduling – prediction of runtime
◦ Genetics – detection of rare events
2
3. What is data Science?
IDC, May 2015
From Wikipedia:
“Data science is the
study of the
generalizable extraction
of knowledge from
data…
3
4. IDC, May 2015
More from Wikipedia:
…builds on techniques and theories from many fields,
including signal processing, mathematics, probability
models, machine learning, statistical learning, computer
programming, data engineering, pattern recognition and
learning, visualization, uncertainty modeling, data
warehousing and high performance computing...
…goal: extracting meaning from data and creating data
products…
…not restricted to only big data, although the fact that
data is scaling up makes big data an important aspect of
data science.”
4
5. Data Science – who needs it?
IDC, May 2015 5
Anyone who has (big) data, e.g.:
Cellular industry – phones, apps, advertisers
Internet: search engines, social media, marketing,
advertisers
Computer networks and server systems
Cyber security
Credit cards
Banks
Health care providers
Life science – genome, proteome…
TV and related
Weather forecast
6. The elements of data science
IDC, May 2015 6
• NoSQL Database
(e.g. Cassandra)
• DFS (Distributed File System)
(e.g. Hadoop, Spark, GraphLab)
Store,
Preprocess
Database - SQL
(e.g. MySQL, SAS-SQL)
Dump to SQL
• Apply sophisticated methods:
Statistical modeling
Machine learning algorithms
Analyze
“Big data
technologies”
“Big data
Analytics”
7. IDC, May 2015
◦ How can I decide that an item in a manufacturing process is
faulty?
◦ What is the difference between the new machine and the old
one?
◦ What are the factors that affect system load?
◦ How can I predict memory/runtime of a program?
◦ How can I predict that a costumer will churn?
◦ What is the chance that the phone/web user will click my
advertisement?
◦ What is the chance that the current ATM user is making a fraud?
◦ What are the chance for snow this week?
7
Data Analysis –
First, define the problem
9. IDC, May 2015 9
Choosing models
Type of variables:
◦ Continuous, ordinal, categorical.
Statistical assumption:
◦ Normality, equal-variance, independence.
Missing data
Stability
10. IDC, May 2015 10
Learning tools
Bootstrap
◦ Repeatedly fit model on resampled data.
Bagging (“bootstrap aggregation”)
◦ Combine bootstrap samples to prevent instability.
Boosting
◦ Combine a set of weak learners to create a single strong learner
Regularization
◦ Solve over-fitting by restriction
(e.g. limit regression to linear or low degree polynomial)
Utility/cost function
◦ Evaluate performance, compare models
Typically iterative procedures.
combined with the modeling procedures
Help optimize the model and evaluate its performance
11. IDC, May 2015 11
More to consider-
Control statistical error
due to large scale analysis
Multiple
statistical tests
Inflated
statistical error
Control FDR?
FDR = expected proportion of false findings (e.g.
“features”)
12. IDC, May 2015 12
The R software
Open source programming language and
environment for statistical computing.
Widely used among statisticians for developing
statistical software (“packages”) and for data
analysis.
Increasingly popular among all data professionals.
Advantages:
Contains most updated statistical
models and machine learning
algorithms.
Methods are based on research,
compiled and documented.
Contains Hadoop functions
(package “rhdfs”).
Very convenient for plain
programming, scripting,
simulations, visualization.
Friendly interface
(e.g. R-Studio).
The R project site
14. Example 1:
Classification of Job Runtime
in Intel
Joint work with:
Anna Grabarnick, University of Haifa
Edi Shmueli, Intel
15. Job processing
IDC, May 2015 15
Users serversjobs Job
scheduler
Decide:
• Which server?
• Queue?
16. Job schedulers
IDC, May 2015
Algorithms aimed to efficiently queuing and distributing
jobs among servers, thereby improving system
utilization.
Popular scheduling algorithms (e.g. the backfilling) use
information on how long the jobs are expected to run.
In serial job systems, scheduling performance can be
improved by merely separating the short jobs from the
long and assigning them to different queues in the
system.
This helps reduce the likelihood that short jobs will be
delayed after long ones, and thus improves overall 16
18. The problem
IDC, May 2015
Main purpose:
Classify jobs into “short” and “long” durations.
Questions:
◦ How can the classes can be defined?
◦ How can the jobs be classified?
18
19. Available data
IDC, May 2015
two traces obtained from one of Intel’s data centers:
1. ~1 million jobs executed during a period of 10 consecutive days.
Used for data training.
2. ~755,000 jobs executed during a period of 7 consecutive days.
Used for model validation.
Aside from runtime information, 9 categorical variables
were available:
19
20. IDC, May 2015 20
TABLE I. ROUGH GROUPING OF THE 9 CATEGORICAL VARIABLES
Group # of variables Relates to Example
A 3 Scheduling information Resources requested by the job
B 2 Execution-specific information Command line and arguments
C 4 Association information Project and component
TABLE I. STATISTICS REGARDING THE CATEGORICAL VARIABLES
Variable # of categories
# of missing (in
training data)
A1 9 0
A2 7 0
A3 5 0
B1 44 173
B2 22 184
C1 2 0
C2 5 239
C3 6 184
C4 32 0
21. Analysis steps
IDC, May 2015
Exploratory visualization of the data.
Class construction and characterization.
Classification:
◦ Choice of a classification model.
◦ Optimize model.
◦ Validate model.
21
26. IDC, May 2015 26
Constructing classes by the mixture
model
• The Gaussian (normal) mixture model has the form
𝑓 𝑥 =
𝑚=1
𝑀
𝛼 𝑚 𝜙 𝑥; 𝜇 𝑚, Σ 𝑚
with mixing proportions 𝛼 𝑚, 𝛼 𝑚 = 1.
• Each Gaussian density has a mean 𝜇 𝑚 and covariance matrix Σ 𝑚.
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
27. IDC, May 2015 27
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
• We model the runtime 𝑌 as a mixture of the two normal variables
𝑌1~𝑁 𝜇1, 𝜎1
2
, 𝑌2~𝑁 𝜇2, 𝜎2
2
.
𝑌 can be defined by
𝑌 = 1 − 𝛥 ∙ 𝑌1 + 𝛥 ∙ 𝑌2,
where Δ ∈ {0, 1} with ℙ Δ = 1 = 𝜋.
• Let 𝜙 𝜃(𝑥) denote the normal density with parameters 𝜃 = (𝜇, 𝜎2).
Then the density of 𝑌 is
𝑔 𝑌 𝑦 = 1 − 𝜋 𝜙 𝜃1
𝑦 + 𝜋𝜙 𝜃2
𝑦 .
• fit this model to our data by maximum likelihood. The parameters
are
𝜃 = 𝜋, 𝜃1, 𝜃2 = 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
.
The log-likelihood based on 𝑁 training cases is
𝑙 𝜃; Ζ =
𝑖=1
𝑁
log 1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖 .
Mixture distribution – parameters
estimation
“Short”
28. IDC, May 2015 28
• Direct maximization of 𝑙 𝜃; Ζ is quite difficult numerically. Instead,
we consider unobserved latent variables Δ𝑖 taking values 0 or 1 as
earlier: if Δ𝑖 = 1 then 𝑌𝑖 comes from distribution 2, otherwise it
comes from distribution 1.
• Suppose we knew the values of the Δ𝑖's. Then the log-likelihood
would be
𝑙 𝜃; Ζ, Δ
=
𝑖=1
𝑁
1 − Δ𝑖 log 𝜙 𝜃1
𝑦𝑖 + Δ𝑖 log 𝜙 𝜃2
𝑦𝑖
+
𝑖=1
𝑁
1 − Δ𝑖 log 𝜋 + Δ𝑖 log 1 − 𝜋
and the maximum likelihood estimates of 𝜇1 and 𝜎1
2
would be the
sample mean and the sample variance of the observations with Δ𝑖
= 0. Similarly, the estimates for 𝜇2 and 𝜎2
2
would be the sample mean
and the sample variance of the observations with Δ𝑖 = 1.
Parameters estimation – cont’d
29. • Since the Δ𝑖 values are actually unknown, we proceed in an
iterative fashion, substituting for each Δ𝑖 in the previous equation
its expected value
𝛾𝑖 𝜃 = 𝔼 Δ𝑖 𝜃, Ζ = ℙ Δ𝑖 = 1 𝜃, Ζ ,
which is also called the responsibility of model 2 for observation 𝑖.
• We use the following procedure, known as the EM algorithm, for
the two-component Gaussian mixture:
1. Take initial guesses for the parameters 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
(see below).
2. Expectation step: compute the responsibilities
𝛾𝑖=
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖
= 1, 2, … , 𝑁.
IDC, May 2015 29
Parameters estimation – cont’d
31. • A simple choice for initial guesses for 𝜇1 and 𝜇2 is two randomly
selected observations 𝑦𝑖. The overall sample variance 𝑖=1
𝑁 𝑦 𝑖− 𝑦 2
𝑁
can be used as an initial guess for both 𝜎1
2
and 𝜎2
2
. The initial
mixing proportion 𝜋 can be set to 0.5.
• Software:
The "mixtools" R package was used for the mixture analysis, with
the function "normalmixEM" for parameter and posterior probability
(responsibility) estimation.
IDC, May 2015 31
Parameter estimation - additional notes
32. IDC, May 2015 32
• We obtain the following estimates:
33. IDC, May 2015 33
60.56%
39.44%
Partition of the runtimes to short (1) and long (2) for
threshold 0.5
1 2
• Each observation 𝑖 is assigned a posterior probability to
belong to each class:
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖 = 1, 2, … , 𝑁.
• For instance, using probability threshold of 0.5:
34. Building a Classifier –
The Learning algorithm
IDC, May 2015 34
Fit a model on
training data:
• Model/feature
selection
Evaluate the
model on
testing data
Summarize model
performance:
• ROC
• Misclassification rates
• Fit (F test, SSE)
Compare
models
Validate on
validation set
Optimize on
full data:
• ROC,
pseudo-ROC
35. IDC, May 2015 35
• We use observations that are close to the means (∓0.5 sd).
They include ~450,000 observations (~43%).
The training and testing process
• 80% are for training – finding a classifier (model/feature selection)
• 20% are for testing– checking performance
• After obtaining a classifier – optimize:
choose the mixture threshold that maximizes performance on full
dataset.
• Sequential
procedures for
model reduction
36. IDC, May 2015 36
Classifiers
• Here we choose two classification models:
• logistic regression
• decision trees
• They can both handle:
• Missing data
• Candidate classifying variables that are either continuous or
categorical.
• Categorical variables with many categories
37. IDC, May 2015 37
Decision trees
• Classification rules are formed by the paths from the root to the leaves.
• No assumptions are made regarding the distribution of predictors.
• Relatively unstable.
• steps:
• Tree is built using recursive splitting of nodes, until a “maximal” tree is
generated.
• “Pruning” – simplification of the tree by cutting nodes off, prevents
overfitting.
• Selection of the “optimal” pruned tree – fits without overfitting.
38. IDC, May 2015 38
Logistic Regression
• Regression used to predict the outcome of a binary variable (like “short” or “long”).
• Conditional mean E(Y|X) is distributed Bernoulli.
• The connection between E(Y|X) and X can be described by the logistic function:
which has an “s” shape.
In general, the logistic function is
0 1
0 1 0 1
1
|
1 1
i
i i
X
i i X X
e
E Y X
e e
z
e
zf
1
1
39. IDC, May 2015 39
Performance measures
• We use ROC curve.
• It combines both types of errors:
• Sensitivity (“true positive rate”)
- probability for a “short” classification when the runtime is “short”.
• Specificity (“true negative rate”)
- probability for a “long” classification when the runtime is “long”.
40. IDC, May 2015 40
Performance optimization
• For the CART procedure, variables A1, A2, A3 and B4 were selected
to be in the classifier.
• For performance optimization, we use a pseudo-ROC curve:
• blue circle marks optimal tradeoff between sensitivity and specificity
• obtained for mixture probability threshold of 0.45.
41. IDC, May 2015 41
• For the Logistic regression, most variables were selected to be in
the classifier.
• For performance optimization, we compare ROC curves obtained for
different thresholds, and choose threshold 0.4:
42. IDC, May 2015 42
Validation results
• Total misclassification rates:
• CART: 9%.
• Logistic regression: 17%.
• Summary:
• Runtime can be effectively classified
using the available information.
• Further evaluation of our method is
required using different data sets from
different installations and times.
43. IDC, May 2015 43
Joint work with:
Pavel Goldstein and Prof. Avraham Korol,
University of Haifa
Example 2:
Detection of 2nd order Epistasis
on multi-trait complexes
44. IDC, May 2015
Goal:
search for epistatic effects (interactions between
genomic loci) on expression traits.
44
Searching for Epistasis
46. IDC, May 2015
Despite the growing interest in searching for epistatic
interactions, there is no consensus as to the best
strategy for their detection
Suggested approach:
QTL analysis - combine gene expression and mapping
data
Use multi-trait complexes rather than single traits
(trait = gene expression of a particular gene).
Screen for potential epistatic regions in a hierarchical
manner.
Control the overall FDR (False Discovery Rate).
46
47. Multi-trait complexes
47IDC, May 2015
Number of tests for interactions on single traits:
Number of genes (~7200) * number of loci pairs (~120,000) = a
lot!
A dimension reduction stage can be of help!
Suggestion:
Consider correlated traits as multi-trait complexes
has been shown to increase QTL detection power,
mapping resolution and estimation accuracy
(Korol et al, 2001).
48. Use WGCNA – Weighted correlation network
Top-down hierarchical clustering.
Dynamic Tree Cut algorithm:
branch cutting method for detecting gene modules,
depending on their shape
Building up meta-genes by taking the first principal
component of the genes from every cluster.
48IDC, May 2015
Clustering traits (genes)
49. Testing for epistasis:
Natural and Orthogonal Interactions (NOIA) model
(Alvarez-Castro and Carlborg , 2007)
For trait t, loci-pair l (loci A and B) and replicate i :
design
matrix
vector of
genetic effects
Indicator of
genotype
combinations
for two loci
genotypes
gene expression
49IDC, May 2015
50. The test for epistasis is done
hierarchically
50
Framework marker
Secondary markers
IDC, May 2015
51. False Discovery Rate (FDR)
in hierarchical testing
Yekutieli (2008) offers a procedure to control the FDR for the full
tree of tests
51IDC, May 2015
52. Hierarchical FDR control
A universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):
An upper bound for 𝜹* may be estimated using:
where Rt
Pi=0 and Rt
Pi=1 are the number of discoveries in τt, given that Hi is a true null
hypothesis in τt, and false null hypothesis, respectively.
.
52IDC, May 2015
53. IDC, May 2015 53
Searching algorithm
STAGE 1:
Construct multi-trait complexes (using WGCNA clustering)
STAGE 2: hierarchical search
◦ step1:
Screen for combinations of loci-pair and multi-trait complex
with potential for epistasis (NOIA model)
◦ Step 2:
Test using higher resolution loci only for the selected
regions (NOIA model).
54. Data
A sample of 210 individuals from Arabidopsis
thaliana population
Genotypic map consists of 579 markers
Transcript levels were quantified using Affymetrix
whole-genome microarrays
Total of 22,810 gene expressions from all five
chromosomes
(non-expressed genes filtered out).
54IDC, May 2015
55. Two-stage hierarchical testing for
epistasis
STAGE 1: Identified 314 gene clusters (WGSNA)
STAGE 2:
47 sparse "framework" markers that are within 10 cM of each
other.
10-12 “secondary" marker related to each "framework" marker.
First step:
1081 marker pairs X 314 meta-genes =339,434 tests
- 11 regions are identified.
Second step:
- 1141 epistatic effects are identified.
55IDC, May 2015
59. Preprocessing
The Variance Stabilization Normalization
Gene expression filtering: 7244 genes out of 22810
Markers preprocessing
59IDC, May 2015
60. Computational advantage
Using the two-stage algorithm on meta-genes, 341,107
hypotheses were tests
Naive analysis:
121278 loci pairs for each of 7244 traits, namely 878,537,832
tests would have been performed
Reduction of tests number by 2575 times
60IDC, May 2015
62. Define a scan statistics
For gene 𝑔, 𝑔 = 1, … , 𝑚, let
Then the scan statistic for gene 𝑔 is
For gene 𝑔, we test the null hypothesis that there
is no k such that
E 𝐷𝑔,𝑘 , … , E 𝐷𝑔,𝑘+𝑤−1 > 𝛿0,
where 𝛿0 is the baseline level for the gene.
1
,( )
t w
w
g g p
p t
Y t D
1 1
max ( )
g
w w
g g
t n w
S Y t
62IDC, May 2015
63. IDC, May 2015
Peak Detection
63
1 1
max
g
w
g g
t n w
S Y
Point-wise
statistics
Moving-sum
statistics
1
,( )
t w
w
g g p
p t
Y t D
64. IDC, May 2015 64
Summary – data science
• Data science is an emerging filed/profession that
incorporates knowledge and expertise form several
disciplines.
• It combines both big data technologies and
sophisticated methods for complicated data analysis.
• Data analysis is aimed to answer various questions
with case-specific challenges, and should therefore
be carefully tailored to the type of problem and data.
65. References
IDC, May 2015 65
Reiner-Benaim, A., Shmueli, E. and Grabarnick, A.
(submitted)
A statistical learning approach for runtime prediction in Intel’s
data center.
Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014)
Two-stage genome-wide search for epistasis with
implementation to Recombinant Inbred Lines (RIL)
populations. PLOS ONE, 9(12).
Reiner-Benaim, A., (2015) Scan statistic tail probability
assessment based on process covariance and window size.
Methodology and Computing in Applied Probability, In Press.
Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014)
Scan statistics analysis for detection of introns in time-course
tiling array data. Statistical Applications in Genetics and
Molecular Biology, 13(2), 173-90.
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
The WGCNA proposed by Zhang and Horvath, used for gene expression clustering.
Firstly, Top-down hierarchical clustering applied, using weighted inter-genes distances .
Then, the branches cutting method sensitive for their shapes implemented for detecting gene modules.
And then meta-genes are defined as the first principal component of the genes from every cluster.
We propose to test epistasis hypothesis by fitting a proposed by Alvarez-Castro and Carlborg NOIA model modified for second order epistasis in RIL populations, which are homozygous. The model allows orthogonal estimation of genetic effects.
For loci A and B gene expression level for trait t, loci-pair l and replicate i we can represent the vector of gene expressions as a product of phenotypes with corresponding gynotype combinations indicators plus error term. In turn, phenotypes may be represented as a multiplication of genetic effects and design matrix that guarantees orthogonality of the effects.
As mentioned gynotype map neighbor markers contain very similar information. Based on this attribute we separated all markers for "framework" markers (marked as bold dots) - relatively distant loci and "secondary" markers (small vertical lines) related to corresponding framework markers. Long vertical lines denote borders of "framework" marker areas. Thus our markers have hierarchical structure.
We propose a two-stage approach for identifying QTL epistasis. The offered algorithm starts with an initial construction of multi-trait complexes (or meta-genes by WGCNA clustering the microarray gene expression data. Then, epistasis is tested for among all combinations of such complexes and loci-pairs: starting with an initial "rough" search for pairs among framework markers, which is followed by a higher resolution search only within the identified regions.
If we found the epistatic effect between markers m1 and m2, we should continue our search between all pairs of markers along with their "secondary" markers (colored in yellow)
Since the number of tests involved is enormous we should control false positives. For this purpose we used False Discovery Rate criteria proposed by Benjamini and Hochberg, In our case it defined as expected proportion of erroneously identified epistasis effects among all identified ones.
Yekutieli (2008) suggested the hierarchical procedure to control the FDR across the tree of hypotheses.
In our case all hypotheses could be arranged in a 2-level structure. In the first level are the hypotheses for all combinations of multi-trait complexes and pairs of sparse "framework" markers. In the second level are the hypotheses for all combinations selected in the first level, this time using "secondary", markers, related to corresponding framework markers. We are interesting in Full-tree FDR control - all epistasis discoveries in all tree . The rejection threshold q should be chosen such that the full-tree FDR will be controlled at the level, 0.1.
We implemented the algorithm on Arabidobsis data of 210 RILs .
Around 23000 gene expressions were produced from all five chromosomes.
Then we applied out algorithm:
were identified 314 gene clusters (WGCNA)
For the first stage of hierarchical testing 47 sparse "framework" markers that are within 10 cM of each other were used
10 -12 “secondary" markers were placed for each framework area
So we tested around 440 000 epistatic hypotheses
The Variance Stabilization Normalization (VSN) uses Generalized log transformation
After filtering non-expressed genes remained 7244 genes out of 22810
Also we filtered out bad markers or non-informative markers
Using the two-stage algorithm on meta-genes, 341,107 hypotheses were tests
If instead, all possible combinations of markers and row traits were tested at one stage, about 900,000,000 tests would have been performed.