SlideShare uma empresa Scribd logo
1 de 66
The big-data analytics challenge –
combining statistical and algorithmic
perspectives
Anat Reiner-Benaim
Department of Statistics
University of Haifa
IDC, May 14, 2015
Outline
IDC, May 2015
 data science -
◦ Definition?
◦ Who needs it?
◦ The elements of data science
 Analysis:
◦ Modeling
◦ Software
 Examples:
◦ Scheduling – prediction of runtime
◦ Genetics – detection of rare events
2
What is data Science?
IDC, May 2015
From Wikipedia:
“Data science is the
study of the
generalizable extraction
of knowledge from
data…
3
IDC, May 2015
More from Wikipedia:
…builds on techniques and theories from many fields,
including signal processing, mathematics, probability
models, machine learning, statistical learning, computer
programming, data engineering, pattern recognition and
learning, visualization, uncertainty modeling, data
warehousing and high performance computing...
…goal: extracting meaning from data and creating data
products…
…not restricted to only big data, although the fact that
data is scaling up makes big data an important aspect of
data science.”
4
Data Science – who needs it?
IDC, May 2015 5
Anyone who has (big) data, e.g.:
 Cellular industry – phones, apps, advertisers
 Internet: search engines, social media, marketing,
advertisers
 Computer networks and server systems
 Cyber security
 Credit cards
 Banks
 Health care providers
 Life science – genome, proteome…
 TV and related
 Weather forecast
The elements of data science
IDC, May 2015 6
• NoSQL Database
(e.g. Cassandra)
• DFS (Distributed File System)
(e.g. Hadoop, Spark, GraphLab)
Store,
Preprocess
Database - SQL
(e.g. MySQL, SAS-SQL)
Dump to SQL
• Apply sophisticated methods:
 Statistical modeling
 Machine learning algorithms
Analyze
“Big data
technologies”
“Big data
Analytics”
IDC, May 2015
◦ How can I decide that an item in a manufacturing process is
faulty?
◦ What is the difference between the new machine and the old
one?
◦ What are the factors that affect system load?
◦ How can I predict memory/runtime of a program?
◦ How can I predict that a costumer will churn?
◦ What is the chance that the phone/web user will click my
advertisement?
◦ What is the chance that the current ATM user is making a fraud?
◦ What are the chance for snow this week?
7
Data Analysis –
First, define the problem
IDC, May 2015
 Possible goals:
◦ Predicting, classifying
 (Logistic) Regression, LDA, QDA, Naïve Bayes, Neural networks
 CART, Random forests, SVM, KNN
◦ Clustering
 Hierarchical, K-means, Mixture models, HMM, PCA
◦ Anomaly detection, peak detection
 Scan statistic, outlier detection methods
◦ “A/B Testing” (actually two sample comparison)
 Parametric tests (normal, t, chi-square, ANOVA)
 Non-parametric tests (signed-rank, rank-sum, Kruskal-Wallis)
◦ Identify trends, cycles
 Regression, time-series
8
Modeling
IDC, May 2015 9
Choosing models
 Type of variables:
◦ Continuous, ordinal, categorical.
 Statistical assumption:
◦ Normality, equal-variance, independence.
 Missing data
 Stability
IDC, May 2015 10
Learning tools
 Bootstrap
◦ Repeatedly fit model on resampled data.
 Bagging (“bootstrap aggregation”)
◦ Combine bootstrap samples to prevent instability.
 Boosting
◦ Combine a set of weak learners to create a single strong learner
 Regularization
◦ Solve over-fitting by restriction
(e.g. limit regression to linear or low degree polynomial)
 Utility/cost function
◦ Evaluate performance, compare models
 Typically iterative procedures.
 combined with the modeling procedures
 Help optimize the model and evaluate its performance
IDC, May 2015 11
More to consider-
Control statistical error
due to large scale analysis
Multiple
statistical tests
Inflated
statistical error
 Control FDR?
FDR = expected proportion of false findings (e.g.
“features”)
IDC, May 2015 12
The R software
 Open source programming language and
environment for statistical computing.
 Widely used among statisticians for developing
statistical software (“packages”) and for data
analysis.
 Increasingly popular among all data professionals.
Advantages:
 Contains most updated statistical
models and machine learning
algorithms.
 Methods are based on research,
compiled and documented.
 Contains Hadoop functions
(package “rhdfs”).
 Very convenient for plain
programming, scripting,
simulations, visualization.
 Friendly interface
(e.g. R-Studio).
 The R project site
IDC, May 2015 13
Examples
 Runtime prediction
(manufacturing, scheduling)
 Anomaly/peak detection
(fraud, electronics, genetics)
 Diagnostics
(biotech, healthcare)
 Epistatic detection
(genetics)
Example 1:
Classification of Job Runtime
in Intel
Joint work with:
Anna Grabarnick, University of Haifa
Edi Shmueli, Intel
Job processing
IDC, May 2015 15
Users serversjobs Job
scheduler
Decide:
• Which server?
• Queue?
Job schedulers
IDC, May 2015
 Algorithms aimed to efficiently queuing and distributing
jobs among servers, thereby improving system
utilization.
 Popular scheduling algorithms (e.g. the backfilling) use
information on how long the jobs are expected to run.
 In serial job systems, scheduling performance can be
improved by merely separating the short jobs from the
long and assigning them to different queues in the
system.
 This helps reduce the likelihood that short jobs will be
delayed after long ones, and thus improves overall 16
Job processing
IDC, May 2015 17
Users serversClassify
Each
Job:
shortlong
jobs
scheduler
The problem
IDC, May 2015
 Main purpose:
Classify jobs into “short” and “long” durations.
 Questions:
◦ How can the classes can be defined?
◦ How can the jobs be classified?
18
Available data
IDC, May 2015
 two traces obtained from one of Intel’s data centers:
1. ~1 million jobs executed during a period of 10 consecutive days.
Used for data training.
2. ~755,000 jobs executed during a period of 7 consecutive days.
Used for model validation.
 Aside from runtime information, 9 categorical variables
were available:
19
IDC, May 2015 20
TABLE I. ROUGH GROUPING OF THE 9 CATEGORICAL VARIABLES
Group # of variables Relates to Example
A 3 Scheduling information Resources requested by the job
B 2 Execution-specific information Command line and arguments
C 4 Association information Project and component
TABLE I. STATISTICS REGARDING THE CATEGORICAL VARIABLES
Variable # of categories
# of missing (in
training data)
A1 9 0
A2 7 0
A3 5 0
B1 44 173
B2 22 184
C1 2 0
C2 5 239
C3 6 184
C4 32 0
Analysis steps
IDC, May 2015
 Exploratory visualization of the data.
 Class construction and characterization.
 Classification:
◦ Choice of a classification model.
◦ Optimize model.
◦ Validate model.
21
IDC, May 2015 22
A1
B4
A3
A2
IDC, May 2015 23
IDC, May 2015 24
Runtime distribution
All observations
Wtime < 15,000 sec
seconds
seconds
IDC, May 2015 25
Runtime - log transformation
log2 𝑤𝑡𝑖𝑚𝑒
IDC, May 2015 26
Constructing classes by the mixture
model
• The Gaussian (normal) mixture model has the form
𝑓 𝑥 =
𝑚=1
𝑀
𝛼 𝑚 𝜙 𝑥; 𝜇 𝑚, Σ 𝑚
with mixing proportions 𝛼 𝑚, 𝛼 𝑚 = 1.
• Each Gaussian density has a mean 𝜇 𝑚 and covariance matrix Σ 𝑚.
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
IDC, May 2015 27
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
• We model the runtime 𝑌 as a mixture of the two normal variables
𝑌1~𝑁 𝜇1, 𝜎1
2
, 𝑌2~𝑁 𝜇2, 𝜎2
2
.
𝑌 can be defined by
𝑌 = 1 − 𝛥 ∙ 𝑌1 + 𝛥 ∙ 𝑌2,
where Δ ∈ {0, 1} with ℙ Δ = 1 = 𝜋.
• Let 𝜙 𝜃(𝑥) denote the normal density with parameters 𝜃 = (𝜇, 𝜎2).
Then the density of 𝑌 is
𝑔 𝑌 𝑦 = 1 − 𝜋 𝜙 𝜃1
𝑦 + 𝜋𝜙 𝜃2
𝑦 .
• fit this model to our data by maximum likelihood. The parameters
are
𝜃 = 𝜋, 𝜃1, 𝜃2 = 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
.
The log-likelihood based on 𝑁 training cases is
𝑙 𝜃; Ζ =
𝑖=1
𝑁
log 1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖 .
Mixture distribution – parameters
estimation
“Short”
IDC, May 2015 28
• Direct maximization of 𝑙 𝜃; Ζ is quite difficult numerically. Instead,
we consider unobserved latent variables Δ𝑖 taking values 0 or 1 as
earlier: if Δ𝑖 = 1 then 𝑌𝑖 comes from distribution 2, otherwise it
comes from distribution 1.
• Suppose we knew the values of the Δ𝑖's. Then the log-likelihood
would be
𝑙 𝜃; Ζ, Δ
=
𝑖=1
𝑁
1 − Δ𝑖 log 𝜙 𝜃1
𝑦𝑖 + Δ𝑖 log 𝜙 𝜃2
𝑦𝑖
+
𝑖=1
𝑁
1 − Δ𝑖 log 𝜋 + Δ𝑖 log 1 − 𝜋
and the maximum likelihood estimates of 𝜇1 and 𝜎1
2
would be the
sample mean and the sample variance of the observations with Δ𝑖
= 0. Similarly, the estimates for 𝜇2 and 𝜎2
2
would be the sample mean
and the sample variance of the observations with Δ𝑖 = 1.
Parameters estimation – cont’d
• Since the Δ𝑖 values are actually unknown, we proceed in an
iterative fashion, substituting for each Δ𝑖 in the previous equation
its expected value
𝛾𝑖 𝜃 = 𝔼 Δ𝑖 𝜃, Ζ = ℙ Δ𝑖 = 1 𝜃, Ζ ,
which is also called the responsibility of model 2 for observation 𝑖.
• We use the following procedure, known as the EM algorithm, for
the two-component Gaussian mixture:
1. Take initial guesses for the parameters 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
(see below).
2. Expectation step: compute the responsibilities
𝛾𝑖=
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖
= 1, 2, … , 𝑁.
IDC, May 2015 29
Parameters estimation – cont’d
3. Maximization step: compute the weighted means and variances,
𝜇1 =
𝑖=1
𝑁
1 − 𝛾𝑖 𝑦𝑖
𝑖=1
𝑁
1 − 𝛾𝑖
, 𝜎1
2
=
𝑖=1
𝑁
1 − 𝛾𝑖 𝑦𝑖 − 𝜇1
2
𝑖=1
𝑁
1 − 𝛾𝑖
,
𝜇2 =
𝑖=1
𝑁
𝛾𝑖 𝑦𝑖
𝑖=1
𝑁
𝛾𝑖
, 𝜎2
2
=
𝑖=1
𝑁
𝛾𝑖 𝑦𝑖 − 𝜇1
2
𝑖=1
𝑁
𝛾𝑖
,
and the mixing probability,
𝜋 =
𝑖=1
𝑁
𝛾𝑖
𝑁
.
4. Iterate steps 2 and 3 until convergence.
IDC, May 2015 30
Parameters estimation – cont’d
• A simple choice for initial guesses for 𝜇1 and 𝜇2 is two randomly
selected observations 𝑦𝑖. The overall sample variance 𝑖=1
𝑁 𝑦 𝑖− 𝑦 2
𝑁
can be used as an initial guess for both 𝜎1
2
and 𝜎2
2
. The initial
mixing proportion 𝜋 can be set to 0.5.
• Software:
The "mixtools" R package was used for the mixture analysis, with
the function "normalmixEM" for parameter and posterior probability
(responsibility) estimation.
IDC, May 2015 31
Parameter estimation - additional notes
IDC, May 2015 32
• We obtain the following estimates:
IDC, May 2015 33
60.56%
39.44%
Partition of the runtimes to short (1) and long (2) for
threshold 0.5
1 2
• Each observation 𝑖 is assigned a posterior probability to
belong to each class:
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖 = 1, 2, … , 𝑁.
• For instance, using probability threshold of 0.5:
Building a Classifier –
The Learning algorithm
IDC, May 2015 34
Fit a model on
training data:
• Model/feature
selection
Evaluate the
model on
testing data
Summarize model
performance:
• ROC
• Misclassification rates
• Fit (F test, SSE)
Compare
models
Validate on
validation set
Optimize on
full data:
• ROC,
pseudo-ROC
IDC, May 2015 35
• We use observations that are close to the means (∓0.5 sd).
They include ~450,000 observations (~43%).
The training and testing process
• 80% are for training – finding a classifier (model/feature selection)
• 20% are for testing– checking performance
• After obtaining a classifier – optimize:
choose the mixture threshold that maximizes performance on full
dataset.
• Sequential
procedures for
model reduction
IDC, May 2015 36
Classifiers
• Here we choose two classification models:
• logistic regression
• decision trees
• They can both handle:
• Missing data
• Candidate classifying variables that are either continuous or
categorical.
• Categorical variables with many categories
IDC, May 2015 37
Decision trees
• Classification rules are formed by the paths from the root to the leaves.
• No assumptions are made regarding the distribution of predictors.
• Relatively unstable.
• steps:
• Tree is built using recursive splitting of nodes, until a “maximal” tree is
generated.
• “Pruning” – simplification of the tree by cutting nodes off, prevents
overfitting.
• Selection of the “optimal” pruned tree – fits without overfitting.
IDC, May 2015 38
Logistic Regression
• Regression used to predict the outcome of a binary variable (like “short” or “long”).
• Conditional mean E(Y|X) is distributed Bernoulli.
• The connection between E(Y|X) and X can be described by the logistic function:
which has an “s” shape.
In general, the logistic function is
   
 
 
0 1
0 1 0 1
1
|
1 1
i
i i
X
i i X X
e
E Y X
e e
 
   


  
  
 
  z
e
zf 


1
1
IDC, May 2015 39
Performance measures
• We use ROC curve.
• It combines both types of errors:
• Sensitivity (“true positive rate”)
- probability for a “short” classification when the runtime is “short”.
• Specificity (“true negative rate”)
- probability for a “long” classification when the runtime is “long”.
IDC, May 2015 40
Performance optimization
• For the CART procedure, variables A1, A2, A3 and B4 were selected
to be in the classifier.
• For performance optimization, we use a pseudo-ROC curve:
• blue circle marks optimal tradeoff between sensitivity and specificity
• obtained for mixture probability threshold of 0.45.
IDC, May 2015 41
• For the Logistic regression, most variables were selected to be in
the classifier.
• For performance optimization, we compare ROC curves obtained for
different thresholds, and choose threshold 0.4:
IDC, May 2015 42
Validation results
• Total misclassification rates:
• CART: 9%.
• Logistic regression: 17%.
• Summary:
• Runtime can be effectively classified
using the available information.
• Further evaluation of our method is
required using different data sets from
different installations and times.
IDC, May 2015 43
Joint work with:
Pavel Goldstein and Prof. Avraham Korol,
University of Haifa
Example 2:
Detection of 2nd order Epistasis
on multi-trait complexes
IDC, May 2015
 Goal:
search for epistatic effects (interactions between
genomic loci) on expression traits.
44
Searching for Epistasis
Epistasis
no epistasis
epistasis
45IDC, May 2015
QTL2
QTL1
QTL2
QTL1
allele
allele
IDC, May 2015
 Despite the growing interest in searching for epistatic
interactions, there is no consensus as to the best
strategy for their detection
Suggested approach:
 QTL analysis - combine gene expression and mapping
data
 Use multi-trait complexes rather than single traits
(trait = gene expression of a particular gene).
 Screen for potential epistatic regions in a hierarchical
manner.
 Control the overall FDR (False Discovery Rate).
46
Multi-trait complexes
47IDC, May 2015
 Number of tests for interactions on single traits:
Number of genes (~7200) * number of loci pairs (~120,000) = a
lot!
 A dimension reduction stage can be of help!
 Suggestion:
Consider correlated traits as multi-trait complexes
has been shown to increase QTL detection power,
mapping resolution and estimation accuracy
(Korol et al, 2001).
 Use WGCNA – Weighted correlation network
 Top-down hierarchical clustering.
 Dynamic Tree Cut algorithm:
branch cutting method for detecting gene modules,
depending on their shape
 Building up meta-genes by taking the first principal
component of the genes from every cluster.
48IDC, May 2015
Clustering traits (genes)
Testing for epistasis:
Natural and Orthogonal Interactions (NOIA) model
(Alvarez-Castro and Carlborg , 2007)
For trait t, loci-pair l (loci A and B) and replicate i :
design
matrix
vector of
genetic effects
Indicator of
genotype
combinations
for two loci
genotypes
gene expression
49IDC, May 2015
The test for epistasis is done
hierarchically
50
Framework marker
Secondary markers
IDC, May 2015
False Discovery Rate (FDR)
in hierarchical testing
Yekutieli (2008) offers a procedure to control the FDR for the full
tree of tests
51IDC, May 2015
Hierarchical FDR control
A universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):
An upper bound for 𝜹* may be estimated using:
where Rt
Pi=0 and Rt
Pi=1 are the number of discoveries in τt, given that Hi is a true null
hypothesis in τt, and false null hypothesis, respectively.
.
52IDC, May 2015
IDC, May 2015 53
Searching algorithm
 STAGE 1:
Construct multi-trait complexes (using WGCNA clustering)
 STAGE 2: hierarchical search
◦ step1:
Screen for combinations of loci-pair and multi-trait complex
with potential for epistasis (NOIA model)
◦ Step 2:
Test using higher resolution loci only for the selected
regions (NOIA model).
Data
 A sample of 210 individuals from Arabidopsis
thaliana population
 Genotypic map consists of 579 markers
 Transcript levels were quantified using Affymetrix
whole-genome microarrays
 Total of 22,810 gene expressions from all five
chromosomes
(non-expressed genes filtered out).
54IDC, May 2015
Two-stage hierarchical testing for
epistasis
 STAGE 1: Identified 314 gene clusters (WGSNA)
 STAGE 2:
47 sparse "framework" markers that are within 10 cM of each
other.
10-12 “secondary" marker related to each "framework" marker.
 First step:
1081 marker pairs X 314 meta-genes =339,434 tests
- 11 regions are identified.
 Second step:
- 1141 epistatic effects are identified.
55IDC, May 2015
IDC, May 2015 56
Epistatic
regions
IDC, May 2015 57
Simulation study
IDC, May 2015 58
Simulation study (cont’d)
Preprocessing
 The Variance Stabilization Normalization
 Gene expression filtering: 7244 genes out of 22810
 Markers preprocessing
59IDC, May 2015
Computational advantage
 Using the two-stage algorithm on meta-genes, 341,107
hypotheses were tests
 Naive analysis:
121278 loci pairs for each of 7244 traits, namely 878,537,832
tests would have been performed
 Reduction of tests number by 2575 times
60IDC, May 2015
Peak Detection
61
Point-wise
statistics
Wild-type
Mutant
1
,( )
t w
w
g g p
p t
Y t D
 

 
IDC, May 2015
Define a scan statistics
 For gene 𝑔, 𝑔 = 1, … , 𝑚, let
 Then the scan statistic for gene 𝑔 is
 For gene 𝑔, we test the null hypothesis that there
is no k such that
E 𝐷𝑔,𝑘 , … , E 𝐷𝑔,𝑘+𝑤−1 > 𝛿0,
where 𝛿0 is the baseline level for the gene.
1
,( )
t w
w
g g p
p t
Y t D
 

 
1 1
max ( )
g
w w
g g
t n w
S Y t
   

62IDC, May 2015
IDC, May 2015
Peak Detection
63
1 1
max
g
w
g g
t n w
S Y
   

Point-wise
statistics
Moving-sum
statistics
1
,( )
t w
w
g g p
p t
Y t D
 

 
IDC, May 2015 64
Summary – data science
• Data science is an emerging filed/profession that
incorporates knowledge and expertise form several
disciplines.
• It combines both big data technologies and
sophisticated methods for complicated data analysis.
• Data analysis is aimed to answer various questions
with case-specific challenges, and should therefore
be carefully tailored to the type of problem and data.
References
IDC, May 2015 65
 Reiner-Benaim, A., Shmueli, E. and Grabarnick, A.
(submitted)
A statistical learning approach for runtime prediction in Intel’s
data center.
 Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014)
Two-stage genome-wide search for epistasis with
implementation to Recombinant Inbred Lines (RIL)
populations. PLOS ONE, 9(12).
 Reiner-Benaim, A., (2015) Scan statistic tail probability
assessment based on process covariance and window size.
Methodology and Computing in Applied Probability, In Press.
 Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014)
Scan statistics analysis for detection of introns in time-course
tiling array data. Statistical Applications in Genetics and
Molecular Biology, 13(2), 173-90.
Thank you

Mais conteúdo relacionado

Mais procurados

Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMikhail Klassen
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7Roger Barga
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3Roger Barga
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Giacomo Boracchi
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1Roger Barga
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET Journal
 

Mais procurados (20)

Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace Training
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Timeseries forecasting
Timeseries forecastingTimeseries forecasting
Timeseries forecasting
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databases
 
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
 
Pca ppt
Pca pptPca ppt
Pca ppt
 

Destaque

Selection possibilities for seed content fresh fruit quality in guava j. ap...
Selection possibilities for seed content   fresh fruit quality in guava j. ap...Selection possibilities for seed content   fresh fruit quality in guava j. ap...
Selection possibilities for seed content fresh fruit quality in guava j. ap...srajanlko
 
iFluids Profile_2016
iFluids Profile_2016iFluids Profile_2016
iFluids Profile_2016John Kingsley
 
Presentación de Marcelo Wilkorwsky - eCommerce Day Montevideo 2015
Presentación de Marcelo Wilkorwsky  - eCommerce Day Montevideo 2015Presentación de Marcelo Wilkorwsky  - eCommerce Day Montevideo 2015
Presentación de Marcelo Wilkorwsky - eCommerce Day Montevideo 2015eCommerce Institute
 
Resumen de Especificaciones - Winter Panel Chile
Resumen de Especificaciones - Winter Panel ChileResumen de Especificaciones - Winter Panel Chile
Resumen de Especificaciones - Winter Panel ChileWinterPanelChile
 

Destaque (6)

Selection possibilities for seed content fresh fruit quality in guava j. ap...
Selection possibilities for seed content   fresh fruit quality in guava j. ap...Selection possibilities for seed content   fresh fruit quality in guava j. ap...
Selection possibilities for seed content fresh fruit quality in guava j. ap...
 
iFluids Profile_2016
iFluids Profile_2016iFluids Profile_2016
iFluids Profile_2016
 
posterzine_computer
posterzine_computerposterzine_computer
posterzine_computer
 
Presentación de Marcelo Wilkorwsky - eCommerce Day Montevideo 2015
Presentación de Marcelo Wilkorwsky  - eCommerce Day Montevideo 2015Presentación de Marcelo Wilkorwsky  - eCommerce Day Montevideo 2015
Presentación de Marcelo Wilkorwsky - eCommerce Day Montevideo 2015
 
Resumen de Especificaciones - Winter Panel Chile
Resumen de Especificaciones - Winter Panel ChileResumen de Especificaciones - Winter Panel Chile
Resumen de Especificaciones - Winter Panel Chile
 
PKY Tanaman Hias
PKY Tanaman HiasPKY Tanaman Hias
PKY Tanaman Hias
 

Semelhante a presentationIDC - 14MAY2015

Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...IJDKP
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmIRJET Journal
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKIRJET Journal
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RIRJET Journal
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 

Semelhante a presentationIDC - 14MAY2015 (20)

Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Ds for finance day 3
Ds for finance day 3Ds for finance day 3
Ds for finance day 3
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with R
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 

presentationIDC - 14MAY2015

  • 1. The big-data analytics challenge – combining statistical and algorithmic perspectives Anat Reiner-Benaim Department of Statistics University of Haifa IDC, May 14, 2015
  • 2. Outline IDC, May 2015  data science - ◦ Definition? ◦ Who needs it? ◦ The elements of data science  Analysis: ◦ Modeling ◦ Software  Examples: ◦ Scheduling – prediction of runtime ◦ Genetics – detection of rare events 2
  • 3. What is data Science? IDC, May 2015 From Wikipedia: “Data science is the study of the generalizable extraction of knowledge from data… 3
  • 4. IDC, May 2015 More from Wikipedia: …builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing and high performance computing... …goal: extracting meaning from data and creating data products… …not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.” 4
  • 5. Data Science – who needs it? IDC, May 2015 5 Anyone who has (big) data, e.g.:  Cellular industry – phones, apps, advertisers  Internet: search engines, social media, marketing, advertisers  Computer networks and server systems  Cyber security  Credit cards  Banks  Health care providers  Life science – genome, proteome…  TV and related  Weather forecast
  • 6. The elements of data science IDC, May 2015 6 • NoSQL Database (e.g. Cassandra) • DFS (Distributed File System) (e.g. Hadoop, Spark, GraphLab) Store, Preprocess Database - SQL (e.g. MySQL, SAS-SQL) Dump to SQL • Apply sophisticated methods:  Statistical modeling  Machine learning algorithms Analyze “Big data technologies” “Big data Analytics”
  • 7. IDC, May 2015 ◦ How can I decide that an item in a manufacturing process is faulty? ◦ What is the difference between the new machine and the old one? ◦ What are the factors that affect system load? ◦ How can I predict memory/runtime of a program? ◦ How can I predict that a costumer will churn? ◦ What is the chance that the phone/web user will click my advertisement? ◦ What is the chance that the current ATM user is making a fraud? ◦ What are the chance for snow this week? 7 Data Analysis – First, define the problem
  • 8. IDC, May 2015  Possible goals: ◦ Predicting, classifying  (Logistic) Regression, LDA, QDA, Naïve Bayes, Neural networks  CART, Random forests, SVM, KNN ◦ Clustering  Hierarchical, K-means, Mixture models, HMM, PCA ◦ Anomaly detection, peak detection  Scan statistic, outlier detection methods ◦ “A/B Testing” (actually two sample comparison)  Parametric tests (normal, t, chi-square, ANOVA)  Non-parametric tests (signed-rank, rank-sum, Kruskal-Wallis) ◦ Identify trends, cycles  Regression, time-series 8 Modeling
  • 9. IDC, May 2015 9 Choosing models  Type of variables: ◦ Continuous, ordinal, categorical.  Statistical assumption: ◦ Normality, equal-variance, independence.  Missing data  Stability
  • 10. IDC, May 2015 10 Learning tools  Bootstrap ◦ Repeatedly fit model on resampled data.  Bagging (“bootstrap aggregation”) ◦ Combine bootstrap samples to prevent instability.  Boosting ◦ Combine a set of weak learners to create a single strong learner  Regularization ◦ Solve over-fitting by restriction (e.g. limit regression to linear or low degree polynomial)  Utility/cost function ◦ Evaluate performance, compare models  Typically iterative procedures.  combined with the modeling procedures  Help optimize the model and evaluate its performance
  • 11. IDC, May 2015 11 More to consider- Control statistical error due to large scale analysis Multiple statistical tests Inflated statistical error  Control FDR? FDR = expected proportion of false findings (e.g. “features”)
  • 12. IDC, May 2015 12 The R software  Open source programming language and environment for statistical computing.  Widely used among statisticians for developing statistical software (“packages”) and for data analysis.  Increasingly popular among all data professionals. Advantages:  Contains most updated statistical models and machine learning algorithms.  Methods are based on research, compiled and documented.  Contains Hadoop functions (package “rhdfs”).  Very convenient for plain programming, scripting, simulations, visualization.  Friendly interface (e.g. R-Studio).  The R project site
  • 13. IDC, May 2015 13 Examples  Runtime prediction (manufacturing, scheduling)  Anomaly/peak detection (fraud, electronics, genetics)  Diagnostics (biotech, healthcare)  Epistatic detection (genetics)
  • 14. Example 1: Classification of Job Runtime in Intel Joint work with: Anna Grabarnick, University of Haifa Edi Shmueli, Intel
  • 15. Job processing IDC, May 2015 15 Users serversjobs Job scheduler Decide: • Which server? • Queue?
  • 16. Job schedulers IDC, May 2015  Algorithms aimed to efficiently queuing and distributing jobs among servers, thereby improving system utilization.  Popular scheduling algorithms (e.g. the backfilling) use information on how long the jobs are expected to run.  In serial job systems, scheduling performance can be improved by merely separating the short jobs from the long and assigning them to different queues in the system.  This helps reduce the likelihood that short jobs will be delayed after long ones, and thus improves overall 16
  • 17. Job processing IDC, May 2015 17 Users serversClassify Each Job: shortlong jobs scheduler
  • 18. The problem IDC, May 2015  Main purpose: Classify jobs into “short” and “long” durations.  Questions: ◦ How can the classes can be defined? ◦ How can the jobs be classified? 18
  • 19. Available data IDC, May 2015  two traces obtained from one of Intel’s data centers: 1. ~1 million jobs executed during a period of 10 consecutive days. Used for data training. 2. ~755,000 jobs executed during a period of 7 consecutive days. Used for model validation.  Aside from runtime information, 9 categorical variables were available: 19
  • 20. IDC, May 2015 20 TABLE I. ROUGH GROUPING OF THE 9 CATEGORICAL VARIABLES Group # of variables Relates to Example A 3 Scheduling information Resources requested by the job B 2 Execution-specific information Command line and arguments C 4 Association information Project and component TABLE I. STATISTICS REGARDING THE CATEGORICAL VARIABLES Variable # of categories # of missing (in training data) A1 9 0 A2 7 0 A3 5 0 B1 44 173 B2 22 184 C1 2 0 C2 5 239 C3 6 184 C4 32 0
  • 21. Analysis steps IDC, May 2015  Exploratory visualization of the data.  Class construction and characterization.  Classification: ◦ Choice of a classification model. ◦ Optimize model. ◦ Validate model. 21
  • 22. IDC, May 2015 22 A1 B4 A3 A2
  • 24. IDC, May 2015 24 Runtime distribution All observations Wtime < 15,000 sec seconds seconds
  • 25. IDC, May 2015 25 Runtime - log transformation log2 𝑤𝑡𝑖𝑚𝑒
  • 26. IDC, May 2015 26 Constructing classes by the mixture model • The Gaussian (normal) mixture model has the form 𝑓 𝑥 = 𝑚=1 𝑀 𝛼 𝑚 𝜙 𝑥; 𝜇 𝑚, Σ 𝑚 with mixing proportions 𝛼 𝑚, 𝛼 𝑚 = 1. • Each Gaussian density has a mean 𝜇 𝑚 and covariance matrix Σ 𝑚. • The parameters are usually estimated by maximum likelihood using the EM algorithm.
  • 27. IDC, May 2015 27 • The parameters are usually estimated by maximum likelihood using the EM algorithm. • We model the runtime 𝑌 as a mixture of the two normal variables 𝑌1~𝑁 𝜇1, 𝜎1 2 , 𝑌2~𝑁 𝜇2, 𝜎2 2 . 𝑌 can be defined by 𝑌 = 1 − 𝛥 ∙ 𝑌1 + 𝛥 ∙ 𝑌2, where Δ ∈ {0, 1} with ℙ Δ = 1 = 𝜋. • Let 𝜙 𝜃(𝑥) denote the normal density with parameters 𝜃 = (𝜇, 𝜎2). Then the density of 𝑌 is 𝑔 𝑌 𝑦 = 1 − 𝜋 𝜙 𝜃1 𝑦 + 𝜋𝜙 𝜃2 𝑦 . • fit this model to our data by maximum likelihood. The parameters are 𝜃 = 𝜋, 𝜃1, 𝜃2 = 𝜋, 𝜇1, 𝜎1 2 , 𝜇2, 𝜎2 2 . The log-likelihood based on 𝑁 training cases is 𝑙 𝜃; Ζ = 𝑖=1 𝑁 log 1 − 𝜋 𝜙 𝜃1 𝑦𝑖 + 𝜋𝜙 𝜃2 𝑦𝑖 . Mixture distribution – parameters estimation “Short”
  • 28. IDC, May 2015 28 • Direct maximization of 𝑙 𝜃; Ζ is quite difficult numerically. Instead, we consider unobserved latent variables Δ𝑖 taking values 0 or 1 as earlier: if Δ𝑖 = 1 then 𝑌𝑖 comes from distribution 2, otherwise it comes from distribution 1. • Suppose we knew the values of the Δ𝑖's. Then the log-likelihood would be 𝑙 𝜃; Ζ, Δ = 𝑖=1 𝑁 1 − Δ𝑖 log 𝜙 𝜃1 𝑦𝑖 + Δ𝑖 log 𝜙 𝜃2 𝑦𝑖 + 𝑖=1 𝑁 1 − Δ𝑖 log 𝜋 + Δ𝑖 log 1 − 𝜋 and the maximum likelihood estimates of 𝜇1 and 𝜎1 2 would be the sample mean and the sample variance of the observations with Δ𝑖 = 0. Similarly, the estimates for 𝜇2 and 𝜎2 2 would be the sample mean and the sample variance of the observations with Δ𝑖 = 1. Parameters estimation – cont’d
  • 29. • Since the Δ𝑖 values are actually unknown, we proceed in an iterative fashion, substituting for each Δ𝑖 in the previous equation its expected value 𝛾𝑖 𝜃 = 𝔼 Δ𝑖 𝜃, Ζ = ℙ Δ𝑖 = 1 𝜃, Ζ , which is also called the responsibility of model 2 for observation 𝑖. • We use the following procedure, known as the EM algorithm, for the two-component Gaussian mixture: 1. Take initial guesses for the parameters 𝜋, 𝜇1, 𝜎1 2 , 𝜇2, 𝜎2 2 (see below). 2. Expectation step: compute the responsibilities 𝛾𝑖= 𝜋𝜙 𝜃2 𝑦𝑖 1 − 𝜋 𝜙 𝜃1 𝑦𝑖 + 𝜋𝜙 𝜃2 𝑦𝑖 , 𝑖 = 1, 2, … , 𝑁. IDC, May 2015 29 Parameters estimation – cont’d
  • 30. 3. Maximization step: compute the weighted means and variances, 𝜇1 = 𝑖=1 𝑁 1 − 𝛾𝑖 𝑦𝑖 𝑖=1 𝑁 1 − 𝛾𝑖 , 𝜎1 2 = 𝑖=1 𝑁 1 − 𝛾𝑖 𝑦𝑖 − 𝜇1 2 𝑖=1 𝑁 1 − 𝛾𝑖 , 𝜇2 = 𝑖=1 𝑁 𝛾𝑖 𝑦𝑖 𝑖=1 𝑁 𝛾𝑖 , 𝜎2 2 = 𝑖=1 𝑁 𝛾𝑖 𝑦𝑖 − 𝜇1 2 𝑖=1 𝑁 𝛾𝑖 , and the mixing probability, 𝜋 = 𝑖=1 𝑁 𝛾𝑖 𝑁 . 4. Iterate steps 2 and 3 until convergence. IDC, May 2015 30 Parameters estimation – cont’d
  • 31. • A simple choice for initial guesses for 𝜇1 and 𝜇2 is two randomly selected observations 𝑦𝑖. The overall sample variance 𝑖=1 𝑁 𝑦 𝑖− 𝑦 2 𝑁 can be used as an initial guess for both 𝜎1 2 and 𝜎2 2 . The initial mixing proportion 𝜋 can be set to 0.5. • Software: The "mixtools" R package was used for the mixture analysis, with the function "normalmixEM" for parameter and posterior probability (responsibility) estimation. IDC, May 2015 31 Parameter estimation - additional notes
  • 32. IDC, May 2015 32 • We obtain the following estimates:
  • 33. IDC, May 2015 33 60.56% 39.44% Partition of the runtimes to short (1) and long (2) for threshold 0.5 1 2 • Each observation 𝑖 is assigned a posterior probability to belong to each class: 𝜋𝜙 𝜃2 𝑦𝑖 1 − 𝜋 𝜙 𝜃1 𝑦𝑖 + 𝜋𝜙 𝜃2 𝑦𝑖 , 𝑖 = 1, 2, … , 𝑁. • For instance, using probability threshold of 0.5:
  • 34. Building a Classifier – The Learning algorithm IDC, May 2015 34 Fit a model on training data: • Model/feature selection Evaluate the model on testing data Summarize model performance: • ROC • Misclassification rates • Fit (F test, SSE) Compare models Validate on validation set Optimize on full data: • ROC, pseudo-ROC
  • 35. IDC, May 2015 35 • We use observations that are close to the means (∓0.5 sd). They include ~450,000 observations (~43%). The training and testing process • 80% are for training – finding a classifier (model/feature selection) • 20% are for testing– checking performance • After obtaining a classifier – optimize: choose the mixture threshold that maximizes performance on full dataset. • Sequential procedures for model reduction
  • 36. IDC, May 2015 36 Classifiers • Here we choose two classification models: • logistic regression • decision trees • They can both handle: • Missing data • Candidate classifying variables that are either continuous or categorical. • Categorical variables with many categories
  • 37. IDC, May 2015 37 Decision trees • Classification rules are formed by the paths from the root to the leaves. • No assumptions are made regarding the distribution of predictors. • Relatively unstable. • steps: • Tree is built using recursive splitting of nodes, until a “maximal” tree is generated. • “Pruning” – simplification of the tree by cutting nodes off, prevents overfitting. • Selection of the “optimal” pruned tree – fits without overfitting.
  • 38. IDC, May 2015 38 Logistic Regression • Regression used to predict the outcome of a binary variable (like “short” or “long”). • Conditional mean E(Y|X) is distributed Bernoulli. • The connection between E(Y|X) and X can be described by the logistic function: which has an “s” shape. In general, the logistic function is         0 1 0 1 0 1 1 | 1 1 i i i X i i X X e E Y X e e                   z e zf    1 1
  • 39. IDC, May 2015 39 Performance measures • We use ROC curve. • It combines both types of errors: • Sensitivity (“true positive rate”) - probability for a “short” classification when the runtime is “short”. • Specificity (“true negative rate”) - probability for a “long” classification when the runtime is “long”.
  • 40. IDC, May 2015 40 Performance optimization • For the CART procedure, variables A1, A2, A3 and B4 were selected to be in the classifier. • For performance optimization, we use a pseudo-ROC curve: • blue circle marks optimal tradeoff between sensitivity and specificity • obtained for mixture probability threshold of 0.45.
  • 41. IDC, May 2015 41 • For the Logistic regression, most variables were selected to be in the classifier. • For performance optimization, we compare ROC curves obtained for different thresholds, and choose threshold 0.4:
  • 42. IDC, May 2015 42 Validation results • Total misclassification rates: • CART: 9%. • Logistic regression: 17%. • Summary: • Runtime can be effectively classified using the available information. • Further evaluation of our method is required using different data sets from different installations and times.
  • 43. IDC, May 2015 43 Joint work with: Pavel Goldstein and Prof. Avraham Korol, University of Haifa Example 2: Detection of 2nd order Epistasis on multi-trait complexes
  • 44. IDC, May 2015  Goal: search for epistatic effects (interactions between genomic loci) on expression traits. 44 Searching for Epistasis
  • 45. Epistasis no epistasis epistasis 45IDC, May 2015 QTL2 QTL1 QTL2 QTL1 allele allele
  • 46. IDC, May 2015  Despite the growing interest in searching for epistatic interactions, there is no consensus as to the best strategy for their detection Suggested approach:  QTL analysis - combine gene expression and mapping data  Use multi-trait complexes rather than single traits (trait = gene expression of a particular gene).  Screen for potential epistatic regions in a hierarchical manner.  Control the overall FDR (False Discovery Rate). 46
  • 47. Multi-trait complexes 47IDC, May 2015  Number of tests for interactions on single traits: Number of genes (~7200) * number of loci pairs (~120,000) = a lot!  A dimension reduction stage can be of help!  Suggestion: Consider correlated traits as multi-trait complexes has been shown to increase QTL detection power, mapping resolution and estimation accuracy (Korol et al, 2001).
  • 48.  Use WGCNA – Weighted correlation network  Top-down hierarchical clustering.  Dynamic Tree Cut algorithm: branch cutting method for detecting gene modules, depending on their shape  Building up meta-genes by taking the first principal component of the genes from every cluster. 48IDC, May 2015 Clustering traits (genes)
  • 49. Testing for epistasis: Natural and Orthogonal Interactions (NOIA) model (Alvarez-Castro and Carlborg , 2007) For trait t, loci-pair l (loci A and B) and replicate i : design matrix vector of genetic effects Indicator of genotype combinations for two loci genotypes gene expression 49IDC, May 2015
  • 50. The test for epistasis is done hierarchically 50 Framework marker Secondary markers IDC, May 2015
  • 51. False Discovery Rate (FDR) in hierarchical testing Yekutieli (2008) offers a procedure to control the FDR for the full tree of tests 51IDC, May 2015
  • 52. Hierarchical FDR control A universal upper bound is derived for the full-tree FDR (Yekutieli, 2008): An upper bound for 𝜹* may be estimated using: where Rt Pi=0 and Rt Pi=1 are the number of discoveries in τt, given that Hi is a true null hypothesis in τt, and false null hypothesis, respectively. . 52IDC, May 2015
  • 53. IDC, May 2015 53 Searching algorithm  STAGE 1: Construct multi-trait complexes (using WGCNA clustering)  STAGE 2: hierarchical search ◦ step1: Screen for combinations of loci-pair and multi-trait complex with potential for epistasis (NOIA model) ◦ Step 2: Test using higher resolution loci only for the selected regions (NOIA model).
  • 54. Data  A sample of 210 individuals from Arabidopsis thaliana population  Genotypic map consists of 579 markers  Transcript levels were quantified using Affymetrix whole-genome microarrays  Total of 22,810 gene expressions from all five chromosomes (non-expressed genes filtered out). 54IDC, May 2015
  • 55. Two-stage hierarchical testing for epistasis  STAGE 1: Identified 314 gene clusters (WGSNA)  STAGE 2: 47 sparse "framework" markers that are within 10 cM of each other. 10-12 “secondary" marker related to each "framework" marker.  First step: 1081 marker pairs X 314 meta-genes =339,434 tests - 11 regions are identified.  Second step: - 1141 epistatic effects are identified. 55IDC, May 2015
  • 56. IDC, May 2015 56 Epistatic regions
  • 57. IDC, May 2015 57 Simulation study
  • 58. IDC, May 2015 58 Simulation study (cont’d)
  • 59. Preprocessing  The Variance Stabilization Normalization  Gene expression filtering: 7244 genes out of 22810  Markers preprocessing 59IDC, May 2015
  • 60. Computational advantage  Using the two-stage algorithm on meta-genes, 341,107 hypotheses were tests  Naive analysis: 121278 loci pairs for each of 7244 traits, namely 878,537,832 tests would have been performed  Reduction of tests number by 2575 times 60IDC, May 2015
  • 61. Peak Detection 61 Point-wise statistics Wild-type Mutant 1 ,( ) t w w g g p p t Y t D      IDC, May 2015
  • 62. Define a scan statistics  For gene 𝑔, 𝑔 = 1, … , 𝑚, let  Then the scan statistic for gene 𝑔 is  For gene 𝑔, we test the null hypothesis that there is no k such that E 𝐷𝑔,𝑘 , … , E 𝐷𝑔,𝑘+𝑤−1 > 𝛿0, where 𝛿0 is the baseline level for the gene. 1 ,( ) t w w g g p p t Y t D      1 1 max ( ) g w w g g t n w S Y t      62IDC, May 2015
  • 63. IDC, May 2015 Peak Detection 63 1 1 max g w g g t n w S Y      Point-wise statistics Moving-sum statistics 1 ,( ) t w w g g p p t Y t D     
  • 64. IDC, May 2015 64 Summary – data science • Data science is an emerging filed/profession that incorporates knowledge and expertise form several disciplines. • It combines both big data technologies and sophisticated methods for complicated data analysis. • Data analysis is aimed to answer various questions with case-specific challenges, and should therefore be carefully tailored to the type of problem and data.
  • 65. References IDC, May 2015 65  Reiner-Benaim, A., Shmueli, E. and Grabarnick, A. (submitted) A statistical learning approach for runtime prediction in Intel’s data center.  Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014) Two-stage genome-wide search for epistasis with implementation to Recombinant Inbred Lines (RIL) populations. PLOS ONE, 9(12).  Reiner-Benaim, A., (2015) Scan statistic tail probability assessment based on process covariance and window size. Methodology and Computing in Applied Probability, In Press.  Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014) Scan statistics analysis for detection of introns in time-course tiling array data. Statistical Applications in Genetics and Molecular Biology, 13(2), 173-90.

Notas do Editor

  1. However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.  
  2. However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.  
  3. The WGCNA proposed by Zhang and Horvath, used for gene expression clustering. Firstly, Top-down hierarchical clustering applied, using weighted inter-genes distances . Then, the branches cutting method sensitive for their shapes implemented for detecting gene modules. And then meta-genes are defined as the first principal component of the genes from every cluster.
  4. We propose to test epistasis hypothesis by fitting a proposed by Alvarez-Castro and Carlborg NOIA model modified for second order epistasis in RIL populations, which are homozygous. The model allows orthogonal estimation of genetic effects. For loci A and B gene expression level for trait t, loci-pair l and replicate i we can represent the vector of gene expressions as a product of phenotypes with corresponding gynotype combinations indicators plus error term. In turn, phenotypes may be represented as a multiplication of genetic effects and design matrix that guarantees orthogonality of the effects.
  5. As mentioned gynotype map neighbor markers contain very similar information. Based on this attribute we separated all markers for "framework" markers (marked as bold dots) - relatively distant loci and "secondary" markers (small vertical lines) related to corresponding framework markers. Long vertical lines denote borders of "framework" marker areas. Thus our markers have hierarchical structure. We propose a two-stage approach for identifying QTL epistasis. The offered algorithm starts with an initial construction of multi-trait complexes (or meta-genes by WGCNA clustering the microarray gene expression data. Then, epistasis is tested for among all combinations of such complexes and loci-pairs: starting with an initial "rough" search for pairs among framework markers, which is followed by a higher resolution search only within the identified regions. If we found the epistatic effect between markers m1 and m2, we should continue our search between all pairs of markers along with their "secondary" markers (colored in yellow)
  6. Since the number of tests involved is enormous we should control false positives. For this purpose we used False Discovery Rate criteria proposed by Benjamini and Hochberg, In our case it defined as expected proportion of erroneously identified epistasis effects among all identified ones. Yekutieli (2008) suggested the hierarchical procedure to control the FDR across the tree of hypotheses. In our case all hypotheses could be arranged in a 2-level structure. In the first level are the hypotheses for all combinations of multi-trait complexes and pairs of sparse "framework" markers. In the second level are the hypotheses for all combinations selected in the first level, this time using "secondary", markers, related to corresponding framework markers. We are interesting in Full-tree FDR control - all epistasis discoveries in all tree . The rejection threshold q should be chosen such that the full-tree FDR will be controlled at the level, 0.1.  
  7. We implemented the algorithm on Arabidobsis data of 210 RILs . Around 23000 gene expressions were produced from all five chromosomes.
  8. Then we applied out algorithm: were identified 314 gene clusters (WGCNA)   For the first stage of hierarchical testing 47 sparse "framework" markers that are within 10 cM of each other were used 10 -12 “secondary" markers were placed for each framework area So we tested around 440 000 epistatic hypotheses  
  9. The Variance Stabilization Normalization (VSN) uses Generalized log transformation After filtering non-expressed genes remained 7244 genes out of 22810 Also we filtered out bad markers or non-informative markers
  10. Using the two-stage algorithm on meta-genes, 341,107 hypotheses were tests If instead, all possible combinations of markers and row traits were tested at one stage, about 900,000,000 tests would have been performed.