Advanced regression and model selection

Advanced Regression
and Model Selection
UpGrad Live Session - Ankit Jain

Model Selection Techniques
● If you are looking for a good place to start to choose a
machine learning algorithm for your dataset here are some
general guidelines.
● How large is your training set?
○ Small -- Prefer high bias/low variance classifiers (e.g.
Naive Bayes) over low bias/high variance classifiers (e.g.
KNN) to avoid overfitting.
○ Large - Low Bias/High Variance classifiers tend to produce
more accurate models

Adv/Disadv of Various Algorithms
● Naive Bayes:
○ Very simple to implement as it’s just a bunch of counts.
○ If conditional independence exists, it converges faster
than say Logistic Regression and thus requires less
training data.
○ If you want something fast,easy and performs well NB is a
good choice
○ Biggest disadvantage is that it can’t learn interactions in
the dataset

● Logistic Regression:
○ Lots of ways to regularize the model and no need to worry
about features being correlated like in Naive Bayes.
○ Nice probabilistic interpretation. Helpful in problems like
churn prediction etc .
○ Online algorithm: Easy to update the model with the new
data (using an online gradient descent method)

● Decision Trees:
○ Easy to explain and interpret (at least for some people)
○ Easily handles feature interactions.
○ No need to worry about outliers or whether data is linearly
separable or not.
○ Doesn’t support online learning. Rebuilding the model with
new data every time can be painful.
○ Tend to easily overfit. Solution: ensemble methods (RF)

● SVM:
○ High accuracy for many datasets
○ With appropriate kernel, can work well even if your data
isn’t linearly separable in the base feature space.
○ Popular in text processing applications where high
dimensionality is a norm
○ Memory intensive, hard to interpret and kind of annoying
to run and tune

Linear Regression Issues
● Sensitivity to outliers
● Multicollinearity leads to high variance of the estimator.
● Prone to overfit if there are lot of variables
● Hard to interpret when the number of predictors is large.Need
a smaller subset that exhibits strongest effects.

Regularization Techniques
● Regularization techniques typically work by penalizing the
magnitude of coefficients of features along with minimizing
the error between predicted and actual observations
● Different types of penalization
○ Ridge Regression: Penalize on squared coefficients
○ Lasso Regression: Penalize on absolute value of
coefficients

Why penalize on model coefficients?
Model1 = beta0 + beta1*x Model2 = beta0 + beta1*x + … beta10*x^10
beta1 = -0.58 beta1 = -1.4e05

Ridge Regression
● L2 penalty
● Pros
○ Variables >> Rows
○ Multicollinearity
○ Increased bias and lower variance from Linear Regression
● Cons
○ Doesn’t produce parsimonious model
Let’s see a collinearity example in R

Example: Luekemia Prediction
● Leukemia Data, Golub et al. Science 1999
● There are 38 training samples and 34 test samples with total
genes ~ 7000 (p >> n)
● Xij is the gene expression value for sample i and gene j
● Sample i either has tumor type AML or ALL
● We want to select genes relevant to tumor type
○ eliminate the trivial genes
○ grouped selection as many genes are highly correlated
● Ridge Regression can help to pursue this modeling

Grouped Selection
● If two predictors are highly correlated among themselves, the
estimated coefficients will be similar for them.
● if some variables are exactly identical, they will have same
coefficients
Ridge is good for grouped selection but not good for eliminating
trivial genes

LASSO
● Pros
○ Allow p >> n
○ Enforce sparsity in parameters
● Cons
○ If a group of predictors are highly correlated among
themselves, LASSO tends to pick only one of them and
shrink the other to zero
○ can not do grouped selection, tend to select one variable
LASSO is good for eliminating trivial genes but not good for
grouped selection

Elastic Net
● Weighted combination of L1 and L2 penalty
● Helps in enforcing sparsity
● Encourage grouping effect in highly correlated predictors
In gene selection problem, it can achieve both purposes of
removing trivial genes and doing group selection

Other Advanced Regression Methods
Poisson Regression
○ Typically used when the Y variable follows poisson
distribution (typically counts of events within a time t)
○ # times a customer will visit an ecommerce website next
month

Piecewise Linear Regression
● Polynomial regression
won’t work perfectly as it
will have high tendency to
overfit/underfit
● Instead, splitting the curve
into separate linear pieces
and building linear model
for each piece leads to
better results

Advanced regression and model selection

Mais conteúdo relacionado

Mais procurados

Semelhante a Advanced regression and model selection

Mais de Ankit Jain

Último

No presente documento

Advanced regression and model selection