BigML Release: PCA

Introducing Principal Component Analysis
PCA Release

BigML, Inc BigML PCA Release Webinar
Fall 2018 Release
GREGORY ANTELL, Ph.D. - Machine Learning
Architect and Product Manager
Please enter questions into chat box – We will answer
some via chat and others at the end of the session
https://bigml.com/releases/fall-2018
ATAKAN CETINSOY - VP of Predictive Applications
Resources
Moderator
Speaker
Contact support@bigml.com
Twitter @bigmlcom
Questions
2

Agenda: Principal Component Analysis
3
1 Utility in Machine Learning Workﬂows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation

4

Problem Formulation
Data Acquisition
Feature Engineering
Modeling and Evaluations
Predictions
Measure Results
Data Transformations
Task
5
Steps of a ML Application

Problem Formulation
Data Acquisition
Feature Engineering
Predictions
Measure Results
Task
6
• More often than changing models,
improvement comes from more data
or better features
• Garbage In, Garbage Out principle
• Model training and hyper-parameter
tuning can be automated, feature
engineering (mostly) cannot

Problem Formulation
Data Acquisition
Feature Engineering
Predictions
Measure Results
Today’s release
further expands what
is possible in
Task
7

8

High-dimensional Data
9
F1 F2 F3 F4 F5 … FN
I1
I2
I3
I4
I5
…
IN
Features (p)
Instances (n)
Machine Learning typically performs better when n >>> p

Dangers of high-dimensional Data
• Implicitly increases model complexity, prone to overﬁtting
• Requires more observations in order to generalize well
• Contains correlated or useless variables
• Data is diﬃcult to visualize
• Takes a longer time to train models or make predictions
10
Principal Component Analysis
addresses all of these issues

Model Complexity and Training Data
11
• Models with lower complexity
will converge to higher test error
rates
Number of training examples
TestError
Model 1
Model 2

Model Complexity and Training Data
12
• Models with lower complexity
will converge to higher test error
rates
• A threshold exists where
enough training data is available
to favor the more complex
model
• With a ﬁxed amount of data,
less complex models are often
favoredNumber of training examples
TestError
Less Complex
Model Wins
Model 1
Model 2
More Complex
Model Wins

Combating High-dimensional Data
13
MODEL Pruning, Node threshold
ENSEMBLE Bagging, Randomization
LOGISTIC
REGRESSION
L1 and L2 penalties
DEEPNET Dropout

Dimensionality Reduction
14
Feature Selection
• Preserves the original variables and selects a subset
• Often uses recursive methods or statistical thresholds
• Examples: RFE, Chi-Squared Test, Boruta
Feature Extraction
• Transforms original variables into variables better suited for modeling
• Examples: word vectors, clustering
• PCA falls into this category
Reducing the dimensions will decrease model complexity

15

Why Consider Using PCA?
1. You want to reduce the number of variables in your model, but
it is not clear which should be eliminated
2. You want to generate variables that are not correlated
3. You are okay with sacriﬁcing some amount of interpretability
for potential downstream performance gains
16

PCA in Machine Learning Workﬂows
17
SOURCE DATASET
TRAIN
TEST

BigML, Inc BigML PCA Release Webinar 18
PCA
SOURCE DATASET
TRAIN
TEST

BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA

NEW TRAIN
FEATURES
NEW TEST
FEATURES
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA

NEW TRAIN
FEATURES
NEW TEST
FEATURES
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA
What’s special about
these new features?

Original Data Matrix
F1 F2 F3 F4 F5 … FN
I1
I2
I3
I4
I5
…
IN
Transformed Data Matrix
PC1 PC2 PC3 PC4 PC5 … PCN
I1
I2
I3
I4
I5
…
IN
The new variables are the “principal components”
What Does PCA Yield?

Properties of Principal Components
Each PC is a linear combination of original variables
PC1 = w1F1 + w2F2 + w3F3 + … + wNFN
PC2 = w1F1 + w2F2 + w3F3 + … + wNFN
PCN = w1F1 + w2F2 + w3F3 + … + wNFN
…

Geometric Interpretation of PCA

Intuition Behind Principal Components

Original Data Transformed Data
Principal Components are not correlated

Principal Components are sorted by the percentage
of variance explained in the original data

How to Reduce Dimensions
Approach #1
Directly select how
many PCs to keep

How to Reduce Dimensions
Approach #2
Select a threshold for
the cumulative Percent
Variance Explained

31

32
4 Use Cases with BigML Dashboard

BigML-Specific Implementation
• Standard PCA only applies to numerical data
• BigML uses three different data transformation methods in order to
handle different data types
• Numeric data: Principal Component Analysis (PCA)
• Categorical data: Multiple Correspondence Analysis (MCA)
• Mixed data: Factorial Analysis of Mixed Data (FAMD)
• BigML will automatically handle numeric, text, items, and categorical
data without needing user input

https://bigml.com/releases/fall-2018
34
More Info

Questions?
@bigmlcom support@bigml.com

BigML Release: PCA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigML Release: PCA

Similar to BigML Release: PCA (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

BigML Release: PCA