BigML brings Principal Component Analysis (PCA) to the platform, a key unsupervised Machine Learning technique used to transform a given dataset in order to yield uncorrelated features and reduce dimensionality. BigML PCA unique implementation is distinct from other approaches to PCA in that it can handle numeric and non-numeric data types, including text, categorical, items fields, as well as combinations of different data types. PCA can be used in any industry vertical as a preprocessing technique to improve supervised learning performance, with the caveat that some measure of interpretability may be sacrificed. It is commonly applied in fields with high dimensional data including bioinformatics, quantitative finance, and signal processing.
2. BigML, Inc BigML PCA Release Webinar
Fall 2018 Release
GREGORY ANTELL, Ph.D. - Machine Learning
Architect and Product Manager
Please enter questions into chat box – We will answer
some via chat and others at the end of the session
https://bigml.com/releases/fall-2018
ATAKAN CETINSOY - VP of Predictive Applications
Resources
Moderator
Speaker
Contact support@bigml.com
Twitter @bigmlcom
Questions
2
3. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
3
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation
4. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
4
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation
5. BigML, Inc BigML PCA Release Webinar
Problem Formulation
Data Acquisition
Feature Engineering
Modeling and Evaluations
Predictions
Measure Results
Data Transformations
Task
5
Steps of a ML Application
6. BigML, Inc BigML PCA Release Webinar
Steps of a ML Application
Problem Formulation
Data Acquisition
Feature Engineering
Modeling and Evaluations
Predictions
Measure Results
Data Transformations
Task
6
• More often than changing models,
improvement comes from more data
or better features
• Garbage In, Garbage Out principle
• Model training and hyper-parameter
tuning can be automated, feature
engineering (mostly) cannot
7. BigML, Inc BigML PCA Release Webinar
Steps of a ML Application
Problem Formulation
Data Acquisition
Feature Engineering
Modeling and Evaluations
Predictions
Measure Results
Data Transformations
Today’s release
further expands what
is possible in
Task
7
8. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
8
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation
9. BigML, Inc BigML PCA Release Webinar
High-dimensional Data
9
F1 F2 F3 F4 F5 … FN
I1
I2
I3
I4
I5
…
IN
Features (p)
Instances (n)
Machine Learning typically performs better when n >>> p
10. BigML, Inc BigML PCA Release Webinar
Dangers of high-dimensional Data
• Implicitly increases model complexity, prone to overfitting
• Requires more observations in order to generalize well
• Contains correlated or useless variables
• Data is difficult to visualize
• Takes a longer time to train models or make predictions
10
Principal Component Analysis
addresses all of these issues
11. BigML, Inc BigML PCA Release Webinar
Model Complexity and Training Data
11
• Models with lower complexity
will converge to higher test error
rates
Number of training examples
TestError
Model 1
Model 2
12. BigML, Inc BigML PCA Release Webinar
Model Complexity and Training Data
12
• Models with lower complexity
will converge to higher test error
rates
• A threshold exists where
enough training data is available
to favor the more complex
model
• With a fixed amount of data,
less complex models are often
favoredNumber of training examples
TestError
Less Complex
Model Wins
Model 1
Model 2
More Complex
Model Wins
13. BigML, Inc BigML PCA Release Webinar
Combating High-dimensional Data
13
MODEL Pruning, Node threshold
ENSEMBLE Bagging, Randomization
LOGISTIC
REGRESSION
L1 and L2 penalties
DEEPNET Dropout
14. BigML, Inc BigML PCA Release Webinar
Dimensionality Reduction
14
Feature Selection
• Preserves the original variables and selects a subset
• Often uses recursive methods or statistical thresholds
• Examples: RFE, Chi-Squared Test, Boruta
Feature Extraction
• Transforms original variables into variables better suited for modeling
• Examples: word vectors, clustering
• PCA falls into this category
Reducing the dimensions will decrease model complexity
15. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
15
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation
16. BigML, Inc BigML PCA Release Webinar
Why Consider Using PCA?
1. You want to reduce the number of variables in your model, but
it is not clear which should be eliminated
2. You want to generate variables that are not correlated
3. You are okay with sacrificing some amount of interpretability
for potential downstream performance gains
16
17. BigML, Inc BigML PCA Release Webinar
PCA in Machine Learning Workflows
17
SOURCE DATASET
TRAIN
TEST
18. BigML, Inc BigML PCA Release Webinar 18
PCA
PCA in Machine Learning Workflows
SOURCE DATASET
TRAIN
TEST
20. BigML, Inc BigML PCA Release Webinar 20
NEW TRAIN
FEATURES
NEW TEST
FEATURES
PCA in Machine Learning Workflows
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA
21. BigML, Inc BigML PCA Release Webinar 21
PCA in Machine Learning Workflows
NEW TRAIN
FEATURES
NEW TEST
FEATURES
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA
What’s special about
these new features?
22. BigML, Inc BigML PCA Release Webinar 22
Original Data Matrix
F1 F2 F3 F4 F5 … FN
I1
I2
I3
I4
I5
…
IN
Transformed Data Matrix
PC1 PC2 PC3 PC4 PC5 … PCN
I1
I2
I3
I4
I5
…
IN
The new variables are the “principal components”
What Does PCA Yield?
23. BigML, Inc BigML PCA Release Webinar 23
Properties of Principal Components
Each PC is a linear combination of original variables
PC1 = w1F1 + w2F2 + w3F3 + … + wNFN
PC2 = w1F1 + w2F2 + w3F3 + … + wNFN
PCN = w1F1 + w2F2 + w3F3 + … + wNFN
…
27. BigML, Inc BigML PCA Release Webinar 27
Properties of Principal Components
Original Data Transformed Data
Principal Components are not correlated
28. BigML, Inc BigML PCA Release Webinar 28
Properties of Principal Components
Principal Components are sorted by the percentage
of variance explained in the original data
29. BigML, Inc BigML PCA Release Webinar 29
How to Reduce Dimensions
Approach #1
Directly select how
many PCs to keep
30. BigML, Inc BigML PCA Release Webinar 30
How to Reduce Dimensions
Approach #2
Select a threshold for
the cumulative Percent
Variance Explained
31. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
31
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with the BigML Dashboard
5 BigML Implementation
32. BigML, Inc BigML PCA Release Webinar
Agenda: Principal Component Analysis
32
1 Utility in Machine Learning Workflows
2 High Dimensional Data in Machine Learning
3 PCA Intuition and Methodology
4 Use Cases with BigML Dashboard
5 BigML Implementation
33. BigML, Inc BigML PCA Release Webinar 33
BigML-Specific Implementation
• Standard PCA only applies to numerical data
• BigML uses three different data transformation methods in order to
handle different data types
• Numeric data: Principal Component Analysis (PCA)
• Categorical data: Multiple Correspondence Analysis (MCA)
• Mixed data: Factorial Analysis of Mixed Data (FAMD)
• BigML will automatically handle numeric, text, items, and categorical
data without needing user input
34. BigML, Inc BigML PCA Release Webinar
https://bigml.com/releases/fall-2018
34
More Info