AutoML can automate data preparation, feature extraction, model selection, and model tuning. This can save a Data Scientist loads of time. So instead of hiring four Data Scientists, you may only need two, right?
It’s no secret the shortage of data science talent to help companies produce advanced analytics from their stockpiles of data. There is also a plethora of vendor tools available making promises of turning an analyst into the next great data scientist (which BTW, is possible).
From the depths of the hardcore mathematicians, statisticians and computer scientists (who created this stuff in the first place), have created more advanced tools automate the model creation process to help Data Scientists become more efficient, and (hopefully) better at our jobs.
I will demo AutoML, discuss some pros/cons, and what it can do for you.
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
AutoML: Helping to Bridge Skills Gap Between Data Enthusiasts & Data Scientists
1. AUTOML: HELPING TO BRIDGE SKILLS
GAP BETWEEN DATA ENTHUSIASTS &
DATA SCIENTISTS
By Josh Janzen – Data Scientist
2. AUTOML: AGENDA
What is ML
What is AutoML
Animated Visualizations of ML vs AutoML
AutoML code demos on Titanic Dataset
Comparison of AutoML tools available
4. THE EMERGING FIELD OF DATA SCIENCE
“The more I learn, the more I realize
how much I don’t know.”
5. EXAMPLE
OF ML
(MACHINE
LEARNING)
1. Gather historical data (years of weather,
were you cold, hot, comfortable, activity
levels, did you wear a coat)
2. Apply algorithm to learn relationships
(learn impact of weather, activity, coat,
to determine were you comfortable)
3. Predict on new data/future (is it a good
idea to where a coat today?) NO. 95.2%
chance of being comfortable by not
wearing a coat
Example: should I wear a coat
today?
ML finds relationships in large
datasets to help us understand
patterns and make predictions
what is ML video
6. 1. There is a place for business analysis, and a
place for ML. They are very different
2. ML is another tool to help drive value from
data, especially with large, complex datasets
3. Both ML & Analysis need an understanding of
the business to be successful
4. ML can’t solve every data problem
5. ML is a vast and growing field
ML != Analysis
ML is another tool to help further
drive value from data.
WHAT IS
ML
(MACHINE
LEARNING)
7. WHAT IT ISN’TWHAT IT IS
AUTOML
• “Tools to make Data Scientists more
efficient”
• “..Data Science democratization”
• AutoML makes ML available to the Data
Enthusiasts
• Simplifies the Machine Learning model
building process by applying Computer
Science and Statistical techniques to find
an optimal model in an efficient amount
of time.
• The silver bullet to make all ML better
• A proven to produce better results than
a very experienced ML Data Scientist
• Simpler and easier to use (at least not
yet)
• Another data buzz word like “big data”
8. WHAT IS A
DATA SCIENCE
ENTHUSIAST
• The software developer
who wants to try ML
• The college student with
aspirations to be a DS
• The mid-level MGR
looking to up their game
• The analyst looking to
differentiate their skills
• Do not need advanced
MATH & STATS skills
Estimated from multiple sources including
https://www.kdnuggets.com/2018/09/how-many-data-scientists-are-
there.html
9. ML VS. AUTOML
> >
Prediction
Score: 0.752
ML without AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
Work
w/Business to
Identify
Opportunity
10. ML VS. AUTOML
> >
Prediction
Score: 0.752
ML without AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
- Missing values
- Outlier handling
- Checking variable types
Preprocessing
- Feature selection
- Feature transformation
Feature Engineering
- Split data train, valid,
test
- Import ML libraries
- Try various algorithm(s)
- Score models
Partition Data & Model
Selection
- Evaluate model
- Tune hyper parameters
Model Tuning
- Save best model
- Run to make
predictions
Predict on New Data
>>
Prediction
Score: 0.752
ML with AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
Work
w/Business to
Identify
Opportunity
Work
w/Business to
Identify
Opportunity
SME
11. ML VS. AUTOML
> >
Prediction
Score: 0.752
ML without AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
- Missing values
- Outlier handling
- Checking variable types
Preprocessing
- Feature selection
- Feature transformation
Feature Engineering
- Split data train, valid,
test
- Import ML libraries
- Try various algorithm(s)
- Score models
Partition Data & Model
Selection
- Evaluate model
- Tune hyper parameters
Model Tuning
- Save best model
- Run to make
predictions
Predict on New Data
>>
Prediction
Score: 0.752
ML with AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
Work
w/Business to
Identify
Opportunity
Work
w/Business to
Identify
Opportunity
12. ML VS. AUTOML
>
>>
>
Prediction
Score: 0.752
- Missing values
- Outlier handling
- Checking variable types
Preprocessing
- Feature selection
- Feature transformation
Feature Engineering
- Split data train, valid,
test
- Import ML libraries
- Try various algorithm(s)
- Score models
Partition Data & Model
Selection
- Evaluate model
- Tune hyper parameters
Model Tuning
- Save best model
- Run to make
predictions
Predict on New Data
ML without AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
Prediction
Score: 0.752
ML with AutoML
Credit:
Josh Janzen
Data
Scientist
Visualize and
Structure the Dataset
Import Data
- Automatically build and
evaluate
100s of models
- Review performance and
variable importance
Work
w/Business to
Identify
Opportunity
Work
w/Business to
Identify
Opportunity
13. DEMO WITH TITANIC DATASET
• Ipython notebook:
https://github.com/donnemartin/data-science-
ipython-
notebooks/blob/master/kaggle/titanic.ipynb
• Run MLBox from command line
• Demo AutoML App
• Show Azure AutoML tool
17. AZURE ML RESULTS
PROS:
• No code to write
• Lots of investment from Microsoft in this
space
CONS:
• Slower than expected, took about 20 min
• No easy way to create new predictions
• Process not polished, easy to use as expected
18. NEXT FRONTIER: AUTO FEATURE ENGINEERING
Source: https://towardsdatascience.com/feature-engineering-what-powers-machine-learning-
93ab191bcc2d