SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
AutoML
Productivity for Data Science, AND …
A better way to make Digital Decisions
Dr. Steven Gustafson
Chief Scientist, Maana (2+ years)
previously, GE Research (10+ years)
(before previously, PhD AI for “automatic programming”)
What do you take-away?
Observe my arguments about AutoML and the algorithm
Reason about the evidence, consider past experience
Decide to change your Data Science approach
Learn by experimentation and feedback
My Argument
• Generate new knowledge
• Find good model pipelines
• Allow your experts and data scientists to
understand, learn and improve models that drive
business decisions!
• We created our AutoML as an archetype for
architecting digital decisions!
AutoML
• Generate and tune ML pipeline
• Auto-WEKA, Auto-SKLEARN, Google NN, Azure ML, …,TPOT
• Most Bayesian learning or computation vs. improvement
• Black box – helps find solutions, not knowledge or wisdom
• Assumes future problems represented by data
• Biased by what code and data is available vs. what’s useful
• Can be very long running - hours to days
Expert Data Science
Observe the Problem, Data, Background Knowledge
Reason about data characteristics vs. goals vs. techniques
Decide on initial approaches
Learn from results and iterate
Expert Data Science Vs. AutoML
Massive compute
Optimize many, many
parameters
Blind search, etc?
How do you explain results?
Justify compute budget
Engage an SME?
Does the Data Scientist learn?
Observe the Problem, Data, Background
Knowledge
Reason about data characteristics
Decide on initial approaches
Learn from results and iterate
=?
What if AutoML…
• Capture & represent knowledge
• Use reasoning to ”expertly” choose pipelines
• Use reinforcement learning with human input in real-
time to guide iteration
• Target seconds and minutes for results instead of hours
and days, match expert iteration
What is Knowledge Representation?
• A surrogate, a substitute for the thing itself.
• Enable an entity to determine consequences by thinking rather than
acting.
• A “language” in which we say things about the world.
• A “theory” of intelligent reasoning: the type of reasoning and the
applicable reasoning given data
• Guidance for organizing information to facilitate inferences to get
new expressions from old.
• A KR is not a data structure. A KR must be implemented in the
machine by some data structure.
http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html
Program Search for Machine Learning Pipelines Leveraging
Symbolic Planning and Reinforcement Learning
F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine
Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning.  In
Genetic Programming Theory and Practice XVI. 2018. Springer.
Symbolic planning
• Symbolic planning concerns using logical formalism to represent dynamic
systems and performs automated algorithms that generate plans
• Plans are a sequence of actions that achieves the goal state from an
initial state
• Common action description language (such as B, C, C+, BC) where plan
can be automatically computed using ASP solver, such as Clingo.
Data science contains a set of actions that transform and fit Data.
AutoML Pipelines
Pipelines• Featurizers
– Count / bag of words Vectorizor
– Tfidf Vectorizer
• Preprocessors
– matrix decompositions (truncatedSVD,pca,kernelPCA,fastICA)
– kernel approximation (rbfsampler,nystroem)
– feature selection (selectkbest,selectpercentile)
– scaling (minmaxscaler,robustscaler,absscaler)
– no preprocessing
• Classifiers
– logistic regression
– gaussian naive Bayes
– linear SVM
– random forest
– multinomial naive Bayes
– stochastic gradient descent
Nystroem: Approximate a kernel map using a
subset of the training data.
KernelPCA: Kernel Principal component analysis
(KPCA)
fastICA: a fast algorithm for Independent
Component Analysis.
truncatedSVD : This transformer performs
linear dimensionality reduction by means of
truncated singular value decomposition (SVD).
Rbfsampler : Approximates feature map of an
RBF kernel by Monte Carlo approximation of its
Fourier transform.
Reinforcement Learning
• Find a policy, i.e., a mapping from
state to action, such that the
agent can accumulate maximal
reward
• Learns the policy by trial-and-
error: executing actions in the
environment, obtain reward,
update its estimation of the value
function, until the value iteration
converges
• R-learning, update R(s,a) and
rho(s) that reflects long term
undiscounted average reward
and gain reward, shooting for
finite horizon problems (fixed
number of steps in future)
Data scientists performs trail and error on
different ML pipelines to understand the most
effective pipeline and hyper-parameters, similar
to performing a reinforcement learning process
PEORL: Planning--Execution--Observation--Reinforcement-Learning

Define pipeline goals
Find all satisfying plans
Shortest/highest
reward plan
instantiated
Update plan R-values
Planner focuses future trials on plan
components and overall pipelines with
higher learned rewards until all plans are
tried, accuracy achieved, or out of time.
BC Action Language
ASP - Clingo
scikit-learn
R-Learning
Evidence from Experiments
Best IMDB 300
bag of words, fastICA and stochastic gradient descent (SGD),
Hashing vectorizer: ngram range = (1,2), lowercase = False •
FastICA: n components = 3 • SGD classifier: loss=log,
penalty=l2
Best Polarity Dataset 2.0 2000 movie reviews
Cross validation accuracy of 0.84
• Hasing vecctorizer: ngram range = (1,3), lowercase = True
• FastICA: n components = 3
• SGD classifier: loss = modified huber, penalty=elasticnet.
Best Full IMDB dataset
Cross validation score of 0.88
• Hashing vectorizer: ngram range = (1,1), lowercase = False
• FastICA: n components = 3
• SGD classifier: loss=log, penalty=None
300 IMDB Docs – Top 5
300 IMBD Docs – Bottom 5
Classifiers
All have viable
options, but pipelines
vary significantly.
Rho value evolution
Pipeline A,B,C:
• A – B fixed
• C changes
* Episodes are sequential, not reflected below
Each pipeline is evaluated for 1..5 episodes of 5-fold
cross-validation, 300 documents, 2 classes. Each
episode updates the value
𝜌
episode
PEORL learns to focus on promising pipelines
UCI Data
UCI Data Set
Abenteeism 0.912
linear_svc_classifier
Blood Transfusion 0.792
random_forest_classifier
Breast Cancer
Coimbra
0.75
random_forest_classifier
Breast Cancer
Wisconsin
0.972
sgd_classifier
Breast tissue 0.707
logistic_classifier
Cervical Cancer 0.9685
linear_svc_classifier
Climate 0.9574
linear_svc_classifier
Connectionist
Bench
0.8269
gradient_boosting_classifier
Ecoli 0.875
logistic_classifier
Energy Efficiency A: 0.570
gradient_boosting_classifier
B: 0.501
random_forest_classifier
Glass 0.780
UCI Data Set
Haberman's Survival 0.735
gradient_boosting_classifier
HCC Survival 0.745
gradient_boosting_classifier
Ionosphere 0.94
random_forest_classifier
Iris 0.953
random_forest_classifier
Leaf 0.793
linear_svc_classifier
Libras Movement 0.852
logistic_classifier
LSVT Voice
Rehabilitation
0.881
logistic_classifier
Mammographic
Mass
0.85
random_forest_classifier
Musk 0.823
random_forest_classifier
Optical
Interconnection
0.647
gradient_boosting_classifier
Parkinsons 0.897
gradient_boosting_classifier
UCI Data Set
Quality Assessment of
Digital Colposcopies
A: 0.796
random_forest_classifier
Seeds 0.971
linear_svc_classifier
SPECTF Heart 0.8
random_forest_classifier
Sports articles for
objectivity analysis
0.853
linear_svc_classifier
Vehicle Silhouettes 0.754
gradient_boosting_classifier
Student Performance 0.185
random_forest_classifier
Tennis Major Tournament
Match Statistics
0.998
logistic_classifier
Ultrasonic Flowmeter
Diagnostics
A: 0.839
gradient_boosting_classifier
B: 1
random_forest_classifier
User Knowledge Modeling 0.922
logistic_classifier
Vertebral Column A: 0.842
linear_svc_classifier
B: 0.864
logistic_classifier
Wholesale Customers 0.920
random_forest_classifier
Wine 0.983
Azure ML comparison
UCI AutoML Azure AutoML
Student Performance 0.185
random_forest_classifier
0.1507
 LightGBM
Abenteeism 0.912
linear_svc_classifier
1.0
 LogisticRegression
Blood Transfusion 0.792
random_forest_classifier
1.0
 LogisticRegression
Breast Cancer Coimbra 0.75
random_forest_classifier
1.0
 LogisticRegression
Ionosphere 0.94
random_forest_classifier
0.8785
LightGBM
Optical Interconnection 0.647
gradient_boosting_classifier
0.4179
 LogisticRegression
Wine 0.983
random_forest_classifier
1.0
 LightGBM
References
F. Yang, A. Elkholy, S. Gustafson. Interpretable Automated Machine Learning
in Maana Knowledge Platform. 18th International Conference on
Autonomous Agents and Multiagent Systems (AAMAS), Montreal.  Extended
Abstract, May, 2019.
D. Lyu, F. Yang, B. Liu, S. Gustafson. SDRL: Interpretable and Data-efficient
Deep Reinforcement Learning Leveraging Symbolic Planning.  33rd AAAI
Conference on Artificial Intelligence (AAAI), Honolulu, HI, 2019
F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine
Learning Pipelines Leveraging Symbolic Planning and Reinforcement
Learning.  In Genetic Programming Theory and Practice XVI. 2018. Springer.
F. Yang, D. Lyu, B. Liu, S. Gustafson.  PEORL: Integrating Symbolic Planning
and Hierarchical Reinforcement Learning for Robust Decision-Making.  IJCAI.
Sweden. 2018.


AutoML
• Algorithm closely mirrors expert’s process, reasonable results
• Algorithm is naturally “human in the loop”
• Includes learning, via human input and reinforcement learning
• Anything else?
Digitization / Digital Decisions
AutoML has a knowledge representation of a digital decision
It allows you to think & reason about the decision before
making it
I have made AutoML before, but this time, I want to do it in a
way that aligns with digital decisions in general.
AutoML is simply a digital decision for picking a ML pipeline!
Canvas (derived from “to canvass”)
• A set of topics and questions that allowed you to
gather information about your business and
strategy, reflect, brainstorm and refine strategy
• We will use a four section Decision Canvas:
1. Define the problem or opportunity
2. Identify the decision strategy
3. Break down the decision
4. Define the solution as composable functions
Given data with labels, what is the best model to predict label of new data?
Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines
Given data with labels, what is the best model to predict label of new data?
What pipeline steps
have worked well,
gotten closer to goal
(better CV results)
Stop
pipeline, set
accuracy,
constrain
options
Select next
pipeline to
try
Pipeline
meets goal,
best so far
Labeled
data, user
preferences
on pipeline
CV results CV results,
user action
to stop
Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines
Given data with labels, what is the best model to predict label of new data?
model  =  best (  ...  ( learn ( score ( plan
( input data, user preferences) ) ) ) )
where ... is an iteration of (learn(score(plan( )))
until all plans are tried or a target accuracy is
met
model : given input data and user preferences,
what is the best pipeline
plan : given input data and user preferences and
known pipeline element performance, what are
ordered by potential performance and length
the possible pipelines
score : given potential pipeline, what is its
accuracy
learn : given pipeline performance, what is
pipeline element performance
best : given known pipeline accuracy, what is
the best one
What pipeline steps
have worked well,
gotten closer to goal
(better CV results)
Stop
pipeline, set
accuracy,
constrain
options
Select next
pipeline to
try
Pipeline
meets goal,
best so far
Labeled
data, user
preferences
on pipeline
CV results CV results,
user action
to stop
Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines
Given data with labels, what is the best model to predict label of new data?
Example Digitization : Should I bring my
umbrella?
• Traditionally, I would only observe the weather report, but I can now
combine this with my online calendar to decide if I’ll be outside
• It stands to reason that I should bring an umbrella if I’ll be outside long
when it is most likely to rain
• If I have an important meeting, a long distance to walk, or if I have to
carry a lot of other things, will factor into a decision about bringing an
umbrella.
• I want to learn to predict what to bring better, a better estimate of
walking times, and learn to manage my daily activities better in
general.
• Optimizing a decision (bring umbrella) extends previous data (weather
report) and fills in missing data (walking times), useful for other
opportunities.
Given today’s activities, should I bring
my umbrella?
(main PQ)
Given activities and step monitor
data, when can I assume I am outside?
(predict time outside based on step
data)
Given time outside and the weather
forecast, what is likelihood of getting
wet?
(combine outside and weather
prediction)
Given likelihood of getting wet and
activities, when do I accept
recommendation to bring umbrella?
(learn judgement decision to bring
umbrella (Y/N) as conditioned on wet
likelihood and activities)
Today’s
activities,
Weather
predictions
Am I
outside
when it’s
raining?
Will being
wet matter?
Cost of
carrying it?
Don’t get
caught
out in the
rain
What activities,
when will I be
outside, based
on steps data
Carry
umbrella
given day’s
activities?
Bring
umbrella?
Happy with
advice to
bring
umbrella –
sent by text
Day’s
activities
(locations
and times),
weather
service
Activities
(name and
time) and
activity step
monitor data
Reply to text
is Yes, No. A
No is used to
train a
function on
decision to
send text.
Given today’s activities, should I bring my umbrella?
What do you take-away?
Observe my arguments about AutoML and the algorithm
Reason about the evidence, consider past experience
Decide to change your Data Science approach
Learn by experimentation and feedback
Team
Fangkai Yang (NVIDIA) Prof. Bo Liu (Auburn) Daoming Lyu (Auburn) Alexander Elkholy Krishnan Ram (intern)
Jeremy Brown Sergey Ilinskiy
Take-away
Solve AutoML (and
digitization in
general) like and
with human experts!
www.globalbigdataconference.com
Twitter : @bigdataconf
#GAIC

Mais conteúdo relacionado

Mais procurados

Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing EcosystemDatabricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsSimeon Fitch
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovDatabricks
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 

Mais procurados (20)

Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data Payloads
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 

Semelhante a AutoML for Expert Data Science

Data Science Course in Pune
Data Science Course in Pune Data Science Course in Pune
Data Science Course in Pune nmdfilmProduction
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - SlidesAditya Joshi
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
DSI_Detailed_Syllabus_v10.2
DSI_Detailed_Syllabus_v10.2DSI_Detailed_Syllabus_v10.2
DSI_Detailed_Syllabus_v10.2Dorian Lacaisse
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-dsHimansu Sahoo
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 

Semelhante a AutoML for Expert Data Science (20)

Data Science Course in Pune
Data Science Course in Pune Data Science Course in Pune
Data Science Course in Pune
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
resume_MH
resume_MHresume_MH
resume_MH
 
DSI_Detailed_Syllabus_v10.2
DSI_Detailed_Syllabus_v10.2DSI_Detailed_Syllabus_v10.2
DSI_Detailed_Syllabus_v10.2
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-ds
 
NASA Data Science Day Plenary: Applied Machine Learning (ML)
NASA Data Science Day Plenary: Applied Machine Learning (ML)NASA Data Science Day Plenary: Applied Machine Learning (ML)
NASA Data Science Day Plenary: Applied Machine Learning (ML)
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
RESUME
RESUMERESUME
RESUME
 
36x48_Trifold_FinalPoster
36x48_Trifold_FinalPoster36x48_Trifold_FinalPoster
36x48_Trifold_FinalPoster
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

AutoML for Expert Data Science

  • 1. AutoML Productivity for Data Science, AND … A better way to make Digital Decisions Dr. Steven Gustafson Chief Scientist, Maana (2+ years) previously, GE Research (10+ years) (before previously, PhD AI for “automatic programming”)
  • 2. What do you take-away? Observe my arguments about AutoML and the algorithm Reason about the evidence, consider past experience Decide to change your Data Science approach Learn by experimentation and feedback
  • 3. My Argument • Generate new knowledge • Find good model pipelines • Allow your experts and data scientists to understand, learn and improve models that drive business decisions! • We created our AutoML as an archetype for architecting digital decisions!
  • 4. AutoML • Generate and tune ML pipeline • Auto-WEKA, Auto-SKLEARN, Google NN, Azure ML, …,TPOT • Most Bayesian learning or computation vs. improvement • Black box – helps find solutions, not knowledge or wisdom • Assumes future problems represented by data • Biased by what code and data is available vs. what’s useful • Can be very long running - hours to days
  • 5. Expert Data Science Observe the Problem, Data, Background Knowledge Reason about data characteristics vs. goals vs. techniques Decide on initial approaches Learn from results and iterate
  • 6. Expert Data Science Vs. AutoML Massive compute Optimize many, many parameters Blind search, etc? How do you explain results? Justify compute budget Engage an SME? Does the Data Scientist learn? Observe the Problem, Data, Background Knowledge Reason about data characteristics Decide on initial approaches Learn from results and iterate =?
  • 7. What if AutoML… • Capture & represent knowledge • Use reasoning to ”expertly” choose pipelines • Use reinforcement learning with human input in real- time to guide iteration • Target seconds and minutes for results instead of hours and days, match expert iteration
  • 8. What is Knowledge Representation? • A surrogate, a substitute for the thing itself. • Enable an entity to determine consequences by thinking rather than acting. • A “language” in which we say things about the world. • A “theory” of intelligent reasoning: the type of reasoning and the applicable reasoning given data • Guidance for organizing information to facilitate inferences to get new expressions from old. • A KR is not a data structure. A KR must be implemented in the machine by some data structure. http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html
  • 9. Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning.  In Genetic Programming Theory and Practice XVI. 2018. Springer.
  • 10. Symbolic planning • Symbolic planning concerns using logical formalism to represent dynamic systems and performs automated algorithms that generate plans • Plans are a sequence of actions that achieves the goal state from an initial state • Common action description language (such as B, C, C+, BC) where plan can be automatically computed using ASP solver, such as Clingo. Data science contains a set of actions that transform and fit Data.
  • 12. Pipelines• Featurizers – Count / bag of words Vectorizor – Tfidf Vectorizer • Preprocessors – matrix decompositions (truncatedSVD,pca,kernelPCA,fastICA) – kernel approximation (rbfsampler,nystroem) – feature selection (selectkbest,selectpercentile) – scaling (minmaxscaler,robustscaler,absscaler) – no preprocessing • Classifiers – logistic regression – gaussian naive Bayes – linear SVM – random forest – multinomial naive Bayes – stochastic gradient descent Nystroem: Approximate a kernel map using a subset of the training data. KernelPCA: Kernel Principal component analysis (KPCA) fastICA: a fast algorithm for Independent Component Analysis. truncatedSVD : This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Rbfsampler : Approximates feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform.
  • 13. Reinforcement Learning • Find a policy, i.e., a mapping from state to action, such that the agent can accumulate maximal reward • Learns the policy by trial-and- error: executing actions in the environment, obtain reward, update its estimation of the value function, until the value iteration converges • R-learning, update R(s,a) and rho(s) that reflects long term undiscounted average reward and gain reward, shooting for finite horizon problems (fixed number of steps in future) Data scientists performs trail and error on different ML pipelines to understand the most effective pipeline and hyper-parameters, similar to performing a reinforcement learning process
  • 14. PEORL: Planning--Execution--Observation--Reinforcement-Learning
 Define pipeline goals Find all satisfying plans Shortest/highest reward plan instantiated Update plan R-values Planner focuses future trials on plan components and overall pipelines with higher learned rewards until all plans are tried, accuracy achieved, or out of time. BC Action Language ASP - Clingo scikit-learn R-Learning
  • 15.
  • 16. Evidence from Experiments Best IMDB 300 bag of words, fastICA and stochastic gradient descent (SGD), Hashing vectorizer: ngram range = (1,2), lowercase = False • FastICA: n components = 3 • SGD classifier: loss=log, penalty=l2 Best Polarity Dataset 2.0 2000 movie reviews Cross validation accuracy of 0.84 • Hasing vecctorizer: ngram range = (1,3), lowercase = True • FastICA: n components = 3 • SGD classifier: loss = modified huber, penalty=elasticnet. Best Full IMDB dataset Cross validation score of 0.88 • Hashing vectorizer: ngram range = (1,1), lowercase = False • FastICA: n components = 3 • SGD classifier: loss=log, penalty=None 300 IMDB Docs – Top 5 300 IMBD Docs – Bottom 5
  • 17. Classifiers All have viable options, but pipelines vary significantly.
  • 18. Rho value evolution Pipeline A,B,C: • A – B fixed • C changes * Episodes are sequential, not reflected below Each pipeline is evaluated for 1..5 episodes of 5-fold cross-validation, 300 documents, 2 classes. Each episode updates the value 𝜌 episode
  • 19. PEORL learns to focus on promising pipelines
  • 20. UCI Data UCI Data Set Abenteeism 0.912 linear_svc_classifier Blood Transfusion 0.792 random_forest_classifier Breast Cancer Coimbra 0.75 random_forest_classifier Breast Cancer Wisconsin 0.972 sgd_classifier Breast tissue 0.707 logistic_classifier Cervical Cancer 0.9685 linear_svc_classifier Climate 0.9574 linear_svc_classifier Connectionist Bench 0.8269 gradient_boosting_classifier Ecoli 0.875 logistic_classifier Energy Efficiency A: 0.570 gradient_boosting_classifier B: 0.501 random_forest_classifier Glass 0.780 UCI Data Set Haberman's Survival 0.735 gradient_boosting_classifier HCC Survival 0.745 gradient_boosting_classifier Ionosphere 0.94 random_forest_classifier Iris 0.953 random_forest_classifier Leaf 0.793 linear_svc_classifier Libras Movement 0.852 logistic_classifier LSVT Voice Rehabilitation 0.881 logistic_classifier Mammographic Mass 0.85 random_forest_classifier Musk 0.823 random_forest_classifier Optical Interconnection 0.647 gradient_boosting_classifier Parkinsons 0.897 gradient_boosting_classifier UCI Data Set Quality Assessment of Digital Colposcopies A: 0.796 random_forest_classifier Seeds 0.971 linear_svc_classifier SPECTF Heart 0.8 random_forest_classifier Sports articles for objectivity analysis 0.853 linear_svc_classifier Vehicle Silhouettes 0.754 gradient_boosting_classifier Student Performance 0.185 random_forest_classifier Tennis Major Tournament Match Statistics 0.998 logistic_classifier Ultrasonic Flowmeter Diagnostics A: 0.839 gradient_boosting_classifier B: 1 random_forest_classifier User Knowledge Modeling 0.922 logistic_classifier Vertebral Column A: 0.842 linear_svc_classifier B: 0.864 logistic_classifier Wholesale Customers 0.920 random_forest_classifier Wine 0.983
  • 21. Azure ML comparison UCI AutoML Azure AutoML Student Performance 0.185 random_forest_classifier 0.1507  LightGBM Abenteeism 0.912 linear_svc_classifier 1.0  LogisticRegression Blood Transfusion 0.792 random_forest_classifier 1.0  LogisticRegression Breast Cancer Coimbra 0.75 random_forest_classifier 1.0  LogisticRegression Ionosphere 0.94 random_forest_classifier 0.8785 LightGBM Optical Interconnection 0.647 gradient_boosting_classifier 0.4179  LogisticRegression Wine 0.983 random_forest_classifier 1.0  LightGBM
  • 22. References F. Yang, A. Elkholy, S. Gustafson. Interpretable Automated Machine Learning in Maana Knowledge Platform. 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Montreal.  Extended Abstract, May, 2019. D. Lyu, F. Yang, B. Liu, S. Gustafson. SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning.  33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, 2019 F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning.  In Genetic Programming Theory and Practice XVI. 2018. Springer. F. Yang, D. Lyu, B. Liu, S. Gustafson.  PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making.  IJCAI. Sweden. 2018. 

  • 23. AutoML • Algorithm closely mirrors expert’s process, reasonable results • Algorithm is naturally “human in the loop” • Includes learning, via human input and reinforcement learning • Anything else?
  • 24. Digitization / Digital Decisions AutoML has a knowledge representation of a digital decision It allows you to think & reason about the decision before making it I have made AutoML before, but this time, I want to do it in a way that aligns with digital decisions in general. AutoML is simply a digital decision for picking a ML pipeline!
  • 25. Canvas (derived from “to canvass”) • A set of topics and questions that allowed you to gather information about your business and strategy, reflect, brainstorm and refine strategy • We will use a four section Decision Canvas: 1. Define the problem or opportunity 2. Identify the decision strategy 3. Break down the decision 4. Define the solution as composable functions
  • 26.
  • 27. Given data with labels, what is the best model to predict label of new data?
  • 28. Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  • 29. What pipeline steps have worked well, gotten closer to goal (better CV results) Stop pipeline, set accuracy, constrain options Select next pipeline to try Pipeline meets goal, best so far Labeled data, user preferences on pipeline CV results CV results, user action to stop Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  • 30. model  =  best (  ...  ( learn ( score ( plan ( input data, user preferences) ) ) ) ) where ... is an iteration of (learn(score(plan( ))) until all plans are tried or a target accuracy is met model : given input data and user preferences, what is the best pipeline plan : given input data and user preferences and known pipeline element performance, what are ordered by potential performance and length the possible pipelines score : given potential pipeline, what is its accuracy learn : given pipeline performance, what is pipeline element performance best : given known pipeline accuracy, what is the best one What pipeline steps have worked well, gotten closer to goal (better CV results) Stop pipeline, set accuracy, constrain options Select next pipeline to try Pipeline meets goal, best so far Labeled data, user preferences on pipeline CV results CV results, user action to stop Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  • 31. Example Digitization : Should I bring my umbrella? • Traditionally, I would only observe the weather report, but I can now combine this with my online calendar to decide if I’ll be outside • It stands to reason that I should bring an umbrella if I’ll be outside long when it is most likely to rain • If I have an important meeting, a long distance to walk, or if I have to carry a lot of other things, will factor into a decision about bringing an umbrella. • I want to learn to predict what to bring better, a better estimate of walking times, and learn to manage my daily activities better in general. • Optimizing a decision (bring umbrella) extends previous data (weather report) and fills in missing data (walking times), useful for other opportunities.
  • 32. Given today’s activities, should I bring my umbrella? (main PQ) Given activities and step monitor data, when can I assume I am outside? (predict time outside based on step data) Given time outside and the weather forecast, what is likelihood of getting wet? (combine outside and weather prediction) Given likelihood of getting wet and activities, when do I accept recommendation to bring umbrella? (learn judgement decision to bring umbrella (Y/N) as conditioned on wet likelihood and activities) Today’s activities, Weather predictions Am I outside when it’s raining? Will being wet matter? Cost of carrying it? Don’t get caught out in the rain What activities, when will I be outside, based on steps data Carry umbrella given day’s activities? Bring umbrella? Happy with advice to bring umbrella – sent by text Day’s activities (locations and times), weather service Activities (name and time) and activity step monitor data Reply to text is Yes, No. A No is used to train a function on decision to send text. Given today’s activities, should I bring my umbrella?
  • 33. What do you take-away? Observe my arguments about AutoML and the algorithm Reason about the evidence, consider past experience Decide to change your Data Science approach Learn by experimentation and feedback
  • 34. Team Fangkai Yang (NVIDIA) Prof. Bo Liu (Auburn) Daoming Lyu (Auburn) Alexander Elkholy Krishnan Ram (intern) Jeremy Brown Sergey Ilinskiy
  • 35. Take-away Solve AutoML (and digitization in general) like and with human experts!