AutoML for Expert Data Science

AutoML
Productivity for Data Science, AND …
A better way to make Digital Decisions
Dr. Steven Gustafson
Chief Scientist, Maana (2+ years)
previously, GE Research (10+ years)
(before previously, PhD AI for “automatic programming”)

What do you take-away?
Observe my arguments about AutoML and the algorithm
Reason about the evidence, consider past experience
Decide to change your Data Science approach
Learn by experimentation and feedback

My Argument
• Generate new knowledge
• Find good model pipelines
• Allow your experts and data scientists to
understand, learn and improve models that drive
business decisions!
• We created our AutoML as an archetype for
architecting digital decisions!

AutoML
• Generate and tune ML pipeline
• Auto-WEKA, Auto-SKLEARN, Google NN, Azure ML, …,TPOT
• Most Bayesian learning or computation vs. improvement
• Black box – helps find solutions, not knowledge or wisdom
• Assumes future problems represented by data
• Biased by what code and data is available vs. what’s useful
• Can be very long running - hours to days

Expert Data Science
Observe the Problem, Data, Background Knowledge
Reason about data characteristics vs. goals vs. techniques
Decide on initial approaches
Learn from results and iterate

Expert Data Science Vs. AutoML
Massive compute
Optimize many, many
parameters
Blind search, etc?
How do you explain results?
Justify compute budget
Engage an SME?
Does the Data Scientist learn?
Observe the Problem, Data, Background
Knowledge
Reason about data characteristics
Decide on initial approaches
Learn from results and iterate
=?

What if AutoML…
• Capture & represent knowledge
• Use reasoning to ”expertly” choose pipelines
• Use reinforcement learning with human input in real-
time to guide iteration
• Target seconds and minutes for results instead of hours
and days, match expert iteration

What is Knowledge Representation?
• A surrogate, a substitute for the thing itself.
• Enable an entity to determine consequences by thinking rather than
acting.
• A “language” in which we say things about the world.
• A “theory” of intelligent reasoning: the type of reasoning and the
applicable reasoning given data
• Guidance for organizing information to facilitate inferences to get
new expressions from old.
• A KR is not a data structure. A KR must be implemented in the
machine by some data structure.
http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html

Program Search for Machine Learning Pipelines Leveraging
Symbolic Planning and Reinforcement Learning
F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu. Program Search for Machine
Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning. In
Genetic Programming Theory and Practice XVI. 2018. Springer.

Symbolic planning
• Symbolic planning concerns using logical formalism to represent dynamic
systems and performs automated algorithms that generate plans
• Plans are a sequence of actions that achieves the goal state from an
initial state
• Common action description language (such as B, C, C+, BC) where plan
can be automatically computed using ASP solver, such as Clingo.
Data science contains a set of actions that transform and fit Data.

Pipelines• Featurizers
– Count / bag of words Vectorizor
– Tfidf Vectorizer
• Preprocessors
– matrix decompositions (truncatedSVD,pca,kernelPCA,fastICA)
– kernel approximation (rbfsampler,nystroem)
– feature selection (selectkbest,selectpercentile)
– scaling (minmaxscaler,robustscaler,absscaler)
– no preprocessing
• Classifiers
– logistic regression
– gaussian naive Bayes
– linear SVM
– random forest
– multinomial naive Bayes
– stochastic gradient descent
Nystroem: Approximate a kernel map using a
subset of the training data.
KernelPCA: Kernel Principal component analysis
(KPCA)
fastICA: a fast algorithm for Independent
Component Analysis.
truncatedSVD : This transformer performs
linear dimensionality reduction by means of
truncated singular value decomposition (SVD).
Rbfsampler : Approximates feature map of an
RBF kernel by Monte Carlo approximation of its
Fourier transform.

Reinforcement Learning
• Find a policy, i.e., a mapping from
state to action, such that the
agent can accumulate maximal
reward
• Learns the policy by trial-and-
error: executing actions in the
environment, obtain reward,
update its estimation of the value
function, until the value iteration
converges
• R-learning, update R(s,a) and
rho(s) that reflects long term
undiscounted average reward
and gain reward, shooting for
finite horizon problems (fixed
number of steps in future)
Data scientists performs trail and error on
different ML pipelines to understand the most
effective pipeline and hyper-parameters, similar
to performing a reinforcement learning process

PEORL: Planning--Execution--Observation--Reinforcement-Learning 
Define pipeline goals
Find all satisfying plans
Shortest/highest
reward plan
instantiated
Update plan R-values
Planner focuses future trials on plan
components and overall pipelines with
higher learned rewards until all plans are
tried, accuracy achieved, or out of time.
BC Action Language
ASP - Clingo
scikit-learn
R-Learning

Evidence from Experiments
Best IMDB 300
bag of words, fastICA and stochastic gradient descent (SGD),
Hashing vectorizer: ngram range = (1,2), lowercase = False •
FastICA: n components = 3 • SGD classifier: loss=log,
penalty=l2
Best Polarity Dataset 2.0 2000 movie reviews
Cross validation accuracy of 0.84
• Hasing vecctorizer: ngram range = (1,3), lowercase = True
• FastICA: n components = 3
• SGD classifier: loss = modified huber, penalty=elasticnet.
Best Full IMDB dataset
Cross validation score of 0.88
• Hashing vectorizer: ngram range = (1,1), lowercase = False
• FastICA: n components = 3
• SGD classifier: loss=log, penalty=None
300 IMDB Docs – Top 5
300 IMBD Docs – Bottom 5

Classifiers
All have viable
options, but pipelines
vary significantly.

Rho value evolution
Pipeline A,B,C:
• A – B fixed
• C changes
* Episodes are sequential, not reflected below
Each pipeline is evaluated for 1..5 episodes of 5-fold
cross-validation, 300 documents, 2 classes. Each
episode updates the value
𝜌
episode

PEORL learns to focus on promising pipelines

UCI Data
UCI Data Set
Abenteeism 0.912
linear_svc_classifier
Blood Transfusion 0.792
random_forest_classifier
Breast Cancer
Coimbra
0.75
Breast Cancer
Wisconsin
0.972
sgd_classifier
Breast tissue 0.707
logistic_classifier
Cervical Cancer 0.9685
Climate 0.9574
Connectionist
Bench
0.8269
gradient_boosting_classifier
Ecoli 0.875
logistic_classifier
Energy Efficiency A: 0.570
B: 0.501
Glass 0.780
UCI Data Set
Haberman's Survival 0.735
HCC Survival 0.745
Ionosphere 0.94
Iris 0.953
Leaf 0.793
Libras Movement 0.852
logistic_classifier
LSVT Voice
Rehabilitation
0.881
logistic_classifier
Mammographic
Mass
0.85
Musk 0.823
Optical
Interconnection
0.647
Parkinsons 0.897
UCI Data Set
Quality Assessment of
Digital Colposcopies
A: 0.796
Seeds 0.971
SPECTF Heart 0.8
Sports articles for
objectivity analysis
0.853
Vehicle Silhouettes 0.754
Student Performance 0.185
Tennis Major Tournament
Match Statistics
0.998
logistic_classifier
Ultrasonic Flowmeter
Diagnostics
A: 0.839
B: 1
User Knowledge Modeling 0.922
logistic_classifier
Vertebral Column A: 0.842
B: 0.864
logistic_classifier
Wholesale Customers 0.920
Wine 0.983

Azure ML comparison
UCI AutoML Azure AutoML
Student Performance 0.185
0.1507
LightGBM
Abenteeism 0.912
1.0
LogisticRegression
Blood Transfusion 0.792
1.0
LogisticRegression
Breast Cancer Coimbra 0.75
1.0
LogisticRegression
Ionosphere 0.94
0.8785
LightGBM
Optical Interconnection 0.647
0.4179
LogisticRegression
Wine 0.983
1.0
LightGBM

References
F. Yang, A. Elkholy, S. Gustafson. Interpretable Automated Machine Learning
in Maana Knowledge Platform. 18th International Conference on
Autonomous Agents and Multiagent Systems (AAMAS), Montreal. Extended
Abstract, May, 2019.
D. Lyu, F. Yang, B. Liu, S. Gustafson. SDRL: Interpretable and Data-efficient
Deep Reinforcement Learning Leveraging Symbolic Planning. 33rd AAAI
Conference on Artificial Intelligence (AAAI), Honolulu, HI, 2019
F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu. Program Search for Machine
Learning Pipelines Leveraging Symbolic Planning and Reinforcement
Learning. In Genetic Programming Theory and Practice XVI. 2018. Springer.
F. Yang, D. Lyu, B. Liu, S. Gustafson. PEORL: Integrating Symbolic Planning
and Hierarchical Reinforcement Learning for Robust Decision-Making. IJCAI.
Sweden. 2018.

AutoML
• Algorithm closely mirrors expert’s process, reasonable results
• Algorithm is naturally “human in the loop”
• Includes learning, via human input and reinforcement learning
• Anything else?

Digitization / Digital Decisions
AutoML has a knowledge representation of a digital decision
It allows you to think & reason about the decision before
making it
I have made AutoML before, but this time, I want to do it in a
way that aligns with digital decisions in general.
AutoML is simply a digital decision for picking a ML pipeline!

Canvas (derived from “to canvass”)
• A set of topics and questions that allowed you to
gather information about your business and
strategy, reflect, brainstorm and refine strategy
• We will use a four section Decision Canvas:
1. Define the problem or opportunity
2. Identify the decision strategy
3. Break down the decision
4. Define the solution as composable functions

Given data with labels, what is the best model to predict label of new data?

Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines

What pipeline steps
have worked well,
gotten closer to goal
(better CV results)
Stop
pipeline, set
accuracy,
constrain
options
Select next
pipeline to
try
Pipeline
meets goal,
best so far
Labeled
data, user
preferences
on pipeline
CV results CV results,
user action
to stop
Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines

model = best ( ... ( learn ( score ( plan
( input data, user preferences) ) ) ) )
where ... is an iteration of (learn(score(plan( )))
until all plans are tried or a target accuracy is
met
model : given input data and user preferences,
what is the best pipeline
plan : given input data and user preferences and
known pipeline element performance, what are
ordered by potential performance and length
the possible pipelines
score : given potential pipeline, what is its
accuracy
learn : given pipeline performance, what is
pipeline element performance
best : given known pipeline accuracy, what is
the best one
What pipeline steps
have worked well,
gotten closer to goal
(better CV results)
Stop
pipeline, set
accuracy,
constrain
options
Select next
pipeline to
try
Pipeline
meets goal,
best so far
Labeled
data, user
preferences
on pipeline
CV results CV results,
user action
to stop
Data, methods
that can be
combined into
a pipeline
Pipeline with
good cross
validation
accuracy
Shortest
pipelines with
low variability
in accuracy
Iterate
over
different
pipelines

Example Digitization : Should I bring my
umbrella?
• Traditionally, I would only observe the weather report, but I can now
combine this with my online calendar to decide if I’ll be outside
• It stands to reason that I should bring an umbrella if I’ll be outside long
when it is most likely to rain
• If I have an important meeting, a long distance to walk, or if I have to
carry a lot of other things, will factor into a decision about bringing an
umbrella.
• I want to learn to predict what to bring better, a better estimate of
walking times, and learn to manage my daily activities better in
general.
• Optimizing a decision (bring umbrella) extends previous data (weather
report) and fills in missing data (walking times), useful for other
opportunities.

Given today’s activities, should I bring
my umbrella?
(main PQ)
Given activities and step monitor
data, when can I assume I am outside?
(predict time outside based on step
data)
Given time outside and the weather
forecast, what is likelihood of getting
wet?
(combine outside and weather
prediction)
Given likelihood of getting wet and
activities, when do I accept
recommendation to bring umbrella?
(learn judgement decision to bring
umbrella (Y/N) as conditioned on wet
likelihood and activities)
Today’s
activities,
Weather
predictions
Am I
outside
when it’s
raining?
Will being
wet matter?
Cost of
carrying it?
Don’t get
caught
out in the
rain
What activities,
when will I be
outside, based
on steps data
Carry
umbrella
given day’s
activities?
Bring
umbrella?
Happy with
advice to
bring
umbrella –
sent by text
Day’s
activities
(locations
and times),
weather
service
Activities
(name and
time) and
activity step
monitor data
Reply to text
is Yes, No. A
No is used to
train a
function on
decision to
send text.
Given today’s activities, should I bring my umbrella?

Team
Fangkai Yang (NVIDIA) Prof. Bo Liu (Auburn) Daoming Lyu (Auburn) Alexander Elkholy Krishnan Ram (intern)
Jeremy Brown Sergey Ilinskiy

Take-away
Solve AutoML (and
digitization in
general) like and
with human experts!

www.globalbigdataconference.com
Twitter : @bigdataconf
#GAIC

AutoML for Expert Data Science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a AutoML for Expert Data Science

Semelhante a AutoML for Expert Data Science (20)

Último

Último (20)

AutoML for Expert Data Science