SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Machine Learning - Black Art
Charles Parker
Allston Trading
Machine Learning is Hard!
• By now, you know kind of a lot

• Different types of models

• Feature engineering

• Ways to evaluate

• But you’ll still fail!

• Out in the real world, there’s a
whole bunch of things that will kill
your project

• FYI - A lot of these talks are stolen
2
Join Me!
• On a journey into the Machine Learning House of
Horrors!

• Mwa ha ha!
3
5
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Choosing A Hypothesis Space
• By “hypothesis space” we
mean the possible classifiers
you could build with an
algorithm given the data

• This is the choice you make
when you pick a learning
algorithm

• You have one job!

• Is there any way to make it
easier?
6
Theory to The Rescue!
• Probably Approximately Correct

• We’d like our model to have error less than epsilon

• We’d like that to happen at least some percentage of the time

• If the error is epsilon, the percentage is sigma, the number of
training examples is m, and the hypothesis space size is d:
7
The Triple Trade-Off
• There is a triple-trade off between the error, the size
of the hypothesis space, and the amount of training
data you have
8
Error
Hypothesis Space Training Data
What About Huge Data?
• I’m clever, so I’ll use non-
parametric methods (Decision
tree, k-NN, kernelized SVMs)

• As data scales, curious things
tend to happen

• Simpler models become more
desirable as they’re faster to fit.

• You can increase model
complexity by adding features
(maybe word counts)

• Big data often trumps modeling!
9
10
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
A Dirty Little Secret About ML Algorithms
• They don’t care what you want

• Decision Trees:

• SVM:

• LR:

• LDA:
11
Real-world Losses
• Real losses are nothing like this

• False positive in disease
diagnosis

• False positive in face
detection

• False positive in thumbprint
identification

• Some aren’t even instance-
based

• Path dependencies

• Game playing
12
Specializing Your Loss
• One solution is to let developers apply their own loss

• This is the approach of SVM light: 

http://svmlight.joachims.org/

It’s been around for a while

• Losses other than Mutual Information can be plugged into the appropriate
place in splitting code

• Models trained via gradient descent can obviously be customized (Python’s
Theano is interesting for this)

• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit

https://github.com/JohnLangford/vowpal_wabbit
13
Other Hackery
• Sometimes, the solution is just to hack
around the actual prediction

• Have several levels (cascade) of
classifiers in e.g., medical diagnosis, text
recognition

• Apply logic to explicitly avoid high loss
cases (e.g., when buying/selling equities)

• Changing the problem setting

• Will you be doing queries? Use ranking
or metric learning

• “I want to do crazy thing x with
classifiers”, chances are it’s already been
done and you can read about it.
14
15
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
When Validation Attacks!
• Cross validation

• n-Fold - Hold out one fold for
testing, train on n - 1 folds

• Great way to measure
performance, right?

• It’s all about information leakage

• via instances

• via features
16
Case Study #1: Law of Averages
• Estimate sporting event
outcomes

• Use previous games to
estimate points scored for
each team (via windowing
transform)

• Choose winner based on
predicted score

• What if you’re off by one on
the window?
17
Case Study #2: Photo Dating
• Take scanned photos from
30 different users (on
average 200 per user) and
create a model to assign a
date taken (plus or minus
five years)

• Perform 10-cross
validation

• Accuracy is 85%. Can
you trust it?
18
Case Study #3: Moments In Time
• You have a buy/sell
opportunity every five
seconds

• The signals you use to
evaluate the opportunity
are aggregates of market
activity over the last five
minutes

• How careful must you be
with cross-validation?
19
20
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Breaking Machine Learning
• You’ve got this great model!
Congratulations!

• Suddenly it stops working.
Why?

• You might be in a domain
that tends to change over
time (document classification,
sales prediction)

• You might be experiencing
adverse selection (market
data predictions, spam)
21
Concept Drift
• This is called non-stationarity in either the prior or the conditional
distributions

• Could be a couple of different things

• If the prior p(input) is changing, it’s covariate shift

• If the conditional p(output | input) is changing, it’s concept drift

• No rule that it can’t be both

• http://blog.bigml.com/2013/03/12/machine-learning-from-
streaming-data-two-problems-two-solutions-two-concerns-and-
two-lessons/
22
Take Action!
• First: Look for symptoms

• Getting a lot of errors

• The distribution of predicted values changes

• Drift detection algorithms (that I know about) have the same basic flavor:

• Buffer some data in memory

• If recent data is “different” from past data, retrain, update or give up

• Some resources - A nice survey paper and an open source package:
23
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

http://moa.cms.waikato.ac.nz/
The Benefits of Archeology
• Why might you train on old
data, even if it’s not relevant?

• Verification of your research
process

• You’d do the same thing
last year. Did it work?

• Gives you a good idea of
how much drift you should
expect
24
25
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Publish or Perish
• Academic papers are a certain type of
result

• Show incremental improvement in
accuracy or generality

• Prove something about your
algorithm

• This latter is hard to come by as results
get more realistic

• Machine learning proofs assume data
is “i.i.d”, but this is obviously false.

• Real world data sucks, and dealing
with that significantly changes the
dataset
26
Usefulness of Results
• Theoretical Results

• Most of the time bounds do not apply (error, sample
complexity, convergence)

• Sometimes they don’t even make any sense

• Beware of putting too much faith in a single person or single
person’s work

• Usefulness generally occurs only in the aggregate

• And sometimes not even then (researchers are people, too)
27
Machine Learning Isn’t About Machine Learning
• Why doesn’t it work like in the
paper?

• Remember, the paper is carefully
controlled in a way your application
is not.

• Performance is rarely driven by
machine learning

• It’s driven by camera
microphones

• It’s driven by Mario Draghi
28
So, Don’t Bother With It?
• Of course not!

• What’s the alternative?

• “All our science, measured
against reality, is primitive
and childlike — and yet it is
the most precious thing we
have” - Albert Einstein

• Use academia as your
starting point, but don’t
think it will get you out of
the work
29
Some Themes
• The major points of this talk:

• Machine learning is hard to get right

• The algorithms won’t do what you want

• Good results are probably spurious

• Even if they aren’t, it won’t last

• Reading the research won’t help

• Wait, no!

• Have an attitude of skeptical optimism (or optimal skepticism?)
30

Mais conteúdo relacionado

Mais procurados

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningHaptik
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - IntroductionMarina Santini
 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2BigML, Inc
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Supervised learning
Supervised learningSupervised learning
Supervised learningankit_ppt
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics Akanksha Bali
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterpriseankit_ppt
 
Machine Learning in NutShell
Machine Learning in NutShellMachine Learning in NutShell
Machine Learning in NutShellAshwin Shiv
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsankit_ppt
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 

Mais procurados (20)

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - Introduction
 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2
 
ML Basics
ML BasicsML Basics
ML Basics
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterprise
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine Learning in NutShell
Machine Learning in NutShellMachine Learning in NutShell
Machine Learning in NutShell
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 

Semelhante a L15. Machine Learning - Black Art

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 
Influx/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron SchwartzInflux/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron SchwartzInfluxData
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale DataPredicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Dataphilippbayer
 
Will Robots Replace Testers?
Will Robots Replace Testers?Will Robots Replace Testers?
Will Robots Replace Testers?TEST Huddle
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 
AI Models For Fun and Profit by Walmart Director of Artificial Intelligence
AI Models For Fun and Profit by Walmart Director of Artificial IntelligenceAI Models For Fun and Profit by Walmart Director of Artificial Intelligence
AI Models For Fun and Profit by Walmart Director of Artificial IntelligenceProduct School
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tpseudor00t overflow
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsBigML, Inc
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
How to make m achines learn
How to make m achines learnHow to make m achines learn
How to make m achines learniskamegy
 
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...Drew Miller
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systemsPeter Varhol
 

Semelhante a L15. Machine Learning - Black Art (20)

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Influx/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron SchwartzInflux/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron Schwartz
 
Waves keynote2c
Waves keynote2cWaves keynote2c
Waves keynote2c
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale DataPredicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
 
Will Robots Replace Testers?
Will Robots Replace Testers?Will Robots Replace Testers?
Will Robots Replace Testers?
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
AI Models For Fun and Profit by Walmart Director of Artificial Intelligence
AI Models For Fun and Profit by Walmart Director of Artificial IntelligenceAI Models For Fun and Profit by Walmart Director of Artificial Intelligence
AI Models For Fun and Profit by Walmart Director of Artificial Intelligence
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and Fusions
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
How to make m achines learn
How to make m achines learnHow to make m achines learn
How to make m achines learn
 
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...
November 15th 2018 denver cu seminar (drew miller) ai robotics cryptocurrency...
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systems
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 

Mais de Machine Learning Valencia

Mais de Machine Learning Valencia (12)

From Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de MántarasFrom Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de Mántaras
 
Artificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom DietterichArtificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom Dietterich
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
 
LR1. Summary Day 1
LR1. Summary Day 1LR1. Summary Day 1
LR1. Summary Day 1
 
L6. Unbalanced Datasets
L6. Unbalanced DatasetsL6. Unbalanced Datasets
L6. Unbalanced Datasets
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
L1. State of the Art in Machine Learning
L1. State of the Art in Machine LearningL1. State of the Art in Machine Learning
L1. State of the Art in Machine Learning
 

Último

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 

Último (20)

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

L15. Machine Learning - Black Art

  • 1. Machine Learning - Black Art Charles Parker Allston Trading
  • 2. Machine Learning is Hard! • By now, you know kind of a lot • Different types of models • Feature engineering • Ways to evaluate • But you’ll still fail! • Out in the real world, there’s a whole bunch of things that will kill your project • FYI - A lot of these talks are stolen 2
  • 3. Join Me! • On a journey into the Machine Learning House of Horrors! • Mwa ha ha! 3
  • 4. 5 • The Horror of The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 5. Choosing A Hypothesis Space • By “hypothesis space” we mean the possible classifiers you could build with an algorithm given the data • This is the choice you make when you pick a learning algorithm • You have one job! • Is there any way to make it easier? 6
  • 6. Theory to The Rescue! • Probably Approximately Correct • We’d like our model to have error less than epsilon • We’d like that to happen at least some percentage of the time • If the error is epsilon, the percentage is sigma, the number of training examples is m, and the hypothesis space size is d: 7
  • 7. The Triple Trade-Off • There is a triple-trade off between the error, the size of the hypothesis space, and the amount of training data you have 8 Error Hypothesis Space Training Data
  • 8. What About Huge Data? • I’m clever, so I’ll use non- parametric methods (Decision tree, k-NN, kernelized SVMs) • As data scales, curious things tend to happen • Simpler models become more desirable as they’re faster to fit. • You can increase model complexity by adding features (maybe word counts) • Big data often trumps modeling! 9
  • 9. 10 • The Horror of The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 10. A Dirty Little Secret About ML Algorithms • They don’t care what you want • Decision Trees: • SVM: • LR: • LDA: 11
  • 11. Real-world Losses • Real losses are nothing like this • False positive in disease diagnosis • False positive in face detection • False positive in thumbprint identification • Some aren’t even instance- based • Path dependencies • Game playing 12
  • 12. Specializing Your Loss • One solution is to let developers apply their own loss • This is the approach of SVM light: http://svmlight.joachims.org/ It’s been around for a while • Losses other than Mutual Information can be plugged into the appropriate place in splitting code • Models trained via gradient descent can obviously be customized (Python’s Theano is interesting for this) • In the case of multi-example loss function, we have SEARN in Vowpal Wabbit https://github.com/JohnLangford/vowpal_wabbit 13
  • 13. Other Hackery • Sometimes, the solution is just to hack around the actual prediction • Have several levels (cascade) of classifiers in e.g., medical diagnosis, text recognition • Apply logic to explicitly avoid high loss cases (e.g., when buying/selling equities) • Changing the problem setting • Will you be doing queries? Use ranking or metric learning • “I want to do crazy thing x with classifiers”, chances are it’s already been done and you can read about it. 14
  • 14. 15 • The Horror of The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 15. When Validation Attacks! • Cross validation • n-Fold - Hold out one fold for testing, train on n - 1 folds • Great way to measure performance, right? • It’s all about information leakage • via instances • via features 16
  • 16. Case Study #1: Law of Averages • Estimate sporting event outcomes • Use previous games to estimate points scored for each team (via windowing transform) • Choose winner based on predicted score • What if you’re off by one on the window? 17
  • 17. Case Study #2: Photo Dating • Take scanned photos from 30 different users (on average 200 per user) and create a model to assign a date taken (plus or minus five years) • Perform 10-cross validation • Accuracy is 85%. Can you trust it? 18
  • 18. Case Study #3: Moments In Time • You have a buy/sell opportunity every five seconds • The signals you use to evaluate the opportunity are aggregates of market activity over the last five minutes • How careful must you be with cross-validation? 19
  • 19. 20 • The Horror of The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 20. Breaking Machine Learning • You’ve got this great model! Congratulations! • Suddenly it stops working. Why? • You might be in a domain that tends to change over time (document classification, sales prediction) • You might be experiencing adverse selection (market data predictions, spam) 21
  • 21. Concept Drift • This is called non-stationarity in either the prior or the conditional distributions • Could be a couple of different things • If the prior p(input) is changing, it’s covariate shift • If the conditional p(output | input) is changing, it’s concept drift • No rule that it can’t be both • http://blog.bigml.com/2013/03/12/machine-learning-from- streaming-data-two-problems-two-solutions-two-concerns-and- two-lessons/ 22
  • 22. Take Action! • First: Look for symptoms • Getting a lot of errors • The distribution of predicted values changes • Drift detection algorithms (that I know about) have the same basic flavor: • Buffer some data in memory • If recent data is “different” from past data, retrain, update or give up • Some resources - A nice survey paper and an open source package: 23 http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf http://moa.cms.waikato.ac.nz/
  • 23. The Benefits of Archeology • Why might you train on old data, even if it’s not relevant? • Verification of your research process • You’d do the same thing last year. Did it work? • Gives you a good idea of how much drift you should expect 24
  • 24. 25 • The Horror of The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 25. Publish or Perish • Academic papers are a certain type of result • Show incremental improvement in accuracy or generality • Prove something about your algorithm • This latter is hard to come by as results get more realistic • Machine learning proofs assume data is “i.i.d”, but this is obviously false. • Real world data sucks, and dealing with that significantly changes the dataset 26
  • 26. Usefulness of Results • Theoretical Results • Most of the time bounds do not apply (error, sample complexity, convergence) • Sometimes they don’t even make any sense • Beware of putting too much faith in a single person or single person’s work • Usefulness generally occurs only in the aggregate • And sometimes not even then (researchers are people, too) 27
  • 27. Machine Learning Isn’t About Machine Learning • Why doesn’t it work like in the paper? • Remember, the paper is carefully controlled in a way your application is not. • Performance is rarely driven by machine learning • It’s driven by camera microphones • It’s driven by Mario Draghi 28
  • 28. So, Don’t Bother With It? • Of course not! • What’s the alternative? • “All our science, measured against reality, is primitive and childlike — and yet it is the most precious thing we have” - Albert Einstein • Use academia as your starting point, but don’t think it will get you out of the work 29
  • 29. Some Themes • The major points of this talk: • Machine learning is hard to get right • The algorithms won’t do what you want • Good results are probably spurious • Even if they aren’t, it won’t last • Reading the research won’t help • Wait, no! • Have an attitude of skeptical optimism (or optimal skepticism?) 30