SlideShare uma empresa Scribd logo
1 de 39
Wapid and wobust active online machine leawning with Vowpal Wabbit 
Pycon Finland 2014, Helsinki 
2014-10-27 
Antti Haapala 
antti@anttipatterns.com
Disclaimer 
● IANAS – I Am Not A Statistician 
● I researched the principles on how this works 
for this presentation
Why did I start to do ML? 
● Task 
– Receive social media content from various sources 
– Filter out all messages that are not in English, are press 
releases or outright spam. 
● Easy, when you can hire a team of people for just this task... 
● But people are expensive compared to computers... 
– And filtering messages is tedious work 
● Clearly a little machine learning could help us to 
separate the spam from sausage, eggs and ham.
Time to code 
● Write a binary classifier 
● But with what? 
– How does one even do it?
Libraries: Scikit-Learn 
● NLTK 
– Has some pure Python classifier implementations 
– These algorithms require all data in memory 
– The speed is an issue here 
● Some of them are too slow 
● The rest are even slower
Libraries: Scikit-Learn 
● Scikit-Learn 
– Better than NLTK 
– Though most algorithms require all data in memory 
● And our data still does not fit 
– There are some out-of-core algorithms yes, but 
they're not clearly documented 
– Still slow - we cannot afford to reevaluate our 
classifiers for hours...
Possible libraries 
● How about FANN, Orange, PyMC, PyML, 
LIBSVM, PyBrain, ffnet, MDP, Shogun toolbox, 
Theano, mlpy, Elefant, Bayes Blocks, Monte 
Python, hcluster, Plearn, Pycplex, pymorph....
????
Asking does not hurt 
“Have you tried using Vowpal Wabbit?” 
“Vowwhat?” 
“Vowpal Wabbit”
What is Vowpal Wabbit? 
A research project with the most Pythonic name 
ever
The name
What is Vowpal Wabbit? 
• John Langford: I'd like to solve AI. 
• Interviewer: How? 
• John: I want to use parallel learning algorithms 
to create fantastic learning machines!
What is Vowpal Wabbit? 
“VW is the essence of speed in machine learning, 
able to learn from terafeature datasets with ease.”
What is Vowpal Wabbit? 
“Via parallel learning, it can exceed the 
throughput of any single machine network 
interface when doing linear learning, a first 
amongst learning algorithms.”
Built for speed and scalability 
● “Plausibly the most scalable public linear 
learner, and plausibly the most scalable 
anywhere” 
● Excels on the network though impressive 
performance even on a single node.
Vowpal Wabbit compared to scikit-learn 
The algorithms where the cheatsheet says “> 
100k samples”
Scalability 
● Find a good linear predictor 
f w (x )=Σi 
– For 2,100,000,000,000 features... 
– 17,000,000,000 examples... 
– 16,000,000 parameters... 
– Using 1,000 nodes... 
wi xi 
● Finished in 70 minutes, at 500M features per second 
● That was years ago, using the then stock build of 
VW.
Open Source 
● Vowpal Wabbit is open source, under BSD 
license 
● Exists even in Ubuntu universe repository 
● The project was started by Yahoo Research, 
currently under Microsoft Research. 
– So even Windows will be supported...
Sparse Stochastic 
Gradient Descent 
● Maps all inputs to n-dimensional space 
● And divides the space by one hyperplane 
minimizing the loss caused by wrong 
classification 
– One class is on one side of the plane 
– The other is on the other side of the plane 
– The loss is modeled by a loss function
Stochastic Gradient Descent 
Image from Scikit-Learn
Which loss function for a classifier? 
● Crash course in statistics: 
– “It helps if you understand the data” 
– “But if you don't then try logistic regression” 
– Thus go for the logistic loss function
Multiclass classifier 
● Vowpal Wabbit supports various methods for 
multiclass classification, read on documentation 
how to use them.
Least squares regression 
● The gradient descent algorithm can also be used 
for regression, for example using the “squared” 
loss function for least squares. 
● A regression predicts the real number value for 
the input that is dependent on the given features 
● A classifier gives a class for the input, and 
possibly the probability for input belonging to that 
class
Classifier output 
in logistic regression 
● With Vowpal Wabbit the prediction value given 
by a classifier with logistic loss is in range [-50, 
50] 
● You can map this to a binary probability using 
the logistic function
From prediction to probability 
p= 
1 
1+e−x
Common practices of machine 
learning 
● Reduce the number of features by hand 
guessing which features are relevant 
● Use non-linear approaches such as the kernel 
trick 
● Map your features to integers 
● Leave your computer on at night to build the 
model from your training data
... become don'ts 
● Reduce the number of features by hand 
guessing which features are relevant 
● Use non-linear approaches such as the kernel 
trick 
● Map your features to integers 
● Leave your workstation on at night to build the 
model from your training data
Reduce the number of features 
● Vowpal Wabbit can handle sparse featuresets 
having millions of features efficiently
Use non-linear approaches 
● Sparse dataset with many dimensions yields 
comparative results to using fewer features with 
kernel tricks 
● One can ask Vowpal Wabbit to generate new 
features as the Cartesian product of existing 
features, using namespaces: 
– That is, given features u^a, u^b, v^c, and v^d, by 
using command line parameter -q uv, VW can make 
u^a^v^c and so forth.
Map your features to integers 
● Vowpal Wabbit hashes feature names to 
integers internally using Murmur hash v3 
● The downside of hashing are the possible 
collisions for too many features 
– H(“Nigerian prince”) = H(“job interview”) 
● Though it also decreases the possibility of 
overfitting
Fit the model at night 
● Vowpal Wabbit supports online and active 
learning. 
● Most learning tasks are IO-, not CPU-bound 
● That is to mean, your feature extraction code 
will be the bottleneck.
Supervised Learning 
● Training 
Label 
Input FFeeaatuturree e exxtrtraacctotorr Features 
● Prediction 
Machine 
Learning 
algorithm 
Machine 
Learning 
algorithm 
Input FFeeaatuturree e exxtrtraacctotorr Features 
MMooddeell 
Label
Offline vs Online learning 
● In offline learning the model is fed all the input, after 
which it is finalized; the finalized model will be used for 
predictions 
– That is, teach the classifier all kinds of unwanted messages 
before actual use, and use the resulting classifier for 10 years. 
→ Certainly not going to work. 
● In online learning, the model can be used for predictions 
right after the first input 
– The model will gradually converge towards better classification
Semisupervised learning – active 
learning 
● Asking for input for classifier is expensive 
– If one asks to label all given examples, it is almost 
even worse as not asking at all 
● The solution is active learning
Active learning 
● Train only if importance >= threshold 
Label 
Input FFeeaatuturree e exxtrtraacctotorr Features 
Machine 
Learning 
algorithm 
Machine 
Learning 
algorithm 
● Prediction 
Input FFeeaatuturree e exxtrtraacctotorr Features 
MMooddeell 
Label 
Importance
How to use Vowpal Wabbit 
● You can use it on the command line. To teach a 
model using logistic regression: 
% cat train.txt 
-1 |t nigerian prince offers money ... |a user@example.com 
1 |t invite job interview ... |a boss@dreamco.com 
... 
% vw -d train.txt --loss_function=logistic -f model.vw 
● To test 
% vw -i model.vw --loss_function=logistic -p /dev/stdout 
|t nigerian prince interview 
-0.145824 
|t spam ham and eggs |a boss@dreamco.com 
0.134225
How to use VW in Python 
● Multiple libraries exist 
– Though none of the APIs are to my liking 
– So I wrote my own 
from caerbannog import Rabbit
Examples in Python
Thanks 
Questions?

Mais conteúdo relacionado

Mais procurados

Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaAndre Pemmelaar
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchSubhashis Hazarika
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceBen Ball
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsShashank Gupta
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 

Mais procurados (20)

Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
 
Numba
NumbaNumba
Numba
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 

Destaque

HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarthHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforceHackerEarth
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingHackerEarth
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningHJ van Veen
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarthHackerEarth
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbitodsc
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 

Destaque (20)

No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 

Semelhante a Wapid and wobust active online machine leawning with Vowpal Wabbit

Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With GatlingKnoldus Inc.
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
Neural_Programmer_Interpreter
Neural_Programmer_InterpreterNeural_Programmer_Interpreter
Neural_Programmer_InterpreterKaty Lee
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetupTheodoros Vasiloudis
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANsterek47
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Abhishek Thakur
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015NoSQLmatters
 
Everything You Were Taught About Java Is Wrong
Everything You Were Taught About Java Is WrongEverything You Were Taught About Java Is Wrong
Everything You Were Taught About Java Is WrongTim Boudreau
 
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLove
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLoveOSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLove
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLoveNETWAYS
 
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)Igalia
 
programming_tutorial_course_ lesson_1.pptx
programming_tutorial_course_ lesson_1.pptxprogramming_tutorial_course_ lesson_1.pptx
programming_tutorial_course_ lesson_1.pptxaboma2hawi
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 

Semelhante a Wapid and wobust active online machine leawning with Vowpal Wabbit (20)

Gatling
Gatling Gatling
Gatling
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With Gatling
 
API Design
API DesignAPI Design
API Design
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Neural_Programmer_Interpreter
Neural_Programmer_InterpreterNeural_Programmer_Interpreter
Neural_Programmer_Interpreter
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANs
 
Software + Babies
Software + BabiesSoftware + Babies
Software + Babies
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Python ml
Python mlPython ml
Python ml
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
 
Everything You Were Taught About Java Is Wrong
Everything You Were Taught About Java Is WrongEverything You Were Taught About Java Is Wrong
Everything You Were Taught About Java Is Wrong
 
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLove
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLoveOSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLove
OSDC 2015: Kris Buytaert | From ConfigManagementSucks to ConfigManagementLove
 
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
 
programming_tutorial_course_ lesson_1.pptx
programming_tutorial_course_ lesson_1.pptxprogramming_tutorial_course_ lesson_1.pptx
programming_tutorial_course_ lesson_1.pptx
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 

Último

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 

Último (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

Wapid and wobust active online machine leawning with Vowpal Wabbit

  • 1. Wapid and wobust active online machine leawning with Vowpal Wabbit Pycon Finland 2014, Helsinki 2014-10-27 Antti Haapala antti@anttipatterns.com
  • 2. Disclaimer ● IANAS – I Am Not A Statistician ● I researched the principles on how this works for this presentation
  • 3. Why did I start to do ML? ● Task – Receive social media content from various sources – Filter out all messages that are not in English, are press releases or outright spam. ● Easy, when you can hire a team of people for just this task... ● But people are expensive compared to computers... – And filtering messages is tedious work ● Clearly a little machine learning could help us to separate the spam from sausage, eggs and ham.
  • 4. Time to code ● Write a binary classifier ● But with what? – How does one even do it?
  • 5. Libraries: Scikit-Learn ● NLTK – Has some pure Python classifier implementations – These algorithms require all data in memory – The speed is an issue here ● Some of them are too slow ● The rest are even slower
  • 6. Libraries: Scikit-Learn ● Scikit-Learn – Better than NLTK – Though most algorithms require all data in memory ● And our data still does not fit – There are some out-of-core algorithms yes, but they're not clearly documented – Still slow - we cannot afford to reevaluate our classifiers for hours...
  • 7. Possible libraries ● How about FANN, Orange, PyMC, PyML, LIBSVM, PyBrain, ffnet, MDP, Shogun toolbox, Theano, mlpy, Elefant, Bayes Blocks, Monte Python, hcluster, Plearn, Pycplex, pymorph....
  • 9. Asking does not hurt “Have you tried using Vowpal Wabbit?” “Vowwhat?” “Vowpal Wabbit”
  • 10. What is Vowpal Wabbit? A research project with the most Pythonic name ever
  • 12. What is Vowpal Wabbit? • John Langford: I'd like to solve AI. • Interviewer: How? • John: I want to use parallel learning algorithms to create fantastic learning machines!
  • 13. What is Vowpal Wabbit? “VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease.”
  • 14. What is Vowpal Wabbit? “Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms.”
  • 15. Built for speed and scalability ● “Plausibly the most scalable public linear learner, and plausibly the most scalable anywhere” ● Excels on the network though impressive performance even on a single node.
  • 16. Vowpal Wabbit compared to scikit-learn The algorithms where the cheatsheet says “> 100k samples”
  • 17. Scalability ● Find a good linear predictor f w (x )=Σi – For 2,100,000,000,000 features... – 17,000,000,000 examples... – 16,000,000 parameters... – Using 1,000 nodes... wi xi ● Finished in 70 minutes, at 500M features per second ● That was years ago, using the then stock build of VW.
  • 18. Open Source ● Vowpal Wabbit is open source, under BSD license ● Exists even in Ubuntu universe repository ● The project was started by Yahoo Research, currently under Microsoft Research. – So even Windows will be supported...
  • 19. Sparse Stochastic Gradient Descent ● Maps all inputs to n-dimensional space ● And divides the space by one hyperplane minimizing the loss caused by wrong classification – One class is on one side of the plane – The other is on the other side of the plane – The loss is modeled by a loss function
  • 20. Stochastic Gradient Descent Image from Scikit-Learn
  • 21. Which loss function for a classifier? ● Crash course in statistics: – “It helps if you understand the data” – “But if you don't then try logistic regression” – Thus go for the logistic loss function
  • 22. Multiclass classifier ● Vowpal Wabbit supports various methods for multiclass classification, read on documentation how to use them.
  • 23. Least squares regression ● The gradient descent algorithm can also be used for regression, for example using the “squared” loss function for least squares. ● A regression predicts the real number value for the input that is dependent on the given features ● A classifier gives a class for the input, and possibly the probability for input belonging to that class
  • 24. Classifier output in logistic regression ● With Vowpal Wabbit the prediction value given by a classifier with logistic loss is in range [-50, 50] ● You can map this to a binary probability using the logistic function
  • 25. From prediction to probability p= 1 1+e−x
  • 26. Common practices of machine learning ● Reduce the number of features by hand guessing which features are relevant ● Use non-linear approaches such as the kernel trick ● Map your features to integers ● Leave your computer on at night to build the model from your training data
  • 27. ... become don'ts ● Reduce the number of features by hand guessing which features are relevant ● Use non-linear approaches such as the kernel trick ● Map your features to integers ● Leave your workstation on at night to build the model from your training data
  • 28. Reduce the number of features ● Vowpal Wabbit can handle sparse featuresets having millions of features efficiently
  • 29. Use non-linear approaches ● Sparse dataset with many dimensions yields comparative results to using fewer features with kernel tricks ● One can ask Vowpal Wabbit to generate new features as the Cartesian product of existing features, using namespaces: – That is, given features u^a, u^b, v^c, and v^d, by using command line parameter -q uv, VW can make u^a^v^c and so forth.
  • 30. Map your features to integers ● Vowpal Wabbit hashes feature names to integers internally using Murmur hash v3 ● The downside of hashing are the possible collisions for too many features – H(“Nigerian prince”) = H(“job interview”) ● Though it also decreases the possibility of overfitting
  • 31. Fit the model at night ● Vowpal Wabbit supports online and active learning. ● Most learning tasks are IO-, not CPU-bound ● That is to mean, your feature extraction code will be the bottleneck.
  • 32. Supervised Learning ● Training Label Input FFeeaatuturree e exxtrtraacctotorr Features ● Prediction Machine Learning algorithm Machine Learning algorithm Input FFeeaatuturree e exxtrtraacctotorr Features MMooddeell Label
  • 33. Offline vs Online learning ● In offline learning the model is fed all the input, after which it is finalized; the finalized model will be used for predictions – That is, teach the classifier all kinds of unwanted messages before actual use, and use the resulting classifier for 10 years. → Certainly not going to work. ● In online learning, the model can be used for predictions right after the first input – The model will gradually converge towards better classification
  • 34. Semisupervised learning – active learning ● Asking for input for classifier is expensive – If one asks to label all given examples, it is almost even worse as not asking at all ● The solution is active learning
  • 35. Active learning ● Train only if importance >= threshold Label Input FFeeaatuturree e exxtrtraacctotorr Features Machine Learning algorithm Machine Learning algorithm ● Prediction Input FFeeaatuturree e exxtrtraacctotorr Features MMooddeell Label Importance
  • 36. How to use Vowpal Wabbit ● You can use it on the command line. To teach a model using logistic regression: % cat train.txt -1 |t nigerian prince offers money ... |a user@example.com 1 |t invite job interview ... |a boss@dreamco.com ... % vw -d train.txt --loss_function=logistic -f model.vw ● To test % vw -i model.vw --loss_function=logistic -p /dev/stdout |t nigerian prince interview -0.145824 |t spam ham and eggs |a boss@dreamco.com 0.134225
  • 37. How to use VW in Python ● Multiple libraries exist – Though none of the APIs are to my liking – So I wrote my own from caerbannog import Rabbit