SlideShare a Scribd company logo
1 of 75
Download to read offline
Scikit-learn for easy machine learning:
the vision, the tool, and the project
Ga¨el Varoquaux
scikit
machine learning in Python
1 Scikit-learn: the vision
G Varoquaux 2
1 Scikit-learn: the vision
An enabler
G Varoquaux 2
1 Scikit-learn: the vision
An enabler
Machine learning
for everybody and
for everything
Machine learning
without learning the
machinery
G Varoquaux 2
Machine learning in a nutshell
Machine learning is about making prediction from data
G Varoquaux 3
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Eatable?
Mobile?
Tall?
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
G Varoquaux 4
1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
“Big data isn’t actually interesting without machine
learning”
Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 5
1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
?G Varoquaux 5
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
G Varoquaux 6
1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no errors
Test data = Train data
G Varoquaux 6
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Which model to prefer?
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Problem of “over-fitting”
Minimizing error is not always the best strategy
(learning noise)
Test data = train data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
Bias variance tradeoff
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
G Varoquaux 7
1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
⇒ need more data
“curse of dimensionality”
G Varoquaux 7
1 Machine learning in a nutshell: classification
Example:
recognizing hand-written digits
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
Example:
recognizing hand-written digits
Represent with 2 numerical features
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
G Varoquaux 8
1 Machine learning in a nutshell: classification
X1
X2
It’s about finding
separating lines
G Varoquaux 8
1 Machine learning in a nutshell: models
1 staircase
Fit with a staircase of 10 constant values
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
Boosted regression trees
G Varoquaux 9
1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
Boosted regression trees
Complexitity trade offs
Computational + statistical
G Varoquaux 9
1 Machine learning in a nutshell: unsupervised
Stock market structure
G Varoquaux 10
1 Machine learning in a nutshell: unsupervised
Stock market structure
Unlabeled data
more common than labeled data
G Varoquaux 10
Machine learning
Mathematics and algorithms for fitting predictive models
Regression
x
y
Classification
Unsupervised...
Notions of overfit, test error
regularization, model complexity
G Varoquaux 11
Machine learning is everywhere
Image recognition
Marketing (click-through rate)
Movie / music recommendation
Medical data
Logistic chains (eg supermarkets)
Language translation
Detecting industrial failures
G Varoquaux 12
Why another machine learning package?
G Varoquaux 13
Real statisticians use R
And real astronomers use IRAF
Real economists use Gauss
Real coders use C assembler
Real experiments are controlled in Labview
Real Bayesians use BUGS stan
Real text processing is done in Perl
Real Deep learner is best done with torch (Lua)
And medical doctors only trust SPSS
G Varoquaux 14
1 My stack
Python, what else?
General purpose
Interactive language
Easy to read / write
G Varoquaux 15
1 My stack
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared with C / fortran
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
187745620
G Varoquaux 15
1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
It’s about plugin things
together
G Varoquaux 15
1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
Being Pythonic and
SciPythonic
G Varoquaux 15
1 scikit-learn vision
Machine learning for all
No specific application domain
No requirements in machine learning
High-quality Pythonic software library
Interfaces designed for users
Community-driven development
BSD licensed, very diverse contributors
http://scikit-learn.org
G Varoquaux 16
1 Between research and applications
Machine learning research
Conceptual complexity is not an issue
New and bleeding edge is better
Simple problems are old science
In the field
Tried and tested (aka boring) is good
Little sophistication from the user
API is more important than maths
Solving simple problems matters
Solving them really well matters a lot
G Varoquaux 17
2 Scikit-learn: the tool
A Python library for machine learning
c Theodore W. Gray
G Varoquaux 18
2 A Python library
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 19
2 API: specifying a model
A central concept: the estimator
Instanciated without data
But specifying the parameters
from s k l e a r n . n e i g h b o r s import
KNear estNeig hbo r s
e s t i m a t o r = KN ea r estNe ig h b or s (
n n e i g h b o r s =2)
G Varoquaux 20
2 API: training a model
Training from data
e s t i m a t o r . f i t ( X t r a i n , Y t r a i n )
with:
X a numpy array with shape
nsamples × nfeatures
y a numpy 1D array, of ints or float, with shape
nsamples
G Varoquaux 21
2 API: using a model
Prediction: classification, regression
Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )
Transforming: dimension reduction, filter
X new = e s t i m a t o r . t r a n s f o r m ( X t e s t )
Test score, density estimation
t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )
G Varoquaux 22
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
G Varoquaux 23
2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t
import H a s h i n g V e c t o r i z e r
h a s h e r = H a s h i n g V e c t o r i z e r ()
X = h a s h e r . f i t t r a n s f o r m ( documents )
G Varoquaux 23
2 Scikit-learn: very rich feature set
Supervised learning
Decision trees (Random-Forest, Boosted Tree)
Linear models
SVM
Unsupervised Learning
Clustering
Dictionary learning
Outlier detection
Model selection
Built in cross-validation
Parameter optimization
G Varoquaux 24
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
G Varoquaux 25
2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000Fittime(s)
203.01 211.53
4464.65
3342.83
1518.14
1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
randomForest
R, Fortran
Orange
Python
Figure: Gilles Louppe
G Varoquaux 25
What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
G Varoquaux 26
What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 26
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 27
2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Linear models
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch (new in 0.16)
PCA (new in 0.16)
sklearn.decompositions.IncrementalPCA
G Varoquaux 27
2 On-the-fly data reduction
Many features
⇒ Reduce the data as it is loaded
X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 28
2 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 28
3 Scikit-learn: the project
G Varoquaux 29
3 Having an impact
G Varoquaux 30
3 Having an impact
G Varoquaux 30
3 Having an impact
G Varoquaux 30
3 Having an impact
1% of Debian installs
1200 job offers on stack overflow
G Varoquaux 30
3 Having an impact
1% of Debian installs
1200 job offers on stack overflow
G Varoquaux 30
3 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
∼ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 31
3 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 32
3 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 33
3 Quality assurance
Code review: pull requests
Can include newcomers
We read each others code
Everything is discussed:
- Should the algorithm go in?
- Are there good defaults?
- Are names meaningfull?
- Are the numerics stable?
- Could it be faster?
G Varoquaux 34
3 Quality assurance
Unit testing
Everything is tested
Great for numerics
Overall tests enforce on all estimators
- consistency with the API
- basic invariances
- good handling of various inputs
If it ain’t tested
it’s broken
G Varoquaux 35
Make it work, make it right, make it boring
G Varoquaux 36
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 37
3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
+ It’s so hard to scale
User support
Growing codebase
G Varoquaux 37
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
Versatile library: the “right” level of abstraction
Close to research, but seeking different tradeoffs
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
Numpy matrices as data containers
Reasonnably fast
@GaelVaroquaux
Scikit-learn
The vision
Machine learning as a means not an end
The tool
Simple API uniform across learners
The project
Many people working together
Tests and discussions for quality
We’re hiring!

More Related Content

What's hot

Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Managementbgoncalves2
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerSeiya Tokui
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
 
Introduction to behavior based recommendation system
Introduction to behavior based recommendation systemIntroduction to behavior based recommendation system
Introduction to behavior based recommendation systemKimikazu Kato
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 

What's hot (20)

Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Management
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
Introduction to behavior based recommendation system
Introduction to behavior based recommendation systemIntroduction to behavior based recommendation system
Introduction to behavior based recommendation system
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 

Viewers also liked

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanChetan Khatri
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learnYoss Cohen
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPôle Systematic Paris-Region
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnKan Ouivirach, Ph.D.
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learnQingkai Kong
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017Francesco Mosconi
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learnAWeber
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnAWeber
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnGilles Louppe
 

Viewers also liked (20)

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Similar to Scikit-learn for easy machine learning: the vision, the tool, and the project

On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareGael Varoquaux
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assemblyZhuyi Xue
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...Victor Asanza
 
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowRamdhan Rizki
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
 
Reversible Logic Synthesis and RevKit
Reversible Logic Synthesis and RevKitReversible Logic Synthesis and RevKit
Reversible Logic Synthesis and RevKitMathias Soeken
 
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. OW2
 
Language Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAILanguage Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAISamuelButler15
 
Artificial software diversity: automatic synthesis of program sosies
Artificial software diversity: automatic synthesis of program sosiesArtificial software diversity: automatic synthesis of program sosies
Artificial software diversity: automatic synthesis of program sosiesFoCAS Initiative
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
computer notes - Data Structures - 9
computer notes - Data Structures - 9computer notes - Data Structures - 9
computer notes - Data Structures - 9ecomputernotes
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
 
Recurrent Neuronal Network tailored for Weather Radar Nowcasting
Recurrent Neuronal Network tailored for Weather Radar NowcastingRecurrent Neuronal Network tailored for Weather Radar Nowcasting
Recurrent Neuronal Network tailored for Weather Radar NowcastingAndreas Scheidegger
 

Similar to Scikit-learn for easy machine learning: the vision, the tool, and the project (20)

On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assembly
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
 
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflow
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
 
Reversible Logic Synthesis and RevKit
Reversible Logic Synthesis and RevKitReversible Logic Synthesis and RevKit
Reversible Logic Synthesis and RevKit
 
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
 
Language Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAILanguage Language Models (in 2023) - OpenAI
Language Language Models (in 2023) - OpenAI
 
Artificial software diversity: automatic synthesis of program sosies
Artificial software diversity: automatic synthesis of program sosiesArtificial software diversity: automatic synthesis of program sosies
Artificial software diversity: automatic synthesis of program sosies
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
computer notes - Data Structures - 9
computer notes - Data Structures - 9computer notes - Data Structures - 9
computer notes - Data Structures - 9
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Recurrent Neuronal Network tailored for Weather Radar Nowcasting
Recurrent Neuronal Network tailored for Weather Radar NowcastingRecurrent Neuronal Network tailored for Weather Radar Nowcasting
Recurrent Neuronal Network tailored for Weather Radar Nowcasting
 

More from Gael Varoquaux

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Gael Varoquaux
 
Open Source Scientific Software
Open Source Scientific SoftwareOpen Source Scientific Software
Open Source Scientific SoftwareGael Varoquaux
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 

More from Gael Varoquaux (13)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
 
Open Source Scientific Software
Open Source Scientific SoftwareOpen Source Scientific Software
Open Source Scientific Software
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Scikit-learn for easy machine learning: the vision, the tool, and the project

  • 1. Scikit-learn for easy machine learning: the vision, the tool, and the project Ga¨el Varoquaux scikit machine learning in Python
  • 2. 1 Scikit-learn: the vision G Varoquaux 2
  • 3. 1 Scikit-learn: the vision An enabler G Varoquaux 2
  • 4. 1 Scikit-learn: the vision An enabler Machine learning for everybody and for everything Machine learning without learning the machinery G Varoquaux 2
  • 5. Machine learning in a nutshell Machine learning is about making prediction from data G Varoquaux 3
  • 6. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 4
  • 7. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 4
  • 8. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 4
  • 9. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 4
  • 10. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 4
  • 11. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave G Varoquaux 5
  • 12. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave ?G Varoquaux 5
  • 13. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 6
  • 14. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no errors Test data = Train data G Varoquaux 6
  • 15. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y G Varoquaux 7
  • 16. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 7
  • 17. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 7
  • 18. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 7
  • 19. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data Bias variance tradeoff G Varoquaux 7
  • 20. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 7
  • 21. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 7
  • 22. 1 Machine learning in a nutshell: classification Example: recognizing hand-written digits G Varoquaux 8
  • 23. 1 Machine learning in a nutshell: classification X1 X2 Example: recognizing hand-written digits Represent with 2 numerical features G Varoquaux 8
  • 24. 1 Machine learning in a nutshell: classification X1 X2 G Varoquaux 8
  • 25. 1 Machine learning in a nutshell: classification X1 X2 It’s about finding separating lines G Varoquaux 8
  • 26. 1 Machine learning in a nutshell: models 1 staircase Fit with a staircase of 10 constant values G Varoquaux 9
  • 27. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors G Varoquaux 9
  • 28. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going G Varoquaux 9
  • 29. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined 300 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going Boosted regression trees G Varoquaux 9
  • 30. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined 300 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going Boosted regression trees Complexitity trade offs Computational + statistical G Varoquaux 9
  • 31. 1 Machine learning in a nutshell: unsupervised Stock market structure G Varoquaux 10
  • 32. 1 Machine learning in a nutshell: unsupervised Stock market structure Unlabeled data more common than labeled data G Varoquaux 10
  • 33. Machine learning Mathematics and algorithms for fitting predictive models Regression x y Classification Unsupervised... Notions of overfit, test error regularization, model complexity G Varoquaux 11
  • 34. Machine learning is everywhere Image recognition Marketing (click-through rate) Movie / music recommendation Medical data Logistic chains (eg supermarkets) Language translation Detecting industrial failures G Varoquaux 12
  • 35. Why another machine learning package? G Varoquaux 13
  • 36. Real statisticians use R And real astronomers use IRAF Real economists use Gauss Real coders use C assembler Real experiments are controlled in Labview Real Bayesians use BUGS stan Real text processing is done in Perl Real Deep learner is best done with torch (Lua) And medical doctors only trust SPSS G Varoquaux 14
  • 37. 1 My stack Python, what else? General purpose Interactive language Easy to read / write G Varoquaux 15
  • 38. 1 My stack The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared with C / fortran 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 187745620 G Varoquaux 15
  • 39. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... It’s about plugin things together G Varoquaux 15
  • 40. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... Being Pythonic and SciPythonic G Varoquaux 15
  • 41. 1 scikit-learn vision Machine learning for all No specific application domain No requirements in machine learning High-quality Pythonic software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 16
  • 42. 1 Between research and applications Machine learning research Conceptual complexity is not an issue New and bleeding edge is better Simple problems are old science In the field Tried and tested (aka boring) is good Little sophistication from the user API is more important than maths Solving simple problems matters Solving them really well matters a lot G Varoquaux 17
  • 43. 2 Scikit-learn: the tool A Python library for machine learning c Theodore W. Gray G Varoquaux 18
  • 44. 2 A Python library A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 19
  • 45. 2 API: specifying a model A central concept: the estimator Instanciated without data But specifying the parameters from s k l e a r n . n e i g h b o r s import KNear estNeig hbo r s e s t i m a t o r = KN ea r estNe ig h b or s ( n n e i g h b o r s =2) G Varoquaux 20
  • 46. 2 API: training a model Training from data e s t i m a t o r . f i t ( X t r a i n , Y t r a i n ) with: X a numpy array with shape nsamples × nfeatures y a numpy 1D array, of ints or float, with shape nsamples G Varoquaux 21
  • 47. 2 API: using a model Prediction: classification, regression Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t ) Transforming: dimension reduction, filter X new = e s t i m a t o r . t r a n s f o r m ( X t e s t ) Test score, density estimation t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t ) G Varoquaux 22
  • 48. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix G Varoquaux 23
  • 49. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z e r h a s h e r = H a s h i n g V e c t o r i z e r () X = h a s h e r . f i t t r a n s f o r m ( documents ) G Varoquaux 23
  • 50. 2 Scikit-learn: very rich feature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 24
  • 51. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 25
  • 52. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 25
  • 53. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster G Varoquaux 26
  • 54. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 26
  • 55. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 27
  • 56. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Linear models sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch (new in 0.16) PCA (new in 0.16) sklearn.decompositions.IncrementalPCA G Varoquaux 27
  • 57. 2 On-the-fly data reduction Many features ⇒ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 28
  • 58. 2 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 28
  • 59. 3 Scikit-learn: the project G Varoquaux 29
  • 60. 3 Having an impact G Varoquaux 30
  • 61. 3 Having an impact G Varoquaux 30
  • 62. 3 Having an impact G Varoquaux 30
  • 63. 3 Having an impact 1% of Debian installs 1200 job offers on stack overflow G Varoquaux 30
  • 64. 3 Having an impact 1% of Debian installs 1200 job offers on stack overflow G Varoquaux 30
  • 65. 3 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 31
  • 66. 3 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 32
  • 67. 3 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 33
  • 68. 3 Quality assurance Code review: pull requests Can include newcomers We read each others code Everything is discussed: - Should the algorithm go in? - Are there good defaults? - Are names meaningfull? - Are the numerics stable? - Could it be faster? G Varoquaux 34
  • 69. 3 Quality assurance Unit testing Everything is tested Great for numerics Overall tests enforce on all estimators - consistency with the API - basic invariances - good handling of various inputs If it ain’t tested it’s broken G Varoquaux 35
  • 70. Make it work, make it right, make it boring G Varoquaux 36
  • 71. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 37
  • 72. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages + It’s so hard to scale User support Growing codebase G Varoquaux 37
  • 73. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end Versatile library: the “right” level of abstraction Close to research, but seeking different tradeoffs
  • 74. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners Numpy matrices as data containers Reasonnably fast
  • 75. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners The project Many people working together Tests and discussions for quality We’re hiring!