Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
1. Scikit-learn The state of the union
Ga¨el Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris
3. 1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
G Varoquaux 3
4. 1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
5. 1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4
7. 1 Started as David Cournapeau’s failed PhD project
David then preferred
improving numpy/scipy
That’s David sprinting in 2011
G Varoquaux 6
8. 1 2009: We (Inria Parietal) need machine learning
My team takes over the
development
Hire a young guy
(Fabian Pedregosa)
Put post-docs and PhDs
(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7
9. 1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in the
tutorial are pretty, but
not particularly useful
for the serious user.”
“For the sustainability of
the project it might be bet-
ter to narrow the focus...”
G Varoquaux 8
10. 1 2011: NIPS sprint
People that I didn’t know
were solving my problems
G Varoquaux 9
11. 1 2011: NIPS sprint
People that I didn’t know
were solving my problems
The project took off because of the community...
G Varoquaux 9
14. 2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git
⇒ opens the door to fused-types (polymorphism)
⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11
15. 2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
G Varoquaux 12
16. 2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K means
For large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas M¨uller
G Varoquaux 12
17. 2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
18. 2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R V
G Varoquaux 13
19. 2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesian
optimization / noisy fit problem.
⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
https://github.com/scikit-learn/scikit-learn/pull/5491
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14
21. Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
G Varoquaux 16
22. Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
Small hardware,
medium data
G Varoquaux 16
23. 3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
G Varoquaux 17
24. 3 Deep learning
sklearn.neural network.MLPClassifier
architecture-specification language
GPUs unbound technicality
keras, caffe...
G Varoquaux 17
25. 3 AutoML
Automatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18
26. 3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
G Varoquaux 19
27. 3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
Larger data
More partial fit online forests?
Less copies
G Varoquaux 19
28. 3 Scaling up (out?)
I don’t want java/scala
Less fluid prototyping
Cross-VM debugging hard
Numerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20
29. 3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
G Varoquaux 20
30. 3 Scaling up (out?)
I don’t want java/scala
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithms
dask promising for ETL
New backends for joblib parallel and storage
distributed, ssh
G Varoquaux 20
31. Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
G Varoquaux 21
32. Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
Structure & stability
How to organize funding and governance?
process/meetings/reports/funding proposal...
= work on project
Passionate coders get a lot done
unless they get drowned by meetings
G Varoquaux 21