Scikit-learn: the state of the union 2016

Scikit-learn The state of the union
Ga¨el Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris

1 Some history
Scikit-learn canal historique
G Varoquaux 2

1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
G Varoquaux 3

1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3

1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4

1 scikit-learn growth: contributors
Contributors:
759 contributors
https://www.openhub.net/p/scikit-learn
G Varoquaux 5

1 Started as David Cournapeau’s failed PhD project
David then preferred
improving numpy/scipy
That’s David sprinting in 2011
G Varoquaux 6

1 2009: We (Inria Parietal) need machine learning
My team takes over the
development
Hire a young guy
(Fabian Pedregosa)
Put post-docs and PhDs
(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7

1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in the
tutorial are pretty, but
not particularly useful
for the serious user.”
“For the sustainability of
the project it might be bet-
ter to narrow the focus...”
G Varoquaux 8

1 2011: NIPS sprint
People that I didn’t know
were solving my problems
G Varoquaux 9

1 2011: NIPS sprint
People that I didn’t know
were solving my problems
The project took oﬀ because of the community...
G Varoquaux 9

2 Upcoming cool stuﬀ
Upcoming 0.18 release
G Varoquaux 10

2 Less code:
Lines of code:
G Varoquaux 11

2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git
⇒ opens the door to fused-types (polymorphism)
⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11

2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
G Varoquaux 12

2 Faster code: better algorithmics
RandomizedPCA → PCA
Automatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed up
https://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K means
For large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas M¨uller
G Varoquaux 12

2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
Raghav R V
G Varoquaux 13

2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
Raghav R V
G Varoquaux 13

2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesian
optimization / noisy ﬁt problem.
⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14

3 Vision(s): the future
G Varoquaux 15

Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
G Varoquaux 16

Mission statement
Enable progress via data science
Lower the costs,
less technicalities
Machine learning
for everybody and
for everything
Small hardware,
medium data
G Varoquaux 16

3 Deep learning
sklearn.neural network.MLPClassifier
architecture-speciﬁcation language
GPUs unbound technicality
G Varoquaux 17

3 Deep learning
sklearn.neural network.MLPClassifier
architecture-speciﬁcation language
GPUs unbound technicality
keras, caﬀe...
G Varoquaux 17

3 AutoML
Automatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18

3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
G Varoquaux 19

3 Better, faster, stronger
Faster models
From lightning, back to sklearn
Inspiration from XGBoost the paper is out!
Larger data
More partial ﬁt online forests?
Less copies
G Varoquaux 19

3 Scaling up (out?)
I don’t want java/scala
Less ﬂuid prototyping
Cross-VM debugging hard
Numerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20

3 Scaling up (out?)
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
G Varoquaux 20

3 Scaling up (out?)
They have:
Coupling distributed store to computation
Distributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithms
dask promising for ETL
New backends for joblib parallel and storage
distributed, ssh
G Varoquaux 20

Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
G Varoquaux 21

Sustainable growth
Reviewing is the bottleneck
User support drowns core devs
Users need stability (Airbus)
Coding is not the only thing
sprint, GSOC management, tutorials...
Structure & stability
How to organize funding and governance?
process/meetings/reports/funding proposal...
= work on project
Passionate coders get a lot done
unless they get drowned by meetings
G Varoquaux 21

@GaelVaroquaux
Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC

Scikit-learn: the state of the union 2016

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Scikit-learn: the state of the union 2016

Semelhante a Scikit-learn: the state of the union 2016 (20)

Mais de Gael Varoquaux

Mais de Gael Varoquaux (20)

Último

Último (20)

Scikit-learn: the state of the union 2016