O slideshow foi denunciado.

Scikit-learn for easy machine learning: the vision, the tool, and the project

28

Compartilhar

1 de 75
1 de 75

Scikit-learn for easy machine learning: the vision, the tool, and the project

28

Compartilhar

Baixar para ler offline

Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?

Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?

Mais Conteúdo rRelacionado

Audiolivros relacionados

Gratuito durante 14 dias do Scribd

Ver tudo

Scikit-learn for easy machine learning: the vision, the tool, and the project

  1. 1. Scikit-learn for easy machine learning: the vision, the tool, and the project Ga¨el Varoquaux scikit machine learning in Python
  2. 2. 1 Scikit-learn: the vision G Varoquaux 2
  3. 3. 1 Scikit-learn: the vision An enabler G Varoquaux 2
  4. 4. 1 Scikit-learn: the vision An enabler Machine learning for everybody and for everything Machine learning without learning the machinery G Varoquaux 2
  5. 5. Machine learning in a nutshell Machine learning is about making prediction from data G Varoquaux 3
  6. 6. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 4
  7. 7. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 4
  8. 8. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 4
  9. 9. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 4
  10. 10. 1 Machine learning: a historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 4
  11. 11. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave G Varoquaux 5
  12. 12. 1 Machine learning in a nutshell: an example Face recognition Andrew Bill Charles Dave ?G Varoquaux 5
  13. 13. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 6
  14. 14. 1 Machine learning in a nutshell A simple method: 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no errors Test data = Train data G Varoquaux 6
  15. 15. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y G Varoquaux 7
  16. 16. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 7
  17. 17. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 7
  18. 18. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 7
  19. 19. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data Bias variance tradeoff G Varoquaux 7
  20. 20. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 7
  21. 21. 1 Machine learning in a nutshell: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 7
  22. 22. 1 Machine learning in a nutshell: classification Example: recognizing hand-written digits G Varoquaux 8
  23. 23. 1 Machine learning in a nutshell: classification X1 X2 Example: recognizing hand-written digits Represent with 2 numerical features G Varoquaux 8
  24. 24. 1 Machine learning in a nutshell: classification X1 X2 G Varoquaux 8
  25. 25. 1 Machine learning in a nutshell: classification X1 X2 It’s about finding separating lines G Varoquaux 8
  26. 26. 1 Machine learning in a nutshell: models 1 staircase Fit with a staircase of 10 constant values G Varoquaux 9
  27. 27. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors G Varoquaux 9
  28. 28. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going G Varoquaux 9
  29. 29. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined 300 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going Boosted regression trees G Varoquaux 9
  30. 30. 1 Machine learning in a nutshell: models 1 staircase 2 staircases combined 3 staircases combined 300 staircases combined Fit with a staircase of 10 constant values Fit a new staircase on errors Keep going Boosted regression trees Complexitity trade offs Computational + statistical G Varoquaux 9
  31. 31. 1 Machine learning in a nutshell: unsupervised Stock market structure G Varoquaux 10
  32. 32. 1 Machine learning in a nutshell: unsupervised Stock market structure Unlabeled data more common than labeled data G Varoquaux 10
  33. 33. Machine learning Mathematics and algorithms for fitting predictive models Regression x y Classification Unsupervised... Notions of overfit, test error regularization, model complexity G Varoquaux 11
  34. 34. Machine learning is everywhere Image recognition Marketing (click-through rate) Movie / music recommendation Medical data Logistic chains (eg supermarkets) Language translation Detecting industrial failures G Varoquaux 12
  35. 35. Why another machine learning package? G Varoquaux 13
  36. 36. Real statisticians use R And real astronomers use IRAF Real economists use Gauss Real coders use C assembler Real experiments are controlled in Labview Real Bayesians use BUGS stan Real text processing is done in Perl Real Deep learner is best done with torch (Lua) And medical doctors only trust SPSS G Varoquaux 14
  37. 37. 1 My stack Python, what else? General purpose Interactive language Easy to read / write G Varoquaux 15
  38. 38. 1 My stack The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared with C / fortran 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 187745620 G Varoquaux 15
  39. 39. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... It’s about plugin things together G Varoquaux 15
  40. 40. 1 My stack The scientific Python stack numpy arrays Connecting to scipy scikit-image pandas ... Being Pythonic and SciPythonic G Varoquaux 15
  41. 41. 1 scikit-learn vision Machine learning for all No specific application domain No requirements in machine learning High-quality Pythonic software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 16
  42. 42. 1 Between research and applications Machine learning research Conceptual complexity is not an issue New and bleeding edge is better Simple problems are old science In the field Tried and tested (aka boring) is good Little sophistication from the user API is more important than maths Solving simple problems matters Solving them really well matters a lot G Varoquaux 17
  43. 43. 2 Scikit-learn: the tool A Python library for machine learning c Theodore W. Gray G Varoquaux 18
  44. 44. 2 A Python library A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 19
  45. 45. 2 API: specifying a model A central concept: the estimator Instanciated without data But specifying the parameters from s k l e a r n . n e i g h b o r s import KNear estNeig hbo r s e s t i m a t o r = KN ea r estNe ig h b or s ( n n e i g h b o r s =2) G Varoquaux 20
  46. 46. 2 API: training a model Training from data e s t i m a t o r . f i t ( X t r a i n , Y t r a i n ) with: X a numpy array with shape nsamples × nfeatures y a numpy 1D array, of ints or float, with shape nsamples G Varoquaux 21
  47. 47. 2 API: using a model Prediction: classification, regression Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t ) Transforming: dimension reduction, filter X new = e s t i m a t o r . t r a n s f o r m ( X t e s t ) Test score, density estimation t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t ) G Varoquaux 22
  48. 48. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix G Varoquaux 23
  49. 49. 2 Vectorizing From raw data to a sample matrix X For text data: counting word occurences - Input data: list of documents (string) - Output data: numerical matrix from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z e r h a s h e r = H a s h i n g V e c t o r i z e r () X = h a s h e r . f i t t r a n s f o r m ( documents ) G Varoquaux 23
  50. 50. 2 Scikit-learn: very rich feature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 24
  51. 51. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 25
  52. 52. 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 25
  53. 53. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster G Varoquaux 26
  54. 54. What if the data does not fit in memory? “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 26
  55. 55. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 27
  56. 56. 2 On-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Linear models sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch (new in 0.16) PCA (new in 0.16) sklearn.decompositions.IncrementalPCA G Varoquaux 27
  57. 57. 2 On-the-fly data reduction Many features ⇒ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 28
  58. 58. 2 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 28
  59. 59. 3 Scikit-learn: the project G Varoquaux 29
  60. 60. 3 Having an impact G Varoquaux 30
  61. 61. 3 Having an impact G Varoquaux 30
  62. 62. 3 Having an impact G Varoquaux 30
  63. 63. 3 Having an impact 1% of Debian installs 1200 job offers on stack overflow G Varoquaux 30
  64. 64. 3 Having an impact 1% of Debian installs 1200 job offers on stack overflow G Varoquaux 30
  65. 65. 3 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 31
  66. 66. 3 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 32
  67. 67. 3 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 33
  68. 68. 3 Quality assurance Code review: pull requests Can include newcomers We read each others code Everything is discussed: - Should the algorithm go in? - Are there good defaults? - Are names meaningfull? - Are the numerics stable? - Could it be faster? G Varoquaux 34
  69. 69. 3 Quality assurance Unit testing Everything is tested Great for numerics Overall tests enforce on all estimators - consistency with the API - basic invariances - good handling of various inputs If it ain’t tested it’s broken G Varoquaux 35
  70. 70. Make it work, make it right, make it boring G Varoquaux 36
  71. 71. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 37
  72. 72. 3 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages + It’s so hard to scale User support Growing codebase G Varoquaux 37
  73. 73. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end Versatile library: the “right” level of abstraction Close to research, but seeking different tradeoffs
  74. 74. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners Numpy matrices as data containers Reasonnably fast
  75. 75. @GaelVaroquaux Scikit-learn The vision Machine learning as a means not an end The tool Simple API uniform across learners The project Many people working together Tests and discussions for quality We’re hiring!

×