O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
The Artful Business
                        of Data Mining
                            Computational Statistics
          ...
David Coallier
                         @davidcoallier



Wednesday 20 March 13
Data Scientist
                         At Engine Yard (.com)




Wednesday 20 March 13
Find Data


Wednesday 20 March 13
Clean Data


Wednesday 20 March 13
Analyse Data?


Wednesday 20 March 13
Analyse Data


Wednesday 20 March 13
Question Data


Wednesday 20 March 13
Report Findings


Wednesday 20 March 13
Data Scientist


Wednesday 20 March 13
Data Janitor


Wednesday 20 March 13
Actual
                        Tasks
Wednesday 20 March 13
“If your model
           is elegant, it’s
           probably wrong”

Wednesday 20 March 13
“The Times
                        they are
                        a-Changing”
                              — Bob Dylan
...
Python & R


Wednesday 20 March 13
SciPy
                        http://www.scipy.org




Wednesday 20 March 13
scipy.stats

Wednesday 20 March 13
scipy.stats
                         Descriptive Statistics




Wednesday 20 March 13
from scipy.stats
                        import describe

                        s = [1,2,1,3,4,5]

                     ...
scipy.stats
                        Probability Distributions




Wednesday 20 March 13
Example
                           Poisson Distribution




Wednesday 20 March 13
λ e
                                    k −k
                        f (k; λ ) =
                                     k!
 ...
import scipy.stats.poisson
    p = poisson.pmf([1,2,3,4,1,2,3], 2)




Wednesday 20 March 13
print p.mean()
                        print p.sum()
                        ...



Wednesday 20 March 13
NumPy
                        http://www.numpy.org/




Wednesday 20 March 13
NumPy
                          Linear Algebra




Wednesday 20 March 13
⎛ 1 0 ⎞
                        ⎜ 0 1 ⎟
                        ⎝     ⎠
Wednesday 20 March 13
import numpy as np
      x = np.array([ [1, 0], [0, 1] ])
      vec, val = np.linalg.eig(x)
      np.linalg.eigvals(x)



...
>>> np.linalg.eig(x)
                           (
                             array([ 1., 1.]),
                         ...
Matplotlib
                         Python Plotting




Wednesday 20 March 13
statsmodels
                        Advanced Statistics Modeling




Wednesday 20 March 13
NLTK
                        Natural Language Tool Kit




Wednesday 20 March 13
scikit-learn
                        Machine Learning




Wednesday 20 March 13
from sklearn import tree
                   X = [[0, 0], [1, 1]]
                   Y = [0, 1]
                   clf = tr...
PyBrain
                          ... Machine Learning




Wednesday 20 March 13
PyMC
                        Bayesian Inference




Wednesday 20 March 13
Pattern
                         Web Mining for Python




Wednesday 20 March 13
NetworkX
                            Study Networks




Wednesday 20 March 13
MILK
                        MOAR machine LEARNING!




Wednesday 20 March 13
Pandas
                           easy-to-use
                          data structures




Wednesday 20 March 13
from pandas import *
        x = DataFrame([
            {"age": 26},
            {"age": 19},
            {"age": 21},
  ...
R
Wednesday 20 March 13
RStudio
                             The IDE




Wednesday 20 March 13
lubridate
                        and zoo
                            Dealing with Dates...




Wednesday 20 March 13
yy/mm/dd mm/dd/yy
          YYYY-mm-dd HH:MM:ss TZ
          yy-mm-dd 1363784094.513425
          yy/mm different timezone...
reshape2
                           Reshape your Data




Wednesday 20 March 13
ggplot2
                          Visualise your Data




Wednesday 20 March 13
RCurl, RJSONIO
                        Find more Data




Wednesday 20 March 13
HMisc
                        Miscellaneous useful functions




Wednesday 20 March 13
forecast
                            Can you guess?




Wednesday 20 March 13
garch
                          And ruGarch




Wednesday 20 March 13
quantmod
                        Statistical Financial Trading




Wednesday 20 March 13
xts
                        Extensible Time Series




Wednesday 20 March 13
igraph
                          Study Networks




Wednesday 20 March 13
maptools
                           Read & View Maps




Wednesday 20 March 13
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)

...
Sto
rage
Wednesday 20 March 13
Oppose
                        “big” Data

Wednesday 20 March 13
“Learn how
           to sample”

Wednesday 20 March 13
Experim
ents
Wednesday 20 March 13
What Do
     You Want to Answer?

Wednesday 20 March 13
Understand
     Your Audience

Wednesday 20 March 13
Scientific
     Reporting

Wednesday 20 March 13
Busy-ness
                            Time is money




Wednesday 20 March 13
Public
     Visualisation

Wednesday 20 March 13
Best
                   Visualisation,
                   Bad
                   Data
Wednesday 20 March 13
Best
                   Forecasting
                   models...
                   Bad
                   Visualisation
W...
Wednesday 20 March 13
Wednesday 20 March 13
Sean
chaí
Wednesday 20 March 13
Wednesday 20 March 13
Feel
it
Wednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
“Don’t be scared of
           bar charts.”

Wednesday 20 March 13
Mathematical
     Statistics
     Engineering
     Business
     Economics
     Curiosity
Wednesday 20 March 13
davidcoallier.github.com
            @davidcoallier on Twitter




Wednesday 20 March 13
Próximos SlideShares
Carregando em…5
×

de

The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 1 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 2 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 3 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 4 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 5 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 6 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 7 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 8 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 9 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 10 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 11 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 12 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 13 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 14 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 15 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 16 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 17 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 18 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 19 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 20 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 21 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 22 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 23 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 24 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 25 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 26 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 27 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 28 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 29 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 30 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 31 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 32 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 33 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 34 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 35 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 36 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 37 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 38 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 39 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 40 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 41 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 42 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 43 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 44 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 45 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 46 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 47 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 48 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 49 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 50 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 51 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 52 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 53 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 54 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 55 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 56 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 57 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 58 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 59 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 60 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 61 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 62 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 63 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 64 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 65 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 66 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 67 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 68 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 69 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 70 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 71 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 72 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 73 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 74 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 75 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 76 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 77 The Artful Business of Data Mining: Computational Statistics with Open Source Tools Slide 78
Próximos SlideShares
やはり日本の技術基準適合証明はまちがっている?
Avançar
Transfira para ler offline e ver em ecrã inteiro.

3 gostaram

Compartilhar

Baixar para ler offline

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Baixar para ler offline

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

  1. 1. The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13
  2. 2. David Coallier @davidcoallier Wednesday 20 March 13
  3. 3. Data Scientist At Engine Yard (.com) Wednesday 20 March 13
  4. 4. Find Data Wednesday 20 March 13
  5. 5. Clean Data Wednesday 20 March 13
  6. 6. Analyse Data? Wednesday 20 March 13
  7. 7. Analyse Data Wednesday 20 March 13
  8. 8. Question Data Wednesday 20 March 13
  9. 9. Report Findings Wednesday 20 March 13
  10. 10. Data Scientist Wednesday 20 March 13
  11. 11. Data Janitor Wednesday 20 March 13
  12. 12. Actual Tasks Wednesday 20 March 13
  13. 13. “If your model is elegant, it’s probably wrong” Wednesday 20 March 13
  14. 14. “The Times they are a-Changing” — Bob Dylan Wednesday 20 March 13
  15. 15. Python & R Wednesday 20 March 13
  16. 16. SciPy http://www.scipy.org Wednesday 20 March 13
  17. 17. scipy.stats Wednesday 20 March 13
  18. 18. scipy.stats Descriptive Statistics Wednesday 20 March 13
  19. 19. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s) Wednesday 20 March 13
  20. 20. scipy.stats Probability Distributions Wednesday 20 March 13
  21. 21. Example Poisson Distribution Wednesday 20 March 13
  22. 22. λ e k −k f (k; λ ) = k! for k >= 0 Wednesday 20 March 13
  23. 23. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2) Wednesday 20 March 13
  24. 24. print p.mean() print p.sum() ... Wednesday 20 March 13
  25. 25. NumPy http://www.numpy.org/ Wednesday 20 March 13
  26. 26. NumPy Linear Algebra Wednesday 20 March 13
  27. 27. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠ Wednesday 20 March 13
  28. 28. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x) Wednesday 20 March 13
  29. 29. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) ) Wednesday 20 March 13
  30. 30. Matplotlib Python Plotting Wednesday 20 March 13
  31. 31. statsmodels Advanced Statistics Modeling Wednesday 20 March 13
  32. 32. NLTK Natural Language Tool Kit Wednesday 20 March 13
  33. 33. scikit-learn Machine Learning Wednesday 20 March 13
  34. 34. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1]) Wednesday 20 March 13
  35. 35. PyBrain ... Machine Learning Wednesday 20 March 13
  36. 36. PyMC Bayesian Inference Wednesday 20 March 13
  37. 37. Pattern Web Mining for Python Wednesday 20 March 13
  38. 38. NetworkX Study Networks Wednesday 20 March 13
  39. 39. MILK MOAR machine LEARNING! Wednesday 20 March 13
  40. 40. Pandas easy-to-use data structures Wednesday 20 March 13
  41. 41. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean() Wednesday 20 March 13
  42. 42. R Wednesday 20 March 13
  43. 43. RStudio The IDE Wednesday 20 March 13
  44. 44. lubridate and zoo Dealing with Dates... Wednesday 20 March 13
  45. 45. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone Wednesday 20 March 13
  46. 46. reshape2 Reshape your Data Wednesday 20 March 13
  47. 47. ggplot2 Visualise your Data Wednesday 20 March 13
  48. 48. RCurl, RJSONIO Find more Data Wednesday 20 March 13
  49. 49. HMisc Miscellaneous useful functions Wednesday 20 March 13
  50. 50. forecast Can you guess? Wednesday 20 March 13
  51. 51. garch And ruGarch Wednesday 20 March 13
  52. 52. quantmod Statistical Financial Trading Wednesday 20 March 13
  53. 53. xts Extensible Time Series Wednesday 20 March 13
  54. 54. igraph Study Networks Wednesday 20 March 13
  55. 55. maptools Read & View Maps Wednesday 20 March 13
  56. 56. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T) Wednesday 20 March 13
  57. 57. Sto rage Wednesday 20 March 13
  58. 58. Oppose “big” Data Wednesday 20 March 13
  59. 59. “Learn how to sample” Wednesday 20 March 13
  60. 60. Experim ents Wednesday 20 March 13
  61. 61. What Do You Want to Answer? Wednesday 20 March 13
  62. 62. Understand Your Audience Wednesday 20 March 13
  63. 63. Scientific Reporting Wednesday 20 March 13
  64. 64. Busy-ness Time is money Wednesday 20 March 13
  65. 65. Public Visualisation Wednesday 20 March 13
  66. 66. Best Visualisation, Bad Data Wednesday 20 March 13
  67. 67. Best Forecasting models... Bad Visualisation Wednesday 20 March 13
  68. 68. Wednesday 20 March 13
  69. 69. Wednesday 20 March 13
  70. 70. Sean chaí Wednesday 20 March 13
  71. 71. Wednesday 20 March 13
  72. 72. Feel it Wednesday 20 March 13
  73. 73. Wednesday 20 March 13
  74. 74. Wednesday 20 March 13
  75. 75. Wednesday 20 March 13
  76. 76. “Don’t be scared of bar charts.” Wednesday 20 March 13
  77. 77. Mathematical Statistics Engineering Business Economics Curiosity Wednesday 20 March 13
  78. 78. davidcoallier.github.com @davidcoallier on Twitter Wednesday 20 March 13
  • nunoedgarfernandes

    Oct. 20, 2013
  • maheshcr

    Mar. 30, 2013
  • TakeshiWatanabe2

    Mar. 22, 2013

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

Vistos

Vistos totais

1.105

No Slideshare

0

De incorporações

0

Número de incorporações

2

Ações

Baixados

30

Compartilhados

0

Comentários

0

Curtir

3

×