SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Data structures for statistical computing in Python

                      Wes McKinney


                         SciPy 2010




 McKinney ()     Statistical Data Structures in Python   SciPy 2010   1 / 31
Environments for statistics and data analysis



    The usual suspects: R / S+, MATLAB, Stata, SAS, etc.
    Python being used increasingly in statistical or related applications
         scikits.statsmodels: linear models and other econometric estimators
         PyMC: Bayesian MCMC estimation
         scikits.learn: machine learning algorithms
         Many interfaces to mostly non-Python libraries (pycluster, SHOGUN,
         Orange, etc.)
         And others (look at the SciPy conference schedule!)
    How can we attract more statistical users to Python?




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   2 / 31
What matters to statistical users?




    Standard suite of linear algebra, matrix operations (NumPy, SciPy)
    Availability of statistical models and functions
         More than there used to be, but nothing compared to R / CRAN
         rpy2 is coming along, but it doesn’t seem to be an “end-user” project
    Data visualization and graphics tools (matplotlib, ...)
    Interactive research environment (IPython)




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   3 / 31
What matters to statistical users? (cont’d)




    Easy installation and sources of community support
    Well-written and navigable documentation
    Robust input / output tools
    Flexible data structures and data manipulation tools




      McKinney ()        Statistical Data Structures in Python   SciPy 2010   4 / 31
What matters to statistical users? (cont’d)




    Easy installation and sources of community support
    Well-written and navigable documentation
    Robust input / output tools
    Flexible data structures and data manipulation tools




      McKinney ()        Statistical Data Structures in Python   SciPy 2010   5 / 31
Statistical data sets

Statistical data sets commonly arrive in tabular format, i.e. as a
two-dimensional list of observations and names for the fields of each
observation.

array([(’GOOG’, ’2009-12-28’, 622.87, 1697900.0),
       (’GOOG’, ’2009-12-29’, 619.40, 1424800.0),
       (’GOOG’, ’2009-12-30’, 622.73, 1465600.0),
       (’GOOG’, ’2009-12-31’, 619.98, 1219800.0),
       (’AAPL’, ’2009-12-28’, 211.61, 23003100.0),
       (’AAPL’, ’2009-12-29’, 209.10, 15868400.0),
       (’AAPL’, ’2009-12-30’, 211.64, 14696800.0),
       (’AAPL’, ’2009-12-31’, 210.73, 12571000.0)],
      dtype=[(’item’, ’|S4’), (’date’, ’|S10’),
             (’price’, ’<f8’), (’volume’, ’<f8’)])


      McKinney ()         Statistical Data Structures in Python   SciPy 2010   6 / 31
Structured arrays


    Structured arrays are great for many applications, but not always
    great for general data analysis
    Pros
         Fast, memory-efficient, good for loading and saving big data
         Nested dtypes help manage hierarchical data




      McKinney ()         Statistical Data Structures in Python   SciPy 2010   7 / 31
Structured arrays


    Structured arrays are great for many applications, but not always
    great for general data analysis
    Pros
         Fast, memory-efficient, good for loading and saving big data
         Nested dtypes help manage hierarchical data
    Cons
         Can’t be immediately used in many (most?) NumPy methods
         Are not flexible in size (have to use or write auxiliary methods to “add”
         fields)
         Not too many built-in data manipulation methods
         Selecting subsets is often O(n)!




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   7 / 31
Structured arrays


    Structured arrays are great for many applications, but not always
    great for general data analysis
    Pros
         Fast, memory-efficient, good for loading and saving big data
         Nested dtypes help manage hierarchical data
    Cons
         Can’t be immediately used in many (most?) NumPy methods
         Are not flexible in size (have to use or write auxiliary methods to “add”
         fields)
         Not too many built-in data manipulation methods
         Selecting subsets is often O(n)!
    What can be learned from other statistical languages?



      McKinney ()          Statistical Data Structures in Python   SciPy 2010   7 / 31
R’s data.frame

One of the core data structures of the R language. In many ways similar
to a structured array.

    > df <- read.csv(’data’)
      item       date price                 volume
    1 GOOG 2009-12-28 622.87               1697900
    2 GOOG 2009-12-29 619.40               1424800
    3 GOOG 2009-12-30 622.73               1465600
    4 GOOG 2009-12-31 619.98               1219800
    5 AAPL 2009-12-28 211.61              23003100
    6 AAPL 2009-12-29 209.10              15868400
    7 AAPL 2009-12-30 211.64              14696800
    8 AAPL 2009-12-31 210.73              12571000



      McKinney ()         Statistical Data Structures in Python   SciPy 2010   8 / 31
R’s data.frame

Perhaps more like a mutable dictionary of vectors. Much of R’s statistical
estimators and 3rd-party libraries are designed to be used with
data.frame objects.

     > df$isgoog <- df$item == "GOOG"
     > df
       item       date price    volume isgoog
     1 GOOG 2009-12-28 622.87 1697900    TRUE
     2 GOOG 2009-12-29 619.40 1424800    TRUE
     3 GOOG 2009-12-30 622.73 1465600    TRUE
     4 GOOG 2009-12-31 619.98 1219800    TRUE
     5 AAPL 2009-12-28 211.61 23003100 FALSE
     6 AAPL 2009-12-29 209.10 15868400 FALSE
     7 AAPL 2009-12-30 211.64 14696800 FALSE
     8 AAPL 2009-12-31 210.73 12571000 FALSE

      McKinney ()         Statistical Data Structures in Python   SciPy 2010   9 / 31
pandas library


    Began building at AQR in 2008, open-sourced late 2009
    Many goals
         Data structures to make working with statistical or “labeled” data sets
         easy and intuitive for non-experts
         Create a both user- and developer-friendly backbone for implementing
         statistical models
         Provide an integrated set of tools for common analyses
         Implement statistical models!




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   10 / 31
pandas library


    Began building at AQR in 2008, open-sourced late 2009
    Many goals
         Data structures to make working with statistical or “labeled” data sets
         easy and intuitive for non-experts
         Create a both user- and developer-friendly backbone for implementing
         statistical models
         Provide an integrated set of tools for common analyses
         Implement statistical models!
    Takes some inspiration from R but aims also to improve in many
    areas (like data alignment)




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   10 / 31
pandas library


    Began building at AQR in 2008, open-sourced late 2009
    Many goals
         Data structures to make working with statistical or “labeled” data sets
         easy and intuitive for non-experts
         Create a both user- and developer-friendly backbone for implementing
         statistical models
         Provide an integrated set of tools for common analyses
         Implement statistical models!
    Takes some inspiration from R but aims also to improve in many
    areas (like data alignment)
    Core idea: ndarrays with labeled axes and lots of methods




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   10 / 31
pandas library


    Began building at AQR in 2008, open-sourced late 2009
    Many goals
         Data structures to make working with statistical or “labeled” data sets
         easy and intuitive for non-experts
         Create a both user- and developer-friendly backbone for implementing
         statistical models
         Provide an integrated set of tools for common analyses
         Implement statistical models!
    Takes some inspiration from R but aims also to improve in many
    areas (like data alignment)
    Core idea: ndarrays with labeled axes and lots of methods
    Etymology: panel data structures



      McKinney ()          Statistical Data Structures in Python   SciPy 2010   10 / 31
pandas DataFrame

Basically a pythonic data.frame, but with automatic data alignment!
Arithmetic operations align on row and column labels.

    >>> data = DataFrame.fromcsv(’data’, index_col=None)
         date           item     price      volume
    0    2009-12-28     GOOG     622.9      1.698e+06
    1    2009-12-29     GOOG     619.4      1.425e+06
    2    2009-12-30     GOOG     622.7      1.466e+06
    3    2009-12-31     GOOG     620        1.22e+06
    4    2009-12-28     AAPL     211.6      2.3e+07
    5    2009-12-29     AAPL     209.1      1.587e+07
    6    2009-12-30     AAPL     211.6      1.47e+07
    7    2009-12-31     AAPL     210.7      1.257e+07
    >>> df[’ind’] = df[’item’] == ’GOOG’


      McKinney ()        Statistical Data Structures in Python   SciPy 2010   11 / 31
How to organize the data?



Especially for larger data sets, we’d rather not pay O(# obs) to select a
subset of the data. O(1)-ish would be preferable

     >>> data[data[’item’] == ’GOOG’]
     array([(’GOOG’, ’2009-12-28’, 622.87, 1697900.0),
            (’GOOG’, ’2009-12-29’, 619.40, 1424800.0),
            (’GOOG’, ’2009-12-30’, 622.73, 1465600.0),
            (’GOOG’, ’2009-12-31’, 619.98, 1219800.0)],
           dtype=[(’item’, ’|S4’), (’date’, ’|S10’),
                  (’price’, ’<f8’), (’volume’, ’<f8’)])




      McKinney ()         Statistical Data Structures in Python   SciPy 2010   12 / 31
How to organize the data?


Really we have data on three dimensions: date, item, and data type. We
can pay upfront cost to pivot the data and save time later:


    >>> df = data.pivot(’date’, ’item’, ’price’)
    >>> df
                  AAPL           GOOG
    2009-12-28    211.6          622.9
    2009-12-29    209.1          619.4
    2009-12-30    211.6          622.7
    2009-12-31    210.7          620




      McKinney ()        Statistical Data Structures in Python   SciPy 2010   13 / 31
How to organize the data?


In this format, grabbing labeled, lower-dimensional slices is easy:

     >>> df[’AAPL’]
     2009-12-28     211.61
     2009-12-29     209.1
     2009-12-30     211.64
     2009-12-31     210.73

     >>> df.xs(’2009-12-28’)
     AAPL    211.61
     GOOG    622.87




       McKinney ()         Statistical Data Structures in Python   SciPy 2010   14 / 31
Data alignment


Data sets originating from different files or different database tables may
not always be homogenous:

    >>> s1                   >>> s2
    AAPL       0.044         AAPL           0.025
    IBM        0.050         BAR            0.158
    SAP        0.101         C              0.028
    GOOG       0.113         DB             0.087
    C          0.138         F              0.004
    SCGLY      0.037         GOOG           0.154
    BAR        0.200         IBM            0.034
    DB         0.281
    VW         0.040



      McKinney ()         Statistical Data Structures in Python   SciPy 2010   15 / 31
Data alignment

Arithmetic operations, etc., match on axis labels. Done in Cython so
significantly faster than pure Python.

    >>> s1 + s2
    AAPL     0.0686791008184
    BAR      0.358165479807
    C        0.16586702944
    DB       0.367679872693
    F        NaN
    GOOG     0.26666583847
    IBM      0.0833057542385
    SAP      NaN
    SCGLY    NaN
    VW       NaN


      McKinney ()         Statistical Data Structures in Python   SciPy 2010   16 / 31
Missing data handling

Since data points may be deemed “missing” or “masked”, having tools for
these makes sense.

    >>> (s1 + s2).fill(0)
    AAPL     0.0686791008184
    BAR      0.358165479807
    C        0.16586702944
    DB       0.367679872693
    F        0.0
    GOOG     0.26666583847
    IBM      0.0833057542385
    SAP      0.0
    SCGLY    0.0
    VW       0.0


      McKinney ()        Statistical Data Structures in Python   SciPy 2010   17 / 31
Missing data handling


    >>> (s1        + s2).valid()
    AAPL           0.0686791008184
    BAR            0.358165479807
    C              0.16586702944
    DB             0.367679872693
    GOOG           0.26666583847
    IBM            0.0833057542385

    >>> (s1 + s2).sum()
    1.3103630754662747

    >>> (s1 + s2).count()
    6


     McKinney ()            Statistical Data Structures in Python   SciPy 2010   18 / 31
Categorical data and “Group by”
Often want to compute descriptive stats on data given group designations:

    >>> s                   >>> cats
                                   industry
    AAPL       0.044        AAPL   TECH
    IBM        0.050        IBM    TECH
    SAP        0.101        SAP    TECH
    GOOG       0.113        GOOG   TECH
    C          0.138        C      FIN
    SCGLY      0.037        SCGLY FIN
    BAR        0.200        BAR    FIN
    DB         0.281        DB     FIN
    VW         0.040        VW     AUTO
                            RNO    AUTO
                            F      AUTO
                            TM     AUTO

      McKinney ()        Statistical Data Structures in Python   SciPy 2010   19 / 31
GroupBy in R


R users are spoiled by having vector recognized as something you might
want to “group by”:

    > labels
    [1] GOOG GOOG GOOG GOOG AAPL AAPL AAPL AAPL
    Levels: AAPL GOOG
    > data
    [1] 622.87 619.40 622.73 619.98 211.61 209.10
    211.64 210.73

    > tapply(data, labels, mean)
       AAPL    GOOG
    210.770 621.245



      McKinney ()        Statistical Data Structures in Python   SciPy 2010   20 / 31
GroupBy in pandas




We try to do something similar in pandas; the input can be any function or
dict-like object mapping labels to groups:

    >>> data.groupby(labels).aggregate(np.mean)
    AAPL    210.77
    GOOG    621.245




      McKinney ()         Statistical Data Structures in Python   SciPy 2010   21 / 31
GroupBy in pandas
More fancy things are possible, like “transforming” groups by arbitrary
functions:
     demean = lambda x: x - x.mean()

     def group_demean(obj, keyfunc):
         grouped = obj.groupby(keyfunc)
         return grouped.transform(demean)

     >>> group_demean(s, ind)
     AAPL     -0.0328370881632
     BAR      0.0358663891836
     C        -0.0261271326111
     DB       0.11719543981
     GOOG     0.035936259143
     IBM      -0.0272802815728
     SAP      0.024181110593
      McKinney ()         Statistical Data Structures in Python   SciPy 2010   22 / 31
Merging data sets



One commonly encounters a group of data sets which are not quite
identically-indexed:

>>> df1                                       >>> df2
                    AAPL    GOOG                                    MSFT       YHOO
2009-12-24          209     618.5             2009-12-24            31         16.72
2009-12-28          211.6   622.9             2009-12-28            31.17      16.88
2009-12-29          209.1   619.4             2009-12-29            31.39      16.92
2009-12-30          211.6   622.7             2009-12-30            30.96      16.98
2009-12-31          210.7   620




      McKinney ()           Statistical Data Structures in Python           SciPy 2010   23 / 31
Merging data sets



By default gluing these together on the row labels seems reasonable:

    >>> df1.join(df2)
                AAPL            GOOG             MSFT             YHOO
    2009-12-24 209              618.5            31               16.72
    2009-12-28 211.6            622.9            31.17            16.88
    2009-12-29 209.1            619.4            31.39            16.92
    2009-12-30 211.6            622.7            30.96            16.98
    2009-12-31 210.7            620              NaN              NaN




      McKinney ()         Statistical Data Structures in Python           SciPy 2010   24 / 31
Merging data sets


Returning to our first example, one might also wish to join on some other
key:

    >>> df.join(cats, on=’item’)
         date        industry item                         value
    0    2009-12-28 TECH      GOOG                         622.9
    1    2009-12-29 TECH      GOOG                         619.4
    2    2009-12-30 TECH      GOOG                         622.7
    3    2009-12-31 TECH      GOOG                         620
    4    2009-12-28 TECH      AAPL                         211.6
    5    2009-12-29 TECH      AAPL                         209.1
    6    2009-12-30 TECH      AAPL                         211.6
    7    2009-12-31 TECH      AAPL                         210.7



      McKinney ()        Statistical Data Structures in Python     SciPy 2010   25 / 31
Manipulating panel (3D) data
In finance, econometrics, etc. we frequently encounter panel data, i.e.
multiple data series for a group of individuals over time:
     >>> grunfeld
            capita        firm                 inv                value      year
     0      2.8           1                    317.6              3078       1935
     20     53.8          2                    209.9              1362       1935
     40     97.8          3                    33.1               1171       1935
     60     10.5          4                    40.29              417.5      1935
     80     183.2         5                    39.68              157.7      1935
     100    6.5           6                    20.36              197        1935
     120    100.2         7                    24.43              138        1935
     140    1.8           8                    12.93              191.5      1935
     160    162           9                    26.63              290.6      1935
     180    4.5           10                   2.54               70.91      1935
     1      52.6          1                    391.8              4662       1936
     ...
      McKinney ()         Statistical Data Structures in Python           SciPy 2010   26 / 31
Manipulating panel (3D) data


What you saw was the “stacked” or tabular format, but the 3D form can
be more useful at times:

    >>> lp = LongPanel.fromRecords(grunfeld, ’year’,
                                   ’firm’)
    >>> wp = lp.toWide()
    >>> wp
    <class ’pandas.core.panel.WidePanel’>
    Dimensions: 3 (items) x 20 (major) x 10 (minor)
    Items: capital to value
    Major axis: 1935 to 1954
    Minor axis: 1 to 10



      McKinney ()        Statistical Data Structures in Python   SciPy 2010   27 / 31
Manipulating panel (3D) data

What you saw was the “stacked” or tabular format, but the 3D form can
be more useful at times:

    >>> wp[’capital’].head()
        1935      1936      1937                             1938    1939
    1   2.8       265       53.8                             213.8   97.8
    2   52.6      402.2     50.5                             132.6   104.4
    3   156.9     761.5     118.1                            264.8   118
    4   209.2     922.4     260.2                            306.9   156.2
    5   203.4     1020      312.7                            351.1   172.6
    6   207.2     1099      254.2                            357.8   186.6
    7   255.2     1208      261.4                            342.1   220.9
    8   303.7     1430      298.7                            444.2   287.8
    9   264.1     1777      301.8                            623.6   319.9
    10 201.6      2226      279.1                            669.7   321.3

      McKinney ()        Statistical Data Structures in Python        SciPy 2010   28 / 31
Manipulating panel (3D) data

What you saw was the “stacked” or tabular format, but the 3D form can
be more useful at times:

     # mean over time for each firm
    >>> wp.mean(axis=’major’)
          capital     inv         value
    1     140.8       98.45       923.8
    2     153.9       131.5       1142
    3     205.4       134.8       1140
    4     244.2       115.8       872.1
    5     269.9       109.9       998.9
    6     281.7       132.2       1056
    7     301.7       169.7       1148
    8     344.8       173.3       1068
    9     389.2       196.7       1236
    10    428.5       197.4       1233
      McKinney ()        Statistical Data Structures in Python   SciPy 2010   29 / 31
Implementing statistical models




    Common issues
         Model specification (think R formulas)
         Data cleaning
         Attaching metadata (labels) to variables
    To the extent possible, should make the user’s life easy
    Short demo




      McKinney ()          Statistical Data Structures in Python   SciPy 2010   30 / 31
Conclusions




   Let’s attract more (statistical) users to Python by providing superior
   tools!
   Related projects: larry (la), tabular, datarray, others...
   Come to the BoF today at 6 pm
   pandas Website: http://pandas.sourceforge.net
   Contact: wesmckinn@gmail.com




     McKinney ()          Statistical Data Structures in Python   SciPy 2010   31 / 31

Mais conteúdo relacionado

Mais procurados

Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
Python Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaPython Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaEdureka!
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlibPiyush rai
 
String Manipulation in Python
String Manipulation in PythonString Manipulation in Python
String Manipulation in PythonPooja B S
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
Html html5 diapositivas
Html  html5 diapositivasHtml  html5 diapositivas
Html html5 diapositivasEliana Caycedo
 

Mais procurados (20)

NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
Python Collections Tutorial | Edureka
Python Collections Tutorial | EdurekaPython Collections Tutorial | Edureka
Python Collections Tutorial | Edureka
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Pandas csv
Pandas csvPandas csv
Pandas csv
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
 
String Manipulation in Python
String Manipulation in PythonString Manipulation in Python
String Manipulation in Python
 
Bootstrap ppt
Bootstrap pptBootstrap ppt
Bootstrap ppt
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Matplotlib
MatplotlibMatplotlib
Matplotlib
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Python list
Python listPython list
Python list
 
Pandas
PandasPandas
Pandas
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Html html5 diapositivas
Html  html5 diapositivasHtml  html5 diapositivas
Html html5 diapositivas
 
Text mining
Text miningText mining
Text mining
 

Semelhante a Data Structures for Statistical Computing in Python

An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsIRJET Journal
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
TUW-ASE Summer 2015 - Quality of Result-aware data analytics
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsTUW-ASE Summer 2015 - Quality of Result-aware data analytics
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analyticsSteven Hugo
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analyticsJan Vandevelde
 
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALSecrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALMark Tabladillo
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overviewSoojung Hong
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programmingUmang Singh
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureWake Tech BAS
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxhkabir55
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshopbelladati
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsHong-Linh Truong
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 

Semelhante a Data Structures for Statistical Computing in Python (20)

An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
2015 03-28-eb-final
2015 03-28-eb-final2015 03-28-eb-final
2015 03-28-eb-final
 
FDS_dept_ppt.pptx
FDS_dept_ppt.pptxFDS_dept_ppt.pptx
FDS_dept_ppt.pptx
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
TUW-ASE Summer 2015 - Quality of Result-aware data analytics
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsTUW-ASE Summer 2015 - Quality of Result-aware data analytics
TUW-ASE Summer 2015 - Quality of Result-aware data analytics
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analytics
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analytics
 
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALSecrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programming
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 Lecture
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshop
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 

Mais de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

Mais de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Último

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Data Structures for Statistical Computing in Python

  • 1. Data structures for statistical computing in Python Wes McKinney SciPy 2010 McKinney () Statistical Data Structures in Python SciPy 2010 1 / 31
  • 2. Environments for statistics and data analysis The usual suspects: R / S+, MATLAB, Stata, SAS, etc. Python being used increasingly in statistical or related applications scikits.statsmodels: linear models and other econometric estimators PyMC: Bayesian MCMC estimation scikits.learn: machine learning algorithms Many interfaces to mostly non-Python libraries (pycluster, SHOGUN, Orange, etc.) And others (look at the SciPy conference schedule!) How can we attract more statistical users to Python? McKinney () Statistical Data Structures in Python SciPy 2010 2 / 31
  • 3. What matters to statistical users? Standard suite of linear algebra, matrix operations (NumPy, SciPy) Availability of statistical models and functions More than there used to be, but nothing compared to R / CRAN rpy2 is coming along, but it doesn’t seem to be an “end-user” project Data visualization and graphics tools (matplotlib, ...) Interactive research environment (IPython) McKinney () Statistical Data Structures in Python SciPy 2010 3 / 31
  • 4. What matters to statistical users? (cont’d) Easy installation and sources of community support Well-written and navigable documentation Robust input / output tools Flexible data structures and data manipulation tools McKinney () Statistical Data Structures in Python SciPy 2010 4 / 31
  • 5. What matters to statistical users? (cont’d) Easy installation and sources of community support Well-written and navigable documentation Robust input / output tools Flexible data structures and data manipulation tools McKinney () Statistical Data Structures in Python SciPy 2010 5 / 31
  • 6. Statistical data sets Statistical data sets commonly arrive in tabular format, i.e. as a two-dimensional list of observations and names for the fields of each observation. array([(’GOOG’, ’2009-12-28’, 622.87, 1697900.0), (’GOOG’, ’2009-12-29’, 619.40, 1424800.0), (’GOOG’, ’2009-12-30’, 622.73, 1465600.0), (’GOOG’, ’2009-12-31’, 619.98, 1219800.0), (’AAPL’, ’2009-12-28’, 211.61, 23003100.0), (’AAPL’, ’2009-12-29’, 209.10, 15868400.0), (’AAPL’, ’2009-12-30’, 211.64, 14696800.0), (’AAPL’, ’2009-12-31’, 210.73, 12571000.0)], dtype=[(’item’, ’|S4’), (’date’, ’|S10’), (’price’, ’<f8’), (’volume’, ’<f8’)]) McKinney () Statistical Data Structures in Python SciPy 2010 6 / 31
  • 7. Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31
  • 8. Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data Cons Can’t be immediately used in many (most?) NumPy methods Are not flexible in size (have to use or write auxiliary methods to “add” fields) Not too many built-in data manipulation methods Selecting subsets is often O(n)! McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31
  • 9. Structured arrays Structured arrays are great for many applications, but not always great for general data analysis Pros Fast, memory-efficient, good for loading and saving big data Nested dtypes help manage hierarchical data Cons Can’t be immediately used in many (most?) NumPy methods Are not flexible in size (have to use or write auxiliary methods to “add” fields) Not too many built-in data manipulation methods Selecting subsets is often O(n)! What can be learned from other statistical languages? McKinney () Statistical Data Structures in Python SciPy 2010 7 / 31
  • 10. R’s data.frame One of the core data structures of the R language. In many ways similar to a structured array. > df <- read.csv(’data’) item date price volume 1 GOOG 2009-12-28 622.87 1697900 2 GOOG 2009-12-29 619.40 1424800 3 GOOG 2009-12-30 622.73 1465600 4 GOOG 2009-12-31 619.98 1219800 5 AAPL 2009-12-28 211.61 23003100 6 AAPL 2009-12-29 209.10 15868400 7 AAPL 2009-12-30 211.64 14696800 8 AAPL 2009-12-31 210.73 12571000 McKinney () Statistical Data Structures in Python SciPy 2010 8 / 31
  • 11. R’s data.frame Perhaps more like a mutable dictionary of vectors. Much of R’s statistical estimators and 3rd-party libraries are designed to be used with data.frame objects. > df$isgoog <- df$item == "GOOG" > df item date price volume isgoog 1 GOOG 2009-12-28 622.87 1697900 TRUE 2 GOOG 2009-12-29 619.40 1424800 TRUE 3 GOOG 2009-12-30 622.73 1465600 TRUE 4 GOOG 2009-12-31 619.98 1219800 TRUE 5 AAPL 2009-12-28 211.61 23003100 FALSE 6 AAPL 2009-12-29 209.10 15868400 FALSE 7 AAPL 2009-12-30 211.64 14696800 FALSE 8 AAPL 2009-12-31 210.73 12571000 FALSE McKinney () Statistical Data Structures in Python SciPy 2010 9 / 31
  • 12. pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or “labeled” data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31
  • 13. pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or “labeled” data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31
  • 14. pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or “labeled” data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) Core idea: ndarrays with labeled axes and lots of methods McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31
  • 15. pandas library Began building at AQR in 2008, open-sourced late 2009 Many goals Data structures to make working with statistical or “labeled” data sets easy and intuitive for non-experts Create a both user- and developer-friendly backbone for implementing statistical models Provide an integrated set of tools for common analyses Implement statistical models! Takes some inspiration from R but aims also to improve in many areas (like data alignment) Core idea: ndarrays with labeled axes and lots of methods Etymology: panel data structures McKinney () Statistical Data Structures in Python SciPy 2010 10 / 31
  • 16. pandas DataFrame Basically a pythonic data.frame, but with automatic data alignment! Arithmetic operations align on row and column labels. >>> data = DataFrame.fromcsv(’data’, index_col=None) date item price volume 0 2009-12-28 GOOG 622.9 1.698e+06 1 2009-12-29 GOOG 619.4 1.425e+06 2 2009-12-30 GOOG 622.7 1.466e+06 3 2009-12-31 GOOG 620 1.22e+06 4 2009-12-28 AAPL 211.6 2.3e+07 5 2009-12-29 AAPL 209.1 1.587e+07 6 2009-12-30 AAPL 211.6 1.47e+07 7 2009-12-31 AAPL 210.7 1.257e+07 >>> df[’ind’] = df[’item’] == ’GOOG’ McKinney () Statistical Data Structures in Python SciPy 2010 11 / 31
  • 17. How to organize the data? Especially for larger data sets, we’d rather not pay O(# obs) to select a subset of the data. O(1)-ish would be preferable >>> data[data[’item’] == ’GOOG’] array([(’GOOG’, ’2009-12-28’, 622.87, 1697900.0), (’GOOG’, ’2009-12-29’, 619.40, 1424800.0), (’GOOG’, ’2009-12-30’, 622.73, 1465600.0), (’GOOG’, ’2009-12-31’, 619.98, 1219800.0)], dtype=[(’item’, ’|S4’), (’date’, ’|S10’), (’price’, ’<f8’), (’volume’, ’<f8’)]) McKinney () Statistical Data Structures in Python SciPy 2010 12 / 31
  • 18. How to organize the data? Really we have data on three dimensions: date, item, and data type. We can pay upfront cost to pivot the data and save time later: >>> df = data.pivot(’date’, ’item’, ’price’) >>> df AAPL GOOG 2009-12-28 211.6 622.9 2009-12-29 209.1 619.4 2009-12-30 211.6 622.7 2009-12-31 210.7 620 McKinney () Statistical Data Structures in Python SciPy 2010 13 / 31
  • 19. How to organize the data? In this format, grabbing labeled, lower-dimensional slices is easy: >>> df[’AAPL’] 2009-12-28 211.61 2009-12-29 209.1 2009-12-30 211.64 2009-12-31 210.73 >>> df.xs(’2009-12-28’) AAPL 211.61 GOOG 622.87 McKinney () Statistical Data Structures in Python SciPy 2010 14 / 31
  • 20. Data alignment Data sets originating from different files or different database tables may not always be homogenous: >>> s1 >>> s2 AAPL 0.044 AAPL 0.025 IBM 0.050 BAR 0.158 SAP 0.101 C 0.028 GOOG 0.113 DB 0.087 C 0.138 F 0.004 SCGLY 0.037 GOOG 0.154 BAR 0.200 IBM 0.034 DB 0.281 VW 0.040 McKinney () Statistical Data Structures in Python SciPy 2010 15 / 31
  • 21. Data alignment Arithmetic operations, etc., match on axis labels. Done in Cython so significantly faster than pure Python. >>> s1 + s2 AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 F NaN GOOG 0.26666583847 IBM 0.0833057542385 SAP NaN SCGLY NaN VW NaN McKinney () Statistical Data Structures in Python SciPy 2010 16 / 31
  • 22. Missing data handling Since data points may be deemed “missing” or “masked”, having tools for these makes sense. >>> (s1 + s2).fill(0) AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 F 0.0 GOOG 0.26666583847 IBM 0.0833057542385 SAP 0.0 SCGLY 0.0 VW 0.0 McKinney () Statistical Data Structures in Python SciPy 2010 17 / 31
  • 23. Missing data handling >>> (s1 + s2).valid() AAPL 0.0686791008184 BAR 0.358165479807 C 0.16586702944 DB 0.367679872693 GOOG 0.26666583847 IBM 0.0833057542385 >>> (s1 + s2).sum() 1.3103630754662747 >>> (s1 + s2).count() 6 McKinney () Statistical Data Structures in Python SciPy 2010 18 / 31
  • 24. Categorical data and “Group by” Often want to compute descriptive stats on data given group designations: >>> s >>> cats industry AAPL 0.044 AAPL TECH IBM 0.050 IBM TECH SAP 0.101 SAP TECH GOOG 0.113 GOOG TECH C 0.138 C FIN SCGLY 0.037 SCGLY FIN BAR 0.200 BAR FIN DB 0.281 DB FIN VW 0.040 VW AUTO RNO AUTO F AUTO TM AUTO McKinney () Statistical Data Structures in Python SciPy 2010 19 / 31
  • 25. GroupBy in R R users are spoiled by having vector recognized as something you might want to “group by”: > labels [1] GOOG GOOG GOOG GOOG AAPL AAPL AAPL AAPL Levels: AAPL GOOG > data [1] 622.87 619.40 622.73 619.98 211.61 209.10 211.64 210.73 > tapply(data, labels, mean) AAPL GOOG 210.770 621.245 McKinney () Statistical Data Structures in Python SciPy 2010 20 / 31
  • 26. GroupBy in pandas We try to do something similar in pandas; the input can be any function or dict-like object mapping labels to groups: >>> data.groupby(labels).aggregate(np.mean) AAPL 210.77 GOOG 621.245 McKinney () Statistical Data Structures in Python SciPy 2010 21 / 31
  • 27. GroupBy in pandas More fancy things are possible, like “transforming” groups by arbitrary functions: demean = lambda x: x - x.mean() def group_demean(obj, keyfunc): grouped = obj.groupby(keyfunc) return grouped.transform(demean) >>> group_demean(s, ind) AAPL -0.0328370881632 BAR 0.0358663891836 C -0.0261271326111 DB 0.11719543981 GOOG 0.035936259143 IBM -0.0272802815728 SAP 0.024181110593 McKinney () Statistical Data Structures in Python SciPy 2010 22 / 31
  • 28. Merging data sets One commonly encounters a group of data sets which are not quite identically-indexed: >>> df1 >>> df2 AAPL GOOG MSFT YHOO 2009-12-24 209 618.5 2009-12-24 31 16.72 2009-12-28 211.6 622.9 2009-12-28 31.17 16.88 2009-12-29 209.1 619.4 2009-12-29 31.39 16.92 2009-12-30 211.6 622.7 2009-12-30 30.96 16.98 2009-12-31 210.7 620 McKinney () Statistical Data Structures in Python SciPy 2010 23 / 31
  • 29. Merging data sets By default gluing these together on the row labels seems reasonable: >>> df1.join(df2) AAPL GOOG MSFT YHOO 2009-12-24 209 618.5 31 16.72 2009-12-28 211.6 622.9 31.17 16.88 2009-12-29 209.1 619.4 31.39 16.92 2009-12-30 211.6 622.7 30.96 16.98 2009-12-31 210.7 620 NaN NaN McKinney () Statistical Data Structures in Python SciPy 2010 24 / 31
  • 30. Merging data sets Returning to our first example, one might also wish to join on some other key: >>> df.join(cats, on=’item’) date industry item value 0 2009-12-28 TECH GOOG 622.9 1 2009-12-29 TECH GOOG 619.4 2 2009-12-30 TECH GOOG 622.7 3 2009-12-31 TECH GOOG 620 4 2009-12-28 TECH AAPL 211.6 5 2009-12-29 TECH AAPL 209.1 6 2009-12-30 TECH AAPL 211.6 7 2009-12-31 TECH AAPL 210.7 McKinney () Statistical Data Structures in Python SciPy 2010 25 / 31
  • 31. Manipulating panel (3D) data In finance, econometrics, etc. we frequently encounter panel data, i.e. multiple data series for a group of individuals over time: >>> grunfeld capita firm inv value year 0 2.8 1 317.6 3078 1935 20 53.8 2 209.9 1362 1935 40 97.8 3 33.1 1171 1935 60 10.5 4 40.29 417.5 1935 80 183.2 5 39.68 157.7 1935 100 6.5 6 20.36 197 1935 120 100.2 7 24.43 138 1935 140 1.8 8 12.93 191.5 1935 160 162 9 26.63 290.6 1935 180 4.5 10 2.54 70.91 1935 1 52.6 1 391.8 4662 1936 ... McKinney () Statistical Data Structures in Python SciPy 2010 26 / 31
  • 32. Manipulating panel (3D) data What you saw was the “stacked” or tabular format, but the 3D form can be more useful at times: >>> lp = LongPanel.fromRecords(grunfeld, ’year’, ’firm’) >>> wp = lp.toWide() >>> wp <class ’pandas.core.panel.WidePanel’> Dimensions: 3 (items) x 20 (major) x 10 (minor) Items: capital to value Major axis: 1935 to 1954 Minor axis: 1 to 10 McKinney () Statistical Data Structures in Python SciPy 2010 27 / 31
  • 33. Manipulating panel (3D) data What you saw was the “stacked” or tabular format, but the 3D form can be more useful at times: >>> wp[’capital’].head() 1935 1936 1937 1938 1939 1 2.8 265 53.8 213.8 97.8 2 52.6 402.2 50.5 132.6 104.4 3 156.9 761.5 118.1 264.8 118 4 209.2 922.4 260.2 306.9 156.2 5 203.4 1020 312.7 351.1 172.6 6 207.2 1099 254.2 357.8 186.6 7 255.2 1208 261.4 342.1 220.9 8 303.7 1430 298.7 444.2 287.8 9 264.1 1777 301.8 623.6 319.9 10 201.6 2226 279.1 669.7 321.3 McKinney () Statistical Data Structures in Python SciPy 2010 28 / 31
  • 34. Manipulating panel (3D) data What you saw was the “stacked” or tabular format, but the 3D form can be more useful at times: # mean over time for each firm >>> wp.mean(axis=’major’) capital inv value 1 140.8 98.45 923.8 2 153.9 131.5 1142 3 205.4 134.8 1140 4 244.2 115.8 872.1 5 269.9 109.9 998.9 6 281.7 132.2 1056 7 301.7 169.7 1148 8 344.8 173.3 1068 9 389.2 196.7 1236 10 428.5 197.4 1233 McKinney () Statistical Data Structures in Python SciPy 2010 29 / 31
  • 35. Implementing statistical models Common issues Model specification (think R formulas) Data cleaning Attaching metadata (labels) to variables To the extent possible, should make the user’s life easy Short demo McKinney () Statistical Data Structures in Python SciPy 2010 30 / 31
  • 36. Conclusions Let’s attract more (statistical) users to Python by providing superior tools! Related projects: larry (la), tabular, datarray, others... Come to the BoF today at 6 pm pandas Website: http://pandas.sourceforge.net Contact: wesmckinn@gmail.com McKinney () Statistical Data Structures in Python SciPy 2010 31 / 31