SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
A look inside pandas
design and development
          Wes McKinney
        Lambda Foundry, Inc.

            @wesmckinn

    NYC Python Meetup, 1/10/2012


                                   1
a.k.a. “Pragmatic Python
 for high performance
       data analysis”



                           2
a.k.a. “Rise of the pandas”



                              3
Me




     4
More like...




SPEED!!!

                          5
Or maybe... (j/k)




                    6
Me
• Mathematician at heart
• 3 years in the quant finance industry
• Last 2: statistics + freelance + open source
• My new company: Lambda Foundry
 • Building analytics and tools for finance
    and other domains


                                                 7
Me
• Blog: http://blog.wesmckinney.com
• GitHub: http://github.com/wesm
• Twitter: @wesmckinn
• Working on “Python for Data Analysis” for
  O’Reilly Media
• Giving PyCon tutorial on pandas (!)

                                              8
pandas?
• http://pandas.sf.net
• Swiss-army knife of (in-memory) data
  manipulation in Python

  • Like R’s data.frame on steroids
  • Excellent performance
  • Easy-to-use, highly consistent API
• A foundation for data analysis in Python
                                             9
pandas

• In heavy production use in the financial
  industry
• Generally much better performance than
  other open source alternatives (e.g. R)
• Hope: basis for the “next generation” data
  analytical environment in Python



                                               10
Simplifying data wrangling

• Data munging / preparation / cleaning /
  integration is slow, error prone, and time
  consuming
• Everyone already <3’s Python for data
  wrangling: pandas takes it to the next level




                                                 11
Explosive pandas growth
• Last 6 months: 240 files changed
  49428 insertions(+), 15358 deletions(-)

                          Cython-generated C removed




                                                       12
Rigorous unit testing
• Need to be able to trust your $1e3/e6/e9s
  to pandas
• > 98% line coverage as measured by
  coverage.py
• v0.3.0 (2/19/2011): 533 test functions
• v0.7.0 (1/09/2012): 1272 test functions

                                              13
Some development asides
• I get a lot of questions about my dev env
• Emacs + IPython FTW
• Indispensible development tools
 • pdb (and IPython-enhanced pdb)
 • pylint / pyflakes (integrated with Emacs)
 • nose
 • coverage.py
• grin, for searching code. >> ack/grep IMHO
                                               14
IPython
• Matthew Goodman: “If you are not using
  this tool, you are doing it wrong!”

• Tab completion, introspection, interactive
  debugger, command history

• Designed to enhance your productivity in
  every way. I can’t live without it

• IPython HTML notebook is a game changer
                                               15
Profiling and optimization
• %time, %timeit in IPython
• %prun, to profile a statement with cProfile
• %run -p to profile whole programs
• line_profiler module, for line-by-line timing
• Optimization: find right algorithm first.
  Cython-ize the bottlenecks (if need be)


                                                 16
Other things that matter
• Follow PEP8 religiously
 • Naming conventions, other code style
 • 80 character per line hard limit
• Test more than you think you need to, aim
  for 100% line coverage

• Avoid long functions (> 50 lines), refactor
  aggressively

                                                17
I’m serious about
  function length




 http://gist.github.com/1580880
                                  18
Don’t make a mess




        Uncle Bob

YouTube: “What killed Smalltalk could kill s/Ruby/Python, too”
                                                                 19
Other stuff
• Good keyboard




                       20
Other stuff
• Big monitors




                         21
Other stuff
• Ergonomic chair (good hacking posture)




                                           22
pandas DataFrame
•    Jack-of-trades tabular data structure
    In [10]: tips[:10]
    Out[10]:
        total_bill tip     sex      smoker   day   time     size
    1   16.99       1.01   Female   No       Sun   Dinner   2
    2   10.34       1.66   Male     No       Sun   Dinner   3
    3   21.01       3.50   Male     No       Sun   Dinner   3
    4   23.68       3.31   Male     No       Sun   Dinner   2
    5   24.59       3.61   Female   No       Sun   Dinner   4
    6   25.29       4.71   Male     No       Sun   Dinner   4
    7   8.770       2.00   Male     No       Sun   Dinner   2
    8   26.88       3.12   Male     No       Sun   Dinner   4
    9   15.04       1.96   Male     No       Sun   Dinner   2
    10 14.78        3.23   Male     No       Sun   Dinner   2


                                                                   23
DataFrame

• Heterogeneous columns
• Data alignment and axis indexing
• No-copy data selection (!)
• Agile reshaping
• Fast joining, merging, concatenation

                                         24
DataFrame
• Axis indexing enable rich data alignment,
  joins / merges, reshaping, selection, etc.

  day             Fri     Sat     Sun     Thur
  sex    smoker
  Female No       3.125   2.725   3.329   2.460
         Yes      2.683   2.869   3.500   2.990
  Male   No       2.500   3.257   3.115   2.942
         Yes      2.741   2.879   3.521   3.058




                                                  25
Let’s have a little fun

 To the IPython Notebook, Batman

http://ashleyw.co.uk/project/food-nutrient-database




                                                      26
Axis indexing, the special
 pandas-flavored sauce
• Enables “alignment-free” programming
• Prevents major source of data munging
  frustration and errors
• Fast (O(1) or O(log n)) selecting data
• Powerful way of describing reshape / join /
  merge / pivot-table operations


                                                27
Data alignment, join ops

• The brains live in the axis index
• Indexes know how to do set logic
• Join/align ops: produce “indexers”
 • Mapping between source/output
• Indexer passed to fast “take” function

                                           28
Index join example
left      right          joined     lidx     ridx
                               a    -1         0
 d
        a                      b     1         1
 b
   JOIN b                      c     2         2
 c
        c                      d     0        -1
 e
                               e     3        -1

left_values.take(lidx, axis)       reindexed data

                                                    29
Implementing index joins
• Completely irregular case: use hash tables
• Monotonic / increasing values
 • Faster specialized left/right/inner/outer
    join routines, especially for native types
    (int32/64, datetime64)
• Lookup hash table is persisted inside the
  Index object!


                                                 30
Um, hash table?
    left         joined   indexer




{ }
                   a        -1
d          0
                   b         1
b          1
               map c         2
c          2
                   d         0
e          3
                   e         3




                                    31
Hash tables
• Form the core of many critical pandas
  algorithms
 • unique (for set intersection / union)
 • “factor”ize
 • groupby
 • join / merge / align

                                           32
GroupBy, a brief
 algorithmic exploration
• Simple problem: compute group sums for a
  vector given group identifications
 labels values
   b      -1
                         unique       group
   b       3
                         labels       sums
    a      2
                            a           2
    a      3
                           b            4
   b       2
    a     -4
    a      1
                                              33
GroupBy: Algo #1

unique_labels = np.unique(labels)
results = np.empty(len(unique_labels))

for i, label in enumerate(unique_labels):
    results[i] = values[labels == label].sum()



    For all these examples, assume N data
         points and K unique groups


                                                 34
GroupBy: Algo #1, don’t do this

 unique_labels = np.unique(labels)
 results = np.empty(len(unique_labels))

 for i, label in enumerate(unique_labels):
     results[i] = values[labels == label].sum()


Some obvious problems
  • O(N * K) comparisons. Slow for large K
  • K passes through values
  • numpy.unique is pretty slow (more on this later)
                                                       35
GroupBy: Algo #2
Make this dict in O(N) (pseudocode)
   g_inds = {label : [i where labels[i] == label]}
Now
    for i, label in enumerate(unique_labels):
        indices = g_inds[label]
        label_values = values.take(indices)
        result[i] = label_values.sum()


 Pros: one pass through values. ~O(N) for N >> K
 Cons: g_inds can be built in O(N), but too many
 list/dict API calls, even using Cython

                                                     36
GroupBy: Algo #3, much faster
 • “Factorize” labels
  • Produce vectorto the unique observedK-1
     corresponding
                      of integers from 0, ...,
      values (use a hash table)
   result = np.zeros(k)
   for i, j in enumerate(factorized_labels):
       result[j] += values[i]

Pros: avoid expensive dict-of-lists creation. Avoid
numpy.unique and have option to not to sort the
unique labels, skipping O(K lg K) work
                                                      37
Speed comparisons
• Test case: 100,000 data points, 5,000 groups
 • Algo 3, don’t sort groups: 5.46 ms
 • Algo 3, sort groups: 10.6 ms
 • Algo 2: 155 ms (14.6x slower)
 • Algo 1: 10.49 seconds (990x slower)
• Algos 2/3 implemented in Cython
                                                 38
GroupBy


• Situation is significantly more complicated
  in the multi-key case.
• More on this later


                                               39
Algo 3, profiled
In [32]: %prun for _ in xrange(100) algo3_nosort()

cumtime   filename:lineno(function)
  0.592   <string>:1(<module>)
  0.584   groupby_ex.py:37(algo3_nosort)
  0.535   {method 'factorize' of DictFactorizer' objects}
  0.047   {pandas._tseries.group_add}
  0.002   numeric.py:65(zeros_like)
  0.001   {method 'fill' of 'numpy.ndarray' objects}
  0.000   {numpy.core.multiarray.empty_like}
  0.000   {numpy.core.multiarray.empty}


                     Curious
                                                        40
Slaves to algorithms

• Turns out that numpy.unique works by
  sorting, not a hash table. Thus O(N log N)
  versus O(N)
• Takes > 70% of the runtime of Algo #2
• Factorize is the new bottleneck, possible to
  go faster?!



                                                 41
Unique-ing faster
Basic algorithm using a dict, do this in Cython

        table = {}
        uniques = []
        for value in values:
            if value not in table:
                 table[value] = None # dummy
                 uniques.append(value)
        if sort:
            uniques.sort()

     Performance may depend on the number of
          unique groups (due to dict resizing)
                                                  42
Unique-ing faster




No Sort: at best ~70x faster, worst 6.5x faster
   Sort: at best ~70x faster, worst 1.7x faster
                                                  43
Remember




           44
Can we go faster?
• Python dictimplementations one of the best
  hash table
              is renowned as
                             anywhere
• But:
 • No abilityresizings
    arbitrary
              to preallocate, subject to

  • We don’t care about reference counting,
    throw away table once done
• Hm, what to do, what to do?
                                               45
Enter klib
• http://github.com/attractivechaos/klib
• Small, portable C data structures and
  algorithms
• khash: fast, memory-efficient hash table
• Hack a Cython interface (pxd file) and
  we’re in business



                                            46
khash Cython interface
cdef extern from "khash.h":
    ctypedef struct kh_pymap_t:
        khint_t n_buckets, size, n_occupied, upper_bound
        uint32_t *flags
        PyObject **keys
        Py_ssize_t *vals

    inline kh_pymap_t* kh_init_pymap()
    inline void kh_destroy_pymap(kh_pymap_t*)
    inline khint_t kh_get_pymap(kh_pymap_t*, PyObject*)
    inline khint_t kh_put_pymap(kh_pymap_t*, PyObject*, int*)
    inline void kh_clear_pymap(kh_pymap_t*)
    inline void kh_resize_pymap(kh_pymap_t*, khint_t)
    inline void kh_del_pymap(kh_pymap_t*, khint_t)
    bint kh_exist_pymap(kh_pymap_t*, khiter_t)


                                                                47
PyDict vs. khash unique




Conclusions: dict resizing makes a big impact
                                                48
Use strcmp in C




                  49
Gloves come off
             with int64




PyObject* boxing / PyRichCompare obvious culprit
                                                   50
Some NumPy-fu
• Think about the sorted factorize algorithm
 • Want to compute sorted unique labels
 • Also compute integer ids relative to the
    unique values, without making 2 passes
    through a hash table!

    sorter = uniques.argsort()
    reverse_indexer = np.empty(len(sorter))
    reverse_indexer.put(sorter, np.arange(len(sorter)))

    labels = reverse_indexer.take(labels)


                                                          51
Aside, for the R community
• R’s factor function is suboptimal
• Makes two hash table passes
 • unique          uniquify and sort
 • match           ids relative to unique labels
• This is highly fixable
• R’s integer unique is about 40% slower than
  my khash_int64 unique

                                                   52
Multi-key GroupBy
• Significantly more complicated because the
  number of possible key combinations may
  be very large
• Example, group by two sets of labels
 • 1000 unique values in each
 • “Key space”: 1,000,000, even though
    observed key pairs may be small


                                              53
Multi-key GroupBy
Simplified Algorithm
  id1, count1 = factorize(label1)
  id2, count2 = factorize(label2)
  group_id = id1 * count2 + id2
  nobs = count1 * count2

  if nobs > LARGE_NUMBER:
      group_id, nobs = factorize(group_id)

  result = group_add(data, group_id, nobs)




                                             54
Multi-GroupBy
• Pathological, but realistic example
• 50,000 values, 1e4 unique keys x 2, key
  space 1e8
• Compress key space: 9.2 ms
• Don’t compress: 1.2s (!)
• I actually discovered this problem while
  writing this talk (!!)


                                             55
Speaking of performance
• Testing the correctness of code is easy:
  write unit tests
• How to systematically test performance?
• Need to catch performance regressions
• Being mildly performance obsessed, I got
  very tired of playing performance whack-a-
  mole with pandas


                                               56
vbench project
• http://github.com/wesm/vbench
• Run benchmarks for each version of your
  codebase
• vbench checks out each revision of your
  codebase, builds it, and runs all the
  benchmarks you define
• Results stored in a SQLite database
• Only works with git right now
                                            57
vbench
join_dataframe_index_single_key_bigger = 
    Benchmark("df.join(df_key2, on='key2')", setup,
              name='join_dataframe_index_single_key_bigger')




                                                               58
vbench
stmt3 = "df.groupby(['key1', 'key2']).sum()"
groupby_multi_cython = Benchmark(stmt3, setup,
                                 name="groupby_multi_cython",
                                 start_date=datetime(2011, 7, 1))




                                                                    59
Fast database joins
• Problem: SQL-compatible left, right, inner,
  outer joins
• Row duplication
• Join on index and / or join on columns
• Sorting vs. not sorting
• Algorithmically closely related to groupby
  etc.


                                                60
Row duplication
  left         right         outer join
key lvalue   key rvalue   key lvalue rvalue
foo   1      foo    5     foo    1     5
foo   2      foo    6     foo    1     6
bar   3      bar    7     foo    2     5
baz   4      qux    8     foo    2     6
                          bar    3     7
                          baz    4    NA
                          qux   NA     8

                                              61
Join indexers
  left         right         outer join
key lvalue   key rvalue   key lidx ridx
foo   1      foo    5     foo    0    0
foo   2      foo    6     foo    0    1
bar   3      bar    7     foo    1    0
baz   4      qux    8     foo    1    1
                          bar    2    2
                          baz    3   -1
                          qux   -1    3

                                          62
Join indexers
    left         right         outer join
  key lvalue   key rvalue   key lidx ridx
  foo   1      foo    5     foo    0    0
  foo   2      foo    6     foo    0    1
  bar   3      bar    7     foo    1    0
  baz   4      qux    8     foo    1    1
                            bar    2    2
                            baz    3   -1
Problem: factorized keys    qux   -1    3
   need to be sorted!
                                            63
An algorithmic observation

• If N values are known to be from the range
  0 through K - 1, can be sorted in O(N)
• Variant of counting sort
• For our purposes, only compute the
  sorting indexer (argsort)



                                               64
Winning join algorithm
                                 sort keys   don’t sort keys
   Factorize keys columns
                                O(K log K) or O(N)
     Compute / compress
       group indexes                  O(N)       (refactorize)


   "Sort" by group indexes
                                      O(N)      (counting sort)


    Compute left / right join
   indexers for join method      O(N_output)
   Remap indexers relative
   to original row ordering      O(N_output)

                                 O(N_output)        (this step is actually
   Move data efficiently into
     output DataFrame                                 fairly nontrivial)

                                                                             65
“You’re like CLR, I’m like CLRS”
                    - “Kill Dash Nine”, by Monzy




                                                   66
Join test case

• Left:pairs rows, 2 key columns, 8k unique
  key
        80k

• Right: 8k rows, 2 key columns, 8k unique
  key pairs
• 6k matching key pairs between the tables,
  many-to-one join
• One column of numerical values in each
                                              67
Join test case

• Many-to-many case: stack right DataFrame
  on top of itself to yield 16k rows, 2 rows
  for each key pair
• Aside: sorting the pesky O(K log K)), not
  the runtime (that
                     unique keys dominates

  included in these benchmarks



                                               68
Quick, algebra!

     Many-to-one             Many-to-many
• Left join: 80k rows    • Left join: 140k rows
• Right join: 62k rows   • Right join: 124k rows
• Inner join: 60k rows   • Inner join: 120k rows
• Outer join: 82k rows   • Outer join: 144k rows

                                                   69
Results vs. some R packages




        * relative timings
                              70
Results vs SQLite3
         Absolute timings




      * outer is LEFT   OUTER   in SQLite3

 Note: In SQLite3 doing something like




                                             71
DataFrame sort by columns
• Applied same ideas / tools to “sort by
  multiple columns op” yesterday




                                           72
The bottom line
• Just a flavor: pretty much all of pandas has
  seen the same level of design effort and
  performance scrutiny
• Make sure whoever implemented your data
  structures and algorithms care about
  performance. A lot.
• Python has amazingly powerful and
  productive tools for implementation work


                                                73
Thanks!

• Follow me on Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• Exciting Python things ahead in 2012


                                         74

Mais conteúdo relacionado

Mais procurados

Le Big data en santé et l'éthique, sont- ils compatibles ?
Le Big data en santé et l'éthique, sont- ils compatibles ?Le Big data en santé et l'éthique, sont- ils compatibles ?
Le Big data en santé et l'éthique, sont- ils compatibles ?Céline Poirier
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantALTIC Altic
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceLilia Sfaxi
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadataLouis liu
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the DisruptorTrisha Gee
 

Mais procurados (20)

Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Le Big data en santé et l'éthique, sont- ils compatibles ?
Le Big data en santé et l'éthique, sont- ils compatibles ?Le Big data en santé et l'éthique, sont- ils compatibles ?
Le Big data en santé et l'éthique, sont- ils compatibles ?
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performant
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 

Destaque

pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
vbench: lightweight performance testing for Python
vbench: lightweight performance testing for Pythonvbench: lightweight performance testing for Python
vbench: lightweight performance testing for PythonWes McKinney
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsWes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonChetan Khatri
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 

Destaque (20)

pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
vbench: lightweight performance testing for Python
vbench: lightweight performance testing for Pythonvbench: lightweight performance testing for Python
vbench: lightweight performance testing for Python
 
Memory Pools for C and C++
Memory Pools for C and C++Memory Pools for C and C++
Memory Pools for C and C++
 
NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 

Semelhante a Inside pandas: Design and development for high performance data analysis

7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012cobyst
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingBigML, Inc
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Citus Data
 
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010Atlassian
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxssuserf583ac
 

Semelhante a Inside pandas: Design and development for high performance data analysis (20)

7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
Python ml
Python mlPython ml
Python ml
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
 
Big Data
Big DataBig Data
Big Data
 
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010
Code Review for Teams Too Busy to Review Code - Atlassian Summit 2010
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 

Mais de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

Mais de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Inside pandas: Design and development for high performance data analysis

  • 1. A look inside pandas design and development Wes McKinney Lambda Foundry, Inc. @wesmckinn NYC Python Meetup, 1/10/2012 1
  • 2. a.k.a. “Pragmatic Python for high performance data analysis” 2
  • 3. a.k.a. “Rise of the pandas” 3
  • 4. Me 4
  • 7. Me • Mathematician at heart • 3 years in the quant finance industry • Last 2: statistics + freelance + open source • My new company: Lambda Foundry • Building analytics and tools for finance and other domains 7
  • 8. Me • Blog: http://blog.wesmckinney.com • GitHub: http://github.com/wesm • Twitter: @wesmckinn • Working on “Python for Data Analysis” for O’Reilly Media • Giving PyCon tutorial on pandas (!) 8
  • 9. pandas? • http://pandas.sf.net • Swiss-army knife of (in-memory) data manipulation in Python • Like R’s data.frame on steroids • Excellent performance • Easy-to-use, highly consistent API • A foundation for data analysis in Python 9
  • 10. pandas • In heavy production use in the financial industry • Generally much better performance than other open source alternatives (e.g. R) • Hope: basis for the “next generation” data analytical environment in Python 10
  • 11. Simplifying data wrangling • Data munging / preparation / cleaning / integration is slow, error prone, and time consuming • Everyone already <3’s Python for data wrangling: pandas takes it to the next level 11
  • 12. Explosive pandas growth • Last 6 months: 240 files changed 49428 insertions(+), 15358 deletions(-) Cython-generated C removed 12
  • 13. Rigorous unit testing • Need to be able to trust your $1e3/e6/e9s to pandas • > 98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions • v0.7.0 (1/09/2012): 1272 test functions 13
  • 14. Some development asides • I get a lot of questions about my dev env • Emacs + IPython FTW • Indispensible development tools • pdb (and IPython-enhanced pdb) • pylint / pyflakes (integrated with Emacs) • nose • coverage.py • grin, for searching code. >> ack/grep IMHO 14
  • 15. IPython • Matthew Goodman: “If you are not using this tool, you are doing it wrong!” • Tab completion, introspection, interactive debugger, command history • Designed to enhance your productivity in every way. I can’t live without it • IPython HTML notebook is a game changer 15
  • 16. Profiling and optimization • %time, %timeit in IPython • %prun, to profile a statement with cProfile • %run -p to profile whole programs • line_profiler module, for line-by-line timing • Optimization: find right algorithm first. Cython-ize the bottlenecks (if need be) 16
  • 17. Other things that matter • Follow PEP8 religiously • Naming conventions, other code style • 80 character per line hard limit • Test more than you think you need to, aim for 100% line coverage • Avoid long functions (> 50 lines), refactor aggressively 17
  • 18. I’m serious about function length http://gist.github.com/1580880 18
  • 19. Don’t make a mess Uncle Bob YouTube: “What killed Smalltalk could kill s/Ruby/Python, too” 19
  • 20. Other stuff • Good keyboard 20
  • 21. Other stuff • Big monitors 21
  • 22. Other stuff • Ergonomic chair (good hacking posture) 22
  • 23. pandas DataFrame • Jack-of-trades tabular data structure In [10]: tips[:10] Out[10]: total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2 23
  • 24. DataFrame • Heterogeneous columns • Data alignment and axis indexing • No-copy data selection (!) • Agile reshaping • Fast joining, merging, concatenation 24
  • 25. DataFrame • Axis indexing enable rich data alignment, joins / merges, reshaping, selection, etc. day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990 Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058 25
  • 26. Let’s have a little fun To the IPython Notebook, Batman http://ashleyw.co.uk/project/food-nutrient-database 26
  • 27. Axis indexing, the special pandas-flavored sauce • Enables “alignment-free” programming • Prevents major source of data munging frustration and errors • Fast (O(1) or O(log n)) selecting data • Powerful way of describing reshape / join / merge / pivot-table operations 27
  • 28. Data alignment, join ops • The brains live in the axis index • Indexes know how to do set logic • Join/align ops: produce “indexers” • Mapping between source/output • Indexer passed to fast “take” function 28
  • 29. Index join example left right joined lidx ridx a -1 0 d a b 1 1 b JOIN b c 2 2 c c d 0 -1 e e 3 -1 left_values.take(lidx, axis) reindexed data 29
  • 30. Implementing index joins • Completely irregular case: use hash tables • Monotonic / increasing values • Faster specialized left/right/inner/outer join routines, especially for native types (int32/64, datetime64) • Lookup hash table is persisted inside the Index object! 30
  • 31. Um, hash table? left joined indexer { } a -1 d 0 b 1 b 1 map c 2 c 2 d 0 e 3 e 3 31
  • 32. Hash tables • Form the core of many critical pandas algorithms • unique (for set intersection / union) • “factor”ize • groupby • join / merge / align 32
  • 33. GroupBy, a brief algorithmic exploration • Simple problem: compute group sums for a vector given group identifications labels values b -1 unique group b 3 labels sums a 2 a 2 a 3 b 4 b 2 a -4 a 1 33
  • 34. GroupBy: Algo #1 unique_labels = np.unique(labels) results = np.empty(len(unique_labels)) for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum() For all these examples, assume N data points and K unique groups 34
  • 35. GroupBy: Algo #1, don’t do this unique_labels = np.unique(labels) results = np.empty(len(unique_labels)) for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum() Some obvious problems • O(N * K) comparisons. Slow for large K • K passes through values • numpy.unique is pretty slow (more on this later) 35
  • 36. GroupBy: Algo #2 Make this dict in O(N) (pseudocode) g_inds = {label : [i where labels[i] == label]} Now for i, label in enumerate(unique_labels): indices = g_inds[label] label_values = values.take(indices) result[i] = label_values.sum() Pros: one pass through values. ~O(N) for N >> K Cons: g_inds can be built in O(N), but too many list/dict API calls, even using Cython 36
  • 37. GroupBy: Algo #3, much faster • “Factorize” labels • Produce vectorto the unique observedK-1 corresponding of integers from 0, ..., values (use a hash table) result = np.zeros(k) for i, j in enumerate(factorized_labels): result[j] += values[i] Pros: avoid expensive dict-of-lists creation. Avoid numpy.unique and have option to not to sort the unique labels, skipping O(K lg K) work 37
  • 38. Speed comparisons • Test case: 100,000 data points, 5,000 groups • Algo 3, don’t sort groups: 5.46 ms • Algo 3, sort groups: 10.6 ms • Algo 2: 155 ms (14.6x slower) • Algo 1: 10.49 seconds (990x slower) • Algos 2/3 implemented in Cython 38
  • 39. GroupBy • Situation is significantly more complicated in the multi-key case. • More on this later 39
  • 40. Algo 3, profiled In [32]: %prun for _ in xrange(100) algo3_nosort() cumtime filename:lineno(function) 0.592 <string>:1(<module>) 0.584 groupby_ex.py:37(algo3_nosort) 0.535 {method 'factorize' of DictFactorizer' objects} 0.047 {pandas._tseries.group_add} 0.002 numeric.py:65(zeros_like) 0.001 {method 'fill' of 'numpy.ndarray' objects} 0.000 {numpy.core.multiarray.empty_like} 0.000 {numpy.core.multiarray.empty} Curious 40
  • 41. Slaves to algorithms • Turns out that numpy.unique works by sorting, not a hash table. Thus O(N log N) versus O(N) • Takes > 70% of the runtime of Algo #2 • Factorize is the new bottleneck, possible to go faster?! 41
  • 42. Unique-ing faster Basic algorithm using a dict, do this in Cython table = {} uniques = [] for value in values: if value not in table: table[value] = None # dummy uniques.append(value) if sort: uniques.sort() Performance may depend on the number of unique groups (due to dict resizing) 42
  • 43. Unique-ing faster No Sort: at best ~70x faster, worst 6.5x faster Sort: at best ~70x faster, worst 1.7x faster 43
  • 44. Remember 44
  • 45. Can we go faster? • Python dictimplementations one of the best hash table is renowned as anywhere • But: • No abilityresizings arbitrary to preallocate, subject to • We don’t care about reference counting, throw away table once done • Hm, what to do, what to do? 45
  • 46. Enter klib • http://github.com/attractivechaos/klib • Small, portable C data structures and algorithms • khash: fast, memory-efficient hash table • Hack a Cython interface (pxd file) and we’re in business 46
  • 47. khash Cython interface cdef extern from "khash.h": ctypedef struct kh_pymap_t: khint_t n_buckets, size, n_occupied, upper_bound uint32_t *flags PyObject **keys Py_ssize_t *vals inline kh_pymap_t* kh_init_pymap() inline void kh_destroy_pymap(kh_pymap_t*) inline khint_t kh_get_pymap(kh_pymap_t*, PyObject*) inline khint_t kh_put_pymap(kh_pymap_t*, PyObject*, int*) inline void kh_clear_pymap(kh_pymap_t*) inline void kh_resize_pymap(kh_pymap_t*, khint_t) inline void kh_del_pymap(kh_pymap_t*, khint_t) bint kh_exist_pymap(kh_pymap_t*, khiter_t) 47
  • 48. PyDict vs. khash unique Conclusions: dict resizing makes a big impact 48
  • 50. Gloves come off with int64 PyObject* boxing / PyRichCompare obvious culprit 50
  • 51. Some NumPy-fu • Think about the sorted factorize algorithm • Want to compute sorted unique labels • Also compute integer ids relative to the unique values, without making 2 passes through a hash table! sorter = uniques.argsort() reverse_indexer = np.empty(len(sorter)) reverse_indexer.put(sorter, np.arange(len(sorter))) labels = reverse_indexer.take(labels) 51
  • 52. Aside, for the R community • R’s factor function is suboptimal • Makes two hash table passes • unique uniquify and sort • match ids relative to unique labels • This is highly fixable • R’s integer unique is about 40% slower than my khash_int64 unique 52
  • 53. Multi-key GroupBy • Significantly more complicated because the number of possible key combinations may be very large • Example, group by two sets of labels • 1000 unique values in each • “Key space”: 1,000,000, even though observed key pairs may be small 53
  • 54. Multi-key GroupBy Simplified Algorithm id1, count1 = factorize(label1) id2, count2 = factorize(label2) group_id = id1 * count2 + id2 nobs = count1 * count2 if nobs > LARGE_NUMBER: group_id, nobs = factorize(group_id) result = group_add(data, group_id, nobs) 54
  • 55. Multi-GroupBy • Pathological, but realistic example • 50,000 values, 1e4 unique keys x 2, key space 1e8 • Compress key space: 9.2 ms • Don’t compress: 1.2s (!) • I actually discovered this problem while writing this talk (!!) 55
  • 56. Speaking of performance • Testing the correctness of code is easy: write unit tests • How to systematically test performance? • Need to catch performance regressions • Being mildly performance obsessed, I got very tired of playing performance whack-a- mole with pandas 56
  • 57. vbench project • http://github.com/wesm/vbench • Run benchmarks for each version of your codebase • vbench checks out each revision of your codebase, builds it, and runs all the benchmarks you define • Results stored in a SQLite database • Only works with git right now 57
  • 58. vbench join_dataframe_index_single_key_bigger = Benchmark("df.join(df_key2, on='key2')", setup, name='join_dataframe_index_single_key_bigger') 58
  • 59. vbench stmt3 = "df.groupby(['key1', 'key2']).sum()" groupby_multi_cython = Benchmark(stmt3, setup, name="groupby_multi_cython", start_date=datetime(2011, 7, 1)) 59
  • 60. Fast database joins • Problem: SQL-compatible left, right, inner, outer joins • Row duplication • Join on index and / or join on columns • Sorting vs. not sorting • Algorithmically closely related to groupby etc. 60
  • 61. Row duplication left right outer join key lvalue key rvalue key lvalue rvalue foo 1 foo 5 foo 1 5 foo 2 foo 6 foo 1 6 bar 3 bar 7 foo 2 5 baz 4 qux 8 foo 2 6 bar 3 7 baz 4 NA qux NA 8 61
  • 62. Join indexers left right outer join key lvalue key rvalue key lidx ridx foo 1 foo 5 foo 0 0 foo 2 foo 6 foo 0 1 bar 3 bar 7 foo 1 0 baz 4 qux 8 foo 1 1 bar 2 2 baz 3 -1 qux -1 3 62
  • 63. Join indexers left right outer join key lvalue key rvalue key lidx ridx foo 1 foo 5 foo 0 0 foo 2 foo 6 foo 0 1 bar 3 bar 7 foo 1 0 baz 4 qux 8 foo 1 1 bar 2 2 baz 3 -1 Problem: factorized keys qux -1 3 need to be sorted! 63
  • 64. An algorithmic observation • If N values are known to be from the range 0 through K - 1, can be sorted in O(N) • Variant of counting sort • For our purposes, only compute the sorting indexer (argsort) 64
  • 65. Winning join algorithm sort keys don’t sort keys Factorize keys columns O(K log K) or O(N) Compute / compress group indexes O(N) (refactorize) "Sort" by group indexes O(N) (counting sort) Compute left / right join indexers for join method O(N_output) Remap indexers relative to original row ordering O(N_output) O(N_output) (this step is actually Move data efficiently into output DataFrame fairly nontrivial) 65
  • 66. “You’re like CLR, I’m like CLRS” - “Kill Dash Nine”, by Monzy 66
  • 67. Join test case • Left:pairs rows, 2 key columns, 8k unique key 80k • Right: 8k rows, 2 key columns, 8k unique key pairs • 6k matching key pairs between the tables, many-to-one join • One column of numerical values in each 67
  • 68. Join test case • Many-to-many case: stack right DataFrame on top of itself to yield 16k rows, 2 rows for each key pair • Aside: sorting the pesky O(K log K)), not the runtime (that unique keys dominates included in these benchmarks 68
  • 69. Quick, algebra! Many-to-one Many-to-many • Left join: 80k rows • Left join: 140k rows • Right join: 62k rows • Right join: 124k rows • Inner join: 60k rows • Inner join: 120k rows • Outer join: 82k rows • Outer join: 144k rows 69
  • 70. Results vs. some R packages * relative timings 70
  • 71. Results vs SQLite3 Absolute timings * outer is LEFT OUTER in SQLite3 Note: In SQLite3 doing something like 71
  • 72. DataFrame sort by columns • Applied same ideas / tools to “sort by multiple columns op” yesterday 72
  • 73. The bottom line • Just a flavor: pretty much all of pandas has seen the same level of design effort and performance scrutiny • Make sure whoever implemented your data structures and algorithms care about performance. A lot. • Python has amazingly powerful and productive tools for implementation work 73
  • 74. Thanks! • Follow me on Twitter: @wesmckinn • Blog: http://blog.wesmckinney.com • Exciting Python things ahead in 2012 74