SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
www.twosigma.com
Python Data Wrangling: Preparing for
the Future
October 26, 2016All Rights Reserved
Wes McKinney @wesmckinn
PyCon HK 2016
Disclaimer
October 26, 2016
Important Legal Information
The information presented here is offered for informational purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes: an offer to sell or the solicitation of any offer to buy any security or
other interest; tax advice; or investment advice. This presentation shall remain the
property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time. Copyright © 2016
TWO SIGMA INVESTMENTS, LP. All rights reserved.
All Rights Reserved
Me
October 26, 2016
• Creator of Python pandas project (ca. 2008)
• PMC member for Apache Arrow and Parquet
• Currently: Data technology at Two Sigma
• Past
• Cloudera: Python on Hadoop/Spark + Ibis project
• DataPad: Co-founder/CEO
All Rights Reserved
Python for Data Analysis, 2nd
Edition
October 26, 2016All Rights Reserved
Coming in 2017
• Updated for pandas 1.0
• Remove deprecated / out-of-date
functionality
• New content: Advanced pandas, intro
to statsmodels, scikit-learn
The Python data community
October 26, 2016
• Python has grown from a niche scientific computing language in 2011 to a
mainstream data science language now in 2016
• A language of choice for latest-gen ML: Keras, Tensorflow, Theano
• Worldwide ecosystem of conferences and meetups: PyData, SciPy, etc.
All Rights Reserved
How / why did the community grow?
October 26, 2016
• Confluence of important projects
• Array computing, linear algebra: NumPy, Cython, SciPy
• Data manipulation tools: pandas
• Statistics / Machine Learning: statsmodels, scikit-learn
• Programming interfaces: Jupyter notebooks
• Packaging: {Ana, mini}conda, portable wheels (.whl) for pip
All Rights Reserved
conda-forge: community-led conda packaging
October 26, 2016All Rights Reserved
conda config --add channels conda-forge
NumFOCUS: Not-for-profit open source sponsorship
October 26, 2016All Rights Reserved
pandas, the project
October 26, 2016
• https://github.com/pandas-dev/pandas
• Latest version: 0.19.0, released October 3, 2016
• Code base turned 8 years old in April 2016
• Over 600 distinct contributors, 14,000+ commits
• > 5300 pull requests, > 7200 issues closed on GitHub
• Only 11 developers with 100 or more commits
All Rights Reserved
Estimating the pandas user base size
October 26, 2016All Rights Reserved
pandas.pydata.org
50-70K unique
visitors per month
Sources of pandas’s popularity
October 26, 2016
• Responsive (volunteer) core developers: code reviewed and bugs fixed quickly
• Easy to use, good performance
• One of the only mature data preparation tools available. Most data is not clean
• Strong suite of time series / event analytics
• Library has been battle tested by tens of thousands of users
• Good learning resources (multiple books, etc.)
All Rights Reserved
pandas development, where to from here?
October 26, 2016
• pandas has accumulated much technical debt, problems stemming from early
software architecture decisions
• pandas being used increasingly as a building block in distributed systems like
Spark and dask
• Sprawling codebase: over 200K lines of code
• In works: pandas 1.0
• Instead of pandas v0.{n + 1}
• API stable release, focusing only on stability, bug fixes, performance
improvements
All Rights Reserved
Some known pandas problems
October 26, 2016
• Missing data inconsistencies
• Excess / unpredictable memory use
• Mutability issues, defensive memory copying
• Difficult to add new data types
• Mostly single-threaded, GIL-contention issues
• Degrading micro-performance from complex pure Python internals
• Costly interoperability with file formats, other systems (Dask, Spark)
See also: 2013 talk “10 Things I Hate About pandas”
All Rights Reserved
pandas 2.0
October 26, 2016
• Re-architecting of pandas Series / DataFrame internals
• Backwards compatible for 90+% of user code
• Goals
• Improved performance
• Lower and more predictable memory requirements
• Better utilize hardware with many cores and large amounts of RAM
• Tighter integration with external data
All Rights Reserved
libpandas: a modern C++ core “engine”
October 26, 2016All Rights Reserved
Pure Python
pandas.core
C C C
C
C
C
C C C
C
C
C
pandas 0.x/1.x
C++11
libpandas.so
Python wrapper layer
pandas 2.x
pandas 2.0: some libpandas benefits
October 26, 2016
• Encapsulate memory management and low-level details of Series and
DataFrame
• Isolate user from low-level data issues (e.g. missing data in integers, booleans)
• Better performance for “cheap” operations (e.g. indexing / selection)
• Easier access to multithreading primitives, hardware optimizations
• Note: Similar to the architecture used by Tensorflow and other modern
scientific tools targeting Python
All Rights Reserved
pandas 2.0: memory mapping and fast serialization
October 26, 2016
• Provide for memory-mapped DataFrames or “lazy-loading” of data from disk
• Improved performance with fast disk formats
• HDF5, feather, bcolz, zarr, etc.
• Faster data movement in distributed applications
• Dask, Spark
All Rights Reserved
Apache Arrow: Process and Move Data Fast
October 26, 2016
• New Top-level Apache project as of February 2016
• Collaboration amongst broad set of OSS projects around shared needs
• Language-independent columnar data structures
• Metadata for describing schemas / chunks of data
• Protocol for moving data between processes with minimal serialization
overhead
All Rights Reserved
High performance data interchange
October 26, 2016All Rights Reserved
Today With Arrow
Source: Apache Arrow
File formats and memory interoperability
October 26, 2016
• Improving IO throughput essential to good in-memory pandas performance
• Output data quickly to other tools/systems (R, Spark, Dask, etc.)
All Rights Reserved
Arrow in action: Feather file format
October 26, 2016All Rights Reserved
• https://github.com/wesm/feather
• Cross-language binary DataFrame format
• Write from Python, read in R, and vice versa
• Collaboration with Hadley Wickham from the R community
Arrow in action: Parquet files in Python
October 26, 2016
• Parquet C++ project https://github.com/apache/parquet-cpp
• Parquet files read into Arrow arrays
• Array arrays written back to Parquet files
All Rights Reserved
import pyarrow.parquet as pq
table = pq.read_table(path, **options)
Some serialization benchmarks
October 26, 2016
• Benchmark setup
• 1 GB DataFrame of floating point data, 100 columns
• Pickle, Parquet, bcolz, Feather format, Arrow IPC format
All Rights Reserved
Benchmark results
October 26, 2016All Rights Reserved
Tool Absolute time Bandwidth
Arrow IPC 148 ms 6.77 GB/s
pandas.HDFStore 178 ms 5.62 GB/s
pickle 296 ms 3.38 GB/s
bcolz 498 ms 2.01 GB/s
Wall clock RAM-to-RAM conversion time from source to 1 GB
pandas.DataFrame with 100 float64 columns (no missing data).
HDF5 using tmpfs
Benchmark results
October 26, 2016All Rights Reserved
Tool Absolute time Bandwidth
Arrow IPC 148 ms 6.77 GB/s
pandas.HDFStore 178 ms 5.62 GB/s
pickle 296 ms 3.38 GB/s
bcolz 498 ms 2.01 GB/s
pandas.read_csv 21247 ms 0.047 GB /s
Wall clock RAM-to-RAM conversion time from source to 1 GB
pandas.DataFrame with 100 float64 columns (no missing data).
HDF5 using tmpfs
Benefits of standard columnar formats
October 26, 2016
• Readable for other systems (even JVM-based), easier interoperability
• On-disk filtering / subsetting possible
• Benefit from development work by other communities
All Rights Reserved
Thank you
October 26, 2016
• Conference organizers
• pandas developer community
• Apache Software Foundation
● Note: Opinions are my own
All Rights Reserved

Mais conteúdo relacionado

Mais procurados

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

Mais procurados (20)

ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 

Semelhante a Python Data Wrangling: Preparing for the Future

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
DataPad Inc.
 

Semelhante a Python Data Wrangling: Preparing for the Future (20)

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Data streaming
Data streamingData streaming
Data streaming
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 

Mais de Wes McKinney

Mais de Wes McKinney (17)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Python Data Wrangling: Preparing for the Future

  • 1. www.twosigma.com Python Data Wrangling: Preparing for the Future October 26, 2016All Rights Reserved Wes McKinney @wesmckinn PyCon HK 2016
  • 2. Disclaimer October 26, 2016 Important Legal Information The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Copyright © 2016 TWO SIGMA INVESTMENTS, LP. All rights reserved. All Rights Reserved
  • 3. Me October 26, 2016 • Creator of Python pandas project (ca. 2008) • PMC member for Apache Arrow and Parquet • Currently: Data technology at Two Sigma • Past • Cloudera: Python on Hadoop/Spark + Ibis project • DataPad: Co-founder/CEO All Rights Reserved
  • 4. Python for Data Analysis, 2nd Edition October 26, 2016All Rights Reserved Coming in 2017 • Updated for pandas 1.0 • Remove deprecated / out-of-date functionality • New content: Advanced pandas, intro to statsmodels, scikit-learn
  • 5. The Python data community October 26, 2016 • Python has grown from a niche scientific computing language in 2011 to a mainstream data science language now in 2016 • A language of choice for latest-gen ML: Keras, Tensorflow, Theano • Worldwide ecosystem of conferences and meetups: PyData, SciPy, etc. All Rights Reserved
  • 6. How / why did the community grow? October 26, 2016 • Confluence of important projects • Array computing, linear algebra: NumPy, Cython, SciPy • Data manipulation tools: pandas • Statistics / Machine Learning: statsmodels, scikit-learn • Programming interfaces: Jupyter notebooks • Packaging: {Ana, mini}conda, portable wheels (.whl) for pip All Rights Reserved
  • 7. conda-forge: community-led conda packaging October 26, 2016All Rights Reserved conda config --add channels conda-forge
  • 8. NumFOCUS: Not-for-profit open source sponsorship October 26, 2016All Rights Reserved
  • 9. pandas, the project October 26, 2016 • https://github.com/pandas-dev/pandas • Latest version: 0.19.0, released October 3, 2016 • Code base turned 8 years old in April 2016 • Over 600 distinct contributors, 14,000+ commits • > 5300 pull requests, > 7200 issues closed on GitHub • Only 11 developers with 100 or more commits All Rights Reserved
  • 10. Estimating the pandas user base size October 26, 2016All Rights Reserved pandas.pydata.org 50-70K unique visitors per month
  • 11. Sources of pandas’s popularity October 26, 2016 • Responsive (volunteer) core developers: code reviewed and bugs fixed quickly • Easy to use, good performance • One of the only mature data preparation tools available. Most data is not clean • Strong suite of time series / event analytics • Library has been battle tested by tens of thousands of users • Good learning resources (multiple books, etc.) All Rights Reserved
  • 12. pandas development, where to from here? October 26, 2016 • pandas has accumulated much technical debt, problems stemming from early software architecture decisions • pandas being used increasingly as a building block in distributed systems like Spark and dask • Sprawling codebase: over 200K lines of code • In works: pandas 1.0 • Instead of pandas v0.{n + 1} • API stable release, focusing only on stability, bug fixes, performance improvements All Rights Reserved
  • 13. Some known pandas problems October 26, 2016 • Missing data inconsistencies • Excess / unpredictable memory use • Mutability issues, defensive memory copying • Difficult to add new data types • Mostly single-threaded, GIL-contention issues • Degrading micro-performance from complex pure Python internals • Costly interoperability with file formats, other systems (Dask, Spark) See also: 2013 talk “10 Things I Hate About pandas” All Rights Reserved
  • 14. pandas 2.0 October 26, 2016 • Re-architecting of pandas Series / DataFrame internals • Backwards compatible for 90+% of user code • Goals • Improved performance • Lower and more predictable memory requirements • Better utilize hardware with many cores and large amounts of RAM • Tighter integration with external data All Rights Reserved
  • 15. libpandas: a modern C++ core “engine” October 26, 2016All Rights Reserved Pure Python pandas.core C C C C C C C C C C C C pandas 0.x/1.x C++11 libpandas.so Python wrapper layer pandas 2.x
  • 16. pandas 2.0: some libpandas benefits October 26, 2016 • Encapsulate memory management and low-level details of Series and DataFrame • Isolate user from low-level data issues (e.g. missing data in integers, booleans) • Better performance for “cheap” operations (e.g. indexing / selection) • Easier access to multithreading primitives, hardware optimizations • Note: Similar to the architecture used by Tensorflow and other modern scientific tools targeting Python All Rights Reserved
  • 17. pandas 2.0: memory mapping and fast serialization October 26, 2016 • Provide for memory-mapped DataFrames or “lazy-loading” of data from disk • Improved performance with fast disk formats • HDF5, feather, bcolz, zarr, etc. • Faster data movement in distributed applications • Dask, Spark All Rights Reserved
  • 18. Apache Arrow: Process and Move Data Fast October 26, 2016 • New Top-level Apache project as of February 2016 • Collaboration amongst broad set of OSS projects around shared needs • Language-independent columnar data structures • Metadata for describing schemas / chunks of data • Protocol for moving data between processes with minimal serialization overhead All Rights Reserved
  • 19. High performance data interchange October 26, 2016All Rights Reserved Today With Arrow Source: Apache Arrow
  • 20. File formats and memory interoperability October 26, 2016 • Improving IO throughput essential to good in-memory pandas performance • Output data quickly to other tools/systems (R, Spark, Dask, etc.) All Rights Reserved
  • 21. Arrow in action: Feather file format October 26, 2016All Rights Reserved • https://github.com/wesm/feather • Cross-language binary DataFrame format • Write from Python, read in R, and vice versa • Collaboration with Hadley Wickham from the R community
  • 22. Arrow in action: Parquet files in Python October 26, 2016 • Parquet C++ project https://github.com/apache/parquet-cpp • Parquet files read into Arrow arrays • Array arrays written back to Parquet files All Rights Reserved import pyarrow.parquet as pq table = pq.read_table(path, **options)
  • 23. Some serialization benchmarks October 26, 2016 • Benchmark setup • 1 GB DataFrame of floating point data, 100 columns • Pickle, Parquet, bcolz, Feather format, Arrow IPC format All Rights Reserved
  • 24. Benchmark results October 26, 2016All Rights Reserved Tool Absolute time Bandwidth Arrow IPC 148 ms 6.77 GB/s pandas.HDFStore 178 ms 5.62 GB/s pickle 296 ms 3.38 GB/s bcolz 498 ms 2.01 GB/s Wall clock RAM-to-RAM conversion time from source to 1 GB pandas.DataFrame with 100 float64 columns (no missing data). HDF5 using tmpfs
  • 25. Benchmark results October 26, 2016All Rights Reserved Tool Absolute time Bandwidth Arrow IPC 148 ms 6.77 GB/s pandas.HDFStore 178 ms 5.62 GB/s pickle 296 ms 3.38 GB/s bcolz 498 ms 2.01 GB/s pandas.read_csv 21247 ms 0.047 GB /s Wall clock RAM-to-RAM conversion time from source to 1 GB pandas.DataFrame with 100 float64 columns (no missing data). HDF5 using tmpfs
  • 26. Benefits of standard columnar formats October 26, 2016 • Readable for other systems (even JVM-based), easier interoperability • On-disk filtering / subsetting possible • Benefit from development work by other communities All Rights Reserved
  • 27. Thank you October 26, 2016 • Conference organizers • pandas developer community • Apache Software Foundation ● Note: Opinions are my own All Rights Reserved