Koalas: How Well Does It Scale Pandas for Big Data

Koalas
How Well Does Koalas Work?
Takuya Ueshin, Xinrong Meng
Software Engineer @ Databricks

About Us
Takuya Ueshin
▪ Software Engineer @ Databricks
▪ Apache Spark committer and PMC
member
▪ Focusing on Spark SQL and PySpark
▪ Koalas maintainer
Xinrong Meng
▪ Software Engineer @ Databricks
▪ Koalas maintainer

Agenda
▪ Introduction of Koalas
pandas
PySpark
▪ Koalas Internal
▪ Benchmark
Introduction of Dask
Koalas benchmark against Dask
▪ Koalas Updates

What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundred of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions

pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation
and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends

Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark for Python;
also APIs support for Scala/Java, R, and SQL

Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame

Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL
- The plan will be optimized and
executed by Spark SQL engine
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame

InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame

InternalFrame
Koalas
DataFrame
PySpark
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes

InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_dtypes
PySpark
DataFrame
API call
copy with new
state

InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_dtypes
API call
Only updates metadata
copy with new
state

Introduction of Dask
• A parallel computing framework
• Written in pure python
• Using blocked algorithms and
task scheduling

Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
Abstraction
Collections

Koalas Dask
Execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction
Collections

Koalas Dask
Execution engine
Aim
pandas and Spark
Abstraction Query plan Task graph and task scheduler
Collections

Koalas Dask
Execution engine
Aim
pandas and Spark
Abstraction Query plan Task graph and task scheduler
Collections DataFrame Array, DataFrame, Bag

Benchmark setup - Methodology
• Dataset
157 GB Yellow Taxi Trip Records (2009 - 2013)
• Operations
Basic statistical calculations
Joins
Grouping
• Operations were applied to
The whole dataset
Filtered data (36% whole dataset)
Cached filtered data (36% whole dataset)
The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.

Benchmark setup - Environment
• Local execution
A single i3.16xlarge VM:
(488 GB memory | 64 cores | 25 Gigabit Ethernet)
• Distributed execution
1 driver node, 3 worker nodes
Each node is a i3.4xlarge VM:
(122 GB memory | 16 cores | 10 Gigabit Ethernet)

Benchmark results - Overview
Geometric Mean Simple Average
Local execution 2.1x 4x
Distributed execution 4.6x 7.9x
Koalas outperformed Dask:

Benchmark results - On the whole dataset
Local execution: Koalas is ~1.2x
faster
Distributed execution: Koalas is ~2x
faster

Benchmark results - On the filtered data
Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x
faster

Benchmark results - On the cached filtered data
Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x
faster

Why is Koalas fast?
● Query plan optimization by Catalyst
● Whole-stage code generation

Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst’s optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()

Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst optimization
• After the Catalyst optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()

Why is Koalas fast - Whole-stage code generation
~650%
improvement
~1200%
improvement

Benchmark conclusions
• SQL optimizers improve the performance of DataFrame
APIs
• Caching accelerates both Koalas and Dask dramatically
• Koalas outperforms Dask in the majority of use cases
Reference blog post : Benchmark: Koalas (PySpark) and Dask

Version 1.0.0~1.8.0
▪ Improve Plotly backend support, and
switch the default plotting backend
to Plotly
▪ Extension dtypes support
▪ More Index types
▪ Create Index from Series or Index
objects
▪ Support setting to a Series via
attribute access
▪ Operations between Series and Index
▪ Standardize binary operations
between int and str columns
▪ Index operations support
▪ Better type support
▪ Return type annotations for major
Koalas objects

Version 1.0.0~1.8.0
▪ Support for non-string names
▪ Non-named Series support
▪ Wider support of in-place update
▪ Improve distributed-sequence
default index
▪ pandas 1.1, 1.1.4 support
▪ Better pandas API coverage
▪ Introduced koalas and Spark
accessors
▪ Improve testing infrastructure
▪ Apache Spark 3.0 support
▪ Python 3.8 support
▪ Support for API extensions
▪ Better type hints support

Porting Koalas to Spark
SPIP: Support pandas API layer on PySpark
https://issues.apache.org/jira/browse/SPARK-
34849

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Koalas: How Well Does It Scale Pandas for Big Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Koalas: How Well Does It Scale Pandas for Big Data

Semelhante a Koalas: How Well Does It Scale Pandas for Big Data (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Koalas: How Well Does It Scale Pandas for Big Data