Making Easy Transition Koalas pandas Apache Spark

Koalas: Making an Easy Transition
from pandas to Apache Spark
Takuya Ueshin
Software Engineer @ Databricks

About
Takuya Ueshin
Software Engineer at Databricks
▪ Apache Spark committer and PMC member
▪ Focusing on Spark SQL and PySpark
▪ Koalas maintainer

Outline
▪ What’s Koalas?
▪ pandas vs Apache Spark at a
high level
▪ Koalas 1.0
▪ Demo
▪ InternalFrame
▪ Index and Default Index
▪ Roadmap

What’s Koalas?
▪ Announced April 24, 2019
▪ Aims at providing the pandas API on top of Apache Spark
▪ Unifies the two ecosystems with a familiar API
▪ Seamless transition between small and large data
▪ For pandas users
▪ Scale out the pandas code using Koalas
▪ Make learning PySpark much easier
▪ For PySpark users
▪ More productive by pandas-like functions

pandas
▪ Authored by Wes McKinney in 2008
▪ The standard tool for data
manipulation and analysis in Python
▪ The current version: 1.0.4
Stack Overflow Trends

pandas
▪ Deeply integrated into Python data science ecosystem
▪ numpy
▪ matplotlib
▪ scikit-learn
▪ Can deal with a lot of different situations, including:
▪ Basic statistical analysis
▪ Handling missing data
▪ Time series, categorical variables, strings

Apache Spark
▪ De facto unified analytics engine for large-scale data processing
▪ Streaming
▪ ETL
▪ ML
▪ Originally created at UC Berkeley by Databricks’ founders
▪ PySpark API for Python; also API support for Scala, R and SQL
▪ The latest version: 3.0.0

pandas DataFrame PySpark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Execution Eagerly Lazily
Add a column df['c'] = df['a'] + df['b'] df = df.withColumn('c', df['a'] + df['b'])
Rename columns df.columns = ['a','b']
df = df.select(df['c1'].alias('a'),
df['c2'].alias('b'))
df = df.toDF('a', 'b')
Value count df['col'].value_counts()
df.groupBy(df['col']).count()
.orderBy('count', ascending=False)
pandas DataFrame vs. PySpark DataFrame

A short example
import pandas as pd
df =
pd.read_csv("/path/to/my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.csv("/path/to/my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x * df.x)
PySparkpandas

A short example
import pandas as pd
df =
pd.read_csv("/path/to/my_data.csv")
import databricks.koalas as ks
df =
ks.read_csv("/path/to/my_data.csv")
Koalaspandas

Koalas Growth
▪ 30,000+ Downloads per day, 800,000+ Downloads last month

Koalas 1.0
▪ Spark 3.0 support
▪ Optimize using Spark 3.0 functions, such as mapInPandas().
▪ Python 3.8 support
▪ pandas 1.0 support (since 0.28.0)
▪ Basically Koalas will follow pandas 1.0 behavior.
▪ Remove deprecated functions
▪ Functions removed in pandas 1.0
▪ @pandas_wraps, DataFrame.map_in_pandas()
▪ Introduce spark property and move Spark-specific functions

Koalas 1.0
▪ Most common pandas functions have been implemented in Koalas:
▪ Series : 70%
▪ DataFrame : 77%
▪ Index : 65%
▪ MultiIndex : 60%
▪ DataFrameGroupBy : 67%
▪ SeriesGroupBy : 69%
▪ Plotting: 80%
▪ APIs for Spark users:
▪ to_koalas(), to_spark()
▪ DataFrame.spark.to_spark_io(), ks.read_spark_io(), ...
▪ DataFrame.spark.cache(), ks.sql(), ...

InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame

InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
Koalas
DataFrame
API call copy with a new state

InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Koalas
DataFrame
API call copy with a new state

Index and Default Index
▪ Koalas manages a group of columns as an index.
▪ The index behaves the same as pandas’.
▪ to_koalas() has index_col parameter to specify index columns.
▪ If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
▪ Each “default index” has Pros and Cons.

Comparison of Default Index Types
Configurable by the option “compute.default_index_type”
Distributed
computation
Map-side
operation
Continuous
increment
sequence
No, in a single worker
node
No, requires a shuffle Yes
distributed-
sequence
Yes
Yes, but requires
another Spark job
Yes, in most cases
distributed Yes Yes No

Roadmap
▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
▪ Improve the coverage and the behavior compatibility of APIs.
▪ Visualization
▪ Matplotlib
▪ ...
▪ ML libraries
▪ Documentations
▪ More examples
▪ Workarounds for APIs we won’t support

Getting started
▪ pip install koalas
▪ conda install -c conda-forge koalas
▪ Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
▪ 10 min tutorial in a Live Jupyter notebook is available from the docs.
▪ blog post: 10 Minutes from pandas to Koalas on Apache Spark
https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

Do you have suggestions or requests?
▪ Submit requests to github.com/databricks/koalas/issues
▪ Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html

Koalas Session
▪ Koalas: pandas on Apache Spark
▪ Friday, June 26th 10:00 AM (PDT)

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Making Easy Transition Koalas pandas Apache Spark

Making Easy Transition Koalas pandas Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making Easy Transition Koalas pandas Apache Spark

Similar to Making Easy Transition Koalas pandas Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Making Easy Transition Koalas pandas Apache Spark