Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
4. Outline
▪ What’s Koalas?
▪ pandas vs Apache Spark at a
high level
▪ Koalas 1.0
▪ Demo
▪ InternalFrame
▪ Index and Default Index
▪ Roadmap
5. What’s Koalas?
▪ Announced April 24, 2019
▪ Aims at providing the pandas API on top of Apache Spark
▪ Unifies the two ecosystems with a familiar API
▪ Seamless transition between small and large data
▪ For pandas users
▪ Scale out the pandas code using Koalas
▪ Make learning PySpark much easier
▪ For PySpark users
▪ More productive by pandas-like functions
6. pandas
▪ Authored by Wes McKinney in 2008
▪ The standard tool for data
manipulation and analysis in Python
▪ The current version: 1.0.4
Stack Overflow Trends
7. pandas
▪ Deeply integrated into Python data science ecosystem
▪ numpy
▪ matplotlib
▪ scikit-learn
▪ Can deal with a lot of different situations, including:
▪ Basic statistical analysis
▪ Handling missing data
▪ Time series, categorical variables, strings
8. Apache Spark
▪ De facto unified analytics engine for large-scale data processing
▪ Streaming
▪ ETL
▪ ML
▪ Originally created at UC Berkeley by Databricks’ founders
▪ PySpark API for Python; also API support for Scala, R and SQL
▪ The latest version: 3.0.0
19. Index and Default Index
▪ Koalas manages a group of columns as an index.
▪ The index behaves the same as pandas’.
▪ to_koalas() has index_col parameter to specify index columns.
▪ If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
▪ Each “default index” has Pros and Cons.
20. Comparison of Default Index Types
Configurable by the option “compute.default_index_type”
Distributed
computation
Map-side
operation
Continuous
increment
sequence
No, in a single worker
node
No, requires a shuffle Yes
distributed-
sequence
Yes
Yes, but requires
another Spark job
Yes, in most cases
distributed Yes Yes No
21. Roadmap
▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
▪ Improve the coverage and the behavior compatibility of APIs.
▪ Visualization
▪ Matplotlib
▪ ...
▪ ML libraries
▪ Documentations
▪ More examples
▪ Workarounds for APIs we won’t support
22. Getting started
▪ pip install koalas
▪ conda install -c conda-forge koalas
▪ Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
▪ 10 min tutorial in a Live Jupyter notebook is available from the docs.
▪ blog post: 10 Minutes from pandas to Koalas on Apache Spark
https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html
23. Do you have suggestions or requests?
▪ Submit requests to github.com/databricks/koalas/issues
▪ Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html