SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Koalas
How Well Does Koalas Work?
Takuya Ueshin, Xinrong Meng
Software Engineer @ Databricks
About Us
Takuya Ueshin
▪ Software Engineer @ Databricks
▪ Apache Spark committer and PMC
member
▪ Focusing on Spark SQL and PySpark
▪ Koalas maintainer
Xinrong Meng
▪ Software Engineer @ Databricks
▪ Koalas maintainer
Agenda
▪ Introduction of Koalas
pandas
PySpark
▪ Koalas Internal
▪ Benchmark
Introduction of Dask
Koalas benchmark against Dask
▪ Koalas Updates
Introduction of Koalas
What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundred of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions
pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation
and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends
Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark for Python;
also APIs support for Scala/Java, R, and SQL
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL
- The plan will be optimized and
executed by Spark SQL engine
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas Internal
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Koalas
DataFrame
PySpark
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
API call
copy with new
state
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
API call
Only updates metadata
copy with new
state
Benchmark
Introduction of Dask
• A parallel computing framework
• Written in pure python
• Using blocked algorithms and
task scheduling
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections DataFrame Array, DataFrame, Bag
Benchmark setup - Methodology
• Dataset
157 GB Yellow Taxi Trip Records (2009 - 2013)
• Operations
Basic statistical calculations
Joins
Grouping
• Operations were applied to
The whole dataset
Filtered data (36% whole dataset)
Cached filtered data (36% whole dataset)
The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
Benchmark setup - Environment
• Local execution
A single i3.16xlarge VM:
(488 GB memory | 64 cores | 25 Gigabit Ethernet)
• Distributed execution
1 driver node, 3 worker nodes
Each node is a i3.4xlarge VM:
(122 GB memory | 16 cores | 10 Gigabit Ethernet)
Benchmark results - Overview
Geometric Mean Simple Average
Local execution 2.1x 4x
Distributed execution 4.6x 7.9x
Koalas outperformed Dask:
Benchmark results - On the whole dataset
Local execution: Koalas is ~1.2x
faster
Distributed execution: Koalas is ~2x
faster
Benchmark results - On the filtered data
Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x
faster
Benchmark results - On the cached filtered data
Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x
faster
Why is Koalas fast?
● Query plan optimization by Catalyst
● Whole-stage code generation
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst’s optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst optimization
• After the Catalyst optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Whole-stage code generation
~650%
improvement
~1200%
improvement
Benchmark conclusions
• SQL optimizers improve the performance of DataFrame
APIs
• Caching accelerates both Koalas and Dask dramatically
• Koalas outperforms Dask in the majority of use cases
Reference blog post : Benchmark: Koalas (PySpark) and Dask
Koalas updates
Version 1.0.0~1.8.0
▪ Improve Plotly backend support, and
switch the default plotting backend
to Plotly
▪ Extension dtypes support
▪ More Index types
▪ Create Index from Series or Index
objects
▪ Support setting to a Series via
attribute access
▪ Operations between Series and Index
▪ Standardize binary operations
between int and str columns
▪ Index operations support
▪ Better type support
▪ Return type annotations for major
Koalas objects
Version 1.0.0~1.8.0
▪ Support for non-string names
▪ Non-named Series support
▪ Wider support of in-place update
▪ Improve distributed-sequence
default index
▪ pandas 1.1, 1.1.4 support
▪ Better pandas API coverage
▪ Introduced koalas and Spark
accessors
▪ Improve testing infrastructure
▪ Apache Spark 3.0 support
▪ Python 3.8 support
▪ Support for API extensions
▪ Better type hints support
Porting Koalas to Spark
SPIP: Support pandas API layer on PySpark
https://issues.apache.org/jira/browse/SPARK-
34849
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Mais conteúdo relacionado

Mais procurados

Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
Amazon Cognito使って認証したい?それならSpring Security使いましょう!
Amazon Cognito使って認証したい?それならSpring Security使いましょう!Amazon Cognito使って認証したい?それならSpring Security使いましょう!
Amazon Cognito使って認証したい?それならSpring Security使いましょう!Ryosuke Uchitate
 
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...NTT DATA Technology & Innovation
 
平成最後の1月ですし、Databricksでもやってみましょうか
平成最後の1月ですし、Databricksでもやってみましょうか平成最後の1月ですし、Databricksでもやってみましょうか
平成最後の1月ですし、DatabricksでもやってみましょうかRyuichi Tokugami
 
Amazon S3を中心とするデータ分析のベストプラクティス
Amazon S3を中心とするデータ分析のベストプラクティスAmazon S3を中心とするデータ分析のベストプラクティス
Amazon S3を中心とするデータ分析のベストプラクティスAmazon Web Services Japan
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!Tetsutaro Watanabe
 
グラフデータベース入門
グラフデータベース入門グラフデータベース入門
グラフデータベース入門Masaya Dake
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)Takeshi Mikami
 
はじめてのElasticsearchクラスタ
はじめてのElasticsearchクラスタはじめてのElasticsearchクラスタ
はじめてのElasticsearchクラスタSatoyuki Tsukano
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesignRyoma Nagata
 
リクルートのビッグデータ活用基盤とデータ活用に向けた取組み
リクルートのビッグデータ活用基盤とデータ活用に向けた取組みリクルートのビッグデータ活用基盤とデータ活用に向けた取組み
リクルートのビッグデータ活用基盤とデータ活用に向けた取組みRecruit Technologies
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkTakanori Suzuki
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and PerformanceMineaki Motohashi
 
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)NTT DATA Technology & Innovation
 
アプリを成長させるためのログ取りとログ解析に必要なこと
アプリを成長させるためのログ取りとログ解析に必要なことアプリを成長させるためのログ取りとログ解析に必要なこと
アプリを成長させるためのログ取りとログ解析に必要なことTakao Sumitomo
 
マイクロにしすぎた結果がこれだよ!
マイクロにしすぎた結果がこれだよ!マイクロにしすぎた結果がこれだよ!
マイクロにしすぎた結果がこれだよ!mosa siru
 
こんなに使える!今どきのAPIドキュメンテーションツール
こんなに使える!今どきのAPIドキュメンテーションツールこんなに使える!今どきのAPIドキュメンテーションツール
こんなに使える!今どきのAPIドキュメンテーションツールdcubeio
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計Yoshinori Matsunobu
 

Mais procurados (20)

Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Amazon Cognito使って認証したい?それならSpring Security使いましょう!
Amazon Cognito使って認証したい?それならSpring Security使いましょう!Amazon Cognito使って認証したい?それならSpring Security使いましょう!
Amazon Cognito使って認証したい?それならSpring Security使いましょう!
 
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
大量のデータ処理や分析に使えるOSS Apache Spark入門 - Open Source Conference2020 Online/Fukuoka...
 
平成最後の1月ですし、Databricksでもやってみましょうか
平成最後の1月ですし、Databricksでもやってみましょうか平成最後の1月ですし、Databricksでもやってみましょうか
平成最後の1月ですし、Databricksでもやってみましょうか
 
Amazon S3を中心とするデータ分析のベストプラクティス
Amazon S3を中心とするデータ分析のベストプラクティスAmazon S3を中心とするデータ分析のベストプラクティス
Amazon S3を中心とするデータ分析のベストプラクティス
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
 
グラフデータベース入門
グラフデータベース入門グラフデータベース入門
グラフデータベース入門
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
はじめてのElasticsearchクラスタ
はじめてのElasticsearchクラスタはじめてのElasticsearchクラスタ
はじめてのElasticsearchクラスタ
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesign
 
リクルートのビッグデータ活用基盤とデータ活用に向けた取組み
リクルートのビッグデータ活用基盤とデータ活用に向けた取組みリクルートのビッグデータ活用基盤とデータ活用に向けた取組み
リクルートのビッグデータ活用基盤とデータ活用に向けた取組み
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache Flink
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
 
アプリを成長させるためのログ取りとログ解析に必要なこと
アプリを成長させるためのログ取りとログ解析に必要なことアプリを成長させるためのログ取りとログ解析に必要なこと
アプリを成長させるためのログ取りとログ解析に必要なこと
 
マイクロにしすぎた結果がこれだよ!
マイクロにしすぎた結果がこれだよ!マイクロにしすぎた結果がこれだよ!
マイクロにしすぎた結果がこれだよ!
 
こんなに使える!今どきのAPIドキュメンテーションツール
こんなに使える!今どきのAPIドキュメンテーションツールこんなに使える!今どきのAPIドキュメンテーションツール
こんなに使える!今どきのAPIドキュメンテーションツール
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
Docker Compose 徹底解説
Docker Compose 徹底解説Docker Compose 徹底解説
Docker Compose 徹底解説
 

Semelhante a Koalas: How Well Does It Scale Pandas for Big Data

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkDatabricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 

Semelhante a Koalas: How Well Does It Scale Pandas for Big Data (20)

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Koalas: How Well Does It Scale Pandas for Big Data

  • 1. Koalas How Well Does Koalas Work? Takuya Ueshin, Xinrong Meng Software Engineer @ Databricks
  • 2. About Us Takuya Ueshin ▪ Software Engineer @ Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer Xinrong Meng ▪ Software Engineer @ Databricks ▪ Koalas maintainer
  • 3. Agenda ▪ Introduction of Koalas pandas PySpark ▪ Koalas Internal ▪ Benchmark Introduction of Dask Koalas benchmark against Dask ▪ Koalas Updates
  • 5. What’s Koalas? Announced April 24, 2019 Provides a drop-in replacement for pandas - enabling efficient scaling out to hundred of worker nodes For pandas users - Scale out the pandas code using Koalas - Make learning PySpark much easier For PySpark users - More productive by pandas-like functions
  • 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem - NumPy - Matplotlib - scikit-learn Stack Overflow Trends
  • 7. Apache Spark De facto unified analytics engine for large-scale data processing - Streaming - ETL - ML Originally created at UC Berkeley by Databricks’ founders PySpark for Python; also APIs support for Scala/Java, R, and SQL
  • 8. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 9. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - Translate pandas APIs into a logical plan of Spark SQL - The plan will be optimized and executed by Spark SQL engine - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 11. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 12. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 13. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 14. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 15. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 17. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame API call copy with new state
  • 18. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes API call Only updates metadata copy with new state
  • 20. Introduction of Dask • A parallel computing framework • Written in pure python • Using blocked algorithms and task scheduling
  • 21. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim Abstraction Collections
  • 22. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Collections
  • 23. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections
  • 24. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections DataFrame Array, DataFrame, Bag
  • 25. Benchmark setup - Methodology • Dataset 157 GB Yellow Taxi Trip Records (2009 - 2013) • Operations Basic statistical calculations Joins Grouping • Operations were applied to The whole dataset Filtered data (36% whole dataset) Cached filtered data (36% whole dataset) The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
  • 26. Benchmark setup - Environment • Local execution A single i3.16xlarge VM: (488 GB memory | 64 cores | 25 Gigabit Ethernet) • Distributed execution 1 driver node, 3 worker nodes Each node is a i3.4xlarge VM: (122 GB memory | 16 cores | 10 Gigabit Ethernet)
  • 27. Benchmark results - Overview Geometric Mean Simple Average Local execution 2.1x 4x Distributed execution 4.6x 7.9x Koalas outperformed Dask:
  • 28. Benchmark results - On the whole dataset Local execution: Koalas is ~1.2x faster Distributed execution: Koalas is ~2x faster
  • 29. Benchmark results - On the filtered data Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x faster
  • 30. Benchmark results - On the cached filtered data Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x faster
  • 31. Why is Koalas fast? ● Query plan optimization by Catalyst ● Whole-stage code generation
  • 32. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst’s optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 33. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst optimization • After the Catalyst optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 34. Why is Koalas fast - Whole-stage code generation ~650% improvement ~1200% improvement
  • 35. Benchmark conclusions • SQL optimizers improve the performance of DataFrame APIs • Caching accelerates both Koalas and Dask dramatically • Koalas outperforms Dask in the majority of use cases Reference blog post : Benchmark: Koalas (PySpark) and Dask
  • 37. Version 1.0.0~1.8.0 ▪ Improve Plotly backend support, and switch the default plotting backend to Plotly ▪ Extension dtypes support ▪ More Index types ▪ Create Index from Series or Index objects ▪ Support setting to a Series via attribute access ▪ Operations between Series and Index ▪ Standardize binary operations between int and str columns ▪ Index operations support ▪ Better type support ▪ Return type annotations for major Koalas objects
  • 38. Version 1.0.0~1.8.0 ▪ Support for non-string names ▪ Non-named Series support ▪ Wider support of in-place update ▪ Improve distributed-sequence default index ▪ pandas 1.1, 1.1.4 support ▪ Better pandas API coverage ▪ Introduced koalas and Spark accessors ▪ Improve testing infrastructure ▪ Apache Spark 3.0 support ▪ Python 3.8 support ▪ Support for API extensions ▪ Better type hints support
  • 39. Porting Koalas to Spark SPIP: Support pandas API layer on PySpark https://issues.apache.org/jira/browse/SPARK- 34849
  • 40. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.