By Li Jin
PyData New York City 2017
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. However, it's not well integrated with popular Python tools such as Pandas, and often result in poor performance when using Pandas with PySpark. In this talk, we will demonstrate how we improve PySpark performance with Apache Arrow.
3. About Me
3
• Li Jin (@icexelloss)
• Software Engineer @ Two Sigma Investments
• Apache Arrow Committer
• Analytics Tools Smith
• Other Open Source Projects:
• Flint: A Time Series Library on Spark
• Cook: A Fair Scheduler on Mesos
4. • PySpark Overview
• PySpark UDF: current state and limitation
• Apache Arrow Overview
• Improvement to PySpark UDF with Apache Arrow
• Future Roadmap
This Talk
4
10. • PySpark’s interface to interact with other Python libraries
• Types of UDFs:
• Row UDF
• Group UDF
PySpark User Defined Function (UDF)
10
11. • Operates on row by row basis
• Similar to `map` operator
• Example:
• String processing
• Timestamp processing
• Poor performance
• 1-2 orders of magnitude slower comparing to alternatives (built-in Spark
functions or vectorized operations)
Row UDF: Current
11
12. • UDF that operates on multiple rows
• Similar to `groupBy` followed by `map` operator
• Example:
• Monthly weighted mean
• Not supported out of box
• Poor performance
Group UDF: Current
12
13. • (values – values.mean()) / values.std()
Group UDF: Example
13
18. • In memory columnar format
• Building on the success of Parquet
• Standard from the start:
• Developers from 13+ major open source projects involved
• Benefits:
• Share the effort
• Create an ecosystem
Apache Arrow
18
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
Hbase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
21. Record Batch Construction
Schema
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
22. • Maximize CPU throughput
• Pipelining
• SIMD
• Cache locality
• Scatter/gather I/O
In Memory Columnar Format for Speed
23. • PySpark “toPandas” Improvement
• 53x Speedup
• Streaming Arrow Performance
• 7.75GB/s data movement
• Arrow Parquet C++ Integration
• 4GB/s reads
• Pandas Integration
• 9.71GB/s
Results
Read more on http://arrow.apache.org/blog/
23
34. • Split-apply-combine
• Break a problem into smaller pieces
• Operate on each piece independently
• Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R …
Introduce Group UDF
37. • (values – values.mean()) / values.std()
Previous Example
37
38. Group UDF: Before and After
For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Before: After*:
40. • Available in the upcoming Apache Spark 2.3 release
• Try it with Databricks community version:
• https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html
Try It!
40
41. • Improving PySpark/Pandas interoperability (SPARK-22216)
• Working towards Arrow 1.0 release
• More Arrow integration
Future Roadmap
41
43. Bryan Cutler
Hyukjin Kwon
Jeff Reback
Leif Walsh
Li Jin
Liang-Chi Hsieh
Reynold Xin
Takuya Ueshin
Wenchen Fan
Wes McKinney
Xiao Li
Collaborators
43