O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

In-Memory Evolution in Apache Spark

179 visualizações

Publicada em

Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark

Publicada em: Software
  • Seja o primeiro a comentar

In-Memory Evolution in Apache Spark

  1. 1. Kazuaki Ishizaki IBM Research – Tokyo @kiszk In-Memory Storage Evolution in Apache Spark #UnifiedAnalytics #SparkAISummit
  2. 2. About Me – Kazuaki Ishizaki • Researcher at IBM Research in compiler optimizations • Working for IBM Java virtual machine over 20 years – In particular, just-in-time compiler • Committer of Apache Spark (SQL package) from 2018 • ACM Distinguished Member • Homepage: http://ibm.biz/ishizaki b: https://github.com/kiszk wit: @kiszk https://slideshare.net/ishizaki 2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  3. 3. Why is In-Memory Storage? • In-memory storage is mandatory for high performance • In-memory columnar storage is necessary to – Support first-class citizen column format Parquet – Achieve better compression rate for table cache 3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit memory address memory address SummitAISpark 5000.01.92.0 321Summit AI Spark 5000.0 1.9 2.0 3 2 1 Row format Column format Row 0 Row 1 Row 2 Column x Column y Column z
  4. 4. What I Will Talk about • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 4#UnifiedAnalytics #SparkAISummit
  5. 5. How Columnar Storage is Used • Table cache ORC • Pandas UDF Parquet 5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit df = ... df.cache df1 = df.selectExpr(“y + 1.2”) df = spark.read.parquet(“c”) df1 = df.selectExpr(“y + 1.2”) df = spark.read.format(“orc”).load(“c”) df1 = df.selectExpr(“y + 1.2”) @pandas_udf(‘double’) def plus(v): return v + 1.2 df1 = df.withColumn(‘yy’, plus(df.y))
  6. 6. Performance among Spark Versions • DataFrame table cache from Spark 2.0 to Spark 2.4 6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spark 2.0 Spark 2.3 Spark 2.4 Performance comparison among different Spark versions Relative elapsed time shorter is better df.filter(“i % 16 == 0").count
  7. 7. How This Improvement is Achieved • Structure of columnar storage • Generated code to access columnar storage 7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  8. 8. Outline • Introduction • Deep dive into columnar storage • Deep dive into generated code of columnar storage • Next steps 8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  9. 9. In-Memory Storage Evolution (1/2) 9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit AI| Spark| Spark AI Table cache 2.0 1.9 2.0 1.9 Spark AI Parquet vectorized reader 2.0 1.9 1.4 to 1.6 RDD table cache to 1.3 2.0 to 2.2 RDD table cache : Java objects Table cache : Own memory layout by Project Tungsten for table cache Parquet : Own memory layout, but different class from table cacheSpark version
  10. 10. In-Memory Storage Evolution (2/2) 10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Spark AI Table cache 2.0 1.9 Parquet vectorized reader 2.42.3 Pandas UDF with Arrow ORC vectorized reader ColumnVector becomes a public class ColumnVector class becomes public class from Spark 2.3 Table cache, Parquet, ORC, and Arrow use common ColumnVector class Spark version
  11. 11. Implementation in Spark 1.4 to 1.6 • Table cache uses CachedBatch that is not accessed directly from generated code 11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit case class CachedBatch( buffers: Array[Array[Byte]], stats: Row) Spark AI 2.0 1.9 CachedBatch.buffers
  12. 12. Implementation in Spark 2.0 • Parquet uses ColumnVector class that has well-defined methods that could be called from generated code 12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit public abstract class ColumnVector { float getFloat(…) … UTF8String getUTF8String(…) … … } public final class OnHeapColumnVector extends ColumnVector { private byte[] byteData; … private float[] floatData; … } Spark AI 2.0 1.9 copy 2.0 1.9 Spark AI ColumnVector ColumnarBatch
  13. 13. Implementation in Spark 2.3 • Table cache, Parquet, and Arrow also use ColumnVector • ColumnVector becomes a public class to define APIs 13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet vectorized readers Pandas UDF with Arrow ColumnVector.java
  14. 14. ColumnVector for Your Columnar • Developers can write an own class, which extends ColumnVector, to support a new columnar or to exchange data with other formats 14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit MyColumnarClass extends ColumnVector Columnar data source
  15. 15. Implementation in Spark 2.4 • ORC also uses ColumnVector 15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet and ORC vectorized readers Pandas UDF with Arrow ColumnVector.java
  16. 16. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  17. 17. How Spark Program is Executed? • A Spark program is translated into Java code to be executed 17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Source: Michael et al., Spark SQL: Relational Data Processing in Spark, SIGMOD’15Catalyst while (rowIterator.hasNext()) { Row row = rowIterator.next; … } Java virtual machine Spark Programdf = ... df.cache df1 = df.selectExpr(“y + 1.2”)
  18. 18. Access Columnar Storage (before 2.0) • While columnar storage is used, generated code gets data from row storage Data conversion is required 18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit while (rowIterator.hasNext()) { Row row = rowIterator.next(); float y = row.getFloat(1); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst df: 2.0 1.9 CachedBatch Spark AI row: 2.0 2.0Spark Data conversion Columnar storage Row storage
  19. 19. Access Columnar Storage (from 2.0) • When columnar storage is used, reading data elements directly accesses columnar storage – Removed copy for Parquet in 2.0 and table cache in 2.3 19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … while (i++ < numRows) { float y = column1.getFloat(i); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:
  20. 20. Access Columnar Storage (from 2.3) • Generate this pattern for all cases regarding ColumnVector • Use for-loop to encourage compiler optimizations – Hotspot compiler applies loop optimizations to a well-formed loop 20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … for (int i = 0; i < numRows; i++) { float y = column1.getFloat(i); float f = y + 1.2; … } Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:df1 = df.selectExpr(“y + 1.2")
  21. 21. How Columnar Storage is used in PySpark • Share data in columnar storages of Spark and Pandas – No serialization and deserialization – 3-100x performance improvements 21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin Source: ”Introducing Pandas UDF for PySpark” by Databricks blog @pandas_udf(‘double’) def plus(v): return v + 1.2
  22. 22. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  23. 23. Next Steps • Short-term – support an array type in ColumnVector for table cache – support additional external columnar storage • Middle-term – exploit SIMD instructions to process multiple rows in a column in generated code • Extension of SPARK-25728 (Tungsten IR) 23#UnifiedAnalytics #SparkAISummit
  24. 24. Integrate Spark with Others • Frameworks: DL/ML frameworks • SPARK-24579 • SPARK-26413 • Resources: GPU, FPGA, .. • SPARK-27396 • SAIS2019: “Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators” 24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit From rapids.ai FPGA GPU
  25. 25. Takeaway • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 25#UnifiedAnalytics #SparkAISummit
  26. 26. Thanks Spark Community • Especially, @andrewor14, @bryanCutler, @cloud-fan, @dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91, @ueshin, @viirya 26#UnifiedAnalytics #SparkAISummit

×