O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

SparkTokyo2019

630 visualizações

Publicada em

https://spark-meetup-tokyo.connpass.com/event/131791/

Publicada em: Software
  • Seja o primeiro a comentar

SparkTokyo2019

  1. 1. Kazuaki Ishizaki (石崎 一明) IBM Research – Tokyo, 日本アイ・ビー・エム(株)東京基礎研究所 @kiszk Spark In-Memoryの発表と関連セッションの紹介 1
  2. 2. About Me – Kazuaki Ishizaki ▪ Researcher at IBM Research – Tokyo https://ibm.biz/ishizaki – Compiler optimization, language runtime, and parallel processing ▪ Work for IBM Java (Open J9, now) from 1996 – Technical lead for Just-in-time compiler for PowerPC ▪ Apache Spark committer from 2018/9 (SQL module) – Four Apache Spark committers in Japan ▪ ACM Distinguished Member (2018-) ▪ SNS – @kiszk – ishizaki 2 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  3. 3. Today’s topics ▪ Highlight in a talk “In-Memory Storage Evolution in Apache Spark” ▪ Relationship between Apache Spark and Apache Arrow ▪ Highlights in talks regarding Apache Spark with Apache Arrow 3 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  4. 4. In-Memory Storage Evolution in Apache Spark ▪ History of in-memory storage from Spark 1.3 to 2.4 – From Java Object to own memory format managed by Spark (Project Tungsten) – Introduction of Columnar storage class ▪ Support of Apache Arrow – Performance improvements of PySpark in the case of Pandas UDF ▪ Refactoring of internal data structure – One public abstract class: ColumnVector 4 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki https://www.slideshare.net/ishizaki/in-memory-evolution-in-apache-spark
  5. 5. Why is In-Memory Storage? • In-memory storage is mandatory for high performance • In-memory columnar storage is necessary to – Support first-class citizen column format Parquet – Achieve better compression rate for table cache 5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit memory address memory address SummitAISpark 5000.01.92.0 321Summit AI Spark 5000.0 1.9 2.0 3 2 1 Row format Column format Row 0 Row 1 Row 2 Column x Column y Column z
  6. 6. In-Memory Storage Evolution (1/2) 6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit AI| Spark| Spark AI Table cache 2.0 1.9 2.0 1.9 Spark AI Parquet vectorized reader 2.0 1.9 1.4 to 1.6 RDD table cache to 1.3 2.0 to 2.2 RDD table cache : Java objects Table cache : Own memory layout by Project Tungsten for table cache Parquet : Own memory layout, but different class from table cacheSpark version
  7. 7. In-Memory Storage Evolution (2/2) 7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Spark AI Table cache 2.0 1.9 Parquet vectorized reader 2.42.3 Pandas UDF with Arrow ORC vectorized reader ColumnVector becomes a public class ColumnVector class becomes public class from Spark 2.3 Table cache, Parquet, ORC, and Arrow use common ColumnVector class Spark version
  8. 8. Performance among Spark Versions • DataFrame table cache from Spark 2.0 to Spark 2.4 8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spark 2.0 Spark 2.3 Spark 2.4 Performance comparison among different Spark versions Relative elapsed time shorter is better df.filter(“i % 16 == 0").count
  9. 9. How Columnar Storage is used in PySpark • Share data in columnar storages of Spark and Pandas – No serialization and deserialization – 3-100x performance improvements 9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin Source: ”Introducing Pandas UDF for PySpark” by Databricks blog @pandas_udf(‘double’) def plus(v): return v + 1.2 Apache Arrow
  10. 10. How Columnar Storage is Used • Table cache ORC • Pandas UDF Parquet 10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit df = ... df.cache df1 = df.selectExpr(“y + 1.2”) df = spark.read.parquet(“c”) df1 = df.selectExpr(“y + 1.2”) df = spark.read.format(“orc”).load(“c”) df1 = df.selectExpr(“y + 1.2”) @pandas_udf(‘double’) def plus(v): return v + 1.2 df1 = df.withColumn(‘yy’, plus(df.y))
  11. 11. Integrate Spark with Others • Frameworks: Deep DL/ML frameworks • SPARK-24579 • SPARK-26413 • Resources: GPU, FPGA, .. • SPARK-27396 • SAIS2019: “Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators” 11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit From rapids.ai FPGA GPU
  12. 12. Presentation with Spark & Arrow in SAIS2019 ▪ Language / Framework – Running R at Scale with Apache Arrow on Spark – Introducing .NET Bindings for Apache Spark – Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark – Make your PySpark Data Fly with Arrow! ▪ Hardware resources (e.g. GPU and FPGA) – Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL – Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators 12 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  13. 13. Exchange data between Spark and R ▪ フレームワーク – 13 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Running R at Scale with Apache Arrow on Spark https://www.slideshare.net/databricks/running-r-at-scale-with-apache-arrow-on-spark
  14. 14. Exchange data between Spark and .NET UDF ▪ フレームワーク – 14 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Introducing .NET Bindings for Apache Spark https://www.slideshare.net/databricks/introducing-net-bindings-for-apache-spark
  15. 15. ▪ フレームワーク – 15 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Make your PySpark Data Fly with Arrow! https://www.slideshare.net/databricks/make-your-pyspark-data-fly-with-arrow Exchange data between Spark and TensorFlow
  16. 16. ▪ フレームワーク – 16 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark https://www.slideshare.net/databricks/updates-from-project-hydrogen-unifying-stateoftheart-ai-and-big-data-in-apache-spark Make Arrow format standard in Spark
  17. 17. Exchange data between Spark and RAPIDS library ▪ フレームワーク – 17 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL https://www.slideshare.net/databricks/accelerating-machine-learning-workloads-and-apache-spark- applications-via-cuda-and-nccl #24795 (SPARK-27945) is minimal support for columnar processing
  18. 18. ▪ フレームワーク – 18 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators https://www.slideshare.net/databricks/apache-arrowbased-unified-data-sharing-and-transferring- format-among-cpu-and-accelerators Exchange data between Spark and accelerator
  19. 19. Takeaway ▪ Evolving in-memory in Apache Spark while keep the same API (e.g. DataFrame and Dataset) – Improve performance by using columnar storage and own memory format – Support Apache Arrow – Define API to increase generality and ease of supporting other data sources ▪ Improve performance and programmability to exchange data by using Apache Arrow between Spark and – Framework – Hardware accelerators 19 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki

×