SAIS/DWS2018報告会 #saisdws2018

Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved.
2018年8月6日
李燮鳴
Spark AI Summit 2018 報告会

自己紹介
2
李燮鳴 (リショウメイ)
2017年3月筑波大学大学院博士(工学)取得
• 並列ファイルシステムのためのスケジューラの研究
2017年4月からはヤフーに入社
• 入社後はHadoopクラスタのDevOpsを担当

Spark AI Summit 2018
3
開催日時：2018/06/04~2018/06/06
場所：San Francisco Moscone Center West
参加者：6000名ほど
セッション：約9~11並列発表され、合計で193セッション

会場 (外観)
4

会場 (内部)
5

食事
6

アジェンダ
7
• MLを便利にするフレームワーク(2件)
• Spark SQLについて(２件)
• クラスタアーキテクチャー(２件)

Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved.
MLを便利にするフレームワーク

MEET UP: Horovod
9
• TensorFlowの分散型学習を高速化した
フレームワーク
• MPIのALL REDUCEを利用してGradientsの平均値
の計算を高速化した
Alexander Sergeev, Uber

MEET UP: Horovod (1)
10
Uber. (2018, February 19). Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved July 9, 2018, from
https://eng.uber.com/horovod/
TensorFlowの分散型学習ではParameter Serverを使用し、各Workerで求まったGradientの平均値の計算
難点：Parameter Serverの構成を選択するのは難しい

11
 Horovodでは、Parameter Serverを使用せず、NCCL (NVIDIA Collective
Communications Library, MPIで実装)のRing ALL REDUCEでGradientsの交換・平均計算
を行った

12
 TensorFlow のオフィシャルのベンチマークを用いた性能評価では約２倍ほどの性能向上を確
認できた

13
 InfinibandでRDMA (Remote Direct Memory Access)を使用すると、性能がさらに上がった

KEYNOTE: Hydrogen
14
Reynold Xin, Databricks
DLのフレームワークをSparkで効率できるようにする提案
• SPIP= Spark Project Improvement Proposal
• 現時点ではDesign Sketchが完了(Designが15%終了,
SPARK-24374)

KEYNOTE: Hydrogen (1)
15
Databricks. Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark. Retrieved July 9, 2018, from https://databricks.com/session/databricks-keynote-2

KEYNOTE: Hydrogen (2)
16
Databricks. Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark. Retrieved July 9, 2018, from https://databricks.com/session/databricks-keynote-2

Spark SQL

Deep Dive into Spark SQL with Advanced Performance Tuning
18
Xiao Li, Wenchen Fan, Databricks
Spark SQLがクエリから実行されるまでの各段階で実施できるパラ
メータチューニングの手法を紹介した
Databricks Follow. (2018, June 20). Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao L... Retrieved July 10, 2018, from https://www.slideshare.net/databricks/deep-dive-
into-spark-sql-with-advanced-performance-tuning-with-xiao-li-wenchen-fan

19

20

21

22
Spark SQLがクエリから実行されるまでの各段階で実施できるパラ
メータチューニングの手法を紹介した

Spark SQL Adaptive Execution
23
Carson Wang, Intel, Yuanjian Li, Baidu
Spark SQLの実行をランタイムで変更させて効率よくした
• 最適なReducerの数をランタイムで決める
• 適切なJoin手法をランタイムで決める
• BaiduではProd環境で使用(SPARK-23128)

Spark SQL Adaptive Execution (1)
24
Reducerの数のチューニング
• 少なすぎる場合: Spill, OOM
• 多すぎる場合: Scheduling overhead. More IO
requests. Too many small output files
• すべてのstages に適した数を指定するのはむずかしい

25
ShuffledRowRDD
Partition 0 (70MB)
Partition 1 (30MB)
Partition 2 (20MB)
Partition 3 (10MB)
Partition 4 (50MB)
ShuffledRowRDD
Partition 0 (70MB)
Partition 1 (30MB)
Partition 2 (20MB)
Partition 3 (10MB)
Partition 4 (50MB)
Target Size per Reducer =64MB, Min-Max Shuffle Partition Number = 1 to 5
30+20+10<64MB

26
SQL Query Logical Plan
Optimized
Logical Plan
Multiple
Physical Plan
Selected
Physical Plan
Cost Modelで評価Join2
Join1
Exchange1
T1
Exchange2
T2
Exchange3
T3
最適ではない
Joinが選ばれる
Plannerの予測値と
実際の値と大きく異
なる場合がある

27
Join2
Join1
Exchange1
T1
Exchange2
T2
Exchange3
T3
QueryStage4
Join2
Join1
QueryStage
Input1
QueryStage
Input2
QueryStage
Input3
QueryStage1
Exchange1
T1
QueryStage2
Exchange2
T2
QueryStage3
Exchange3
T3
実際の値を把
握できる

28
BaiduでAdaptive Exectionを適用した結果
• SortMergeJoinがBroadcastJoinに変更され、
50%~200%の性能向上を確認した
• 実行時間が1時間以上のジョブでは適切なReducer数が
指定され、50%~100%の性能向上を確認した

クラスタアーキテクチャー

Taking Advantage of a Disaggregated Storage and
Compute Architecture
30
Brian Cho, Facebook
• データと計算を分離したアーキテクチャーの紹介
• データと計算を分離したアーキテクチャーにおけるSpark
の最適化
1. Fileインターフェイスの定義
2. SparkのTemporaryファイルのアクセス最適化
3. Spark shuffleの最適化

Compute Architecture (1)
31
Brian Cho, Facebook
Databricks. Taking Advantage of a Disaggregated Storage and Compute Architecture. Retrieved July 10, 2018, from
https://databricks.com/session/taking-advantage-of-a-disaggregated-storage-and-compute-architecture

32
Brian Cho, Facebook

33
Brian Cho, Facebook
データと計算を分離したアーキテクチャーのメリット
• それぞれデータと計算に適したサーバー調達できる
• キャパシティプランニングが簡単
• それぞれのチームでメンテナンスできる

34
Brian Cho, Facebook

35
Brian Cho, Facebook

36
Brian Cho, Facebook

37
Brian Cho, Facebook
Executor Executor Executor
ESS ESS ESS
Local FS Local FS Local FS
ここはLocalアクセス
計算

38
Brian Cho, Facebook
Executor Executor Executor
ESS ESS ESS
Warm Storage
ここはRemoteアクセス
＊Network Transfer
計算
ストレージ

39
Brian Cho, Facebook
Index, shuffle
shuffle
shuffle
Index

Apache Spark on Kubernetes Clusters
40
Sean Suchter, PepperData, Anirudh Ramanathan, Google
 Kubernetesの概要、Spark on Kubernetesの実装と今後
の予定を紹介した
 Spark DriverはKubernetesのCustom Controllerとして
実装されている
 将来的に追加される機能（ピックアップ）
• PySpark: SPARK-23984
• Dynamic Allocation: SPARK-24432
• Driver HA

Apache Spark on Kubernetes Clusters (1)
41
Databricks. Apache Spark on Kubernetes Clusters. Retrieved July 11, 2018, from https://databricks.com/session/apache-spark-on-kubernetes-clusters

42

43
bin/spark-submit ¥
--master k8s://<server:port> ¥
--deploy-mode cluster ¥
--name spark-pi ¥
--class org.apach.spark.examples.SparkPi ¥
--conf spark.executor.instances=5 ¥
--conf spark.kubernetes.container.image=<spark-image> ¥
local:///path/to/examples.jar
 利用者はほぼ今まで通りの方法でジョブを提出

44
 Spark on Kubernetes Roadmap

EOP

予備スライド

KEYNOTE: MLflow
47
Matei Zaharia, Databricks
 SparkのMachine Learningのライフサイクル
管理フレームワーク
 SparkのMLが難しい3つのポイントがあること挙げたう
え、それぞれのポイントに対して解決策を提供した

KEYNOTE: Mlflow (1)
48
Databricks. Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark. Retrieved July 9, 2018, from
https://databricks.com/session/unifying-data-and-ai-for-better-data-products

KEYNOTE: Mlflow (2)
49
Databricks. Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark. Retrieved July 9, 2018, from
https://databricks.com/session/unifying-data-and-ai-for-better-data-products

KEYNOTE: Mlflow (3)
50
def main()
alpha = float(argv[1]) if len(argv) > 1 else 0
l1_ratio = float(argv[2]) if len(argv) > 2 else 0
(x_train, y_train) = load_data("train.parguet")
(x_test, y_test) = load_data("test.parguet")
print("Using parameter alpha=%.1f l1_ratio=%.1f" % (alpha, l1_ratio))
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1", l1_ratio)
model = ElasitcNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
model.fit (x_train, x_train)
y_pred = model.predict(x_test)
(mae, rmse, r2) = eval_metrics(y_test, y_pred)
mlflow.log_metric("MAE", mae) print("MAE", mae)
mlflow.log_metric("RMSE", rmse) print("RMSE", rmse)
mlflow.log_metric("R2", r2) print("R2", r2)
Mlflow Tracking

KEYNOTE: Mlflow (4)
51

SAIS/DWS2018報告会 #saisdws2018

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (12)

Semelhante a SAIS/DWS2018報告会 #saisdws2018

Semelhante a SAIS/DWS2018報告会 #saisdws2018 (20)

Mais de Yahoo!デベロッパーネットワーク

Mais de Yahoo!デベロッパーネットワーク (20)

Último

Último (20)

SAIS/DWS2018報告会 #saisdws2018