IEEE/ACM SC2013報告

IEEE/ACM SC2013報告

⾼高野了了成

産業技術総合研究所　情報技術研究研究部⾨門
2014年年1⽉月15⽇日　第41回グリッド協議会ワークショップ@秋葉葉原

SC13 “HPC Everywhere”
•  25th IEEE/ACM International Conference for High
performance computing, Networking, Storage and
Analysis
–  会議名にSuper Computingは残っていない。。
–  今年年はBig data (Analysis)に注⽬目

•  11⽉月10⽇日〜～16⽇日⽶米国コロラド州デンバー
•  HPC関連のトップカンファレンス
–  今年年の採択率率率20％ (90/456)

•  TOP500、各種Awards、
Workshop、Tutorial、BoFなど
•  巨⼤大な展⽰示会場
–  ⽶米国DoE傘下研究所ブースが不不在

•  参加者 10,500名

2

Big Data
•  基調講演: G. Bell (Intel), “The Secret Life of Data”
•  招待講演
–  A. N. Choudhary (Northwestern University)
–  S. Koonin (New York University)

Data Intensive

Data Driven

A.N.Choudhary,
“Big
Data
+
Big
Compute
=
An

Extreme
Scale
Marriage
for
Smarter
Science?”

http://cusp.nyu.edu/

S.
Koonin,
“Big
Data
for
Big
Ci-es”

3

TOP 500
•  ランキングに⼤大きな変動無し
#

System

Rmax
(TFlop/s)

Rpeak
(TFlop/s)

Power
(kW)

1

1

Tianhe-‐2
(Xeon/Phi)

33862.7

54902.4

17808

2

2

Titan
(Opteron/K20x)

17590.0

27112.5

8209

3

3

Sequoia
(BG/Q)

17173.2

20132.7

7890

4

4

K
computer
(SPARC64)

10510.0

11280.4

12660

5

5

Mira
(BG/Q)

8586.6

10066.3

3945

6

-‐

Piz
Daint
(Xeon/K20x)

6271.0

7788.9

2325

7

6

Stampede
(Xeon/Phi)

5168.1

8520.1

4510

8

7

JUQUEEN
(BG/Q)

5008.9

5872.0

2301

9

8

Vulcan
(BG/Q)

4293.3

5033.2

1972

SuperMUC
(Xeon)

2897.0

3185.1

3423

10 9

4

Green 500
•  Xeon + NVIDIA K20xの圧勝
#

System

MFlops/W

1

TSUBAME-‐KFC
(Xeon/K20x)

2

Wilkes
(Xeon/K20)

3

HA-‐PACS
TCA
(Xeon/K20x)

4

Piz
Daint
(Xeon/K20x)

3185.91

1753.66

5

romeo
(Xeon/K20x)

3130.95

81.41

6

TSUBAME
2.5
(Xeon/K20x)

3068.71

922.54

7

iDataPlex
DX360M4
(Xeon/K20x)

2702.16

53.62

8

iDataPlex
DX360M4
(Xeon/K20x)

2629.10

269.94

9

iDataPlex
DX360M4
(Xeon/K20x)

2629.10

55.62

10 CSIRO
GPU
Cluster
(Xeon/K20m)

2358.69

71.01

4503.17

Power
(kW)
27.78

3631.86
52.62
TSUBAME-‐‑‒KFC（油浸冷冷却）
3517.84
78.77

5

Graph 500
•  前回と変動なし
#

System

*)
TEPS:
Edge
Traverse
Per
Second

GTEPS

1

1

Sequoia
(BG/Q)

15363

2

2

Mira
(BG/Q)

14328

3

3

JUQUEEN
(BG/Q)

4

4

K
computer
(SPARC64)

5

5

Fermi
(BG/Q)

6

6

Tianhe-‐2
(Xeon/Phi)

7

7

Turing
(BG/Q)

1427

7

7

Blue
Joule
(BG/Q)

1427

7

7

DIRAC
(BG/Q)

1427

7

7

Zumbrota
(BG/Q)

1427

5848
5524.12
2567
2061.48

6

Green Graph 500
•  TSUBAME-‐‑‒KFCはGreen 500との⼆二冠
•  Small DataではGraph CRESTチーム圧勝

Big
data
category:
# System

small
data
category
(scale
<
30):

MTEPS/
W

Graph5
00
rank

1 TSUBAME-‐KFC

6.72

47

2 JUQUEEN

5.41

3

3 Mira

4.42

2

4 EBD-‐RH5885v2

4.35

96

5 Sequoia

3.55

1

# System

MTEPS Graph5
/W
00
rank

1 GraphCREST-‐Xperia-‐
A-‐SO-‐04E

153.17

143

2 GraphCREST-‐
NEXUS7-‐2013

129.63

141

3 Kicy6

73.57

58

4 GraphCREST-‐Tegra3

64.12

150

5 GraphCREST-‐Intel-‐
NUC

53.82

124

7

30 Technical Sessions
• 

Application Performance Characterization

• 
• 
• 

Energy Management
Engineering Scalable Applications
Extreme-‐‑‒Scale Applications

• 
• 
• 
• 
• 

Fault-‐‑‒Tolerant Computing
GPU Programming
Graph Partitioning and Data Clustering
I/O Tuning
Improving Large-‐‑‒Scale Computation and
Data Resources
In-‐‑‒Situ Data Analytics and Reduction
Inter-‐‑‒Node Communication
Load Balancing

•  Cloud Resource Management and
Scheduling
•  Data Management in the Cloud

•  Fault Tolerance and Migration in
the Cloud

• 
• 
• 

• 
• 
• 
• 

MPI Performance and Debugging
Matrix Computations
Memory Hierarchy
Memory Resilience

• 
• 
• 
• 

Optimizing Numerical Code
Parallel Performance Tools
Parallel Programming Models and Compilation
Performance Analysis of Applications at Large
Scale

• 

• 
• 
• 
• 
• 
• 

Optimizing Data Movement

Performance Management of HPC
Systems

Physical Frontiers
Preconditioners and Unstructured Meshes
Sorting and Graph Algorithms
System-‐‑‒wide Application Performance
Assessments
Tools for Scalable Analysis

8

⾼高速VMマイグレーション
•  ⾼高速かつネットワーク負荷が⼩小さいライブマイ
グレーションであるガイドコピーを提案

–  ポストコピー⽅方式の派⽣生
–  マイグレーション元に残したガイドVMのヒント情報
に従い、ページ転送を最適化
–  c.f. 流流鏑⾺馬、都⿃鳥
source

destination

time

CPU

A

A

background copy

B

B

D

D

time

context transfer

CPU

background copy

shared memory

shared memory
guide
VM

page request

migration
manager

migration
manager

migrated
VM

page fault

read log
hypervisor

D

page request

command
signal

memory
access log

B

C

(a) Guide-copy architecture

C

wait

page transfer
C

memory
mapper

A
B

page transfer
new memory
access

A

C
D

hypervisor

(b) Guided memory transfer mechanism

Figure 3: The guide-copy migration’s architecture with an example of a guided memory transfer scenario.

J. Kim (POSTECH), et al., “Guide-‐‑‒copy: fast and silent migration of virtual machine for
data centers”
9

900

300

guidecopy

2.1

Delay (s)

guidecopy

30
0

average

calculix

dealII

(b) Delay - 1Gbps

postcopy

60

postcopy

1.4
guidecopy

0.7

average

cactusADM

lbm

milc

bwaves

GemsFDTD

average

cactusADM

lbm

milc

bwaves

0.8
0.6
0.4
0.2
0.0

Post-copy
Guide-copy

0.2 0.4 0.6 0.8

0.0

(c) Page faults - 5Gbps

Figure 6: The execution time of workloads repeating
back-to-back post-copy and guide-copy migrations
↓利利⽤用帯域の削減
with a 5s interval.

Delay (s)

90

xalancbmk

gcc

average

calculix

dealII

xalancbmk

leslie3d

bzip2

gcc

(a) Page faults - 1Gbps

leslie3d

0

0

1

Network bandwidth (Gbps)

(a) Delay - bzip2

Delay (s)

guidecopy

←ページフォルトおよび遅延の削減

postcopy

bzip2

10

600

mcf

20

postcopy

Delay (ms)

30

GemsFDTD

ts
on
e,
ue
of

Unpredicted

40

mcf

ns.
wo
ehe
uhe
he
nb)
st
rng

Predicted

Page fault (MB)

es
kn2,
er
c,

⾼高速VMマイグレーション

Page fault (MB)

y

Figure 8: Guide-copy’s cost-e↵ective adaptive migrat
(normalized to the baseline post-copy scheme)

2.0
1.5
1.0
0.5
0.0

Post-copy
Guide-copy

1

2

3

4

5

Network bandwidth (Gbps)

(b) Delay - cactusADM

(d) Delay - 5Gbps

Figure 5: Guide-copy’s in-time memory transfer
reducing the number of page faults and their service
latency.

ds
TCP bu↵ering does not a↵ect the guide-copy’s performance

Figure 7: The guide-copy migration delay with
varying network bandwidth availability.

bandwidth while limiting the bandwidth given to the 10

クラウド資源管理理
•  背景と動機

Processors

–  パブリッククラウド上に仮想クラスタを
作成する環境の整備 e.g., StarCluster
–  予約インスタンスを活⽤用して安く計算したい

C
(1,0.75)
(0.25,0.5)

D (1.75,1.5)

B
A (0,1.5)

•  クラウド資源を「グルーポン」のように
共同購⼊入して利利⽤用するSemi-‐‑‒Elastic
Cluster (SEC)を提案
•  負荷に応じてクラスタサイズを動的に調整
•  バッチスケジューリングの拡張で実現

1

2

3

Time (Hour)

(a) Pure on-demand cloud

Processors
(0.25,0.5)
C
(1,0.75)

B
A (0,1.5)
1

2

D (1.75,1.5)
3

Time (Hour)

(b) Traditional local cluster

Processors
C
(1,0.75)

(0.25,0.5)

–  シミュレーション実験で61%コスト削減

B

D (1.75,1.5)

A (0,1.5)
1

2

3

Time (Hour)

(c) Semi-elastic cluster

Figure 2: Semi-elastic cluster model

S. Niu (Tsinghua Univ.), et al., “Cost-‐‑‒eﬀective Cloud HPC Resource Provisioning by
with its (arrival time, execution time) pair. The g
Building Semi-‐‑‒Elastic Virtual Clusters”

indicate the actual job execution periods on all

11

クラウドのデータ管理理（１）

•  背景

–  超⼤大規模データを扱うデータサイエンス分野では、データを
GridFTPで転送してクライアントサイドで処理理するか、SaaS版
Globus Onlineを⽤用いるのが⼀一般的
–  WAN越しに転送する場合、サーバサイドでユーザが定義した
データのサブセット化を⽀支援してデータ量量を削減する機能が必要
•  GridFTPのプラグインとしてSDQuery DSI (Scientiﬁc Data Query
Data Storage Interface)を開発
–  HDF5とNetCDFデータフォーマットに対応したサブセット化APIを提供
–  システム最適化
•  データセグメントのインデキシングベース検索索とインメモリフィルタリングによる
全検索索を⾃自動的に選択する性能モデル
•  異異なるディスクブロックが読み出される場合、別のTCPストリームを⽤用いる
並列列ストリームデータ転送
•  各サブブロックに対して同時にインデキシングを実⾏行行する並列列インデキシング

Y. Su (Ohio State Univ.), et al., “SDQuery DSI: Integrating Data Management Support
with a Wide Area Data Transfer Protocol”
12

クラウドのデータ管理理（１）

実験では、以下を⽰示した
•  性能モデルの妥当性
•  広帯域ネットワークではサブセット化
の効果が少ないが、帯域が⼗十分ない場
合は効果が⼤大きい
•  並列列ストリームや並列列インデキシング
による性能向上
13

クラウドのデータ管理理（２）
•  背景

–  データインテンシブアプリケーションでは超⾼高性能データ転送
ツールが必要
–  end-‐‑‒to-‐‑‒endパスにおけるホスト、ネットワーク、ストレージの
3つのボトルネックへの対応が必要

•  100Gbpsのend-‐‑‒to-‐‑‒end⾼高速データ転送システムの設計、
最適化、性能評価を実施
–  バックエンドストレージ接続にiSER（iSCSI Extensions for
RDMA）を使⽤用
–  ホスト間通信にRFTP（RDMAベースファイル転送プロトコル）
を使⽤用
–  各ホストでNUMA⽤用チューニングによる性能最適化

Y. Ren (Stony Brook Univ.), et al., “Design and Performance Evaluation of NUMA-‐‑‒
Aware RDMA-‐‑‒Based End-‐‑‒to-‐‑‒End Data Transfer Systems”
14

クラウドのデータ管理理（２）

バックエンドSANの設計
•  提案⼿手法（RFTP）では100Gbps環境で
•  iSERプロトコルを利利⽤用
91Gbpsを達成。GridFTPでは29Gbps
•  各ファイルを指定したNUMAノードメモリに置き、 CPU使⽤用率率率も提案⼿手法では削減できた
• 
local I/Oになるようtargetプロセスを割り当て
•  特にRFTP sink側（RDMA Write）
RDMAベースプロトコルRFTPの利利⽤用
では⼤大幅に削減できる
•  ゼロコピーで⾼高速データ転送するため、
CPU使⽤用率率率を⼤大幅に削減できる
15

ポストペタに向けた耐障害性
•  テクニカルセッション

–  Fault-‐‑‒Tolerant Computing
–  Fault Tolerance and Migration in the Cloud
–  Matrix Computation

•  パネル

–  Fault Tolerance/Resilience at Petascale/Exascale: Is it
Really Critical?...

•  並列列Hessenberg変換（チェックサム付きの線形代数
演算）のように、FTをアルゴリズムに⼊入れ込む発表は
あるが、Checkpoint/Restartで何とかなってしまう
（何とかしよう）という印象
Y. Jia (Univ. of Tennessee), et al., “Parallel Reduction to Hessenberg Form with
Algorithm-‐‑‒Based Fault Tolerance”
16

Exhibition
•  58カ国、350件の展⽰示、10,550名の参加
•  各種メディアでレポート
–  http://news.mynavi.jp/column/sc13/
•  CUDA6、Post-‐‑‒FX10、SX-‐‑‒ACEなど
–  http://www.hpcwire.com/tag/sc13/
•  3 main trends: Big data、Cloud、Exascale

17

ARM-‐‑‒based system

EU exascale super-‐‑‒
computer research project:
Mont-‐‑‒Blanc
The above is another projectʼ’s photo:-‐‑‒)

Tiled wall display controlled
by RasPi cluster@SDSC

Charm++ cluster in a bag
18

FPGA
Convey HC memcached appliance@DELL
memcached benchmark:
3,644,876 -‐‑‒> 11,756,645 opts/s

19

Non silicon-‐‑‒based computers

CNT Computer@Stanford

LEGO Turing Machine@Inria (http://rubens.ens-‐‑‒lyon.fr/)
20

雑感
•  HPC Everywhere = HPC + ビッグデータ

–  すでにHPCは科学技術のためだけのものではない
–  ハイブリッドアーキテクチャが必要（？）

•  HPC Cloudに対する注⽬目の⾼高まり

–  システム系会議かというような論論⽂文も
–  ここ数年年AISTブースではHPCクラウドについて展⽰示し
ているが、年年々興味を持ってくれる⼈人が増えているこ
とを肌で感じた

http://sc13.supercomputing.org/
22

IEEE/ACM SC2013報告

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IEEE/ACM SC2013報告

Similar to IEEE/ACM SC2013報告 (20)

More from Ryousei Takano

More from Ryousei Takano (20)

Recently uploaded

Recently uploaded (10)

IEEE/ACM SC2013報告