AI Bridging Cloud Infrastructure (ABCI) is a large-scale open AI infrastructure in Japan operated by the University of Tokyo. It provides:
1) Over 0.55 exaflops of computing power with 1088 nodes equipped with 4352 NVIDIA GPUs and 43520 CPU cores for AI and data science research.
2) Dense rack design optimized for thermal management with ambient warm water cooling to achieve high density computing.
3) Hierarchical storage including 1.6PB of local NVMe SSDs, 22PB of parallel file storage, and object storage for burst buffers and campaign storage.
4) Open access platform to accelerate joint academic-industry R&D for AI through distributed
3. 3
O c x
Z / 6
o Z
t k N x v Z
N G E N G E /
v Z dfl
/KIJH N )6
tk
e fae O
/BEKM 6
E BEB
kn O
6 R O
fn Oe
8 R
X g
OiO
d N/BEKMb lp
JA HE J
kn O
O
rOo
fn Oe
M
v Z dfl
+ )6
G K v O O
6G HC + FFG
O c x
FIJ H I /
Z
/DB
AFKJ
v
Z
H GA8
BH GA
L N 6 D R
6 /
e
+BL B
DFK F6 /
+ I II E H FE F
bO Od e
FF G H
/BEKM 6
oP Oi Q fub P+Xx aOd
dfl hvn X
sOo X
6
)FHJH EN N R
• IW b
• P e
• IW b G C
U bG /
• P G b G C
b H A b
G /
• IW U P
• P U
• IW
• P H
• IW P G
• P P G
U hvn X bdfl
W
U o fub U S
4. AI Bridging Cloud Infrastructure
as World’s First Large-scale Open AI Infrastructure
• Open, Public, and Dedicated infrastructure for AI/Big Data
• Platform to accelerate joint academic-industry R&D for AI in Japan
• Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP)
( )2 5 0.1 ,. 10 5 ## 0 #
4
Univ. Tokyo Kashiwa II Campus
Operation Scheduled in 2018
5. • 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 43520 CPU Cores,
476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and
Infiniband EDR network
• Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC
• Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc.,
commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack)
5
Gateway and
Firewall
Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP)
476 TiB Mem, 1.6 PB NVMe SSD
Storage: 22 PB GPFS
High Performance Computing Nodes (w/GPU) x1088
• Intel Xeon Gold6148 (2.4GHz/20cores) x2
• NVIDIA Tesla V100 (SXM2) x 4
• 384GiB Memory, 1.6TB NVMe SSD
Multi-platform Nodes (w/o GPU) x10
• Intel Xeon Gold6132 (2.6GHz/14cores) x2
• 768GiB Memory, 3.8TB NVMe SSD
Interactive Nodes
DDN SFA14K
(w/ SS8462 Enclosure x 10) x 3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• NSD Servers x 12
Object Storage for Protocol Nodes
100GbE
Service Network (10GbE)
External
Networks
SINET5
Interconnect (Infiniband EDR)
6. ABCI: AI Bridging Cloud Infrastructure
6
System
(32 Racks)Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)
Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
NVIDIA Tesla V100
(16GB SMX2)
3.72 TB/s MEM BW
384 GiB MEM
200 Gbps NW BW
1.6TB NVMe SSD
1.16 PFlops(DP)
17.2 PFlops (AI)
37.2 PFlops(DP)
0.550 EFlops (AI)
68.5 PFlops(DP)
1.01 PFlops (AI)
34.2 TFlops(DP)
506 TFlops (AI)
GPU:
7.8 TFlops(DP)
125 TFlops (AI)
CPU:
1.53 TFlops(DP)
3.07 TFlops (AI)
Intel Xeon Gold 6148
(27.5M Cache,
2.40 GHz, 20 Core)
0.550 EFlops(AI), 37.2 PFlops(DP)
19.88 PFlops(Peak), Ranked #5 Top500 June 2018
131TB/s MEM BW
Full Bisection BW within Rack
70kW Max
1088 Compute Nodes
4352 GPUs
4.19 PB/s MEM BW
1/3 of Oversubscription BW
2.3MW
7. GPU Compute Nodes
• NVIDIA TESLA V100
(16GB, SXM2) x 4
• Intel Xeon Gold 6148
x 2 Sockets
– 20 cores per Socket
• 384GiB of DDR4 Memory
• 1.6TB NVMe SSD x 1
– Intel DC P4600 u.2
• EDR Infiniband HCA x 2
– Connected to other Compute Notes
and Filesystems
7
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
8. Rack as Dense-packaged “Pod”
( AB < ) 1 6
) 0 BC / 0 BC ,3
0G < F BA -7
BH<D G D CF BA -7 FB
<IF<DA CB
) 7 7 4 I
7 F<D , D 2 D .BB A
8Pod #1
LEAF#1
(SB7890)
LEAF#2
(SB7890)
LEAF#3
(SB7890)
LEAF#4
(SB7890)
SPINE#1
(CS7500)
SPINE#2
(CS7500)
CX40
0#1
CX2570#1
CX2570#2
CX40
0#2
CX2570#3
CX2570#4
CX40
0#3
CX2570#5
CX2570#6
CX40
0#17
CX2570#33
CX2570#34
FBB#1
(SB7890)
FBB#2
(SB7890)
FBB#3
(SB7890)
1/3 Oversubscription BW
IB-EDR x 24
Full bisection BW
IB-EDR x 72
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
x 32 pods
9. Hierarchical Storage Tiers
• Local Storage
– 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node
– Local Storage Aggregation w/ BeeOnd
• Parallel Filesystem
– 22PB of GPFS
• DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set
• Bare Metal NSD servers and Flash-based Metadata
Volumes for metadata operation acceleration
– Home and Shared Use
• Object Storage
– Part of GPFS using OpenStack Swift
– S3-like API Access, Global Shared Use
– Additional Secure Volumes w/ Encryption
(Planned)
9
Parallel Filesystem
Local Storage
as Burst Buffers
Object Storage as Campaign Storage
10. Performance Reference for Distributed Deep Learning
10
Better• Environments
– ABCI 64 nodes (256 GPUs)
– Framework: ChainerMN v1.3.0
• Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5
– Baremetal
• CentOS 7.4, gcc-4.8.5,
CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3
• Settings
– Dataset: Imagenet-1K
– Model: ResNet-50
– Training:
• Batch size: 32 per GPU, 32 x 256 in total
• Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch
w/ warm up scheduling
• Optimization: Momentum SGD (momentum=0.9)
• Weight Decay: 0.0001
• Training Epoch: 100
11. LGD M
P
C N D M
A D U
N , L
A D
I M C
11
/home (GPFS)
Job Job Job Job
NQS
Submit
Scheduling
script
file
$ qsub <option> script_filename
inter-connect
SSH
( )
G
(High Throughput Computing)
12. /: /
D 0 1 72 A :
/: /
D / 2
D 172 2
.2: , 2
,C : ,/17/ 2 2 1
inkbc
augk t v y
12
, Hgmeu P
• 172 fp k
Hh rs wogk L Q
• lr t augk t T
20. 4 , C C GHI
4 4 ,4 M :
c s a mn_e a e e s
P B . . , kt
M :
c sP B Md v
lN g I
Bc s i ro . i NI up
4. 4 . - . , , . , . - 4. 4 -
M :
lN g
l N 4. 4 g
20
22. C B 3A3 : 3
0 :
M
C $ : 3 an f uV
- a gi pN
C $ c v - fo : 3
r P H ts I
y m i NN _ O
3 - B leo
. - m i x
22
23. S y x
R 40 HIA ) 29
R m in - 03 H K67 O
S 0D H K 0N G 9 MDIH )
R oad
S HCN K M O
S NHMN ) C ( C
R g l
S 0 HM8 C ( C
S yup
R b c- 4G C H M 5
R bl- : 7 M (
R -
S e W - 29 U P )
S s - I D U
S t r- 6IG HMNG 21 GIG HMNG. ,
S = CDM 1 -
S h v-
23
:
Better:
24. L i
C 2 2 , 2 e D
N GLbdP G GU
M iA
C 2 2 , 2 e
2 - ak
M i
P GU
G L
, ,
o n I
G L
c P
24
Base Drivers, Libraries on Host
CUDA
Drivers
Infiniband
Drivers
Filesystem
Libraries
(GPFS,
Lustre)
Userland Libraries on Container
CUDA CuDNN NCCL2
MPI
(mpi4py)
Mount
ibverbs
Distributed Deep Learning
Frameworks
Caffe2 ChainerMNDistributed
TensorflowMXNet