SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Opportunities of
ML-based data analytics in ABCI
Ryousei Takano
National Institute of Advanced Industrial Science and Technology (AIST), Japan
Data Analytics for System and Facility Energy Management BoF, SC18, December 15th
1
2
INTRODUCTION OF ABCI SYSTEM
2
HPC is an engine of AI
and
Data is the new OIL
3
Engine
Fuel
Large neural networks
Labeled data
(x,y pairs)
Why is Deep Learning taking off?
Rocket engines: Deep Learning driven by scale
1	million	
connections
(2007)
CPU
10	million	
connections
(2008)
GPU
1	billion
connections
(2011)
Cloud
(many CPUs)
100	billion
connections
(2015)
HPC
(many GPUs)
Andrew	Ng	(Baidu)	“What	Data	Scientists	Should	Know	about	Deep	Learning”
4
ABCI: Open Innovation Platform for
advancing AI Research & Deployment
System
(32 Racks)
Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
1.16 PFlops(FP64)
17.2 PFlops(FP16)
37.2 PFlops(FP64)
0.55 EFlops(FP16)
68.5 PFlops(FP64)
1.01 PFlops(FP16)
34.2 TFlops(FP64)
506 TFlops(FP16)
GPU:
7.8 TFlops(FP64)
125 TFlops(FP16)
CPU:
1.53 TFlops(FP64)
3.07 TFlops(FP32)
0.550 EFlops(FP16), 37.2 PFlops(FP64)
19.88 PFlops(Peak), Ranked #7 in Top500
14.423 GFlops/W, Ranked #4 in Green500
ImageNet training in 224 seconds (world record)
1088 Compute Nodes
4352 GPUs
PRIMERGY
CX2570 M4
PRIMERGY
CX400 M4
Tesla V100
Xeon
Skylake-SP
AIST Booth
#2409
5
AAIC TSUBAME3.0 ABCI Summit TPU 3.0 Pod
Organization AIST TokyoTech AIST ORNL Google
Start of operation 2017 2017 2018 2018 2018
Number of nodes 50 540 1088 4608 unknown
Throughput
Processor (TP)
NVIDIA Tesla P100 NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA Tesla V100 TPU 3.0
Number of TP 400 2160 4352 27648 unknown
Theoretical Perf.
(FP64)
2.2 PF 12.2 PF 37.2 PF 200 PF unknown
Theoretical Perf.
(DL)
8.6 PF 47.2 PF 550 PF 3.3 EF 100 PF / Pod
TOP500* 287 19 5 1 unknown
GREEN500* 7 6 8 5 unknown
#Nodes / Rack 6 36 34 16 unknown
#GPU / Rack 48 144 136 96 unknown
kW / Rack 22 kW 64.8 kW 67.33 kW 45-55 kW (est.) unknown
DL Perf. / Rack 0.9 PF 3.1 PF 17 PF 12 PF 12.5 PF
(100 PF / 8 rack)
*June 2018
ABCI achieves ultra-dense packaged rack
6
AI Datacenter
“Commoditizing supercomputer cooling technologies to Cloud
(70kW/rack)”
High Voltage Transformer
(3.25MW)
Extension Space
UPS
72 Racks
18 Racks
(w/ UPS)
Extension Space
Active
Chillers
Cooling
capacity:
200kW
Passive
Cooling
Tower
Cooling
capacity:
3MW
Cooling Pod
Server Room: 19m x 24m x 6m
Outdoor Facilities
n Single floor, cost effective building
n Hard concrete floor 2t/m2 weight
tolerance for racks and cooling pods
n Number of Racks
• Initial: 90 (ABCI uses 41 racks)
• Max: 144
n Power capacity: 3.25 MW
• ABCI uses 2.3MW max
n Cooling capacity: 3.2MW
• 70kW/rack: 60kW water+10kW air
7
Free cooling with hybrid air/water cooling system
Fan Coil Unit (FCU)
Cooling Tower
CDU
Water Block
Air40℃
Air35℃
Air Cooling
Cooling Pod / Rack
Computing Node
Water
40℃
Water
32℃
Water Cooling
1st closed
circuit
2nd closed
circuit
Outdoor Facilities
Optimizing AI data center operation using ML
8
Monitoring data
(job log and
facility sensor data)
Data Store
Data
for learning
Learning/Inference
• Reduce power consumption
• Improve resource utilization
• Reduce HW maintenance fee
using failure prediction
Improve data center
operation using feedback loop
of sensor data analysis
ABCI
World leading supercomputing systems for big data / AI
Develop a framework for optimizing the operation of AI data center
by self-adapting ML/DL technologies
TSUBAME 3.0
Monitoring
Data analytics
Data-driven
operation
Monitoring mechanism
9
Grafana
Prometheus
(Time series DB)
Sensors
(server, facility)
MariaDB
Job logs,
accounting info.
Data Analytics /
ML tools
ABCI Facility Monitoring through Grafana
/ ) )
/ )
) ) )
) )
) )
) /
) /
( ) )
Collecting sensor data
10
Location #items Sampling rate Sensors
Server 873 1 min.~1 hour CPU/GPU usage, memory usage, I/O usage,
temperature, power consumption, etc
IB/Ethernet
Switch
6245 1 min.~1 hour
Job
scheduler
26 - Job ID, account ID, group ID, job type,
resource type, walltime, application info., etc
AI data
center
408 1 sec. Temperature (hot/cold aisle), rack inlet air
temperature, humidity, power consumption
FCU inlet/outlet temperature, power
consumption, status
Water temperature, volume of water flow,
power consumption of pump, status, CDU
Site 8 1 min. Temperature, humidity, wind speed/direction,
rainfall
11
ABCI Software Stack
Software
Operating	System CentOS,	RHEL
Job	Scheduler Univa Grid	Engine
Container	Engine Docker,	Singularity
MPI OpenMPI,	MVAPICH2,	Intel	MPI
Development	tools Intel	Parallel	Studio	XE	Cluster	Edition,	PGI	Professional	Edition,	NVIDIA	CUDA	SDK,	GCC,	Python,	Ruby,	R,	Java,	Scala,	Perl
Deep	Learning Caffe,	Caffe2,	TensorFlow,	Theano,	Torch,	PyTorch,	CNTK,	MXnet,	Chainer,	Keras,	etc.
Big Data	Processing Hadoop,	Spark
Container support
• Containers enable users to instantly try the state-of-the-art software
developed in AI community
• ABCI supports two container technologies
• Docker, having a large user community
• Singularity, recently accepted HPC community
• ABCI provides various single-node/distributed deep learning framework
container images optimized to achieve high performance on ABCI
/ ) )
/ )
) ) )
) )
) )
) /
) /
( ) )
12
EFFICIENT FACILITY CONTROL
Use Case 1:
12
100% free cooling is possible in high summer
13
47
/2
170586473 :
170586473
WaterTemperature[℃]
Cooling Water (From ABCI):
<40℃
Cooling Water (To ABCI):
<32℃
32℃
• The first ABCI grand challenge
was held on 22-26 July 2018.
• The peak power consumption
reached about 1.5MW.
Max./min. temperature in Kashiwa in July 2018
Use case 1: More efficient and smarter facility control
• Cooling towers cannot so quickly adjust the water
temperature with server load fluctuation.
– The longer water circuit loop, the slower response.
– In some HPC/data centers, it might be serious.
• There is a chance of applying ML to improve utilization.
– Workload prediction is enable to generate cooler water before
coming heavy workload. That means...
– Job scheduler executes more jobs under temperature constraint
– GPU/CPU runs more higher frequency.
14
15
EFFICIENT JOB SCHEDULING
Use case 2:
15
Preliminary data analysis for efficient deep learning
job scheduling in AAIC (prototype system of ABCI)
16
• Analyze	55,127 jobs	submitted	on	AAIC	from	07/14/2017	to	12/31/2017
• 95% jobs	are	Single	GPU	jobs	and	WRA	is	too	low,	0.103 on	average.
1 GPU/ Multi GPU /
Multi Node Jobs
1 Node,
Multi GPU Jobs
Multi Node Jobs
𝑊𝑅𝐴$ =	
𝑊𝑎𝑙𝑙𝑡𝑖𝑚𝑒$
𝑊𝑎𝑙𝑙𝑡𝑖𝑚𝑒_𝑅𝑒𝑞𝑢𝑒𝑠𝑡$
Walltime Request Accuracy (WRA)
Use case 2: Efficient job scheduling
• Predicting job execution time and performance
modeling are important for efficient job scheduling
17
Some
Systems
FeaturesSome
Benchmarks Stress Select
Prediction
model
Regression
Collaborative filtering (CF) based algorithms
handle this by identifying inter-dependencies linking
benchmarks with systems
6Overall Workflow
The overall workflow about how to build machine learning based
models is depicted in Fig. 2.
Fig. 2. Overall Workflow about How to Build Machine Learning based Models
Machine Learning Predictions
for Underestimation of Job
Runtime on HPC System
[Guo2018]
Predicting Performance Using
Collaborative Filtering
[Salaria2018]
Use case 2: Efficient job scheduling
• Future work: Job scripts contains important
information including package dependencies,
program parameters, container image, which will
greatly affect the efficiency and execution time of
jobs.
– E.g., Use CNN to process job scripts [Wyatt2018]
18
19
Thank you for your attention!
More Information is available!
http://abci.ai/
AIST Booth
#2409

Mais conteúdo relacionado

Mais procurados

Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Virtual Server Implementation
Virtual Server ImplementationVirtual Server Implementation
Virtual Server Implementation
webhostingguy
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
Vivek Venugopalan
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 

Mais procurados (19)

Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with R
 
Virtual Server Implementation
Virtual Server ImplementationVirtual Server Implementation
Virtual Server Implementation
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 
Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformatics
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 

Semelhante a Opportunities of ML-based data analytics in ABCI

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 

Semelhante a Opportunities of ML-based data analytics in ABCI (20)

NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013bt
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Data Center Lessons Learned
Data Center Lessons LearnedData Center Lessons Learned
Data Center Lessons Learned
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
Scaling Green Instrumentation to more than 10 Million Cores
Scaling Green Instrumentation to more than 10 Million CoresScaling Green Instrumentation to more than 10 Million Cores
Scaling Green Instrumentation to more than 10 Million Cores
 
IBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOMEIBM and ASTRON 64bit μServer for DOME
IBM and ASTRON 64bit μServer for DOME
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 

Mais de Ryousei Takano

クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価
Ryousei Takano
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
Ryousei Takano
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Ryousei Takano
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
Ryousei Takano
 

Mais de Ryousei Takano (20)

Error Permissive Computing
Error Permissive ComputingError Permissive Computing
Error Permissive Computing
 
ABCI Data Center
ABCI Data CenterABCI Data Center
ABCI Data Center
 
クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksA Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か
 
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
 
IEEE/ACM SC2013報告
IEEE/ACM SC2013報告IEEE/ACM SC2013報告
IEEE/ACM SC2013報告
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPCHPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPC
 

Último

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Último (20)

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 

Opportunities of ML-based data analytics in ABCI

  • 1. Opportunities of ML-based data analytics in ABCI Ryousei Takano National Institute of Advanced Industrial Science and Technology (AIST), Japan Data Analytics for System and Facility Energy Management BoF, SC18, December 15th 1
  • 3. HPC is an engine of AI and Data is the new OIL 3 Engine Fuel Large neural networks Labeled data (x,y pairs) Why is Deep Learning taking off? Rocket engines: Deep Learning driven by scale 1 million connections (2007) CPU 10 million connections (2008) GPU 1 billion connections (2011) Cloud (many CPUs) 100 billion connections (2015) HPC (many GPUs) Andrew Ng (Baidu) “What Data Scientists Should Know about Deep Learning”
  • 4. 4 ABCI: Open Innovation Platform for advancing AI Research & Deployment System (32 Racks) Rack (17 Chassis) Compute Node (4GPUs, 2CPUs)Chips (GPU, CPU) Node Chassis (2 Compute Nodes) 1.16 PFlops(FP64) 17.2 PFlops(FP16) 37.2 PFlops(FP64) 0.55 EFlops(FP16) 68.5 PFlops(FP64) 1.01 PFlops(FP16) 34.2 TFlops(FP64) 506 TFlops(FP16) GPU: 7.8 TFlops(FP64) 125 TFlops(FP16) CPU: 1.53 TFlops(FP64) 3.07 TFlops(FP32) 0.550 EFlops(FP16), 37.2 PFlops(FP64) 19.88 PFlops(Peak), Ranked #7 in Top500 14.423 GFlops/W, Ranked #4 in Green500 ImageNet training in 224 seconds (world record) 1088 Compute Nodes 4352 GPUs PRIMERGY CX2570 M4 PRIMERGY CX400 M4 Tesla V100 Xeon Skylake-SP AIST Booth #2409
  • 5. 5 AAIC TSUBAME3.0 ABCI Summit TPU 3.0 Pod Organization AIST TokyoTech AIST ORNL Google Start of operation 2017 2017 2018 2018 2018 Number of nodes 50 540 1088 4608 unknown Throughput Processor (TP) NVIDIA Tesla P100 NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA Tesla V100 TPU 3.0 Number of TP 400 2160 4352 27648 unknown Theoretical Perf. (FP64) 2.2 PF 12.2 PF 37.2 PF 200 PF unknown Theoretical Perf. (DL) 8.6 PF 47.2 PF 550 PF 3.3 EF 100 PF / Pod TOP500* 287 19 5 1 unknown GREEN500* 7 6 8 5 unknown #Nodes / Rack 6 36 34 16 unknown #GPU / Rack 48 144 136 96 unknown kW / Rack 22 kW 64.8 kW 67.33 kW 45-55 kW (est.) unknown DL Perf. / Rack 0.9 PF 3.1 PF 17 PF 12 PF 12.5 PF (100 PF / 8 rack) *June 2018 ABCI achieves ultra-dense packaged rack
  • 6. 6 AI Datacenter “Commoditizing supercomputer cooling technologies to Cloud (70kW/rack)” High Voltage Transformer (3.25MW) Extension Space UPS 72 Racks 18 Racks (w/ UPS) Extension Space Active Chillers Cooling capacity: 200kW Passive Cooling Tower Cooling capacity: 3MW Cooling Pod Server Room: 19m x 24m x 6m Outdoor Facilities n Single floor, cost effective building n Hard concrete floor 2t/m2 weight tolerance for racks and cooling pods n Number of Racks • Initial: 90 (ABCI uses 41 racks) • Max: 144 n Power capacity: 3.25 MW • ABCI uses 2.3MW max n Cooling capacity: 3.2MW • 70kW/rack: 60kW water+10kW air
  • 7. 7 Free cooling with hybrid air/water cooling system Fan Coil Unit (FCU) Cooling Tower CDU Water Block Air40℃ Air35℃ Air Cooling Cooling Pod / Rack Computing Node Water 40℃ Water 32℃ Water Cooling 1st closed circuit 2nd closed circuit Outdoor Facilities
  • 8. Optimizing AI data center operation using ML 8 Monitoring data (job log and facility sensor data) Data Store Data for learning Learning/Inference • Reduce power consumption • Improve resource utilization • Reduce HW maintenance fee using failure prediction Improve data center operation using feedback loop of sensor data analysis ABCI World leading supercomputing systems for big data / AI Develop a framework for optimizing the operation of AI data center by self-adapting ML/DL technologies TSUBAME 3.0 Monitoring Data analytics Data-driven operation
  • 9. Monitoring mechanism 9 Grafana Prometheus (Time series DB) Sensors (server, facility) MariaDB Job logs, accounting info. Data Analytics / ML tools ABCI Facility Monitoring through Grafana / ) ) / ) ) ) ) ) ) ) ) ) / ) / ( ) )
  • 10. Collecting sensor data 10 Location #items Sampling rate Sensors Server 873 1 min.~1 hour CPU/GPU usage, memory usage, I/O usage, temperature, power consumption, etc IB/Ethernet Switch 6245 1 min.~1 hour Job scheduler 26 - Job ID, account ID, group ID, job type, resource type, walltime, application info., etc AI data center 408 1 sec. Temperature (hot/cold aisle), rack inlet air temperature, humidity, power consumption FCU inlet/outlet temperature, power consumption, status Water temperature, volume of water flow, power consumption of pump, status, CDU Site 8 1 min. Temperature, humidity, wind speed/direction, rainfall
  • 11. 11 ABCI Software Stack Software Operating System CentOS, RHEL Job Scheduler Univa Grid Engine Container Engine Docker, Singularity MPI OpenMPI, MVAPICH2, Intel MPI Development tools Intel Parallel Studio XE Cluster Edition, PGI Professional Edition, NVIDIA CUDA SDK, GCC, Python, Ruby, R, Java, Scala, Perl Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc. Big Data Processing Hadoop, Spark Container support • Containers enable users to instantly try the state-of-the-art software developed in AI community • ABCI supports two container technologies • Docker, having a large user community • Singularity, recently accepted HPC community • ABCI provides various single-node/distributed deep learning framework container images optimized to achieve high performance on ABCI / ) ) / ) ) ) ) ) ) ) ) ) / ) / ( ) )
  • 13. 100% free cooling is possible in high summer 13 47 /2 170586473 : 170586473 WaterTemperature[℃] Cooling Water (From ABCI): <40℃ Cooling Water (To ABCI): <32℃ 32℃ • The first ABCI grand challenge was held on 22-26 July 2018. • The peak power consumption reached about 1.5MW. Max./min. temperature in Kashiwa in July 2018
  • 14. Use case 1: More efficient and smarter facility control • Cooling towers cannot so quickly adjust the water temperature with server load fluctuation. – The longer water circuit loop, the slower response. – In some HPC/data centers, it might be serious. • There is a chance of applying ML to improve utilization. – Workload prediction is enable to generate cooler water before coming heavy workload. That means... – Job scheduler executes more jobs under temperature constraint – GPU/CPU runs more higher frequency. 14
  • 16. Preliminary data analysis for efficient deep learning job scheduling in AAIC (prototype system of ABCI) 16 • Analyze 55,127 jobs submitted on AAIC from 07/14/2017 to 12/31/2017 • 95% jobs are Single GPU jobs and WRA is too low, 0.103 on average. 1 GPU/ Multi GPU / Multi Node Jobs 1 Node, Multi GPU Jobs Multi Node Jobs 𝑊𝑅𝐴$ = 𝑊𝑎𝑙𝑙𝑡𝑖𝑚𝑒$ 𝑊𝑎𝑙𝑙𝑡𝑖𝑚𝑒_𝑅𝑒𝑞𝑢𝑒𝑠𝑡$ Walltime Request Accuracy (WRA)
  • 17. Use case 2: Efficient job scheduling • Predicting job execution time and performance modeling are important for efficient job scheduling 17 Some Systems FeaturesSome Benchmarks Stress Select Prediction model Regression Collaborative filtering (CF) based algorithms handle this by identifying inter-dependencies linking benchmarks with systems 6Overall Workflow The overall workflow about how to build machine learning based models is depicted in Fig. 2. Fig. 2. Overall Workflow about How to Build Machine Learning based Models Machine Learning Predictions for Underestimation of Job Runtime on HPC System [Guo2018] Predicting Performance Using Collaborative Filtering [Salaria2018]
  • 18. Use case 2: Efficient job scheduling • Future work: Job scripts contains important information including package dependencies, program parameters, container image, which will greatly affect the efficiency and execution time of jobs. – E.g., Use CNN to process job scripts [Wyatt2018] 18
  • 19. 19 Thank you for your attention! More Information is available! http://abci.ai/ AIST Booth #2409