This session will introduce High Performance Computing and outline the challenges when trying to fit those workloads into containers. Afterwards the community solutions are touched on before an approach based on proper Docker is shown. The talk will wrap-up with an outlook how containers can foster scientific discoveries by allowing HPC to be used by everyone.
6. Responding to Natural Disasters with IT
● Extreme weather and natural disasters
occurring at greater frequencies
● Containers and Cloud Services for
Disaster Recovery
● Real time monitoring and social media
aiding in damage prediction and loss of life
8. 1. Computation has to be done sequentially
High Performance Computing?
t=0 t=1 t=2
9. 1. Domains exchange intermediate results
1. Decomposition to compute on multiple cores
High Performance Computing?
1. Computation has to be done sequentially
10. 1. Fit into compute node(s)
compute1
1. Domains exchange intermediate results
1. Decomposition to compute on multiple cores
High Performance Computing?
1. Computation has to be done sequentially
compute0
compute0
18. ShipBuild
Given birth to, since Docker(CE back then) did not provide the features necessary
to run on HPC systems.
Current Solutions
Development Build hub.docker.com
HPC Runtimes
Pull Image
Extract File-System Store on /share
chroot /container
19. Traditionally container workloads are scheduled descriptive and
as a task (pod) on a worker.
HPC schedules a workload as a batch job on multiple nodes.
Service/Batch Scheduling
Docker Engine
SWARM Kubernetes
Shared System
process1 process2
node0
Shared System
job-process2
agent
nodeN
job-process2
agent
manager
controller
Distributed
Process
21. HPC-specific workarounds
+ Drop-in replacement as it wraps the job
- Not OCI compliant
- No Secure Supply Chain
- no integration with upstream ecosystem
Current Solutions [cont]
node0
Shared System
job-process2
HPC-runtime
agent
nodeN
job-process2
agent
manager
controller
23. To achieve the highest performance the kernel got squeezed out
of the equation for some technologies.
Kernel-bypassing Devices
Hardware
OS Kernel
Userland
ETH
TCP/IP
GPU
CUDA
IB
OFEDlibnet
Application
24. Scientific end-users expect the environment to be set up for
them, without prior knowledge about the specific cluster.
Scientific Environments
Service Cluster Compute Cluster
Storage
/home/
/proj/
Engine
rank0
Engine
rank1
Engine
rank2
Engine
AI
28. Image RegistrySecurity scan
& sign
Traditional
Third Party
HPC Workloads
docker store
Control
Plane
Leveraging HPC in the Enterprise
29. advanced
Multi Node,
Shared Storage
intermediate
Single Node,
Shared Storage
beginner
Convergence of AI and HPC
Complexity
Maturity
Single node,
Local Storage
non-GPU
MPI
shared file-system
device passthrough
a.k.a. HPC!