SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Best practices for optimizing Red Hat
platforms for large scale datacenter
deployments on DGX systems
Charlie Boyle, NVIDIA
Andre Beausoleil and Jeremy Eder, Red Hat
NVIDIA GTC, Washington, DC, October, 2018
Agenda
● Relationship Overview
● Announcements / What’s New
● Tuned profile for DGX
● NGC Container Support overview
● RHEL, OpenShift, DGX-1 Integration Details
2
Summary of Announcements!
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Summary of Announcements!
Support for using DGX nodes as
workers in OpenShift 3.10 or later
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Summary of Announcements!
NGC containers can run on Red Hat
Enterprise Linux and OpenShift
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Support for using DGX nodes as
workers in OpenShift 3.10 or later
Summary of Announcements!
Expanded Engineering Relationship
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Support for using DGX nodes as
workers in OpenShift 3.10 or later
NGC containers can run on Red Hat
Enterprise Linux and OpenShift
Red Hat/NVIDIA Partnership Timeline
Open Source Project Collaboration
Key Red Hat Maintainer: Ben Skeggs
Qualified with new NVIDIA architectures
Part of complete OSS toolchain for HMM
NOUVEAU DRIVER
Key Red Hat developer: Jerome Glisse
Memory management between device & CPU
Key developer simplification, not just NVIDIA
HETEROGENEOUS MEMORY MGMT.
Key Red Hat Maintainer: Jakub Jelinek
OpenMP common library
GPU AWARE GCC (LIBGOMP)
Multiple vGPUs for compute and graphic
workloads
NVIDIA VGPU & RHV
9
Joint Testing of Critical CVEs
Installing Red Hat Enterprise Linux 7
●
●
●
Tuned
Tuning profile delivery mechanism
Red Hat ships tuned profiles that
improve performance for many
workloads...hopefully yours!
Okay, but why do I care ???
Children
Parents
Tuned: Your Custom Profiles
latency-performancethroughput-performance
network-latencynetwork-throughput
virtual-host
virtual-guest
balanced
desktop
Your Database ProfileYour Web Profile Your Middleware Profile
Children/Grandchildren
Tuned: Profile Inheritance (throughput)
throughput-performance
dgx-performance
governor=performance
energy_perf_bias=performance
min_perf_pct=100
readahead=4096
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_background_ratio = 10
vm.swappiness=10
[bootloader]
cmdline = ast.modeset=0
rd.driver.blacklist=nouveau nouveau.modeset=0
transparent_hugepage=madvise console=tty0
console=ttyS1,115200n8
intremap=no_x2apic_optout
Red Hat OpenShift Container Platform
OPENSHIFT IS GAINING MOMENTUM
OPENSHIFT CUSTOMER GROWTH IS ACCELERATING
COMPREHENSIVECLOUD PARTNERSCUSTOMERSCODE
Strong partnerships
with cloud providers,
ISVs, CCSPs.
Extensive container
catalog of certified
partner images.
Comprehensive portfolio of
container products and
services, including developer
tools, security, application
services, storage, and
management.
Red Hat is the leading
Kubernetes developer and
contributor with Google.
We make container
development easy, reliable,
and more secure.
Most reference customers
running in production.
Years of experience
running OpenShift Online
and OpenShift Dedicated
services.
Why OpenShift is the Best Choice
One Platform to...
OpenShift is the single platform
to run any application:
● Old or new
● Monolithic/Microservice
17
FSI
What does an OpenShift (OCP) Cluster look like?
c
What does an OpenShift (OCP) Cluster look like?
c
DGX-1 server
with Red Hat Enterprise Linux and
OpenShift Container platform (OCP)
● Resource Management Working Group
○ Features Delivered
■ Device Plugins (GPU/Bypass/FPGA)
■ CPU Manager (exclusive cores)
■ Huge Pages Support
○ Extensive Roadmap
● Intel, IBM, Google, NVIDIA, Red Hat, many more...
Upstream First: Kubernetes Working Groups
● Network Plumbing Working Group
○ Formalized Dec 2017
● Goal is to implement an out of tree, pseudo-standard collection of
CRDs for multiple networks, owned by sig-network, *out of tree*
● Separate control- and data-plane, Overlapping IPs, Fast Data-plane
● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least.
Upstream First: Kubernetes Working Groups
Control Plane
Compute and GPU Nodes
Infrastructure
master
and etcd
master
and etcd
master
and etcd
registry
and
router
registry
and
router
LB
registry
and
router
OpenShift Cluster Topology
DGX-1 DGX-1
DGX-1 DGX-1
● How to enable software to take advantage of “special”
hardware
● Create Node Pools
○ Mark them as “special”
○ Taints/Tolerations
○ Priority/Preemption
○ ExtendedResourceTole
ration
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Tune/Configure the OS
○ Tuned Profiles
○ CPU Isolation
○ sysctlsCompute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Optimize your workload
○ Dedicate CPU cores
○ Consume hugepages
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Enable the Hardware
○ Install drivers
○ Deploy Device Plugin
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Consume the Device
○ KubeFlow Template
deployment
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
Soft or Hard Shared Cluster Partitioning?
Priority and Preemption
● Create PriorityClasses based on business
goals
● Annotate pod specs with priorityClassName
● If all GPUs are used
○ A high prio pod is queued
○ A low prio pod is running
○ Kube will preempt low prio pod
■ And schedule high prio pod
● Ensures optimal density
Taints and Toleration
● Taints are “node labels with policies”
○ You can taint a node like
○ nvidia.com/gpu=value:NoSchedule
● Then a pod will have to “tolerate” the
nvidia.com/gpu taint, otherwise it won’t run
on that node.
● This allows you to create “node pools”
● Could lead to under-utilized resources
● Might make sense for security or business
rules
OpenShift + NVIDIA Device Plugin on DGX
Red Hat Enterprise Linux
30
OpenShift Container Platform
Linux Container Runtime nvidia-container-runtime-hook
NVIDIA Driver
libnvidia-container
NGC-gpu-pod-1
nvidia-device-plugin
NGC-gpu-pod-2 NGC-gpu-pod-3
OpenShift + NVIDIA Device Plugin on DGX
Volta GPU Kubelet
Device Plugin
(daemonset)
Kube Scheduler
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
oc create
31
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
OpenShift + NVIDIA Device Plugin on DGX
Volta GPU Kubelet
Device Plugin
(daemonset)
Kube Scheduler
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
oc create
32
Demo
Link
1. Login to openshift web console and land at Service Catalog
2. Verify NVIDIA device-plugin daemonset is running in kube-system namespace
3. Show how you can get a console in any running container
4. Change to nvidia namespace, and filter catalog to only show NGC templates
5. Start a TensorRT Inference Server that uses 4 of the 8 GPUs in the DGX
6. Show logs of tensorRT pod, that it is consuming 4 GPUs and that the model server is ready (curl
output)
7. Go back to service catalog and again filter by NGC images
8. Start NGC caffe framework pod, and configure it to use the remaining 4 GPUs
9. Show logs of caffe pod, show nvidia-smi, and show that this pod can access the inference server via
curl
●
●
○
●
●
NVIDIA Driver Packaging
Red Hat/NVIDIA Expanded Collaboration
● Driver Packaging
● Expanded DGX Testing
● Monitoring
● Heterogeneous Clusters
○ Resource API
● Topology Awareness
● Resource Quota API
References
● radanalytics templates for ML-workflow on OpenShift
● How to use GPUs with DevicePlugin in OpenShift 3.10
● Machine-Learning OpenShift Commons
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStack
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
 
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
 
Docker swarm
Docker swarmDocker swarm
Docker swarm
 
Learning Docker from Square One
Learning Docker from Square OneLearning Docker from Square One
Learning Docker from Square One
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking Overview
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin	Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
Elk
Elk Elk
Elk
 
OpenShift 4 installation
OpenShift 4 installationOpenShift 4 installation
OpenShift 4 installation
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
SK Telecom TACO Introduction at Berlin Summit
SK Telecom TACO Introduction at Berlin SummitSK Telecom TACO Introduction at Berlin Summit
SK Telecom TACO Introduction at Berlin Summit
 
Red Hat OpenShift on Bare Metal and Containerized Storage
Red Hat OpenShift on Bare Metal and Containerized StorageRed Hat OpenShift on Bare Metal and Containerized Storage
Red Hat OpenShift on Bare Metal and Containerized Storage
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Red Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABCRed Hat OpenShift Operators - Operators ABC
Red Hat OpenShift Operators - Operators ABC
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introduction
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 

Semelhante a Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems

GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 

Semelhante a Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems (20)

NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStackGPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdfUsing-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
DockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing AureaDockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing Aurea
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 
Lessons learned with kubernetes in production at PlayPass
Lessons learned with kubernetes in productionat PlayPassLessons learned with kubernetes in productionat PlayPass
Lessons learned with kubernetes in production at PlayPass
 
Delivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devicesDelivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devices
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
NVIDIA GTC 2018:  Enabling GPU-as-a-Service Providers with Red Hat OpenShiftNVIDIA GTC 2018:  Enabling GPU-as-a-Service Providers with Red Hat OpenShift
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
 
GPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor GraphicsGPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor Graphics
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
HDX 3D
HDX 3DHDX 3D
HDX 3D
 
2011-11-03 Intelligence Community Cloud Users Group
2011-11-03 Intelligence Community Cloud Users Group2011-11-03 Intelligence Community Cloud Users Group
2011-11-03 Intelligence Community Cloud Users Group
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 

Mais de Jeremy Eder

Mais de Jeremy Eder (9)

Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShift
 
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
NVIDIA GTC 2018:  Spectre/Meltdown Impact on High Performance WorkloadsNVIDIA GTC 2018:  Spectre/Meltdown Impact on High Performance Workloads
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
 
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShiftTriangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
 
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
 
OSCON 2017: To contain or not to contain
OSCON 2017:  To contain or not to containOSCON 2017:  To contain or not to contain
OSCON 2017: To contain or not to contain
 
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Red Hat Summit 2017:  Wicked Fast PaaS: Performance Tuning of OpenShift and D...Red Hat Summit 2017:  Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
 
DevConf 2017 - Realistic Container Platform Simulations
DevConf 2017 - Realistic Container Platform SimulationsDevConf 2017 - Realistic Container Platform Simulations
DevConf 2017 - Realistic Container Platform Simulations
 
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
KubeCon NA, Seattle, 2016:  Performance and Scalability Tuning Kubernetes for...KubeCon NA, Seattle, 2016:  Performance and Scalability Tuning Kubernetes for...
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
 
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
LinuxCon NA 2016:  When Containers and Virtualization Do - and Don’t - Work T...LinuxCon NA 2016:  When Containers and Virtualization Do - and Don’t - Work T...
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems

  • 1. Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems Charlie Boyle, NVIDIA Andre Beausoleil and Jeremy Eder, Red Hat NVIDIA GTC, Washington, DC, October, 2018
  • 2. Agenda ● Relationship Overview ● Announcements / What’s New ● Tuned profile for DGX ● NGC Container Support overview ● RHEL, OpenShift, DGX-1 Integration Details 2
  • 3. Summary of Announcements! NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7
  • 4. Summary of Announcements! Support for using DGX nodes as workers in OpenShift 3.10 or later NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7
  • 5. Summary of Announcements! NGC containers can run on Red Hat Enterprise Linux and OpenShift NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7 Support for using DGX nodes as workers in OpenShift 3.10 or later
  • 6. Summary of Announcements! Expanded Engineering Relationship NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7 Support for using DGX nodes as workers in OpenShift 3.10 or later NGC containers can run on Red Hat Enterprise Linux and OpenShift
  • 8. Open Source Project Collaboration Key Red Hat Maintainer: Ben Skeggs Qualified with new NVIDIA architectures Part of complete OSS toolchain for HMM NOUVEAU DRIVER Key Red Hat developer: Jerome Glisse Memory management between device & CPU Key developer simplification, not just NVIDIA HETEROGENEOUS MEMORY MGMT. Key Red Hat Maintainer: Jakub Jelinek OpenMP common library GPU AWARE GCC (LIBGOMP) Multiple vGPUs for compute and graphic workloads NVIDIA VGPU & RHV
  • 9. 9 Joint Testing of Critical CVEs
  • 10. Installing Red Hat Enterprise Linux 7 ● ● ●
  • 11. Tuned Tuning profile delivery mechanism Red Hat ships tuned profiles that improve performance for many workloads...hopefully yours! Okay, but why do I care ???
  • 12. Children Parents Tuned: Your Custom Profiles latency-performancethroughput-performance network-latencynetwork-throughput virtual-host virtual-guest balanced desktop Your Database ProfileYour Web Profile Your Middleware Profile Children/Grandchildren
  • 13. Tuned: Profile Inheritance (throughput) throughput-performance dgx-performance governor=performance energy_perf_bias=performance min_perf_pct=100 readahead=4096 kernel.sched_min_granularity_ns = 10000000 kernel.sched_wakeup_granularity_ns = 15000000 vm.dirty_background_ratio = 10 vm.swappiness=10 [bootloader] cmdline = ast.modeset=0 rd.driver.blacklist=nouveau nouveau.modeset=0 transparent_hugepage=madvise console=tty0 console=ttyS1,115200n8 intremap=no_x2apic_optout
  • 14. Red Hat OpenShift Container Platform
  • 15. OPENSHIFT IS GAINING MOMENTUM OPENSHIFT CUSTOMER GROWTH IS ACCELERATING
  • 16. COMPREHENSIVECLOUD PARTNERSCUSTOMERSCODE Strong partnerships with cloud providers, ISVs, CCSPs. Extensive container catalog of certified partner images. Comprehensive portfolio of container products and services, including developer tools, security, application services, storage, and management. Red Hat is the leading Kubernetes developer and contributor with Google. We make container development easy, reliable, and more secure. Most reference customers running in production. Years of experience running OpenShift Online and OpenShift Dedicated services. Why OpenShift is the Best Choice
  • 17. One Platform to... OpenShift is the single platform to run any application: ● Old or new ● Monolithic/Microservice 17 FSI
  • 18. What does an OpenShift (OCP) Cluster look like? c
  • 19. What does an OpenShift (OCP) Cluster look like? c DGX-1 server with Red Hat Enterprise Linux and OpenShift Container platform (OCP)
  • 20. ● Resource Management Working Group ○ Features Delivered ■ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ○ Extensive Roadmap ● Intel, IBM, Google, NVIDIA, Red Hat, many more... Upstream First: Kubernetes Working Groups
  • 21. ● Network Plumbing Working Group ○ Formalized Dec 2017 ● Goal is to implement an out of tree, pseudo-standard collection of CRDs for multiple networks, owned by sig-network, *out of tree* ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. Upstream First: Kubernetes Working Groups
  • 22.
  • 23. Control Plane Compute and GPU Nodes Infrastructure master and etcd master and etcd master and etcd registry and router registry and router LB registry and router OpenShift Cluster Topology DGX-1 DGX-1 DGX-1 DGX-1
  • 24. ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ Mark them as “special” ○ Taints/Tolerations ○ Priority/Preemption ○ ExtendedResourceTole ration Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 25. ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctlsCompute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 26. ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 27. ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 28. ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 29. Soft or Hard Shared Cluster Partitioning? Priority and Preemption ● Create PriorityClasses based on business goals ● Annotate pod specs with priorityClassName ● If all GPUs are used ○ A high prio pod is queued ○ A low prio pod is running ○ Kube will preempt low prio pod ■ And schedule high prio pod ● Ensures optimal density Taints and Toleration ● Taints are “node labels with policies” ○ You can taint a node like ○ nvidia.com/gpu=value:NoSchedule ● Then a pod will have to “tolerate” the nvidia.com/gpu taint, otherwise it won’t run on that node. ● This allows you to create “node pools” ● Could lead to under-utilized resources ● Might make sense for security or business rules
  • 30. OpenShift + NVIDIA Device Plugin on DGX Red Hat Enterprise Linux 30 OpenShift Container Platform Linux Container Runtime nvidia-container-runtime-hook NVIDIA Driver libnvidia-container NGC-gpu-pod-1 nvidia-device-plugin NGC-gpu-pod-2 NGC-gpu-pod-3
  • 31. OpenShift + NVIDIA Device Plugin on DGX Volta GPU Kubelet Device Plugin (daemonset) Kube Scheduler Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Benchmark (pod) resources: limits: nvidia.com/gpu: 8 oc create 31
  • 32. Benchmark (pod) resources: limits: nvidia.com/gpu: 8 OpenShift + NVIDIA Device Plugin on DGX Volta GPU Kubelet Device Plugin (daemonset) Kube Scheduler Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Benchmark (pod) resources: limits: nvidia.com/gpu: 8 oc create 32
  • 33. Demo Link 1. Login to openshift web console and land at Service Catalog 2. Verify NVIDIA device-plugin daemonset is running in kube-system namespace 3. Show how you can get a console in any running container 4. Change to nvidia namespace, and filter catalog to only show NGC templates 5. Start a TensorRT Inference Server that uses 4 of the 8 GPUs in the DGX 6. Show logs of tensorRT pod, that it is consuming 4 GPUs and that the model server is ready (curl output) 7. Go back to service catalog and again filter by NGC images 8. Start NGC caffe framework pod, and configure it to use the remaining 4 GPUs 9. Show logs of caffe pod, show nvidia-smi, and show that this pod can access the inference server via curl
  • 35. Red Hat/NVIDIA Expanded Collaboration ● Driver Packaging ● Expanded DGX Testing ● Monitoring ● Heterogeneous Clusters ○ Resource API ● Topology Awareness ● Resource Quota API
  • 36. References ● radanalytics templates for ML-workflow on OpenShift ● How to use GPUs with DevicePlugin in OpenShift 3.10 ● Machine-Learning OpenShift Commons