SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Perlmutter - A 2020 Pre-Exascale
GPU-accelerated System for NERSC -
Architecture and Application
Performance Optimization
Nicholas J. Wright
Perlmutter Chief Architect
GPU Technology Conference
San Jose
March 21 2019
NERSC is the mission High Performance
Computing facility for the DOE SC
- 2 -
7,000 Users
800 Projects
700 Codes
2000 NERSC citations per year
Simulations at scale
Data analysis support for
DOE’s experimental and
observational facilities
Photo Credit: CAMERA
NERSC has a dual mission to advance science
and the state-of-the-art in supercomputing
3
• We collaborate with computer companies years
before a system’s delivery to deploy advanced
systems with new capabilities at large scale
• We provide a highly customized software and
programming environment for science
applications
• We are tightly coupled with the workflows of
DOE’s experimental and observational facilities –
ingesting tens of terabytes of data each day
• Our staff provide advanced application and
system performance expertise to users
4
LLNL
IBM/NVIDIA
P9/Volta
Perlmutter is a Pre-Exascale System
Crossroads
Frontier
Pre-Exascale Systems Exascale Systems
2021
Argonne
IBM BG/Q
Argonne
Intel/Cray KNL
ORNL
Cray/NVidia K20
LBNL
Cray/Intel Xeon/KNL
LBNL
Cray/NVIDIA/AMD
LANL/SNL
TBD
Argonne
Intel/Cray
ORNL
TBD
LLNL
TBD
LANL/SNL
Cray/Intel Xeon/KNL
2013 2016 2018 2020 2021-2023
Summit
ORNL
IBM/NVIDIA
P9/Volta
LLNL
IBM BG/Q
Sequoia
CORI
A21
Trinity
Theta
Mira
Titan
Sierra
NERSC Systems Roadmap
NERSC-7:
Edison
Multicore
CPU
NERSC-8: Cori
Manycore CPU
NESAP Launched:
transition applications to
advanced architectures
2013
2016
2024
NERSC-9:
CPU and GPU nodes
Continued transition of
applications and support for
complex workflows
2020
NERSC-10:
Exa system
2028
NERSC-11:
Beyond
Moore
Increasing need for energy-efficient architectures
Cori: A pre-exascale supercomputer for the
Office of Science workload
Cray XC40 system with 9,600+ Intel
Knights Landing compute nodes
68 cores / 96 GB DRAM / 16 GB HBM
Support the entire Office of Science
research community
Begin to transition workload to energy
efficient architectures
1,600 Haswell processor nodes
NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec
30 PB of disk, >700 GB/sec I/O bandwidth
Integrated with Cori Haswell nodes on
Aries network for data / simulation /
analysis on one system
Perlmutter: A System Optimized for Science
● GPU-accelerated and CPU-only nodes
meet the needs of large scale
simulation and data analysis from
experimental facilities
● Cray “Slingshot” - High-performance,
scalable, low-latency Ethernet-
compatible network
● Single-tier All-Flash Lustre based HPC
file system, 6x Cori’s bandwidth
● Dedicated login and high memory
nodes to support complex workflows
4x NVIDIA “Volta-next” GPU
● > 7 TF
● > 32 GiB, HBM-2
● NVLINK
1x AMD CPU
4 Slingshot connections
● 4x25 GB/s
GPU direct, Unified Virtual
Memory (UVM)
2-3x Cori
GPU nodes
Volta
specs
AMD “Milan” CPU
● ~64 cores
● “ZEN 3” cores - 7nm+
● AVX2 SIMD (256 bit)
8 channels DDR memory
● >= 256 GiB total per node
1 Slingshot connection
● 1x25 GB/s
~ 1x Cori
CPU nodes
Rome
specs
Perlmutter: A System Optimized for Science
● GPU-accelerated and CPU-only nodes
meet the needs of large scale
simulation and data analysis from
experimental facilities
● Cray “Slingshot” - High-performance,
scalable, low-latency Ethernet-
compatible network
● Single-tier All-Flash Lustre based HPC
file system, 6x Cori’s bandwidth
● Dedicated login and high memory
nodes to support complex workflows
How do we optimize
the size of each partition?
NERSC System Utilization (Aug’17 - Jul’18)
● 3 codes > 25% of the
workload
● 10 codes > 50% of the
workload
● 30 codes > 75% of the
workload
● Over 600 codes
comprise the remaining
25% of the workload.
GPU Readiness Among NERSC Codes (Aug’17 - Jul’18)
12
GPU Status & Description Fraction
Enabled:
Most features are ported
and performant
32%
Kernels:
Ports of some kernels have
been documented.
10%
Proxy:
Kernels in related codes
have been ported
19%
Unlikely:
A GPU port would require
major effort.
14%
Unknown:
GPU readiness cannot be
assessed at this time.
25%
Breakdown of Hours at NERSC
A number of applications in NERSC
workload are GPU enabled already.
We will leverage existing GPU codes
from CAAR + Community
How many GPU nodes to buy - Benchmark Suite
Construction & Scalable System Improvement
Select codes to represent the anticipated workload
• Include key applications from the current workload.
• Add apps that are expected to be contribute significantly
to the future workload.
Scalable System Improvement
Measures aggregate performance of HPC machine
• How many more copies of the benchmark can be run
relative to the reference machine
• Performance relative to reference machine
Application Description
Quantum
Espresso
Materials code using DFT
MILC
QCD code using
staggered quarks
StarLord
Compressible radiation
hydrodynamics
DeepCAM
Weather/Community
Atmospheric Model 5
GTC Fusion PIC code
“CPU Only”
(3 Total)
Representative of
applications that cannot be
ported to GPUs
!!" =
#%&'() × +&,)-.( × /(01_3(0_4&'(
#%&'()567 × +&,)-.(567× /(01_3(0_4&'(567
Hetero system design & price sensitivity:
Budget for GPUs increases as GPU price drops
Chart explores an isocost design space
• Vary the budget allocated to GPUs
• Assume GPU enabled applications have
performance advantage = 10x per node, 3
of 8 apps are still CPU only.
• Examine GPU/CPU node cost ratio
Circles: 50% CPU nodes + 50% GPU nodes
Stars: Optimal system configuration.
GPU / CPU
$ per node
SSI increase
vs. CPU-Only
(@ budget %)
8:1 None No justification for GPUs
6:1 1.05x @ 25%
Slight justification for up to 50% of
budget on GPUs
4:1 1.23x @ 50%
GPUs cost effective up to full system
budget, but optimum at 50%
B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.
Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
Application readiness efforts justify larger GPU
partitions.
Explore an isocost design space
• Assume 8:1 GPU/CPU node cost ratio.
• Vary the budget allocated to GPUs
• Examine GPU / CPU performance gains
such as those obtained by software
optimization & tuning. 5 of 8 codes
have 10x, 20x, 30x speedup.
Circles: 50% CPU nodes + 50% GPU nodes
Stars: Optimal system configuration
GPU / CPU
perf. per node
SSI increase
vs. CPU-Only
(@ budget %)
10x None No justification for GPUs
20x 1.15x @ 45%
Compare to 1.23x
for 10x at 4:1 GPU/CPU cost ratio
30x 1.40x @ 60%
Compare to 3x
from NESAP for KNL
B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.
Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
How to Enable NERSC’s diverse community of 7,000
users, 750 projects, and 700 codes to run on advanced
architectures like Perlmutter and beyond?
• NERSC Exascale Science Application Program (NESAP)
• Engage ~25 Applications
• up to 17 postdoctoral fellows
• Deep partnerships with every SC Office area
• Leverage vendor expertise and hack-a-thons
• Knowledge transfer through documentation and training for all users
• Optimize codes with improvements relevant to multiple architectures
Application Readiness Strategy for Perlmutter
16
NESAP for Perlmutter will extend activities from NESAP
1. Identifying and exploiting on-node parallelism
2. Understanding and improving data-locality within the
memory hierarchy
What’s New for NERSC Users?
1. Heterogeneous compute elements
2. Identification and exploitation of even more parallelism
3. Emphasis on performance-portable programming
approach:
– Continuity from Cori through future NERSC systems
GPU Transition Path for Apps
17
OpenMP is the most popular non-MPI
parallel programming technique
● Results from ERCAP
2017 user survey
○ Question answered
by 328 of 658
survey respondents
● Total exceeds 100%
because some
applications use
multiple techniques
OpenMP meets the needs of the
NERSC workload
● Supports C, C++ and Fortran
○ The NERSC workload consists of ~700 applications with a
relatively equal mix of C, C++ and Fortran
● Provides portability to different architectures at other DOE
labs
● Works well with MPI: hybrid MPI+OpenMP approach
successfully used in many NERSC apps
● Recent release of OpenMP 5.0 specification – the third version
providing features for accelerators
○ Many refinements over this five year period
Ensuring OpenMP is ready for
Perlmutter CPU+GPU nodes
● NERSC will collaborate with NVIDIA to enable OpenMP
GPU acceleration with PGI compilers
○ NERSC application requirements will help prioritize
OpenMP and base language features on the GPU
○ Co-design of NESAP-2 applications to enable effective use
of OpenMP on GPUs and guide PGI optimization effort
● We want to hear from the larger community
○ Tell us your experience, including what OpenMP
techniques worked / failed on the GPU
○ Share your OpenMP applications targeting GPUs
Breaking News !
Engaging around Performance Portability
22
NERSC is a MemberNERSC is working with PGI/NVIDIA to
enable OpenMP GPU acceleration
Doug Doerfler Lead Performance Portability
Workshop at SC18. and 2019 DOE COE Perf.
Port. Meeting
NERSC is leading development of
performanceportability.org
NERSC Hosted Past C++
Summit and ISO C++
meeting on HPC.
Slingshot Network
• High Performance scalable interconnect
– Low latency, high-bandwidth, MPI
performance enhancements
– 3 hops between any pair of nodes
– Sophisticated congestion control and
adaptive routing to minimize tail latency
• Ethernet compatible
– Blurs the line between the inside and the
outside of the machine
– Allow for seamless external
communication
– Direct interface to storage 23
3x!
Perlmutter has a All-Flash Filesystem
24
4.0 TB/s to Lustre
>10 TB/s overall
Logins, DTNs, Workflows
All-Flash Lustre Storage
CPU + GPU Nodes
Community FS
~ 200 PB, ~500 GB/s
Terabits/sec to
ESnet, ALS, ...
• Fast across many dimensions
– 4 TB/s sustained bandwidth
– 7,000,000 IOPS
– 3,200,000 file creates/sec
• Usable for NERSC users
– 30 PB usable capacity
– Familiar Lustre interfaces
– New data movement capabilities
• Optimized for NERSC data workloads
– NEW small-file I/O improvements
– NEW features for high IOPS, non-
sequential I/O
• Winner of 2011 Nobel Prize in
Physics for discovery of the
accelerating expansion of the
universe.
• Supernova Cosmology Project,
lead by Perlmutter, was a
pioneer in using NERSC
supercomputers combine
large scale simulations with
experimental data analysis
• Login “saul.nersc.gov”
NERSC-9 will be named after Saul Perlmutter
25
Perlmutter: A System Optimized for Science
● Cray Shasta System providing 3-4x capability of Cori system
● First NERSC system designed to meet needs of both large scale simulation
and data analysis from experimental facilities
○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes
○ Cray Slingshot high-performance network will support Terabit rate connections to system
○ Optimized data software stack enabling analytics and ML at scale
○ All-Flash filesystem for I/O acceleration
● Robust readiness program for simulation, data and learning applications
and complex workflows
● Delivery in late 2020
Thank you !
We are hiring - https://jobs.lbl.gov/

Mais conteúdo relacionado

Mais de inside-BigData.com

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)inside-BigData.com
 

Mais de inside-BigData.com (20)

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Perlmutter: A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization

  • 1. Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization Nicholas J. Wright Perlmutter Chief Architect GPU Technology Conference San Jose March 21 2019
  • 2. NERSC is the mission High Performance Computing facility for the DOE SC - 2 - 7,000 Users 800 Projects 700 Codes 2000 NERSC citations per year Simulations at scale Data analysis support for DOE’s experimental and observational facilities Photo Credit: CAMERA
  • 3. NERSC has a dual mission to advance science and the state-of-the-art in supercomputing 3 • We collaborate with computer companies years before a system’s delivery to deploy advanced systems with new capabilities at large scale • We provide a highly customized software and programming environment for science applications • We are tightly coupled with the workflows of DOE’s experimental and observational facilities – ingesting tens of terabytes of data each day • Our staff provide advanced application and system performance expertise to users
  • 4. 4 LLNL IBM/NVIDIA P9/Volta Perlmutter is a Pre-Exascale System Crossroads Frontier Pre-Exascale Systems Exascale Systems 2021 Argonne IBM BG/Q Argonne Intel/Cray KNL ORNL Cray/NVidia K20 LBNL Cray/Intel Xeon/KNL LBNL Cray/NVIDIA/AMD LANL/SNL TBD Argonne Intel/Cray ORNL TBD LLNL TBD LANL/SNL Cray/Intel Xeon/KNL 2013 2016 2018 2020 2021-2023 Summit ORNL IBM/NVIDIA P9/Volta LLNL IBM BG/Q Sequoia CORI A21 Trinity Theta Mira Titan Sierra
  • 5. NERSC Systems Roadmap NERSC-7: Edison Multicore CPU NERSC-8: Cori Manycore CPU NESAP Launched: transition applications to advanced architectures 2013 2016 2024 NERSC-9: CPU and GPU nodes Continued transition of applications and support for complex workflows 2020 NERSC-10: Exa system 2028 NERSC-11: Beyond Moore Increasing need for energy-efficient architectures
  • 6. Cori: A pre-exascale supercomputer for the Office of Science workload Cray XC40 system with 9,600+ Intel Knights Landing compute nodes 68 cores / 96 GB DRAM / 16 GB HBM Support the entire Office of Science research community Begin to transition workload to energy efficient architectures 1,600 Haswell processor nodes NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec 30 PB of disk, >700 GB/sec I/O bandwidth Integrated with Cori Haswell nodes on Aries network for data / simulation / analysis on one system
  • 7. Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows
  • 8. 4x NVIDIA “Volta-next” GPU ● > 7 TF ● > 32 GiB, HBM-2 ● NVLINK 1x AMD CPU 4 Slingshot connections ● 4x25 GB/s GPU direct, Unified Virtual Memory (UVM) 2-3x Cori GPU nodes Volta specs
  • 9. AMD “Milan” CPU ● ~64 cores ● “ZEN 3” cores - 7nm+ ● AVX2 SIMD (256 bit) 8 channels DDR memory ● >= 256 GiB total per node 1 Slingshot connection ● 1x25 GB/s ~ 1x Cori CPU nodes Rome specs
  • 10. Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows How do we optimize the size of each partition?
  • 11. NERSC System Utilization (Aug’17 - Jul’18) ● 3 codes > 25% of the workload ● 10 codes > 50% of the workload ● 30 codes > 75% of the workload ● Over 600 codes comprise the remaining 25% of the workload.
  • 12. GPU Readiness Among NERSC Codes (Aug’17 - Jul’18) 12 GPU Status & Description Fraction Enabled: Most features are ported and performant 32% Kernels: Ports of some kernels have been documented. 10% Proxy: Kernels in related codes have been ported 19% Unlikely: A GPU port would require major effort. 14% Unknown: GPU readiness cannot be assessed at this time. 25% Breakdown of Hours at NERSC A number of applications in NERSC workload are GPU enabled already. We will leverage existing GPU codes from CAAR + Community
  • 13. How many GPU nodes to buy - Benchmark Suite Construction & Scalable System Improvement Select codes to represent the anticipated workload • Include key applications from the current workload. • Add apps that are expected to be contribute significantly to the future workload. Scalable System Improvement Measures aggregate performance of HPC machine • How many more copies of the benchmark can be run relative to the reference machine • Performance relative to reference machine Application Description Quantum Espresso Materials code using DFT MILC QCD code using staggered quarks StarLord Compressible radiation hydrodynamics DeepCAM Weather/Community Atmospheric Model 5 GTC Fusion PIC code “CPU Only” (3 Total) Representative of applications that cannot be ported to GPUs !!" = #%&'() × +&,)-.( × /(01_3(0_4&'( #%&'()567 × +&,)-.(567× /(01_3(0_4&'(567
  • 14. Hetero system design & price sensitivity: Budget for GPUs increases as GPU price drops Chart explores an isocost design space • Vary the budget allocated to GPUs • Assume GPU enabled applications have performance advantage = 10x per node, 3 of 8 apps are still CPU only. • Examine GPU/CPU node cost ratio Circles: 50% CPU nodes + 50% GPU nodes Stars: Optimal system configuration. GPU / CPU $ per node SSI increase vs. CPU-Only (@ budget %) 8:1 None No justification for GPUs 6:1 1.05x @ 25% Slight justification for up to 50% of budget on GPUs 4:1 1.23x @ 50% GPUs cost effective up to full system budget, but optimum at 50% B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
  • 15. Application readiness efforts justify larger GPU partitions. Explore an isocost design space • Assume 8:1 GPU/CPU node cost ratio. • Vary the budget allocated to GPUs • Examine GPU / CPU performance gains such as those obtained by software optimization & tuning. 5 of 8 codes have 10x, 20x, 30x speedup. Circles: 50% CPU nodes + 50% GPU nodes Stars: Optimal system configuration GPU / CPU perf. per node SSI increase vs. CPU-Only (@ budget %) 10x None No justification for GPUs 20x 1.15x @ 45% Compare to 1.23x for 10x at 4:1 GPU/CPU cost ratio 30x 1.40x @ 60% Compare to 3x from NESAP for KNL B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
  • 16. How to Enable NERSC’s diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Perlmutter and beyond? • NERSC Exascale Science Application Program (NESAP) • Engage ~25 Applications • up to 17 postdoctoral fellows • Deep partnerships with every SC Office area • Leverage vendor expertise and hack-a-thons • Knowledge transfer through documentation and training for all users • Optimize codes with improvements relevant to multiple architectures Application Readiness Strategy for Perlmutter 16
  • 17. NESAP for Perlmutter will extend activities from NESAP 1. Identifying and exploiting on-node parallelism 2. Understanding and improving data-locality within the memory hierarchy What’s New for NERSC Users? 1. Heterogeneous compute elements 2. Identification and exploitation of even more parallelism 3. Emphasis on performance-portable programming approach: – Continuity from Cori through future NERSC systems GPU Transition Path for Apps 17
  • 18. OpenMP is the most popular non-MPI parallel programming technique ● Results from ERCAP 2017 user survey ○ Question answered by 328 of 658 survey respondents ● Total exceeds 100% because some applications use multiple techniques
  • 19. OpenMP meets the needs of the NERSC workload ● Supports C, C++ and Fortran ○ The NERSC workload consists of ~700 applications with a relatively equal mix of C, C++ and Fortran ● Provides portability to different architectures at other DOE labs ● Works well with MPI: hybrid MPI+OpenMP approach successfully used in many NERSC apps ● Recent release of OpenMP 5.0 specification – the third version providing features for accelerators ○ Many refinements over this five year period
  • 20. Ensuring OpenMP is ready for Perlmutter CPU+GPU nodes ● NERSC will collaborate with NVIDIA to enable OpenMP GPU acceleration with PGI compilers ○ NERSC application requirements will help prioritize OpenMP and base language features on the GPU ○ Co-design of NESAP-2 applications to enable effective use of OpenMP on GPUs and guide PGI optimization effort ● We want to hear from the larger community ○ Tell us your experience, including what OpenMP techniques worked / failed on the GPU ○ Share your OpenMP applications targeting GPUs
  • 22. Engaging around Performance Portability 22 NERSC is a MemberNERSC is working with PGI/NVIDIA to enable OpenMP GPU acceleration Doug Doerfler Lead Performance Portability Workshop at SC18. and 2019 DOE COE Perf. Port. Meeting NERSC is leading development of performanceportability.org NERSC Hosted Past C++ Summit and ISO C++ meeting on HPC.
  • 23. Slingshot Network • High Performance scalable interconnect – Low latency, high-bandwidth, MPI performance enhancements – 3 hops between any pair of nodes – Sophisticated congestion control and adaptive routing to minimize tail latency • Ethernet compatible – Blurs the line between the inside and the outside of the machine – Allow for seamless external communication – Direct interface to storage 23 3x!
  • 24. Perlmutter has a All-Flash Filesystem 24 4.0 TB/s to Lustre >10 TB/s overall Logins, DTNs, Workflows All-Flash Lustre Storage CPU + GPU Nodes Community FS ~ 200 PB, ~500 GB/s Terabits/sec to ESnet, ALS, ... • Fast across many dimensions – 4 TB/s sustained bandwidth – 7,000,000 IOPS – 3,200,000 file creates/sec • Usable for NERSC users – 30 PB usable capacity – Familiar Lustre interfaces – New data movement capabilities • Optimized for NERSC data workloads – NEW small-file I/O improvements – NEW features for high IOPS, non- sequential I/O
  • 25. • Winner of 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe. • Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers combine large scale simulations with experimental data analysis • Login “saul.nersc.gov” NERSC-9 will be named after Saul Perlmutter 25
  • 26. Perlmutter: A System Optimized for Science ● Cray Shasta System providing 3-4x capability of Cori system ● First NERSC system designed to meet needs of both large scale simulation and data analysis from experimental facilities ○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes ○ Cray Slingshot high-performance network will support Terabit rate connections to system ○ Optimized data software stack enabling analytics and ML at scale ○ All-Flash filesystem for I/O acceleration ● Robust readiness program for simulation, data and learning applications and complex workflows ● Delivery in late 2020
  • 27. Thank you ! We are hiring - https://jobs.lbl.gov/