Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
The Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy Office of Science user facility, provides world-class high-performance computing (HPC) resources for open science as well as world-class expertise in scientific computing. The OLCF operates 2 of the top 5 supercomputers in the world: Frontier and Summit. Our Kafka cluster was built in 2018 to stream data from Summit, a 200 Petaflop system with 4,000 compute nodes—but is the cluster ready for Exascale? The OLCF has recently delivered Frontier, the world's first exascale system, and we engineered a significant increase in streaming bandwidth and volume to serve its performance metrics, system events, utilization metrics, job metadata, and facilities monitoring. Data is indexed and served through an Elasticsearch cluster and provided in real time to Grafana dashboards.
In this talk we will discuss scaling and planning a system to meet the streaming demands of the world’s only exascale and most energy efficient supercomputer. Tune in to learn more about HPC and how streaming fits in to monitoring large-scale systems. We will discuss aggregating data from many clusters into a central streaming system, shedding technical debt by pivoting to Confluent Operator on Kubernetes, and how we use real-time data to optimize supercomputer performance.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
1. ORNL is managed by UT-Battelle LLC for the US Department of Energy
Monitoring Exascale Supercomputers
Tim Osborne, Rachel Palumbo, and Corwin Lester
Oak Ridge National Lab (ORNL)
4. 4
Oak Ridge Leadership
Computing Facility (OLCF)
Mission: Deploy and operate the
computational resources required to tackle
global challenges
• Providing world-leading computational and data
resources and specialized services for the most
computationally intensive problems
• Providing stable hardware/software path of increasing
scale to maximize productive applications development
• Providing the resources to investigate otherwise
inaccessible systems at every scale: from galaxy
formation to supernovae to earth systems to automobiles
to nanomaterials
• With our partners, deliver transforming discoveries in
materials, biology, climate, energy technologies, and
basic science
7. 7
7 Open slide master to edit
TOP500
1
OAK RIDGE NATIONAL LABORATORY'S FRONTIER SUPERCOMPUTER
1.1 exaflops of
performance on the
May 2022 Top500 list.
• 74 HPE Cray EX cabinets
• 9,408 AMD EPYC CPUs,
37,632 AMD GPUs
• 700 petabytes of storage
capacity, peak write speeds
of 5 terabytes per second
using Cray Clusterstor
Storage System
• 90 miles of HPE Slingshot
networking cables
ORNL’s Frontier
supercomputer
is #1 on the
TOP500.
GREEN500
1
52.23 gigaflops/watt
power efficiency.
ORNL’s Frontier
supercomputer
is #1 on the
GREEN500.
HPL-AI
1
6.88 exaflops on the
HPL-AI benchmark.
ORNL’s Frontier
supercomputer
is #1 on the
HPL-AI list.
Sources: May 30, 2022 Top500 release
# # #
8. 8
8 Open slide master to edit
What is an Exaflop?
• 1 quintillion (billion billion) floating point
operations per second
• 40 times more capable than what ORNL
deployed 10 years ago.
• Can do more floating point operations in 1
second than everyone in America will do in
their lifetimes.
• 330 million Americans are alive for 3 billion seconds (93 years). If they do 1 calculation every 3 seconds (including sleep), we’re only at 330 million billion
10. 10
Frontier Operational Data
• Acceptance Testing
• Validate design decisions
• Monitor 40,000+ components
• Build up requirements for next system
• HPC Research
12. 12
12 Open slide master to edit
Speeds and Feeds
• Systems data from Summit, Frontier, and
7 other clusters
• File System data from 5 Lustre
deployments and 1 GPFS
• Any other data admins or users want to
throw our way
• 200 topics currently (still working on
incorporating some smaller systems
• Average over 20 MB/s
13. 13
13 Open slide master to edit
Analytics and Monitoring Team
• We scale with scale by centralizing data streams.
• Standardize infrastructure, data governance and security.
• Provide tools that add value and visibility to data.
• Help operators become proactive instead of reactive.
• Empowering leadership by backing decisions with data.
• Enable researchers to find new ways to make supercomputing
better.
14. 14
14 Open slide master to edit
2021: NCCS Analytics and Monitoring Platform
This is an implementation of an hourglass
model for our streaming data ecosystem.
The message bus is the spanning layer
between data sources and sinks.
https://cacm.acm.org/magazines/2019/7/237714-on-the-hourglass-model
15. 15
15 Open slide master to edit
Use Case – Summit Cooling Intelligence
Weather
Wetbulb
Forecast
(NOAA)
Cooling Plant
MTW PLC
Data
(C-TECH)
Cooling Plant
MTW PLC
Outlet
Summit
OpenBMC
Telemetry
Streaming (IBM)
Summit
Job Scheduler
LSF Job
Allocation (IBM)
Message
Bus
(Kafka)
Human
Engineer
Dashboard
State
Snapshot
Histogram
Snapshot
Generator
Lightweight
Time Series
Database
Data
Exporters
Serialization
Compression
Long-term Data
Storage
GPFS, HPSS
Elasticsearch
Training
Learning
Model
Artifact
16. 16
16 Open slide master to edit
Thank You!
Acknowledgements:
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak
Ridge National Laboratory, which is supported by the Office of Science of the U.S. Depart-
ment of Energy under Contract No. DE-AC05-00OR22725.
Ryan Adamson: adamsonrm@ornl.gov
Tim Osborne: osbornetd@ornl.gov
Rachel Palumbo: palumborl@ornl.gov
Corwin Lester: lestercp@ornl.gov
Rob Jones: jonesjr@ornl.gov
Leah Huk: hukln@ornl.gov