Monitoring Exascale Supercomputers With Tim Osborne | Current 2022

ORNL is managed by UT-Battelle LLC for the US Department of Energy
Monitoring Exascale Supercomputers
Tim Osborne, Rachel Palumbo, and Corwin Lester
Oak Ridge National Lab (ORNL)

3
ORNL’s mission
Deliver scientific
discoveries and
technical breakthroughs
needed to realize
solutions in energy
and national security
and provide economic
benefit to the nation

4
Oak Ridge Leadership
Computing Facility (OLCF)
Mission: Deploy and operate the
computational resources required to tackle
global challenges
• Providing world-leading computational and data
resources and specialized services for the most
computationally intensive problems
• Providing stable hardware/software path of increasing
scale to maximize productive applications development
• Providing the resources to investigate otherwise
inaccessible systems at every scale: from galaxy
formation to supernovae to earth systems to automobiles
to nanomaterials
• With our partners, deliver transforming discoveries in
materials, biology, climate, energy technologies, and
basic science

7
7 Open slide master to edit
TOP500
1
OAK RIDGE NATIONAL LABORATORY'S FRONTIER SUPERCOMPUTER
1.1 exaflops of
performance on the
May 2022 Top500 list.
• 74 HPE Cray EX cabinets
• 9,408 AMD EPYC CPUs,
37,632 AMD GPUs
• 700 petabytes of storage
capacity, peak write speeds
of 5 terabytes per second
using Cray Clusterstor
Storage System
• 90 miles of HPE Slingshot
networking cables
ORNL’s Frontier
supercomputer
is #1 on the
TOP500.
GREEN500
1
52.23 gigaflops/watt
power efficiency.
ORNL’s Frontier
supercomputer
is #1 on the
GREEN500.
HPL-AI
1
6.88 exaflops on the
HPL-AI benchmark.
ORNL’s Frontier
supercomputer
is #1 on the
HPL-AI list.
Sources: May 30, 2022 Top500 release
# # #

8
What is an Exaflop?
• 1 quintillion (billion billion) floating point
operations per second
• 40 times more capable than what ORNL
deployed 10 years ago.
• Can do more floating point operations in 1
second than everyone in America will do in
their lifetimes.
• 330 million Americans are alive for 3 billion seconds (93 years). If they do 1 calculation every 3 seconds (including sleep), we’re only at 330 million billion

10
Frontier Operational Data
• Acceptance Testing
• Validate design decisions
• Monitor 40,000+ components
• Build up requirements for next system
• HPC Research

12
Speeds and Feeds
• Systems data from Summit, Frontier, and
7 other clusters
• File System data from 5 Lustre
deployments and 1 GPFS
• Any other data admins or users want to
throw our way
• 200 topics currently (still working on
incorporating some smaller systems
• Average over 20 MB/s

13
Analytics and Monitoring Team
• We scale with scale by centralizing data streams.
• Standardize infrastructure, data governance and security.
• Provide tools that add value and visibility to data.
• Help operators become proactive instead of reactive.
• Empowering leadership by backing decisions with data.
• Enable researchers to find new ways to make supercomputing
better.

14
2021: NCCS Analytics and Monitoring Platform
This is an implementation of an hourglass
model for our streaming data ecosystem.
The message bus is the spanning layer
between data sources and sinks.
https://cacm.acm.org/magazines/2019/7/237714-on-the-hourglass-model

15
Use Case – Summit Cooling Intelligence
Weather
Wetbulb
Forecast
(NOAA)
Cooling Plant
MTW PLC
Data
(C-TECH)
Cooling Plant
MTW PLC
Outlet
Summit
OpenBMC
Telemetry
Streaming (IBM)
Summit
Job Scheduler
LSF Job
Allocation (IBM)
Message
Bus
(Kafka)
Human
Engineer
Dashboard
State
Snapshot
Histogram
Snapshot
Generator
Lightweight
Time Series
Database
Data
Exporters
Serialization
Compression
Long-term Data
Storage
GPFS, HPSS
Elasticsearch
Training
Learning
Model
Artifact

16
Thank You!
Acknowledgements:
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak
Ridge National Laboratory, which is supported by the Office of Science of the U.S. Depart-
ment of Energy under Contract No. DE-AC05-00OR22725.
Ryan Adamson: adamsonrm@ornl.gov
Tim Osborne: osbornetd@ornl.gov
Rachel Palumbo: palumborl@ornl.gov
Corwin Lester: lestercp@ornl.gov
Rob Jones: jonesjr@ornl.gov
Leah Huk: hukln@ornl.gov

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022

Recommended

Recommended

More Related Content

Similar to Monitoring Exascale Supercomputers With Tim Osborne | Current 2022

Similar to Monitoring Exascale Supercomputers With Tim Osborne | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022