The Earth System Grid Federation (ESGF) is a distributed network of climate data servers that archives and shares model output data used by scientists worldwide. ESGF has led data archiving for the Coupled Model Intercomparison Project (CMIP) since its inception. The ESGF Holdings have grown significantly from CMIP5 to CMIP6 and are expected to continue growing rapidly. A new ESGF2 project funded by the US Department of Energy aims to modernize ESGF to handle exabyte scale data volumes through a new architecture based on centralized Globus services, improved data discovery tools, and data proximate computing capabilities.
development of diagnostic enzyme assay to detect leuser virus
The Earth System Grid Federation: Origins, Current State, Evolution
1. The Earth System Grid Federation:
Origins, Current State, Evolution
Presented by: Ian Foster, Argonne National Laboratory and University of Chicago
PIs: Forrest M. Hoffman (ORNL), Ian Foster (ANL), Sasha Ames (LLNL)
And: Rachana Ananthakrishnan, Jason Boutte, Nathan Collier, Scott Collis, Carlos Downie, Max Grover,
Robert Jacob, Jitendra Kumar, Lee Liming, Giri Prakash, Sarat Sreepathi, Steven Turoscy, Min Xu
2. A DOE project to address the challenges of terabyte climate data
Image from 2010 report
In the beginning was the Earth System Grid
2006 web interface
3. What is the Earth System Grid Federation?
● The Earth System Grid Federation
(ESGF) is a globally distributed peer-to-
peer network of data servers using a
common set of protocols and
interfaces to archive and distribute
Earth system model (ESM) output
● ESM output data are used by scientists
all over the world to investigate
consequences of possible climate
change scenarios and the resulting
Earth system feedbacks
4. ESGF has led data archiving for CMIP since its
conception
● Gathering and sharing of climate data is a key effort of the Coupled Model Intercomparison
Project (CMIP), the worldwide standard experimental protocol for studying general circulation
model output
● This climate modeling research requires enormous scientific and computational resources that
involves over 100 models and spans more than 20 countries
● The World Climate Research Program (WCRP) serves as the primary coordinating body for this
research activity
● The WCRP Working Group on Coupled Modeling (WGCM) relies on the ESGF to support these
activities by coordinating and maintaining the distributed petabyte data archive
● CMIP simulation model runs are key components of periodic assessments by the
Intergovernmental Panel on Climate Change (IPCC).
5. ESGF Holdings are Large and Growing
● CMIP5 totals >5 PB
● CMIP6 totals >20 PB
● We expect CMIP7
output, including high
resolutions simulations
and more ensembles,
to total >100 PB
● We also plan to expand
Federation holdings by
adding other Earth
science data projects
As of August 22, 2021
Total Distinct Replica
6. Success of ESGF (2021)
● ESGF data distribution solution has scaled to:
○ Dozens of data nodes
○ Dozens of modeling centers with multiple models
○ 1000s of simulation experiments/variants/ensembles
○ Tens of millions of files
○ Tens of petabytes
○ Thousands of research papers
● ESGF achieved a new level of stability and
reliability for CMIP6
● Much of ESGF’s success is due to the fact that it
has been a paragon of international collaboration,
including strong international governance
But also new challenges:
• Much more data
• Desire for yet more data
• Many more consumers
• New applications
• Aging software stack
• Mixed user support
7. -US: A New Consortium Project in the USA
● New team from Oak Ridge National Laboratory, Argonne National
Laboratory, and Lawrence Livermore National Laboratory proposed to
modernize the data backplane based on the Globus platform
● ESGF2 proposal was reviewed by panel of 8 scientists on August 30–31, 2021,
and was selected for funding by the US Department of Energy in September
● In collaboration with the ESGF Executive Committee, we will develop and
deploy a new architecture based on the Future Architecture Roadmap
● In addition, we will develop new data discovery tools and data access
interfaces, server-side computing (subsetting & summarizing), and user
computing (Kubernetes & JupyterHub) with improved user & system metrics
● We will add a Resource & Project Liaison group and a Science, User & Facility
Advisory Board; hold outreach activities; and offer a help desk/user support
8. Design and implementation principles
● Open architecture and protocols
○ Enable substitution of alternative implementations
● Leverage highly available and scalable central services from Globus
○ Reduce complexity, increase reliability, provide economies of scale
● Use proven, modern security technologies and practices
○ Integrated access control; protect against attacks and intrusions
● Use case approach to design, implementation, and evaluation
○ Ensure that solutions meet real user needs
● Integrated instrumentation
○ Metrics drive data management, data access features, capability development
● Focus on performance to deal with big data
○ High-speed data transfer, scalable search, HPC server-side processing
9. HTTP/S
Globus
Transfer
OpenDAP
Search
data
Data
Node
ESGF
webapp
Custom application
(e.g. portals, scripts)
Identity
Provider
Proxy
Managed
Publication
Service
Central
Search
Service
Authorization
Service
ESGF
Central
Services
Identity Providers
(InCommon, ORCID,
eduGAIN, etc.)
Custom Identity
Provider
Transfer
data
Publish
data
Analysis environment
(e.g. JupyterHub)
Standard APIs
Identity
Provider
Proxy
Managed
Publication
Service
Central
Search
Service
Authorization
Service
…
On prem/cloud
Compute
Data
Node
(1) Use ESGF
webapp to
download or
transfer data
(2) Access with
subsetting or
reduction via
API
(3) Custom web
and thick client
applications
access data via
API
(4) Analysis
application
directly access
data
Personal
Laptop
-US: Open architecture and protocols
10. -US: “Services from Globus” – What does that mean?
● Globus is a set of cloud-hosted data management services
operated by the University of the Chicago for the community
● Managed services for:
○ Auth: Identity broker and delegation
○ Transfer and Sharing: reliable file transfer
○ Search: Indexing and discovery
○ Flows: Automation
○ Compute: Remote computing
● Hybrid model:
○ Management services in the cloud
○ Agents/Connectors at institutions
All services are:
Accessible by Web
Accessible via simple APIs
Secure
Reliable
Easily automatable
[Sanjay Aiyagari]
11. Globus platform
Operated by UChicago for researchers worldwide
Made possible by the support of 200+ subscribers
12. DOE’s Current
Earth System
Grid Federation
● Primary server
at LLNL
● Replicating
data from the
global
Federation
● Independent
data node at
ANL
-US
13. DOE’s Next
Generation
Earth System
Grid Federation
● Co-located at
DOE’s major
computing
facilities
● Replicating
data from the
global
Federation
● Providing
cloud indexing,
automated
migration, and
tape archiving
60,000+ GPUs
-US
14. Enabling a new level of research productivity
Logging in with her institutional credentials,
Samantha is presented with new data, code, and
papers relevant to her current research. Intrigued by a
new report on extreme precipitation events, she
examines a Jupyter notebook that implements the
method used. Wondering how this method would work
with higher-resolution E3SM data, she quickly locates
required datasets and runs the notebook on a
subset. Results are promising, so she shares them
with collaborators via ESGF2 federated storage, and
they agree that a larger ensemble analysis is called for.
ESGF2 confirms that the full ensemble data are
available at OLCF, so they submit a request to execute
the analysis there. Within 24 hours, results have been
published to ESGF2 for broader consumption, along
with the notebook used to produce and validate results. Flood risk increases with
water availability
15. New: ESGF2-US Failsafe Data Replication
● In the US, LLNL operates the primary ESGF node, which replicates much of
the CMIP6 and related model output from around the globe
● Since the data at LLNL are contained only on spinning disk, we decided to
replicate the entire ~7.5 PB collection of data to Argonne National Laboratory
(ANL) and Oak Ridge National Laboratory (ORNL)
Solution: Use Globus to transfer all the data over ESnet
● We used custom Globus scripting (thanks to Lukasz Lacinski), ESnet network
monitoring and diagnostics (thanks to Eli Dart), DTN and GPFS optimized
configurations (thanks to Cameron Harr and others), and debugging and
problem-solving (thanks to Sasha Ames, Lee Liming, and others)
16. ESnet Global Connectivity
Global ESnet interconnectivity—including high
speed connections to London, Amsterdam, and
Geneva—will enable rapid data replication
across most of the Federation data nodes
ESGF2 will make use of the
high bandwidth between DOE
labs and HPC centers
Data will be automatically
migrated and cached
across ORNL, ANL, and
LLNL sites
An ESnet representative is part of our Resource & Project Liaisons group
17. https://dashboard.globus.org/esgf As of May 4, 2022
1.5 GB/s
4 to 6 GB/s
7.5 PB transferred between mid-Feb and May 4
17,347,671 directories and 28,907,532 files
20. New Data Discovery Methods
● Scalable, highly available index
● Globus Search for cloud-based, security-enabled index
● Bulk ingest operations for publication and replication
● Globus Flows to automate publication/replication
● Support discovery (faceted search) and query
● MetaGrid, other interfaces
● (In progress/discussion) STAC interface
21. New: CMIP6 metadata in Globus Search
● Metadata copied from LLNL index
○ ~5.4 million datasets
○ ~26 million files
● Metadata includes pointers
to files at ALCF
○ HTTPS and Globus
Transfer
● ~5 second response time
to return pointers to all
● 5.4 million datasets
● Future:
● Include OLCF pointers
● Automated metadata
enhancement
An index is created
and managed in
Globus Search
service
Globus search
allows for fine-
grained access
controls for
metadata visibility
22. New publication pipelines
● Repeatable, reliable process with well-defined steps
● Coordinate across heterogeneous resources/services
● Asynchronous processing of publication requests
● Enforce authorization to ensure only permitted accounts can
publish
23. Automating publication with Globus Flows
Search
Ingest
Share
Set policy
Identifier
Mint DOI
Auth
Login
Globus
Auth
Web
form
User input
Transfer
Stage
data
funcX
Extract
funcX
Validate
25. New: Data Proximate Compute
Data Proximate Compute (DPC) is an
important part of ESGF2, as:
○ Volumes for CMIP6 are huge and
CMIP7 will be larger
○ There are active efforts that we want
to enable in HPC workflows: e.g.,
Pangeo community is an example.
○ Not all users have computers or
resources to respond to to LCF calls
With DOE computer centers, we
pursue a staged, multi-faceted DPC
strategy:
1. Built-in server-side operations: e.g.,
subsetting
2. Dynamic replication of subsets to
(elastic) compute platforms
3. Vetted Jupyter notebooks running on
leadership computers
We aim for convergence of HPC and cloud environments so that (1) software
development effort can be shared and (2) each consumer can be directed to the
environment most appropriate for their situation. We do not want to replicate anything!
26. Unsupervised cloud classification with a rotation-invariant autoencoder
https://takglobus.github.io/AICCA-explorer/
Another approach to making large datasets
accessible is to use unsupervised learning to
generate compact representations
27. AICCA: AI-Driven Cloud Classification Atlas
801 TB
Unsupervised cloud classification with a rotation-invariant autoencoder
https://takglobus.github.io/AICCA-explorer/
54 GB
15,000x reduction
29. Summary
● ESGF2-US is exploring a next generation Earth System Grid Federation (ESGF2)
○ Designed for an order of magnitude increase in data sizes
○ Highly available, scalable, and fast
○ Automatically migrate data as needed
○ Improved data discovery and sharing tools
○ Server-side computing (e.g., subsetting) for derived data
○ User computing capabilities (e.g., JupyterHub/JupyterLab) near data
● The Globus platform is expected to provide many of the central services of the
ESGF2-US data backplane
● We are eager to understand your use cases for ESGF data, and to connect with
other data sources and services to broaden applications and user base