Manage Data with Assurance
- Globus is a data transfer service that aims to increase the efficiency of researchers by enabling sustainable data sharing.
- It has over 138,000 registered users and has transferred over 600 petabytes of data between endpoints.
- The document discusses recent enhancements to Globus including new connectors, improved security features for regulated data, and plans to further develop the platform capabilities.
4. Globus by the numbers...
7,400
active shared
endpoints
100+
subscribers
600 PB
moved
22,000
active personal
endpoints
90 billion
files processed
1,800
active server
endpoints
3 months
longest running transfer
1 PB
largest single
transfer to date
99.9%
availability
600+
identity providers
2000+
most shared
endpoints
at a single
institution 138,000
registered users
7. Globus User Story Highlights
File Sharing
Value
Improved
Performance
Ease of Use
Connector
Benefits
“We needed an easy way to share terabytes of data on a regular basis
with dozens of researchers. Thanks to Globus sharing, it’s easy for us to
get our researchers the data they need.”
Platform
Development
“Now Canadian researchers have a single repository where data can
easily and securely be accessed, searched and shared.”
“With Globus, our
researchers have one less
thing to worry about!”
“I routinely have to move hundreds of gigabytes of data – Globus makes it
easy, so I can execute these transfers with very little effort.”
“Users can quickly, effectively, and
securely share data with their research
community or the broader public.”
“WVU uses Globus to
archive research data
out to Google Drive.”
“[BlackPearl with Globus] enables us
to archive and share petabytes of
information in a convenient solution.”
Usage Briefs: www.globus.org/usage-brief-library User Stories: www.globus.org/user-stories
What makes it all worthwhile
8. “Whatever you are studying right now, if
you are not getting up to speed on deep
learning, neural networks, etc., you lose.
We are going through the process where
software will automate software,
automation will automate automation.”
-- Mark Cuban
9.
10. 10
Configure apparatus/write code
Run experiments
Solve
societal
problems
Create knowledge
What scientists
want to do
Most
scientist
time
Analyze and plan
Opportunities for AI in science:
Research today
12. AI at Argonne: data-driven discovery
Strong and weak lensing
in sky survey data
Prediction of antimicrobial
resistance phenotypes
Prediction of radiation
stopping power
Identification and tracking
of storms
Parameter extraction in
atom probe tomography
Learning for dynamic
sampling in spectroscopy
Structure-property-process
triangle in additive manufact.
Vehicle energy
consumption prediction
Photometric red shift
estimation
New materials for efficient
solar cells
Cosmic Microwave
Background emulation
Enhancement of noisy
tomographic images
Nowcasting with
convolutional LSTMs
Efficient climate model
emulators
Defect-level prediction in
seminconductors
Flying object detector for
edge deployment
Discovery of new energy
storage materials
Reduced order modeling
of laser sintering
14. 14
Model
creation
Data
ingest
Inference
HPO
Data
enhancement
Data
QA/QC
Feature
selection
Model
training
UQ
Model
reduction Active/
reinforcement
learning
Scientific instruments
Major user facilities
Laboratory equipment
Automated labs
…
Sensors
Environmental
Laboratories
Mobile
…
Simulation codes
Computational results
Function memorization
…
Databases
Reference data
Experimental data
Computed properties
Scientific literature
…
AI Workflows
Data
Models
,
Accelerat
ors
Compute
Agile
Infrastructure
Surrogates
Scientists
Expert input
Goal setting
…
AI industry, academia
New methods
Open source codes
AI accelerators
…
Agile services
Data
transfer
Registries
Data
sharing
Containers
Integrity
Automation
FaaS Identifiers
Rethinking Data infrastructure for Science AI
15. 15
Data
ingest
Inference
HPO
Data
enhancement
Data
QA/QC
Feature
selection
Model
training
UQ
Model
reduction Active/
reinforcement
learning
Scientific instruments
Major user facilities
Laboratory equipment
Automated labs
…
Sensors
Environmental
Laboratories
Mobile
…
Simulation codes
Computational results
Function memorization
…
Databases
Reference data
Experimental data
Computed properties
Scientific literature
…
AI Workflows
Data
Models
,
Accelerat
ors
Compute
Agile
Infrastructure
Surrogates
Scientists
Expert input
Goal setting
…
AI industry, academia
New methods
Open source codes
AI accelerators
…
Agile services
Data
transfer
Registries
Data
sharing
Containers
Integrity
Automation
FaaS Identifiers
Transfer
Auth
Sharing
Model
creation
Rethinking Data infrastructure for Science AI
16. 16
Data
ingest
Inference
HPO
Data
enhancement
Data
QA/QC
Feature
selection
Model
training
UQ
Model
reduction Active/
reinforcement
learning
Scientific instruments
Major user facilities
Laboratory equipment
Automated labs
…
Sensors
Environmental
Laboratories
Mobile
…
Simulation codes
Computational results
Function memorization
…
Databases
Reference data
Experimental data
Computed properties
Scientific literature
…
AI Workflows
Data
Models
,
Accelerat
ors
Compute
Agile
Infrastructure
Surrogates
Scientists
Expert input
Goal setting
…
AI industry, academia
New methods
Open source codes
AI accelerators
…
Agile services
Data
transfer
Registries
Data
sharing
Containers
Integrity
Automation
FaaS Identifiers
funcX
Transfer
Automate
Auth
Sharing
Identifers
Model
creation
Rethinking Data infrastructure for Science AI
17. 17
Data
ingest
Inference
HPO
Data
enhancement
Data
QA/QC
Feature
selection
Model
training
UQ
Model
reduction Active/
reinforcement
learning
Scientific instruments
Major user facilities
Laboratory equipment
Automated labs
…
Sensors
Environmental
Laboratories
Mobile
…
Simulation codes
Computational results
Function memorization
…
Databases
Reference data
Experimental data
Computed properties
Scientific literature
…
AI Workflows
Data
Models
,
Accelerat
ors
Compute
Agile
Infrastructure
Surrogates
Scientists
Expert input
Goal setting
…
AI industry, academia
New methods
Open source codes
AI accelerators
…
Agile services
Data
transfer
Registries
Data
sharing
Containers
Integrity
Automation
FaaS Identifiers
DLHub
xDF
funcX
Parsl
Transfer
Automate
Petrel
Auth
Sharing
Identifers
Model
creation
CANDLE
Rethinking Data infrastructure for Science AI
18. DLHub: Organizing and Serving Models
• Collect, publish, categorize models
• Serve models via API with access
controls to simplify sharing,
consumption, and access
• Leverage ALCF resources and
prepare for Exascale ML
• Deploy and scale automatically
• Provide citable DOI for
reproducible science
Argonne Advanced Computing LDRD Cherukara et al.
Energy Storage Tomography
www.dlhub.org Models and Processing Logic as a Service
X-Ray Science
Ward et al. TomoGAN: Liu et al.
Input
Output
25. Manage Protected Data
25
Higher assurance levels for HIPAA and other regulated data
• Support for managed data
transfer of protected data such
as health related information
• Share data with collaborators
while meeting compliance
requirements
• Administration and
management of access
• Includes BAA option
26. Globus for high assurance data management
• Restricted data handling
– PHI (Protected Health Information)
– PII (Personally identifiable information)
– Controlled Unclassified Information
• University of Chicago security controls
– NIST 800-53 Low
– Superset of 800-171 Low
• Business Associate Agreements (BAA) between
University of Chicago and our subscribers
27. Services in scope
• Globus Services: Auth, Transfer & Sharing, Groups
• Globus Connect Server v5.2 and above
• Globus Connect Personal v3.x
• Web app (app.globus.org)
• Globus Command Line Interface (CLI)
• Connectors: POSIX, Google Drive, AWS S3, CEPH
28. Restricted data disclosure to Globus
• Globus never sees file contents
– File contents can have restricted data
• File paths/name can have restricted data (e.g. PHI)
• No other elements (endpoint definitions, labels,
collection definitions) can contain restricted data
29. Product enhancements for high assurance
• Additional authentication assurance
– Authenticate with specific identity within specific time within a
session
• Isolation of applications
– Authentication context is per application, per session (~browser
session)
• Enforces encryption of all user data in transit
• Audit logging
– Both at the institution and Globus services
30. Product enhancements for high assurance
• Additional security requirements enforced on
management of all high assurance resources
– Data access, and any interaction that can lead to data access
– Examples: Groups, Management Console
• Enhanced user interfaces for seamless management
of protected data
– Webapp and CLI
31. Operational enhancements for high assurance
• Intrusion detection and prevention
• Encryption
• Enhanced logging
• Secure remote access, access control, and secure
practices for laptops
• Uniform configuration management and change control
• AWS best practices for secure environment: VPCs,
security groups, IAM best practices
32. New subscription levels
• High Assurance
– 33% uplift on Standard subscription
and on premium connectors used for
high assurance data
• BAA
– All High Assurance features + BAA
with University of Chicago
– 50% uplift on Standard subscription
and on premium connectors used
under a BAA
34. Web app enhancements
• Accessibility
– Target WCAG 2.0 AA compliance
• Responsiveness and touch
• Works with new connectors
collections.globus.org
34
35. Web app enhancements
• Customizable interface
• Full screen view
• Compact file listing
display
• Remember user
configuration
– Single vs. dual panel
– Columns displayed
• Continue incorporating
user feedback
36. CLI enhancements
• Support for use with high assurance collections
• '--format UNIX' flag - output suitable for line-oriented
processing with typical Unix tools
• 'globus rm' command
• 'globus whoami --linked-identities' flag to show all linked
identities
• '--timeout-exit-code' flag overrides the default exit code
for commands which wait on tasks
• Enhancements to SDK as needed.
36
37.
38. Connector updates
• Enhanced user experience for credential handling for
several connectors (GCSv5)
• AWS S3
– Automated multi-region support
• Google Drive
– Enhancement to retry handling for large transfers
• HPSS
– Support added for HPSS 7.5 (7.3 to 7.5 supported)
– Improved asynchronous staging from tape
– New home for documentation: docs.globus.org/premium-
storage-connectors/hpss
38
39. S3 compatible systems
• Initial customer
deployments
• Validation, testing and
vendor engagement planned
• Additional systems driven
by customer demand
39
41. Globus for Box
• Extends the value of your Box deployment
• Unifies access to cloud and on-prem storage
• Transitions protected data (HIPAA-regulated,
CUI) seamlessly between Box and other storage
systems
41
43. Make Box part of your
research storage ecosystem
globus.org/connectors/box
docs.globus.org/premium-storage-connectors/box
44.
45. Globus Connect Server v5.3
• Subsumes GCS version 5.0, 5.1, 5.2
• Standard and high assurance guest collections (sharing)
• High assurance mapped collections
• Connectors: POSIX, AWS S3, CEPH, Google Drive, Box
• Data access protocols: GridFTP and HTTPS
• Single deployment support both high assurance and
standard gateway
• Upgrade all v5.x deployments to v5.3
46. Recent Transfer enhancements
• Verify transfer using client provided checksums
– User provided checksum used rather than source checksum for
verification
• Improvements for scaling transfer service
– Multiple nodes for transfer service for higher availability and
reliability
– Allows for code updates with no downtime
46
47. SSH with OAuth
• Securely access resource using SSH with federated identity
– Facilitates automation, eliminates SSH key management
– Replacement for deprecated GSI OpenSSH
• First version released
– Server side PAM module with Globus Auth support
– Command line client
• Open source, community support
– Not part of the standard subscription
– OAuth SSH Client: https://pypi.org/project/oauth-ssh/
– OAuth SSH Server PAM module: https://github.com/xsede/oauth-ssh
50. Globus Transfer: A complete solution
☑ Bulk transfer and sync
☑ Good end-to-end performance in myriad of real world settings
☑ End-to-end reliability
☑ Robust security, with federated identities
☑ Layers onto diverse storage systems
☑ Web-compatible client/server remote access
☑ Easy to use interfaces
☑ Easy installation and administration
☑ Sharing data with guest users
☑ Dedicated professional support
50
51. HTTPS and what it enables
• Browser based up/download
• Allow your
(research) storage
to be “on the web”
• Enforce same security
policies
51
52. Globus Connect Server v5 Milestones
v5.0: Google
Drive
v5.1: POSIX guest
collections, HTTPS
v5.x: v4 feature parity+
v5.3
• Multi DTN support
• Additional storage
systems
• Endpoint specific
identity providers
• …
Other
features
v5.2: High
assurance
v5.4: …
53.
54. GCSv5: Key enabling technology for the future
• Challenge: Managing increasing amount of shared, dynamic state among multiple
DTNs
– Endpoint configuration
– Multiple storage gateway configurations
– Collection configurations
– Credentials (user and system)
• Approach: Stateless DTNs
– No persistent state on DTN
– Multi-DTN endpoints without a shared file system
• GCS state stored in the cloud
– Dynamic sync of state to each DTN
– Enabled by our use of AWS AppSync
• Customer managed encryption keys with optional escrow
– Only you can see and modify your endpoint’s state
• Facilitates creation of new Globus Connect features
55. GCSv5 has significant admin benefits
• Greatly simplified multi-DTN deployment
– Bootstrap DTN from only client id & secret, and encryption key
– No more copy-pasting GCS config files with every change
– Command line, REST API, and (eventually) web admin of GCS
– Automatic synchronization amongst DTNs
• Rapid recovery from failures
– Restore all nodes from stored state with minimal effort
– No local backups of GCS state required
• Lost client ID/secret? Recover them from Auth.
• Enables us to roll out new features more quickly
56. What does it mean for you?
• No sudden moves!
• Ready for GCS v4 to v5 migration late this year
• Tools will be available for migration from GCS v4
• Comprehensive documentation
• Long migration period with parallel support of v5 & v4
• Only use GCS v5 today if you need its specific
features, otherwise continue to use GCS v4
57. Planned Features for Globus Transfer
• S3 compatible HTTPS interface to GCSv5 storage
• Browser based up/downloaders
• Multiple checksum algorithm support
• Manifest support
• Automated recurring replication as a service
• …
57
58. Rethinking data publication
• Limited adoption
– Not easily customizable
• Maintenance Challenges
– Costly to maintain
– JRE licensing concerns
• Going forward
– Code will be open source
– Leverage platform
• Invest in higher priorities
59. Platform challenge
• Transform how research applications, services, and
workflows are created, delivered, used, and sustained
– Scientific instrument data processing
– Repositories: Make data more FAIR
– Science gateways
• Interoperable ecosystem
59
60. Globus platform services
• Identity and Access Management (IAM)
– Federated identity login, Groups, Attributes, Access Control
– Auth: Oauth authorization provider
• Connect
• Transfer
– Will become a family of services
• Execution
• Search, Identifiers
• Automation
– Queues, Events, Actions, Triggers
– Flows
60
62. Platform status
• Generally Available in a few years
• Separate product with separate sustainability model
• Early engagements help shape product direction
– Argonne Leadership Computing Facility, Materials Data Facility,
– NCAR Research Data Archive, NSO, …
– Use in Globus products
• Multiple integrations facilitate more complete solution
– e.g. Django, JupyterHub
– Follow progress: globus-integration-examples.readthedocs.io
• Currently accessible via professional services team