1. Delivering a Campus Research Data
Service with Globus
MAGIC Meeting
Ian Foster
May 7, 2014
2. Give me your data,
your terabytes,
Your huddled files
yearning to
breathe free …
Building campus research
data services
3. “It’s deja vu all over again.”
Yogi Berra
Globus Toolkit
Globus Online
Globus
Globus
4. What is Globus (today)?
Big data transfer
and sharing…
…simply, securely, and fast…
…directly from your own
storage systems
5. Reliable, secure, high-performance
file transfer and synchronization
• “Fire-and-forget”
transfers
• Automatic fault
recovery
• Seamless security
integration
• Powerful GUI
and APIs
Data
Source
Data
Destination
User initiates
transfer
request
1
Globus
moves and
syncs files
2
Globus
notifies user
3
6. Simple, secure sharing off existing
storage systems
Data
Source
User A selects
file(s) to
share, selects
user or group, and
sets permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs in
to Globus and
accesses
shared file
3
• Easily share large data
with any user or group
• No cloud storage
required
11. Globus is enabling…
Study of the structure
and evolution of
galaxies, the nature
of dark energy, and
cosmological history
of the universe
Sloan Digital Sky Survey
Source: University of Utah
Joel Brownstein
University of Utah
12. Globus is enabling…
Development
of numerical
simulations of
severe storms
for improved
responsiveness
to weather
events
Weather Research and Forecasting Model
Source: UCAR
Ann Syrowski
University of Illinois
13. Globus is enabling…
Pediatric brain
research by
enhancing
analysis of
genetic material
in pursuit of the
underlying
cause
Communication impairment by genetic variants
Source: Wikimedia Commons
William Dobyns
U. Washington
14. Globus increasingly used to build
campus-wide data service
Source: University of Nebraska
Holland Computing Center
Enable campus computing
facilities to better utilize
high performance network
infrastructure
15. Typical deployment
Science
DMZ
+
Globus
Omaha Core
Holland Computing Center
Internet2 via GPN
East/West
Campus Networks
(firewalls + IDS)
Lincoln Core Router
2x 10 Gigabit
DYNES
Equipment
UNL Science DMZ
Campus Network
Researchers
WDM
Composit Traffic
100 Gigabit
100 Gigabit Capable
West Campus
Border Router
10x CMS Data
Transfer Nodes
Omaha
HPC
Clusters
100 Gigabit Capable
East Campus
Border Router
perfSONAR
+ BRO IDS
additions
10 Gigabit
4x 10 Gigabit
100 Gigabit
perfSONAR
Bro IDS
Future Redundant
I2 Path (2015+)
Lincoln Core Switch
(CMS and HPC clusters) Center for
Brain Imaging
and Behavior
10x 10 Gigabit
Internet2 via CIC
Composit Traffic
100 Gigabit
Source:
University of Nebraska
Holland Computing Center
16. Instruments are increasingly driving the
need for broader data service deployments
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
17. Globus enables users to manage data as
research requirements scale up or down
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
XSEDE Resource
Public Cloud
27. Globus Provider Subscriptions
• Managed Endpoints
– Priority support
– Management console
– Usage reports
– Mass Storage System optimization
– Host shared endpoints
– Integration support
• Plus Subscriptions
– Create and manage shared endpoints
– Personal transfers
• Branded Web Site
• Alternate Identity Provider (InCommon is standard)
https://www.globus.org/provider-plans
28. NET+ Globus
• Internet2 members get discounted
Globus Provider subscriptions
• Completing “Service Validation” phase
– Sponsors:
Cornell, U.Michigan, Yale, U.Missouri, and
U.Chicago
• Available to “Early Adopters” soon
29. Bridging the gap to sustainability
• $500,000 from Sloan Foundation
• Recognition of what it takes to
“cross the chasm”
• Funds non-R&D
activities
– User Support
– Operations
– Marketing
30. Globus Behind the Scenes
Identity, Group, Profile
Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
GlobusConnect
38. Campus Data Service User Stories
• “I need a good place to store / backup / archive
my (big) research data, at a reasonable price.”
• “I need to easily, quickly, and reliably move or
mirror portions of my data to other places.”
• “I need a way to easily and securely share my
data with my colleagues at other institutions.”
39. Campus Data Service User Stories
• “I need a good place to store / backup / archive
my (big) research data, at a reasonable price.”
• “I need to easily, quickly, and reliably move or
mirror portions of my data to other places.”
• “I need a way to easily and securely share my
data with my colleagues at other institutions.”
• “I want to publish my data.”
• “I want to discover published data.”
51. Recap: Globus Data Publication
• SaaS for publishing large research data
• Bring your own storage
• Extensible metadata
• Publication and curation workflows
• Public and restricted collections
• Rich discovery model
60. Looking for 3-5 early adopters
Summer:
Use and
provide
feedback
on alpha
Fall:
Test beta on
your campus
Winter:
Celebrate
General
Availability
Spring:
Tell us about it
at GlobusWorld
2015!
61. Thank you to our sponsors!
U . S . D E P A R T M E N T O F
ENERGY
Notas do Editor
Review what the Globus team has done over the past year.Announce an exciting new capability.
Joel Brownstein is the data archivist of the Sloan Digital Sky Survey-IVTransfers daily telescope observations to the University of UtahThere they have a large cluster to run their various data reduction pipelinesUsing the Globus command-line interface within their Python APIJoel has moved more than 70 TB of data so far
Ann develops numerical simulations of severe storms using the Weather Research and Forecasting (WRF) modelUses several HPC facilities throughout the countryMoved more than 100 TB of data using Globus— 50 TB last January alone!Moves data between various XSEDE resources, NCSA's mass storage system, and PSC's data archiver
Collects tissue samples from young patients and their families and then extracts, sequences, and analyzesthe genetic material to understand underlying cause of disease.Uses Globus to move NGS data to and from public clouds where he runs analysis pipelines.More on Bill’s work later on in this talk (under Globus Genomics)
Can use standard tools such as apt and yum to deployUses configuration fileAllows incremental config changesMultiple I/O nodesID node (MyProxy)Web node (OAuth)
Alllows site administrators to monitor traffic to/from their site. Ultimately will allow for control.
Geoffrey Moore
Highlight CI ConnectHighlight XSEDE’s planned adoption of user, group and profile management
Highlight CI Connect; coming up in Rob Gardner’s talkHighlight XSEDE’s planned adoption of user, group and profile management
Competitive TCOAlternatives are campus computing cores and commercial sequence analysis services
Collection is a set of DatasetsDataset is data + metadataCollection is within a CommunityPolicies on a CollectionMetadataAccess control Curation workflowLicenseStorage
Demo scenario:A scientist, referred to throughout as “the Scientist” and associated with the user Blaiszik, has just published a paper associated with his research on nanoscale materials. He now wants to go ahead and publish the data associated with this publication.Using the Globus publication system, he is able to select the Argonne community, and the Center for Nanoscale Materials (CNM) collection. He selects to publish his dataHe describes the submission with both publication (Dublin core) and scientific metadataThe CNM collection has been preconfigured with its own storage provided at ArgonneAs part of this submission, a unique endpoint is created for “The Scientist", the endpoint is created so that only "The Scientist" can write to it"The Scientist" assembles his dataset on this endpoint by transferring files from 1 or more locations. He can assemble this dataset over a long period of time and can return to the submission workflow when he is happy with the submission. The CNM collection has also been preconfigured with a workflow requiring that an Argonne curator must approve the submissionA curator, referred to throughout as “the Curator” and associated with the user Chard, is able to view and edit the metadata and files of the datasetOnce approved the submission is published in the CNM collection with a DOIOther users (with permission to view the collection) can then discover published datasets by their DOI or using the Globus discovery interface to find datasets by their metadataThese users can choose to browse published datasets and download datasets to other resources (including local resources)
Users can login using any of their linked Globus identities, e.g., Campus credentials (via InCommon), Google Account, XSEDE account, ..
The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research. Note: "The Scientist" can only see collections he is allowed to publish to.
"The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined. Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication. Note: "The Scientist" has missed an ORCID for one of his co-authors.
Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11).This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist" The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy. At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
Having verified the submission, "The Scientist" must grant the submission license. This license is again configured by the collection (i.e. each collection can customize their individual licenses), and allows the submitting user to grant rights to the collection (CNM) and the Globus system to manage and disseminate the dataset based on the agreed upon policies.
The Argonne CNM collection has defined a workflow that requires a curatorto view and approve all submissions. The curation workflow enables the curator to view the submitted files and to edit the submitted metadata.
At this point, the dataset is now published in the collection with a unique DOI (handle in this case) for other researchers to reference this published dataset. Access to the dataset (both metadata and files) is changed to reflect the policies of the collection. Access may be restricted to particular users, or groups of users, or it may be made public for any user to access.
“The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags. Each of these fields can be used to search for a particular dataset.
Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
Having found the desired published dataset, “The Researcher”can navigate to the summary page.
The summary page shows a summary of the dataset and the list of files. “The Researcher” can choose to download individual files, browse the dataset using Globus, or download the entire dataset. Ability to view the dataset and download files is governed by the access control on the collection and permissions associated with “The Researcher”.
Finally,“The Researcher” can view the downloaded dataset on their desktop PC.