A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
7. Cloud services make it easy to automate and outsource
many routine tasks associated with commodity data
8. And research!
To learn more:
cloud4scieng.org
Cloud services make it easy to automate and outsource
many routine tasks associated with commodity data
9. 9
• Auth Manage identities, authentication, and authorization
• Transfer Manage movement
• Sharing Manage who can access
• Publish Preserve, identify, describe, curate
• Search Index and search
• Identify Assign identifiers
• Process Run processing pipelines
• Learn Discover, train, run machine learning models
What tasks can we automate and outsource in the case
of research data?
Let’s give these a try …
11. National Center for Atmospheric Research (NCAR)
Research Data Archive (RDA)
700 datasets, of total size ~2 petabytes
In 2017, >12,000 unique users retrieved ~2 petabytes
12. Globus platform allows NCAR RDA users to:
Sign on with
institutional
credentials
Manage a
personalized
data space
Transfer data
easily, rapidly,
reliably, securely
Streamlined data access = reduced data friction = greater data reuse
14. The Advanced Photon Source
34 sectors
Dozens of beamlines generate from GBs to 10s of TB per day
Data still commonly transported via portable media
Upgrade will scale by 10 to 1000 times
5,000 annual users
15. Automate and outsource:
Capture, publication, analysis
Move to permanent location
Extract and record metadata
Assign persistent identifier
Index for discovery
Associate with machine
learning models
1515
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
16. 1616
Programmatic access (Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
Capture, publication, analysis
17. “Friction and wear pose enormous challenges to 21st
Century mechanical and electromechanical systems.
By one estimate, 6% of the gross national product is
wasted through its impact.”*
The costs of data friction are surely as great.
Tribology was established as a discipline in the 1960s.
We now need a new discipline of data tribology.
* https://www.machinedesign.com/materials/investing-knowledge-friction-wear-and-lubrication
18. Cloud automation is the WD40 of data.
Globus demonstrates what is possible.
• Auth: Manage identities, authentication, authorization
• Transfer: Manage movement
• Sharing: Manage who can access
• Publish: Preserve, identify, describe, curate
• Search: Index and search
• Identify: Assign identifiers
• Process: Run processing pipelines
• Learn: Discover, train, run machine learning models
Established
20,000 endpoints
120,000 users
New
100s of users
Experimental
10s of users
Ian Foster — foster@anl.gov
Notas do Editor
A perspective that while is likely orthogonal
One of 112 boxes of photographs collected by Geoffrey Robinson,
Automate: In this case, capture of images, assignment of metadata, inference, training of inference models, data transfer, …
Outsource: certain tasks at least …