This document discusses the potential for "Science as a Service" by leveraging on-demand computing capabilities. It notes that most research labs have limited resources, so automation and outsourcing are needed to apply sophisticated methods to larger datasets. The author proposes building a "discovery cloud" by identifying time-consuming research activities that can be automated and delivered as software/platform/infrastructure as a service. This would help accelerate scientific discovery. Globus is highlighted as an example of a platform providing data management services using a software as a service model.
4. computationinstitute.org
But most labs have extremely
limited resources
Heidorn: NSF
grants in 2007
< $350,000
80% of awards
50% of grant $$
$1,000,000
$100,000
$10,000
$1,000
2000 4000 6000 8000
7. computationinstitute.org
Building a discovery cloud
• Identify time-consuming activities amenable
to automation and outsourcing
• Implement as high-quality, low-touch SaaS
• Leverage IaaS for reliability,
economies of scale
• Extract common elements as
research automation platform
Bonus question: Sustainability
Software as a service
Platform as a service
Infrastructure as a service
15. computationinstitute.org
Data
Source
User A selects
file(s) to share;
selects
user/group, sets
share permissions
1
Globus Online tracks
shared files; no need
to move files to
cloud storage!
2
User B logs in to
Globus Online
and accesses
shared file
3
16. computationinstitute.org
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect” install
• 5-minute Globus Connect Multi User install
34. computationinstitute.org
We are adding capabilities
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
GlobusOnlineAPIs
GlobusConnect
36. computationinstitute.org
• Ingest and publication
– Imagine a DropBox that not only replicates, but
also extracts metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined
and/or automatically extracted metadata
• Computation
– Associate computational
procedures, orchestrate application, catalog
results, record provenance
We are adding capabilities
37. computationinstitute.org
Looking deeply at how
researchers use data
• A single research question often requires the
integration of many data elements, that are:
– In different locations
– In different formats (Excel, text, CDF, HDF, …)
– Described in different ways
• Best grouping can vary during investigation
– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit
– Share, annotate, process, copy, archive, …
38. computationinstitute.org
How do we manage data today?
• Often, a curious mix of ad hoc methods
– Organize in directories using file and directory
naming conventions
– Capture status in README
files, spreadsheets, notebooks
• Time-consuming, complex, error prone
Why can’t we manage our data like
we manage our pictures and music?
39.
40. computationinstitute.org
Introducing the dataset
• Group data based on use, not location
– Logical grouping to organize, reorganize, search, and
describe usage
• Tag with characteristics that reflect content …
– Capture as much existing information as we can
• …or to reflect current status in investigation
– Stage of processing, provenance, validation, ..
• Share data sets for collaboration
– Control access to data and metadata
• Operate on datasets as units
– Copy, export, analyze, tag, archive, …
41. computationinstitute.org
Builds on catalog as a service
Approach
• Hosted user-defined
catalogs
• Based on tag model
<subject, name, value>
• Optional schema
constraints
• Integrated with other
Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve
tags
/tagdef/
• Create, delete, retrieve
tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
42. computationinstitute.org
Provide more capability for
more people at lower cost by
building a “Discovery Cloud”
Delivering “Science as a service”
Our vision for a 21st century
discovery infrastructure
43. computationinstitute.org
It’s a time of great opportunity … to
develop and apply Science aaS
Globus Nexus
(Identity, Group, Profile)
…
Sharing Service
Transfer Service
Dataset Services
Globus Toolkit
GlobusOnlineAPIs
GlobusConnect
44. computationinstitute.org
Thanks to great colleagues
and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob
Schuler, and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others
at Argonne
10,000 80% of awards and 50% of grant $$ are < $350K
Concern:
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
Started with seemingly simple/mundane task of transferring files …etc.
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
This image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF). The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
Assuming 1 core can handle all tasks, and a basis of 5GB of memory per analysis
Questions remain:-- What capabilities? Where does time go?-- How do we turn them into usable solutions?-- How do we scale from thousands to millions?-- How do we incentivize contributions? Long tail.