Data Sets, Ensemble Cloud Computing, and the University Library:Getting the Most Out of Research Support
1. Data Sets, Ensemble Cloud
Computing, and the University
Library:
Getting the Most Out of Research Support
Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert
McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4
myersjd@umich.edu
1 School on
Information, University of Michigan, Ann Arbor, MI, United States.
School of Informatics and Computing, Indiana University, Bloomington, IN, United States.
3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States.
4 Data To Insight Center, Indiana University, Bloomington, IN, United States.
5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.
2
2. Overview
• Technological advances are making it ever easier to
move computation, data, and metadata around
• With decreasing costs and increasing recognition of the
value of data re-use, many organization are exploring
their role in data curation/preservation
• If we look at the nature of the problem
– How should data be curated to scalably support research?
• Lifecycle approaches to manage value-defined research objects
– Can we do it?
• SEAD as an end-to-end demonstration…
– What organization(s) are best positioned/the most capable
of leading/providing such services long-term?
• Primary research organizations have a combination of capability,
motivation, and long-term commitment.
3. Technology – the world is flat
• Today’s researchers can employ computing
and data resources from anywhere, using
scalable search technologies …
Enough said.
4. Data as a key resource, Big Data
• Data is increasingly recognized as valuable
beyond its initial use:
–
–
–
–
Data reproducibility
Re-analysis
Reference Data
Data mining/machine learning/…
– NSF Data plan requirement
– Paper publication with data requirements
– Community and institutional collections growing
5. Data Publication today
• Data cited in papers (to limited depth)
• Project file archives (large, limited description,
gray/dark)
• Reference/analytical data (standardized content,
limited breadth)
• Historical collections (temporal breadth, limited
numbers)
- do any of these solve the problem?
6. Researchers think, and work, like this:
• Multi
– Disciplinary
– Format
– Model
– Semantics
– Location
7. and this
–
–
–
–
–
–
Raw and derived data
~5 levels of quality,
processing, maturity
Observations,
calibrations,
experiments, models,
statistical ensembles, …
Also organized by
location, time, variables,
technique, creator,
project, provenance, …
Large amount of
reference information
from external sources
(e.g. NASA)
Evidence for ‘nonorthogonal’ subcollections
8. What’s Really Needed?
Scalable Research Productivity Requires:
• A way to
– store what you want
– Reference what you want
– Organize how you want (search, filter, tag, collect)
• At the scale, and level of detail/richness, you want
• When you figure that out
• In a way that is self-describing/high-fidelity across
applications and owners
• In the vocabularies and formats you find efficient
• Beyond the lifetime of individual/project interest
• For active use and external credit
• With minimal training/IT support required.
9. How can we approach magic?
• Global identifiers – data, terms, metadata
• Content management abstractions (blob + type +
metadata)
• Service architectures and automated processing
(conversion, preview, extraction, derivation, cataloging,
…)
• Applications that share these abstractions – write what
you know, display/ignore what you don’t
• Research Object management (structured, interrelated collections)
Web 2.0, Web3.0, + explicit context management …
10. SEAD: Sustainable Environment Actionable Data
• An NSF DataNet project started in
October, 2011
• An international resource for
sustainability science
• A provider of light-weight Data Services
based on novel technical and business
approaches:
– Supporting the long-tail of research
– Enabling active and social curation
– Providing integrated lifecycle support for data
http://sead-data.net/
Margaret Hedstrom, PI
Praveen Kumar, co-PI
Jim Myers, co-PI
Beth Plale, co-PI
11. SEAD is:
• Data discovery
• Project workspaces
• A data-aware
community network
• Curation and
preservation services
that link to multiple archives and discovery
services
12. SEAD is:
• An active repository that creates data pages with
–
–
–
–
–
–
–
–
Previews
Extracted Metadata
Overlays
Tags
Comments
Provenance
Use information
Download/Embed
13. SEAD is:
• A tool for community exploration:
– Personal and
Project Profiles
– Publications and
Data Citations
– Co-author,
co-investigator
graphs
– Temporal analysis
14. SEAD is:
• Curation and Preservation Services:
– Research Object
management
– ID assignment
– Matchmaking to
long-term repositories
Citation Generation
– Catalog Registration SEAD’s Virtual Archive allows curators to
access, assess, enhance, package, and submit
data from SEAD project repositories for long– Discovery services
term storage in SEAD-managed storage or
external institutional repositories and cloud
data services.
15. –
–
–
–
Apps read what they need and write what they know
Curation snapshots meaningful Research Objects
Multiple ROs can be defined/managed re-using the same underlying ‘living’ content
The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level
Flickr-style web management of data
Sensor data
Semantic Content Middleware
over Scalable File System and
Triple Store
Geospatial, social
network mash-ups,
workflows and services
Curation Services to harvest
and package specific data sets
Federation of OAI
repositories for
long-term
preservation
16. Key Points
• Research Objects have meaning/value but data comes in
smaller chunks
• Research Objects are not orthogonal, but individual data
sets/files are
• Lifecycle approaches for datasets are becoming possible
• Managing intermixed ROs is the problem that needs to be
tackled to meet the research community’s needs
• Research Data Alliance (RDA) can help drive
standardization/scaling
17. What will drive research data preservation?
• The most valuable data service(s) are
active/actionable research service(s)…
– The ability to define Research Objects is more
important than any given RO
• Led by research organizations as part of their
long-term mission?
– The only organizations with the focus, scope, and
scale to solve the whole problem (end-to-end
research productivity)
18. Acknowledgements
• SEAD Team @ UM, UI, IU
• NSF
• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other
sustainability researchers
• and Thank You!
… stop by the SEAD booth and share your thoughts!
http://sead-data.net/