SEAD Datanet and Sustainability Science

SEAD Datanet and
1.
2.
NSF DataNet Overview
SEAD Overview
Sustainability Science
3. SEAD Active/Social Curation
4. SEAD Virtual Archive Repository
Robert H. McDonald
Deputy Director/Associate Dean
Data to Insight Center/IU Libraries
SC12 | Salt Lake City, UT

November 12, 2012

http://www.sead-data.net
@SEADdatanet

SEAD DataNet and Sustainability
Science

http://www.sead-data.net
http://slidesha.re/TAk3ht @SEADdatanet

2 SEAD DataNet Home

SEAD TEAMS
Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams,
Michigan George Alter (ICPSR), Bryan Beecher (ICPSR)

Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light,
Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Robert Ping,
Indiana Ryan Cobine

James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear

Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA),
Illinois Luigi Marini (NCSA)

3 SEAD DataNet Home

NSF DataNet Program

Motivation:
“… one of the major challenges of this scientific
generation: how to develop the new methods,
management structures and technologies to
manage the diversity, size, and complexity of
current and future data sets and data streams.”
Response:
DataNet creates “a set of exemplar national and
global data research infrastructure organizations” to
address this challenge.

4 SEAD DataNet Home

Current NSF DataNet Projects

SEAD
• http://sead-data.net
DataOne
• http://www.dataone.org
DataNet Federation Consortium
• http://datafed.org
Terra Populous
• https://www.pop.umn.edu/terra_pop

5 SEAD DataNet Home

SEAD’s Approach
SEAD Partners - http://sead-data.net
• Contribute infrastructure to the
DataNet vision that supports data
access, sharing, reuse, and
preservation for the long tail
• Develop a data access and
preservation environment that
supports the research, technical,
and economic requirements for
data management in the long tail
• Enable Active and Social Curation
Utilize emerging preservation and
access infrastructures

6 SEAD DataNet Home

Long Tail Data Challenges
Exa
Bytes
Bytes per day

Peta
Bytes

Tera
Bytes

Giga
Bytes
Many smaller datasets…

7 SEAD DataNet Home

CI for the Long Tail

What is the “long tail” of scientific research and
why does it matter?
• Diverse set of researchers, questions, data, and
methodologies, etc.
• Diverse set of requirements for instrumentation, data
collection, models, analysis, etc.
• Little standardization, no common denominator
• Most researchers and most research dollars go to
researchers in the long tail
• The long tail is underserved by current CI

8 SEAD DataNet Home

Long Tail Example: Sustainability
Research
Many dimensions, many coordinate systems, many scales,
many data collection and analysis tools, many formats, a
long-tail of providers and users, …

9 SEAD DataNet Home

SEAD 18 month Pilot Phase
Domain Engagement:
• National Center for Earth-Surface Dynamics (NCED), Illinois River
Basin Observatory
• Requirements, Use Cases, Prioritization of Data Types and
Services
Active and Social Curation
• Pilot Active Content Repository, VIVO deployments
• Exemplar services for Data Ingest, Discovery, Re-use, Curation
(Tupelo/Medici)
CI for Long-term Access (Virtual Archive)
• Data model, protocol design/development
• Pilot Federated Repository infrastructure
Education, Outreach, and Training
• Post-doc mentoring
• Web site, training materials, meetings, workshops, …
Project Oversight
• Management, reporting, committees
• Business model development

10 SEAD DataNet Home

NCED Collection Access
NCED collections in SEAD-ACR
• (20 Top-level Collections, 454K
files, 2.25M objects, 1.6 TB data)
• NCED Repository Interface
• Support for hierarchy
• Support for collection annotation
• View/add NCED/domain specific
Terms
• New Large Server with Virtual
Machine ACR instances
• Ingest tools and procedures
• csv2rdf4LOD
• Archiving, Citation, DOI
assignment, …
NCED users can (with an account) go from
web page to previews and downloads (w/o
cart), can add annotations, can browse,
search by text (any fields and content), tags,
etc.


SEAD notions of defined Data Phases
Phases of data lifecycle acknowledge and accommodate the difference between public
data and data still in work by a researcher.
Research Data Phase: data set is research data collection, owned by individual and
under their control.
• Data need not be licensed at this time because it is not ready
for broader release
• Data need not have permanent IDs because still work in
progress
• Corresponds to first existence in Active Curation Repository
Published Phase: Owner of research data collection determines that dataset is ready for
publication
• License terms set
• Persistent ID
• Made available as part of public profile in VIVO
• Activated by user-controlled publish event


SEAD Active/Social Curation
Repository


ACR Bulk Ingest Process
Configuration:
• Headers to Standard Vocabularies
• Content Mapping to identifiers
Metadata • Additional Inference possible

Data
TWC: csv2rdf4lod .ttl output file

DROID Analysis

global ID, filepath, file
Extractors/
ACR Ingest Preview
Incremental ingest, restart, verify
On/Off
SEAD ACR Instance


SEAD/NCED Data Social Network


NCED Data Social Network in SEAD-VIVO
Mary Power NCED PI and Professor University of California
William Dietrich NCED PI and Professor University of California
Collin Bode NCED Data Technician

NCED Social Network Connections Based on Data Authorship


Angelo Basic GIS Coverage Data Set


SEAD Data Set Publishing Workflow

NCED Data Set NCED Data Set
• Data content used Ingested to VA • DataCite minted Published to
within ACR DOI attached to VIVO
• Researcher Profile • Data Set ready to finalized Data Set
publish • DOI Resolution to
Established in VIVO
designated IR

NCED Data Set NCED Data Set
Ingested to ACR Deposited with IR


Published NCED Data Set in IR (IU ScholarWorks)


SEAD Virtual Archive


Virtual Archive Features
Usability consistent with research user expectations
• Additional metadata fields for scientific datasets
• Ability to ingest data with previewing data
Repository tracking: tracking member Institutional Repositories
(IRs) and their stored content
• Not just link to repository, but extensive cataloging tool
(metadata and other additional information)
• Allows users to search for data in particular IR or over
all IR’s
Low cost replication: cloud based storage for reliability
• Proof of concept uses Amazon S3 to maintain copy of
files and collections. Amazon Glacier is low-cost, secure
and durable. Optimized for cold storage. Other
solutions exist.


Component Interactions:
Virtual Archive and ACR

Data Set Ingested Data Set
to Virtual Archive Published to
VIVO

Data Set
Data Set Uploaded Deposited with
to ACR Institutional
repository


ACR – VA Interaction Protocol
ACR UI VA UI
Researcher Curator
Mark Data For Publication
(and Accept Licensing Terms)

Active Curation Repository
Curator Request for Preview

Virtual Archive
(SPARQL) Query Metadata
Return Metadata

Endpoint
SWORD
Curator Preview
T im e

Ingest Data To VA

User Queries VA for DOI
Query
Metadata update and View DOI Metadata

Endpoint
Query

Virtual Archive Workflow
Accept
Repository
Agreement
in ACR

Preview
File
Data Upload Data Run Virus Deposit to Index
to VA Character- Mint DOI
Ready to Checking IR Metadata
ization
Publish

Large Index
Dataset Scientific
Version IR Match- Metadata
Data maker Policy
Decision

To be completed
by March 2013


Key Questions for SEAD Prototype

• What could SEAD capture when?
• How can SEAD provide direct value
to data producers, users, and
curators?
• How can web 2.0/3.0 and social
computing lower barriers and
reduce/realign costs?


Towards A Shared Data Future
Data User functionalities, data
Users capture & transfer, virtual
Generators research environments
Data Curation

Data discovery & navigation,
Community Support Services workflow generation,
Trust

annotation, interpretability

Persistent storage,
identification, authenticity
Common Data Services (provenance), workflow
execution, data mining

Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010


Data Interoperability and SEAD
• NSF OCI: DataNet and INTEROP now DIBBs
• EUDAT
• Research Data Alliance
• IETF Research Data Identifier BOF
• NCED Data Network


Acknowledgements
SEAD is funded by the National Science Foundation
under cooperative agreement #OCI0940824

• For more on SEAD go to:
• http://sead-data.net

• Follow us on Twitter
@SEADdatanet

http://sead-data.net


SEAD Datanet and Sustainability Science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a SEAD Datanet and Sustainability Science

Semelhante a SEAD Datanet and Sustainability Science (20)

Mais de Robert H. McDonald

Mais de Robert H. McDonald (20)

Último

Último (20)

SEAD Datanet and Sustainability Science

Notas do Editor