1. SEAD Datanet and
1.
2.
NSF DataNet Overview
SEAD Overview
Sustainability Science
3. SEAD Active/Social Curation
4. SEAD Virtual Archive Repository
Robert H. McDonald
Deputy Director/Associate Dean
Data to Insight Center/IU Libraries
SC12 | Salt Lake City, UT
November 12, 2012
http://www.sead-data.net
@SEADdatanet
2. SEAD DataNet and Sustainability
Science
http://www.sead-data.net
http://slidesha.re/TAk3ht @SEADdatanet
2 SEAD DataNet Home
3. SEAD TEAMS
Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams,
Michigan George Alter (ICPSR), Bryan Beecher (ICPSR)
Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light,
Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Robert Ping,
Indiana Ryan Cobine
James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear
Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA),
Illinois Luigi Marini (NCSA)
3 SEAD DataNet Home
4. NSF DataNet Program
Motivation:
“… one of the major challenges of this scientific
generation: how to develop the new methods,
management structures and technologies to
manage the diversity, size, and complexity of
current and future data sets and data streams.”
Response:
DataNet creates “a set of exemplar national and
global data research infrastructure organizations” to
address this challenge.
4 SEAD DataNet Home
5. Current NSF DataNet Projects
SEAD
• http://sead-data.net
DataOne
• http://www.dataone.org
DataNet Federation Consortium
• http://datafed.org
Terra Populous
• https://www.pop.umn.edu/terra_pop
5 SEAD DataNet Home
6. SEAD’s Approach
SEAD Partners - http://sead-data.net
• Contribute infrastructure to the
DataNet vision that supports data
access, sharing, reuse, and
preservation for the long tail
• Develop a data access and
preservation environment that
supports the research, technical,
and economic requirements for
data management in the long tail
• Enable Active and Social Curation
Utilize emerging preservation and
access infrastructures
6 SEAD DataNet Home
7. Long Tail Data Challenges
Exa
Bytes
Bytes per day
Peta
Bytes
Tera
Bytes
Giga
Bytes
Many smaller datasets…
7 SEAD DataNet Home
8. CI for the Long Tail
What is the “long tail” of scientific research and
why does it matter?
• Diverse set of researchers, questions, data, and
methodologies, etc.
• Diverse set of requirements for instrumentation, data
collection, models, analysis, etc.
• Little standardization, no common denominator
• Most researchers and most research dollars go to
researchers in the long tail
• The long tail is underserved by current CI
8 SEAD DataNet Home
9. Long Tail Example: Sustainability
Research
Many dimensions, many coordinate systems, many scales,
many data collection and analysis tools, many formats, a
long-tail of providers and users, …
9 SEAD DataNet Home
10. SEAD 18 month Pilot Phase
Domain Engagement:
• National Center for Earth-Surface Dynamics (NCED), Illinois River
Basin Observatory
• Requirements, Use Cases, Prioritization of Data Types and
Services
Active and Social Curation
• Pilot Active Content Repository, VIVO deployments
• Exemplar services for Data Ingest, Discovery, Re-use, Curation
(Tupelo/Medici)
CI for Long-term Access (Virtual Archive)
• Data model, protocol design/development
• Pilot Federated Repository infrastructure
Education, Outreach, and Training
• Post-doc mentoring
• Web site, training materials, meetings, workshops, …
Project Oversight
• Management, reporting, committees
• Business model development
10 SEAD DataNet Home
11. NCED Collection Access
NCED collections in SEAD-ACR
• (20 Top-level Collections, 454K
files, 2.25M objects, 1.6 TB data)
• NCED Repository Interface
• Support for hierarchy
• Support for collection annotation
• View/add NCED/domain specific
Terms
• New Large Server with Virtual
Machine ACR instances
• Ingest tools and procedures
• csv2rdf4LOD
• Archiving, Citation, DOI
assignment, …
NCED users can (with an account) go from
web page to previews and downloads (w/o
cart), can add annotations, can browse,
search by text (any fields and content), tags,
etc.
11 SEAD DataNet Home
12. SEAD notions of defined Data Phases
Phases of data lifecycle acknowledge and accommodate the difference between public
data and data still in work by a researcher.
Research Data Phase: data set is research data collection, owned by individual and
under their control.
• Data need not be licensed at this time because it is not ready
for broader release
• Data need not have permanent IDs because still work in
progress
• Corresponds to first existence in Active Curation Repository
Published Phase: Owner of research data collection determines that dataset is ready for
publication
• License terms set
• Persistent ID
• Made available as part of public profile in VIVO
• Activated by user-controlled publish event
12 SEAD DataNet Home
28. NCED Data Social Network in SEAD-VIVO
Mary Power NCED PI and Professor University of California
William Dietrich NCED PI and Professor University of California
Collin Bode NCED Data Technician
NCED Social Network Connections Based on Data Authorship
28 SEAD DataNet Home
30. SEAD Data Set Publishing Workflow
NCED Data Set NCED Data Set
• Data content used Ingested to VA • DataCite minted Published to
within ACR DOI attached to VIVO
• Researcher Profile • Data Set ready to finalized Data Set
publish • DOI Resolution to
Established in VIVO
designated IR
NCED Data Set NCED Data Set
Ingested to ACR Deposited with IR
30 SEAD DataNet Home
33. Virtual Archive Features
Usability consistent with research user expectations
• Additional metadata fields for scientific datasets
• Ability to ingest data with previewing data
Repository tracking: tracking member Institutional Repositories
(IRs) and their stored content
• Not just link to repository, but extensive cataloging tool
(metadata and other additional information)
• Allows users to search for data in particular IR or over
all IR’s
Low cost replication: cloud based storage for reliability
• Proof of concept uses Amazon S3 to maintain copy of
files and collections. Amazon Glacier is low-cost, secure
and durable. Optimized for cold storage. Other
solutions exist.
33 SEAD DataNet Home
36. Component Interactions:
Virtual Archive and ACR
Data Set Ingested Data Set
to Virtual Archive Published to
VIVO
Data Set
Data Set Uploaded Deposited with
to ACR Institutional
repository
36 SEAD DataNet Home
37. ACR – VA Interaction Protocol
ACR UI VA UI
Researcher Curator
Mark Data For Publication
(and Accept Licensing Terms)
Active Curation Repository
Curator Request for Preview
Virtual Archive
(SPARQL) Query Metadata
Return Metadata
Endpoint
SWORD
Curator Preview
T im e
Ingest Data To VA
User Queries VA for DOI
Query
Metadata update and View DOI Metadata
Endpoint
Query
37 SEAD DataNet Home
38. Virtual Archive Workflow
Accept
Repository
Agreement
in ACR
Preview
File
Data Upload Data Run Virus Deposit to Index
to VA Character- Mint DOI
Ready to Checking IR Metadata
ization
Publish
Large Index
Dataset Scientific
Version IR Match- Metadata
Data maker Policy
Decision
To be completed
by March 2013
38 SEAD DataNet Home
39. Key Questions for SEAD Prototype
• What could SEAD capture when?
• How can SEAD provide direct value
to data producers, users, and
curators?
• How can web 2.0/3.0 and social
computing lower barriers and
reduce/realign costs?
39 SEAD DataNet Home
40. Towards A Shared Data Future
Data User functionalities, data
Users capture & transfer, virtual
Generators research environments
Data Curation
Data discovery & navigation,
Community Support Services workflow generation,
Trust
annotation, interpretability
Persistent storage,
identification, authenticity
Common Data Services (provenance), workflow
execution, data mining
Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010
40 SEAD DataNet Home
41. Data Interoperability and SEAD
• NSF OCI: DataNet and INTEROP now DIBBs
• EUDAT
• Research Data Alliance
• IETF Research Data Identifier BOF
• NCED Data Network
41 SEAD DataNet Home
42. Acknowledgements
SEAD is funded by the National Science Foundation
under cooperative agreement #OCI0940824
• For more on SEAD go to:
• http://sead-data.net
• Follow us on Twitter
@SEADdatanet
http://sead-data.net
42 SEAD DataNet Home
Notas do Editor
A Collection of heterogeneous files. Users can tag and add comments to the entire ‘collection’ and individually tag and comment on the objects in the collection. Note: Extraction services and previewers are all driven by the file MIME type. Extraction services are customizable and are designed to automate derived data products from the file being uploaded. Examples follow…
Lidar data saved as .png.The Image extraction service does the following:Creates the thumbnail and preview imageCreates an image pyramid of the image (zoom/pan large images w/o downloading entire image via the SeaDragon webapp )Extract all header information from image file to include: Exif, GPS, Interoperability, etc… Extracted data is view by clicking on the “Extracted Information” section.
A data set saved as a simple ASCII text file.- Users can preview the first 80 lines of the text file.
Preview the contents of .csv files
Simple map image User defined informationImage is part of multiple collectionsImage is tagged
3 Images (3 clicks)Standard Medici InfoScroll down to show location and annotationThis image file also contained geo location data which become visible in “Location”. Geo-location can be extracted from the image Exif data or authors can add a geo-location to any file in the repository.Note the creator tag and vivo reference.
Tif support - relatively large 71MB fileClicks…Click Zoom to enable SeaDragon to explore the details of the file via zoom and pan with mouse.Click the lower right icon to enable full screen. Use + or – key to zoom (or wheel on mouse), click image and drag to panClick lower right icon to return to embedded window in Medici
Image file that contains GPS data which is extracted by Medici as part of the upload process.
Mpeg file uploads:Extraction service creates a flash version of the file for preview.
PDF files Extraction service generates an image per page of the file. In this case a slide set from a presentation. Click ‘Pages’ to enable the slide set mode and click on the left or right arrows to navigate the pages. 2 images – click to advance slide.
.shp files The components of shape file get uploaded to Medici as a zip Medici saves the zip blob and the extraction service registers the contents of the shp file with GeoServerOpenStreetMap displays the contents of the zipLayers are on by default but can be turned by clicking the ‘show’ button.Opacity of layers can be varied using the opacity scale.(WIP) We plan to embed OpenStreetMap in Medici as a previewer for .shp and .kml
All layers off except Illinois Flood Zone map. Map zoomed into the Champaign region of interest.