Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and Breakthrough Technologies
1. Do It Yourself (DIY) Earth Science
Collaboratories Using Best Practices
and Breakthrough Technologies
IN13D-01
ERIC STEPHAN
December 11, 2017 1
Pacific Northwest National Laboratory
AGU Fall meeting 2017, New Orleans, LA
IN13D: Approaches for Curation to Data Discovery in the Era of Big Data Variety II
2. Addressing Data Challenges of Scientists on
Small and Midscale Budgets
Do it yourself (DIY) home project videos have taken storm in media,
helping you reroof a house or replace a water pump.
DIY recommendations can even help you determine if you can, do it yourself!
Talk targeting innovative smaller sized science projects that produce
quality science products including data that can be shared with future
consumer communities..
Many best practices can be carried out in even the humblest situations.
big data center, smaller projects want more effective ways to connect to your
resources beyond ’point and click’.
December 11, 2017 2
3. Emergence of Scientific Collaborative Tools –
Science inspired the Web and so much more!
Collaboratory - A center without walls, in which the nation’s researchers can perform their research
without regard to physical location, interacting with colleagues, accessing instrumentation, sharing data
and computational resources, [and] accessing information in digital libraries1
December 11, 2017 3
1The national collaboratory. In Towards a national collaboratory. Unpublished report of a National Science Foundation
invitational workshop, Rockefeller University, New York. 1988.
The DOE 2000 Project
Environmental Molecular Sciences
Laboratory (EMSL) User Facility
12 March 1989, Sir Tim
Berners-Lee original “vague
but exciting” submission to
CERN on a distributed
information system
National Institute of Health:
The Human Genome Project
(HGP) Began 1989.
Engage with EMSL to advance your research
How can we work together?
§ Collaborate with our experts
§ Work within multi-disc iplinary teams
to ac c elerate sc ience
§ Acc ess world-c lass sc ientific
user facilities and spec ialized
instrumentation
§ Provide research and c areer
opportunities for your students
Dec ember 8, 2017
www.emsl.pnnl.gov
www.universities.pnnl.gov
4. Examples of Off the Shelf and Standards
Deluge: What Works for You?
December 11, 2017 4
5. Attaining Data Study Afterlife?
December 11, 2017 5
Signal
Message
Application
Database
File store
Archive
Deep Web
Science publications
Data
Visibility through commercial search engine
New advancements in science
and engineering require
careful attention to keeping
scientific discovery literature
and data artifacts in
circulation
Example
Data
Lifecycle
“…Placed in storage, the data has as much
productive value as your labor value when
you sit on the sofa at night to watch TV. “
“…If you want to increase the value of your data
you have to increase its active circulation and
utility!” Steven Adler, DWBP co-Chair
Without some help, science can remain largely
invisible in the Deep Web
6. Increasing Lifespan, Reuse and Visibility DIY
Choose from 35 DWBP best practices to match research functional needs
Scope best practices with reference model sketches
Assess off the shelf product capabilities and limitations with DWBP
Identify required additional plumbing to accomplish research
https://www.w3.org/TR/dwbp/
7. DWBP Data Challenges and Motivating
Questions
December 11, 2017 7
Metadata
Data License
Provenance
Data Quality
Versioning
Identification
Data Formats
Vocabularies
Access
Preservation
Feedback
Enrichment
Replication
How do I provide metadata?
How do I permit/restrict access?
How can I convey transparency?
How can I add trust?
How can I track version history?
How can I create and use
persistent identifiers?
What non-proprietary structures
should I use?
How do I make my data more
easily understood?
How can I make data retrieval
easy, robust, and intuitive?
What should I consider when
archiving?
How can data producers and users
be better engaged?
How can I add better value to
data?
How do I use data responsibly?
“The Web is not a glorified USB Stick”,
Phil Archer, W3C Data Activity Lead https://www.w3.org/2017/Talks/0621-phila-oai/
http://w3c.github.io/dwbp/dwbp-implementation-report.html
8. Best Practices Benefit Measures
December 11, 2017 8
• Comprehension: humans will have a better understanding about the data
structure and meaning, the metadata and the nature of the dataset.
• Processability: machines can automatically ingest and operate on data.
• Discoverability: finding new associations between and in data resources.
• Reuse: increase intrinsic value to wider data consumer communities.
• Trust: improving the confidence that consumers have in the dataset.
• Linkability: it will be possible to associate data resources
• Access: humans and machines will be able to retrieve relevant data in familiar
common formats.
• Interoperability: cooperation among data publishers and consumers.
9. Using Technology Agnostic Reference
Models to Assess Best Practice Relevance
December 11, 2017 9
ISO Open Archival Information System (OAIS) ISO 14721:2003
The Context, Containers, Components and Classes (C4) model for software architecture
10. • Provide data provenance information
• Provide data quality information
• Provide a version indicator
• Provide version history
• Preserve identifiers
Example Context Data Producer Reference
Models
December 11, 2017 10
• Provide metadata
• Provide structural metadata
• Use machine-readable standardized data formats
• Provide data in multiple formats
• Reuse vocabularies, preferably standardized
ones
• Provide Subsets for Large Datasets
Provide bulk download
Provide Subsets for Large Datasets
11. Use Case: Energy Exascale Earth System
Model (E3SM) and Mass Spectrometry
Achieves this through IETF, W3C
formats, W3C Provenance,
Interoperable Protocols,
Off the shelf: Swagger, Jupyter
Notebook, NoSQL databases
Repurposed to support
reproducible Mass Spectrometry
Experiments
December 11, 2017 11
Focus: Recovering enough information to re-execute a given simulation
Thomas M, J Laskin, B Raju, EG Stephan, TO Elsethagen, NYS Van, and SN Nguyen. 2016. "Enabling Re-
executable Workflows with Near-real-time Visualization, Provenance Capture and Advanced Querying for Mass
Spectrometry Data." In NYSDS 2016 - Data-Driven Discovery.
12. Example Context Data Publisher Reference
Model
December 11, 2017 12
• Provide metadata
• Provide descriptive metadata
• Provide structural metadata
• Provide data provenance information
• Use locale-neutral data representations
• Reuse vocabularies, preferably standardized ones
• Choose the right formalization level
• Gather feedback from data consumers
• Enrich data by generating new data
• Provide Complementary Presentations
• Interoperability
• Use persistent URIs as identifiers of datasets
• Use persistent URIs as identifiers within datasets
• Reuse vocabularies, preferably standardized ones
• Choose the right formalization level
• Make data available through an API
• Use Web Standards as the foundation of APIs
• Avoid Breaking Changes to Your API
• Provide Feedback to the Original Publisher
• Provide data provenance information
• Provide data quality information
• Provide a version indicator
• Provide version history
• Preserve identifiers
13. December 11, 2017 13
Example curating and re-publishing to
support discovery
Based on a single soil moisture use case
1.4 billion triples curated measurement
metadata (i.e., relationships, graph edges)
Including descriptions of 777,230 datasets,
2,767 data catalogs,
1,701 data centers,
52 data networks.
Chappell AR, JR Weaver, S Purohit, WP Smith, KL Schuchardt, P West, B Lee, and P Fox. 2015. "Enhancing the Impact of Science Data:
Toward Data Discovery and Reuse." In Proceedings of the 14th IEEE/ACIS International Conference on Computer and Information Science
2015.
Ontology alignment
Query Optimization with SPARQL and
Schema.org
Use of services such as geonames.org
14. DWBP Implementation Report: Field
Guide to Examples of Best Practices
December 11, 2017 14
Use evaluation criteria in report for
assessing your own technology stack and
data resources.
http://w3c.github.io/dwbp/dwbp-implementation-report.html
15. Indirect Collaborations
December 11, 2017 15
Producers
Publishers
Analysts
Researchers
There is real interest in your data from
emerging fields!
Using common methods and
approaches are extremely helpful
indirect collaborations
Internationalizing your products can
widen your impact
Approach supports open and closed
(behind firewall) collaborations
Example
Data
Lifecycle
16. What Type of Data Terrain Are We Providing
for Future Science?
Active technical recommendation communities such as W3C are here to serve
you and are interested in your problems.
Evolving good practice as a guideline is less expensive than technology solution
context switching without good practices.
Success criteria described in the DWBP can help you measure benefit to your
project
Change is good, for legacy applications, good practice and new technology
adoption may be more impactful at a gradual pace
December 11, 2017 16
Questions? Eric.Stephan@pnnl.gov
Paraphrased from notes on TBL’s remarks at the the W3C Technical Plenary and Advisor Committee 2014
“Thank you for giving us level terrain to build upon”
Sir Tim Berners-Lee (inventor of the Web), recalling a conversation he had with Vint Cerf (co-
inventor of the Internet)
17. The International Data on the Web Best
Practices Recommendations Team!
Contributors:
• Annette Greiner (Lawrence Berkley National Laboratory)
• Antoine Isaac
• Carlos Iglesias
• Carlos Laufer
• Christophe Guéret
• Deirdre Lee (Working Group co-Chair)
• Doug Schepers
• Eric G. Stephan (Pacific Northwest National Laboratory)
• Eric Kauz
• Ghislain A. Atemezing
• Hadley Beeman (Working Group co-Chair)
• Ig Ibert Bittencourt
• João Paulo Almeida
• Makx Dekkers
• Peter Winstanley
• Phil Archer (Data Activity Chair)
• Riccardo Albertoni
• Sumit Purohit (Pacific Northwest National Laboratory)
• Yasodara Córdova December 11, 2017 17
DWBP Editors:
• Bernadette Farias Lóscio
• Caroline Burle
• Newton Calegari
Working Group Chairs
• Hadley Beeman
• Deirdre Lee
• Yasodara Córdova
• Steven Adler, Perspective & Community Outreach
W3C Data Activity Lead, W3C Team Contact: Phil Archer