This document discusses using linked open data and semantic technologies to support next generation science. It provides background on the increasing availability of open data and opportunities for citizen science contributions. Semantic technologies can help integrate and link diverse scientific data sources. Linked data principles allow disparate datasets to be connected through shared identifiers and relationships. Examples are provided of existing projects that use semantic approaches to enable scientific data discovery, analysis and collaboration across domains like population health, water quality monitoring and climate change. Overall, the document argues that semantic technologies are mature and can help scientists address large, distributed problems by facilitating data integration and knowledge sharing.
1. Linked Open Data and Next
Generation Science
Deborah L. McGuinness
Tetherless World Senior Constellation Chair
Professor of Computer and Cognitive Science
Rensselaer Polytechnic Institute, Troy, NY
& CEO McGuinness Associates, Latham, NY
Earth System Information Partners, Madison Wisconsin, July 18, 2012
2. Background I
– Access to data is exploding with open government
data and numerous agencies publishing and
providing services access or at least FOIA access
– Citizen interest and contributions are increasing –
data gathering (e.g., bird observations), reviewing
(e.g., galaxy zoo), compute cycles (e.g., SETI), …
– Arguably the more large (both data volume and area
breadth) science problems need addressing – these
go beyond what a single research team can easily
solve
3. Background II
– Semantic Technologies – technological support for
encoding meaning in a form computers can
understand and manipulate – are maturing and
increasing in usage
– Computational encodings of meaning can be used
to help integrate, link, validate, filter,…. Essentially
to make smarter, more context-aware applications
– Semantic Technologies enable linking data … and
linked data provides a way of connecting and
traversing information, nodes, graphs, webs, …
4. Take Home Message
(early)
– Linked Data is usable now by any project
– Linked Data and Semantic Technologies can help in
forming and connecting help large, distributed,
evolving efforts such as many earth and space
science projects
– In the rest of talk:
– Brief intro to Linked Data and Semantic
Technologies through examples
– Discussion about what we might do now and strive
for in the future
5. Linked Data
• Linked Data is quite simple and follows principles set
out by Berners-Lee in
http://www.w3.org/DesignIssues/LinkedData.html
– Use URIs as names for things
– Use HTTP URIs so that people can look up those names.
– When someone looks up a URI, provide useful information,
using the standards (RDF*, SPARQL)
– Include links to other URIs. so that they can discover more
things.
– Introduction by examples and then discussion
6. Population Sciences Grid Goals
• Convey complex health-related information to
consumer and public health decision makers
for community health impact
• Inform the development of future research
opportunities effectively utilizing
cyberinfrastructure for cancer prevention and
control
McGuinness, D. Shaikh, A., Lebo, T, Ding, L., Courtney, P., McCusker, J., Moser,. Morgan, G.D., Tatalovich, Z., Willis, G., Contractor, N., and Hesse, B.
2012. Towards Semantically-Enabled Next Generation Community Health Information Portals: The PopSciGrid Pilot In Proceedings of Hawaii
International Conference on System Sciences 2012
6
7. Semantic Web Perspective on
Initial Project Goals
• How can semantic technologies be used to integrate, present,
and analyze data for a wide range of users?
• Can tools allow lay people to build their own demos and
support public usage and accurate interpretation?
• How do we facilitate collaboration and “viral” applications?
• Within PopSciGrid:
– Which policies (taxation, smoking bans, etc) are correlated with health
and health care costs?
– What data should be displayed to help scientists and lay people
evaluate related questions?
– What data might be presented so that people choose to make (positive)
behavior changes?
– What does the data show? why should someone believe that?
– What are appropriate follow up questions to support actionability? 7
8. What is an Ontology?
Thesauri
“narrower Formal Frames General
Catalog/ term” is-a (properties) Logical
ID relation constraints
Informal Formal Value Disjointness
Terms/ instance Restrs. , Inverse,
glossary is-a part-of…
Ontologies Come of Age McGuinness, 2001, and From AAAI Panel 99 – McGuinness, Welty, Uschold,
Gruninger, Lehmann
Plus basis of Ontologies Come of Age – McGuinness, 2003
9. Inference Web: Making Data Transparent and
Actionable Using Semantic Technologies
• How and when does it make sense to use smart system results & how do we
interact with them?
(Mobile)
Knowledge Intelligent
Provenance in Virtual
Agents NSF Interops:
Observatories SONET
SSIII – Sea Ice
Intelligence Analyst
Tools
Hypothesis
Investigation /
Policy Advisors
9
10. Foundations: Web Layer Cake
Visualization APIs
S2S
Govt Data
Inference Web, Proof
Markup Language, W3C Inference Web IW Trust,
Provenance Working Air + Trust
group formal model,
W3C incubator group, DL, KIF, CL, N3Logic
…
Ontology repositories
OWL 1 & 2 WG Edited main OWL (ontolinguag),
Docs, quick reference, Ontology Evolution env:
OWL profiles (OWL RL), Chimaera,
Earlier languages: DAML, Semantic eScience
DAML+OIL, Classic Ontologies, MANY other ontologie
RIF WG
AIR accountability tool
SPARQL WG, earlier QL –
OWL-QL, Classic’ QL, …
Govt metadata search
Linked Open Govt Data
SPARQL to Xquery translator RDFS materialization
(Billion triple winner) Transparent Accountable
Datamining Initiative (TAM
11.
12. PopSciGrid Workflow
Ban coverage
Publish
CSV2RDF4LOD
Direct visualize
derive derive
CHSI 2009
archive
Archive
SemDiff
CSV2RDF4LOD
derive
Enhance
13. PopSciGrid Example
State View
Extensible Mashups via Linked Data
Diverse datasets from NIH
Potentially linking to other content (e.g.
“unemployment rate”)
Accountable Mashups via Provenance
Annotate datasets used in demos
13
Feedback users’ comment to gov contact (e.g. %)
Annotation capabilities coming (and more)
15. Reflections
Successful but….
• What if we could allow data experts to build
their own demos?
• What if we could allow non-subject matter
experts to function as subject-literate staff?
• What if team members could interchange roles
(and thus make contributions in other areas)?
• What technological infrastructure is required?
• Claim: all of this is being done now – and it is
starting to scale and growing more accessible 15
16. Updates and Motivations from a
Computer Science Perspective
Old: New:
• Raw conversions • Enhanced conversions
• Per-dataset vocabularies • Vocabulary reuse
• Custom queries • Generic queries
• Custom data • Re-usable data
management code management code
• Limited use because of • Unlimited use of new
Google Visualization open source visualization
licenses toolkit
• State-level data • State and county-level
data
16
17. County
average life
expectancy
(Summary Measures of Health
18. Why Did I Show A Population Science
Project and a Water Project?
Questions and goals are similar –
What’s happening with x? – health of a country,
water quality and other parts of an ecosystem,
climate changes
What intervention strategies are being tested
What policies are correlated with factors under
investigation
And
Why should people believe the outcome?
19. See Global Change Provenance Representation in the
Global Change Information System (GCIS)
Curt.Tilmes@nasa.gov
What’s happening with the climate
and how will it affect the U.S.?
National Climate Assessment 2013
30 chapters, 240 authors
A “Highly Influential Scientific Assessment”
Why should I believe it?
GCIS presenting the provenance of the report
itself, the key messages of the report,
including traceable accounts of the >500
technical inputs from reports, papers, models,
datasets, observations, etc.
20. SemantEco/SemantAqua
• Enable/Empower citizens &
scientists to explore pollution
sites, facilities, regulations, and
health impacts along with
provenance. 5 4
• Demonstrates semantic 2 3
monitoring possibilities.
• Map presentation of analysis
• Explanations and Provenance 1
available
http://was.tw.rpi.edu/swqp/map.html and
1. Map view of analyzed results http://aquarius.tw.rpi.edu/projects/semantaqua
2. Explanation of pollution
3. Possible health effect of contaminant (from EPA)
4. Filtering by facet to select type of data
5. Link for reporting problems
6. Now joint with USGS resource managers ; expanded to
endangered species; now more virtual observatory style
22. Originally developed for VSTO, now in SSIII, SESDI, SESF, OOI …
The Virtual Solar-Terrestrial
Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. 19
Conf. on Innovative Applications of Artificial Intelligence (IAAI-07),
http://www.vsto.org
23. Reflections
• What began as Semantic water quality monitoring is now SemantEco –
ecological and environmental monitoring in support of ecosystem analysis
• Now includes endangered species and related health impacts working with
USGS to prototype resource manager dashboard
• Expanding to include citizen science reporting on water on mobile platforms
• Now working with SONet, Santa Barbara County LTER, CUASHI to integrate
other related scientific observations
– Current focus use case ecological researcher
– Find relevant data (within and outside DataOne) by region, timeframe,
chemical, measurement dimension, species
– Currently background ontology is relatively simple and aims more at
discovery and integration
• Semantic Sea Ice project aimed at helping arctic ice researchers find and
evaluate data in support of understanding the state of ice in the arctic
• These technologies span the spectrum of supporting discovery, integration,
23
analysis, and ultimately prediction
24. Discussion
• Semantic Technologies and Linked Data are
powering a wide array of applications – many
in Big Science, Team Science, at least
interdisciplinary science
• Tools and methodologies are ready for use
• We love to partner in these areas
• What do you need or want from linked data
and semantic technologies?
Questions? - Deborah McGuinness
dlm @ cs . rpi . edu
26. RDF Data Cube
Vocabulary
• Integrated with the LOGD
• For publishing multi- data conversion
dimensional data, such infrastructure
as statistics, on the web
in such a way that it can • Integrated with other tooling
be linked to related data like Stats2RDF
sets and concepts using
RDF.
• Compatible with the cube
model that underlies
SDMX (Statistical Data
and Metadata eXchange).
• Also compatible with:
– SKOS, SCOVO, VoiD,
FOAF, Dublin Core Terms
26
27. Foundations: The Tetherless World
Constellation Linked Open Government
Data Portal
Convert TWC LOGD
Query/
Access
LOGD Community Portal
SPARQL • RDF
Endpoint • RSS
• JSON
Create • XML
• HTML
• CSV
•…
Enhance
Data.gov deployment
27
28. Directions
• Incorporation of TWC data Quality Facts label
(Zednik et al)
• Use of DataFAQs automated data quality
framework (Lebo et al)
• Additional provenance inclusion / usage (Inference /
Provenance Web)
• Annotation / Collaboration facilities (Michaelis et al)
• Other data sets? Or exposition of other
parameters?
• Partners in additional topic areas
28