Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data — the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
We’ll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next we’ll delve into the evolution of the U.S. Environmental Protection Agency’s Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the world’s largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
1. ExtendYourReach.
Linking Open Government
Data at Scale
YOW! 2016 Conference
Melbourne December 1-2 ~ Brisbane December 5-6
Sydney December 8-9
Bernadette Hyland
CEO & co-founder
3 Round Stones, Inc.
@BernHyland
bhyland@3RoundStones.com
10. Refers to a set of best practices for publishing and
interlinking data for access by both humans and
machines.
The RDF family of syntaxes (e.g., JSON-LD, N3, Turtle)
and HTTP URIs.
Linked Data
@BernHyland
11. Linked Data can be published by a person
or organization behind the firewall or on the
public Web.
Linked Data published on the public Web is
generally called Linked Open Data.
- W3C Linked Data Glossary
@BernHyland
27. my data
collector
collected by
measurement
Michael
first name
Hausenblaslast name
Person
a
a measurement
2011-01-01
date
0
value
units of measure
degrees
Centigrade
...
Galway Airport
collected at
or
Linked Data on the Web
@BernHyland
28. “Linked Data was part of my initial vision for the
Web and is an important part of the Web’s
future. The Web took off as a web of hyperlinked
documents which were exciting to read, but
which could not be effectively used as data.
“Linked Data was part of my initial vision for the Web
and is an important part of the Web’s future.The Web
took off as a web of hyperlinked documents which
were exciting to read, but which could not be
effectively used as data.”
- Tim Berners-Lee
29. “Linked Data was part of my initial vision for the
Web and is an important part of the Web’s
future. The Web took off as a web of hyperlinked
documents which were exciting to read, but
which could not be effectively used as data.
The Semantic Web morphed when it hit
the marketplace
44. • Widens EPA’s audience (justifies relevance), for
research, environmental justice
• More cost-effective than relational backed web
portals
• Used for scientific R&D, green chemistry, ++
• Increased transparency
https://opendata.epa.gov
@BernHyland
45. 7 Steps to Publish Linked Data
Source: W3C Best Practices for Publishing Linked Data, see https://www.w3.org/TR/ld-bp/
46. Step #1 - Identify
Identify the dataset(s) to be modeled
• Request a copy of the logical and physical model of the
database(s)
• Obtain data extracts (i.e., databases and/or
spreadsheets) or create data in a way that can be
replicated.
@BernHyland
47. Step #2 - Model Data
Model data without context to allow for reuse and
easier merging of data sets
• Traditional DBAs organize data for specified
Web services or applications
• In Linked Data, application logic does not drive
the data schema, concepts, etc
@BernHyland
48. Step #2 - Modeling (cont)
Look for real world objects of interest (e.g., people, places,
things, locations, etc.) and model them.
• Investigate how others are already modeling similar or
related data.
• Look for duplication & normalize the data
• Use common sense to decide whether or not to make
link
@BernHyland
49. • Connect data from different sources & authoritative
vocabularies
• Use URIs as names for your objects
• Put aside immediate needs of any application
• Don’t think about how an application will use your data
• Do think about time and how the data will change over
time.
Step #2 - Modeling (cont)
@BernHyland
50. Identifiers are at the heart of how things
become useful as linked data.
We use the same mechanism for connecting
data as the Web — the humble HTTP URI
The Web is formed by HTTP URIs that are
essentially connections linking pieces of
information together.
Step #3 & 4
Name & Describe
@BernHyland
51. 5. Write a script or process to convert the data set
repeatedly
6. Publish to the Web and announce it!
7. Maintenance strategy
Steps #5, 6 & 7
Convert, Publish & Maintain
@BernHyland
52. Take an iterative approach
1. Review of modeling decisions
2. Review vocabularies chosen and developed
3. Modify/update data conversion scripts
4. Do a maintenance walk-through with real use cases
5. Show how to explore data with SPARQL and
visualizations
6. Discuss a persistent identifier strategy (think PURLs)
@BernHyland
55. Technical DNA of EPA
Linked Data Services
• Built on Open Source Software
• Provides downloadable Linked Open Data (RDF,
JSON-LD)
• Developer guide includes RESTful API, persistent
URLs strategy
• Sample apps on GitHub (https://github.com/
USEPA)
@BernHyland
56. Power of LOD
Combining data sets
in a day with Linked Open
Data from DBpedia &
EPA.
Next the EPA wanted
more chemical data
linked to their data…
@BernHyland
58. PubChem, the world’s
largest open molecular
database
Used by healthcare /
life sciences industry
worldwide - all Linked
Open Data
@BernHyland
59. Use of shared
vocabularies, including
SKOS, RDFS, OWL.
Other key vocabularies
include Dublin Core,
Geo, FOAF, ORG, Vcard
are the “lingua franca” of
data interoperability
61. Public
Application, Script or automated client
Web Browser
SPARQL endpointREST APIResource URIs
Linked Data management system
located at a Tier 1 Cloud Provider
(FISMA compliant)
RDF Database
Registered developer
@BernHyland
62. • A worldwide system of linked information systems
• Global addressing scheme for data integration that scales to the
Web
• Nearly immediate data integration to billions of facts
Linked Data is a gift …
@BernHyland