TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Â
Session 1.4 a distributed network of heritage information
1. A distributed network of digital
heritage information
Enno Meijers
Semantics Conference â Amsterdam â 12 September 2017
2. Contents
âą Introduction to Dutch Digital Heritage Network
âą Problems with the current infrastructure
âą Strategies for improvement
âą Building the distributed network
4. Digital Heritage Network (NDE) aims
at increasing the social value of the
heritage information maintained by
libraries, archives, museums and
other cultural heritage institutions.
Long term cooperation between the
government and the institutions on
national, regional and local level. Itâs
about information and people!
5. Three-layered approach for
improving the sustainability,
the usability and the visibility
of digital heritage information.
sustainable
usable
visible
6. In general - discovery of the âdeep webâ
âą Institutional repositories, collection management systems
âą Millions of âinvisibleâ datasets: publications, research data, heritage collections
âą Poor coverage by regular search engines
âą Metadata is key, describing physical materials or (licensed) digital content
âą Dutch cultural heritage sector: 1500 institutions, >>1500 collections
âą Demand for cross-institutional, cross-domain discovery
âą Many specialized portals giving access to different views
11. Evaluating current approach
Positive results so far:
âą many data sources available through OAI-PMH protocol
âą powerful and smart protocol for metadata synchronization
âą opened up data silos
âą created the need for aligning data models
âą made cross-collection and cross-domain discovery possible (e.g. Europeana)
But there are two problems areas:
âą semantic alignment
âą data integration
12. Problem #1: Poor semantic alignment
Not enough semantic alignment in the data sources:
âą lack sustainable URIs and shared identifiers
âą no shared terminology sources available
âą no provisions for linking between sources
âą implementations lack support for multiple data models
âą data is âflattenedâ to a common data model (EDM, Dublin Core)
âą loss of meaning due to transformation
ï poor capabilities for cross-collection, cross-domain discovery
ï cleaning, aligning and enriching is needed after harvesting
13. Problem #2: Inefficient data integration
Physical data integration based on OAIâPMH (= copying)
âą synchronizing with the sources is hard work
âą ownership, licensing, provenance, control over access are difficult topics
âą no feedback loop to the data source (usage, cleaning, enrichments)
âą data source owner and end user are disconnected
âą centralized model leads to scalability problems
âą OAI-PMH is not a web-centric protocol
See also:
Miel Vander Sande et al. , Towards sustainable publishing and querying of distributed Linked Data archives - Journal of Documentation (2017)
Herbert Van de Sompel - Reminiscing About 15 Years of Interoperability Efforts - D-lib Magazine - December (2015)
15. âą build portals as views based on a common data layer
âą minimize the intermediate layers
âą refer to the source instead of copying
âą support decentralized discovery
âą maximize the usability of data at the source
âą develop a sustainable, âwebcentricâ solution
âą use HTTP, RDF and RESTful APIs as building blocks
=> implement the Linked Data principals
Inspired by the work of Ruben Verborgh, Herbert Van de Sompel and colleagues:
See for example: Miel Vander Sande et al. , Towards sustainable publishing and querying of distributed
Linked Data archives - Journal of Documentation (2017)
Design principles for a discovery infrastructure
16. At the data source level:
âą use sustainable URIs to identify the resources
âą use formal definitions for persons, places, concepts, events (API)
âą use domain vocabularies / data models to describe the data
âą add support for cross-domain discovery (EDM, Schema.org,...)
At the network level:
âą create a ânetwork of termsâ for shared entities
âą provide tools for aligning and linking
âą create alignments and links between different terminology sources
âą provide easy access for collection management systems (API)
Implementing Linked Data principles
21. The Semantic Web is still a dream⊠#1
ï So discovery of
Linked Data requires
registering datasets?!
22. A tiny example...suppose a resource is defined as:
museum_X:object1
a nde:painting ;
dcterms:subject aat:windmill .
For âbrowsable Linked Dataâ you should(!) add the inverse relation [1],[2]:
aat:windmill
a skos:Concept ;
skos:prefLabel âWindmolenâ@nl ;
dcterms:isSubjectOf museum_X:object1 . # âbacklinksâ
=> a Linked Data integration problemâŠ
The Semantic Web is still a dream⊠#2
[1]: Tim Bernerâs Lee on âbrowsable linked dataâ (2006)
[2]: Tom Heath and Christian Bizer on âIncoming Linksâ (2011)
23. 1. Only semantic integration
âą just implement schema.org, let search engines âinferâ the relations
âą is the data interesting enough for Google?
âą what about special thematic or regional views? how about reuse?
âą can we reuse the results of the integration? (NO!)
2. Physical integration:
âą aggregate all the related Linked Data sources
âą build large triplestore and infer the relations
âą but like OAI-PMH, based on copying data
Special case: LOD Laundromat
Comparing Linked Data integration approachesâŠ
24. âTraditionalâ solutions to federated querying not feasible:
- publishing Linked Data in triplestores is hard for small data providers
- service is vulnerable because of rich functionality
- federated querying over SPARQL endpoints performs poorly
Follow the Linked Data Fragments approach :
- Linked Data available through Triple Pattern Fragments interface
- easier to implement for data providers
- federated querying is supported, even SPARQL
- more complexity at network level is acceptable
- even support for time-based versions (Memento)
3. Virtual integration by federated approach
See also: Miel Vander Sande et al. , (2017) Towards sustainable publishing and querying of distributed Linked Data archives -
Journal of Documentation
25. Use the backlinks to support the discovery process:
See also: Miel Vander Sande et al. (2016) Hypermedia-Based Discovery for Source Selection Using Low-Cost Linked Data Interfaces
(IJSWIS) 12(3) 79â110
More advanced:
data source profiling
or dataset summaries
Federated querying needs source selectionâŠ
26. To make discovery of Linked Data work:
1. Register organizations and datasets
2. Build a knowledge graph with backlinks for resource discovery
Implementations will depend on capabilities of cultural heritage institutions
28. Strategy for our distributed network
1. build a knowledge graph for Dutch digital heritage entities
2. improve the usability of the data source:
- align object descriptions with shared entities
- publish data as Linked Data
3. build a discovery infrastructure:
- register organizations and datasets in a registry
- build knowledge graph to support discovery (backlinks)
4. implement virtual data integration technology :
- use registry and knowledge graph for selecting the resources
- support federated querying (or selective aggregation)
semantic
alignment
data
integration
30. Roadmap
âą Start with the existing (OAI-PMH) based infrastructure
âą Build registry for organizations and datasets
âą Build network of terms to provide shared entities for discovery
âą Upgrade object descriptions with URIs
âą Make aggregators Linked Data compliant
âą Build knowledge graph with backlinks for discovery
âą Support federated querying (or selective harvesting)
âą Make collection management systems Linked Data compliant
âą Transform aggregators to service portals for discovery
31. Thank you for your attention!
please share your thoughts with us...
email: enno.meijers at kb.nl
twitter, slideshare: ennomeijers
https://github.com/netwerk-digitaal-erfgoed