From Ambition to Go Live SWIB.pdf

From Ambition to Go Live
The National Library Board of Singapore’s Journey
to an operational Linked Data Management
and Discovery System
Richard Wallis
Evangelist and Founder
Data Liberate
richard.wallis@dataliberate.com
@dataliberate
15th Semantic Web in Libraries Conference
SWIB23 - 12th September 2023 - Berlin
Independent Consultant, Evangelist & Founder
W3C Community Groups:
• Bibframe2Schema (Chair) – Standardised conversion path(s)
• Schema Bib Extend (Chair) - Bibliographic data
• Schema Architypes (Chair) - Archives
• Financial Industry Business Ontology – Financial schema.org
• Tourism Structured Web Data (Co-Chair)
• Schema Course Extension
• Schema IoT Community
• Educational & Occupational Credentials in Schema.org
richard.wallis@dataliberate.com — @dataliberate
40+ Years – Computing
30+ Years – Cultural Heritage technology
20+ Years – Semantic Web & Linked Data
Worked With:
• Google – Schema.org vocabulary, site, extensions. documentation and community
• OCLC – Global library cooperative
• FIBO – Financial Industry Business Ontology Group
• Various Clients – Implementing/understanding Linked Data, Schema.org:
National Library Board Singapore
British Library — Stanford University — Europeana
2
3
National Library Board Singapore
Public Libraries
Network of 28 Public Libraries,
including 2 partner libraries*
Reading Programmes and Initiatives
Programmes and Exhibitions
targeted at Singapore communities
*Partner libraries are libraries which are partner owned
and
funded but managed by NLB/NLB’s subsidiary Libraries
and Archives Solutions Pte Ltd. Library@Chinatown and
the Lifelong Learning Institute Library are Partner libraries.
National Archives
Transferred from NHB to NLB in Nov
2012
Custodian of Singapore’s
Collective Memory: Responsible
for Collection, Preservation and
Management of
Singapore’s Public and Private
Archival Records
Promotes Public Interest in our
Nation’s
History and Heritage
National Library
Preserving Singapore’s Print
and Literary Heritage, and
Intellectual memory
Reference Collections
Legal Deposit (including
electronic)
4
Over
560,000
Singapore &
SEA items
Over 147,000
Chinese, Malay &
Tamil Languages
items
Reference Collection
Over 62,000
Social Sciences
& Humanities
items
Over 39,000
Science &
Technology
items
Over
53,000
Arts items
Over 19,000
Rare Materials
items
Archival Materials
Over 290,000
Government files &
Parliament papers
Over 190,000
Audiovisual & sound
recordings
Over 70,000
Maps & building
plans
Over
1.14m
Photographs
Over 35,000
Oral history
interviews
Over 55,000
Speeches & press
releases
Over
7,000
Posters
National Library Board Singapore
Over 5m
print collection
Over 2.4m
music tracks
78
databases
Over 7,400
e-newspapers and
e-magazines titles
Over
8,000
e-learning
courses
Over 1.7m
e-books and
audio books
Lending Collection
5
National Library Board Online Services
7
The Ambition
• To enable the discovery & display of entitles from different sources
in a combined interface
• To bring together resources physical and digital
• To bring together diverse systems across the National Library,
National Archives, and Public Libraries in a Linked Data Environment
• To provide a staff interface to view and manage all entities, their
descriptions and relationships
8
The Ambition
• To enable the discovery & display of entitles from different sources
in a combined interface
• To bring together resources physical and digital
• To bring together diverse systems across the National Library,
National Archives, and Public Libraries in a Linked Data Environment
• To provide a staff interface to view and manage all entities, their
descriptions and relationships
9
The Ambition
• To enable the discovery & display of entitles from different sources
in a combined interface
• To bring together resources physical and digital
• To bring together diverse systems across the National Library,
National Archives, and Public Libraries in a Linked Data Environment
• To provide a staff interface to view and manage all entities, their
descriptions and relationships
10
The Ambition
• To enable the discovery & display of entitles from different sources
in a combined interface
• To bring together resources physical and digital
• To bring together diverse systems across the National Library,
National Archives, and Public Libraries in a Linked Data Environment
• To provide a staff interface to view and manage all entities, their
descriptions and relationships
11
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
12
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
13
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
14
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
15
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
16
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
17
The Journey Starts(March 2021)
A Linked Data Management System (LDMS)
• Cloud-based System
• Management Interface
• Source data curated at source
– ILS, Authorities, CMS, NAS
• Linked Data
• Public Interface – Google friendly
– Schema.org
• Shared standards
– Bibframe, Schema.org
18
Contract Awarded
metaphactory platform
Low-code knowledge graph platform
Semantic knowledge modeling
Semantic search & discovery
AWS Partner
Public sector partner
Singapore based
Linked Data, Structured data, Semantic
Web, bibliographic meta data, Schema.org
and management systems consultant
19
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
20
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
21
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
22
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
23
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
24
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
25
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
26
Initial Challenges
• Data – lots of it!
– Record-based formats: MARC-XML, DC-XML, CSV
• Entities described in the data – lots and lots!
• Regular updates – daily, weekly, monthly
• Automatic ingestion
• Not the single source of truth
• Manual update capability
• Entity-based search & discovery
• Web friendly – Schema.org output
27
Basic Data Model
• Linked Data
– BIBFRAME to capture detail of bibliographic records
– Schema.org to deliver structured data for search engines
– Schema.org representation of CMS, NAS, TTE data
– Schema.org enrichment of BIBFRAME
• Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph
– All entities described using Schema.org as a minimum.
28
Basic Data Model
• Linked Data
– BIBFRAME to capture detail of bibliographic records
– Schema.org to deliver structured data for search engines
– Schema.org representation of CMS, NAS, TTE data
– Schema.org enrichment of BIBFRAME
• Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph
– All entities described using Schema.org as a minimum.
29
Basic Data Model
• Linked Data
– BIBFRAME to capture detail of bibliographic records
– Schema.org to deliver structured data for search engines
– Schema.org representation of CMS, NAS, TTE data
– Schema.org enrichment of BIBFRAME
• Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph
– All entities described using Schema.org as a minimum.
30
Basic Data Model
• Linked Data
– BIBFRAME to capture detail of bibliographic records
– Schema.org to deliver structured data for search engines
– Schema.org representation of CMS, NAS, TTE data
– Schema.org enrichment of BIBFRAME
• Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph
– All entities described using Schema.org as a minimum.
31
Basic Data Model
• Linked Data
– BIBFRAME to capture detail of bibliographic records
– Schema.org to deliver structured data for search engines
– Schema.org representation of CMS, NAS, TTE data
– Schema.org enrichment of BIBFRAME
• Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph
– All entities described using Schema.org as a minimum.
32
Data Data Data!
Data Source Source Records Entity Count Update Frequency
ILS 1.4m 56.8m Daily
CMS 82k 228k Weekly
NAS 1.6m 6.7m Monthly
TTE 3k 317k Monthly
3.1m 70.4m
33
Data Ingest Pipelines – ILS
• Source data in MARC-XML files
• Step 1: marc2bibframe2 scripts
– Open source – shared by Library of Congress
– “standard” approach
– Output BIBFRAME as individual RDF-XML files
• Step 2: bibframe2schema.org script
– Open source – Bibframe2Schema.org Community Group
– SPARQL-based script
– Output Schema.org enriched individual BIBFRAME RDF files for loading into
Knowledge graph triplestore
34
Data Ingest Pipelines – ILS
• Source data in MARC-XML files
• Step 1: marc2bibframe2 scripts
– Open source – shared by Library of Congress
– “standard” approach
– Output BIBFRAME as individual RDF-XML files
• Step 2: bibframe2schema.org script
– Open source – Bibframe2Schema.org Community Group
– SPARQL-based script
– Output Schema.org enriched individual BIBFRAME RDF files for loading into
Knowledge graph triplestore
35
Data Ingest Pipelines – ILS
• Source data in MARC-XML files
• Step 1: marc2bibframe2 scripts
– Open source – shared by Library of Congress
– “standard” approach
– Output BIBFRAME as individual RDF-XML files
• Step 2: bibframe2schema.org script
– Open source – Bibframe2Schema.org Community Group
– SPARQL-based script
– Output Schema.org enriched individual BIBFRAME RDF files for loading into
Knowledge graph triplestore
36
Data Ingestion Pipelines – TTE (Authorities)
• Source data in bespoke CSV format
• Exported in collection-based files
– People, Places, organisations, time periods, etc.
• Interrelated references so needed to be considered as a whole
• Bespoke python script
• Creating Schema.org entity descriptions
• Output RDF format file for loading into Knowledge graph
triplestore
37
Data Ingestion Pipelines – TTE (Authorities)
• Source data in bespoke CSV format
• Exported in collection-based files
– People, Places, organisations, time periods, etc.
• Interrelated references so needed to be considered as a whole
• Bespoke python script
• Creating Schema.org entity descriptions
• Output RDF format file for loading into Knowledge graph
triplestore
38
Data Ingestion Pipelines – TTE (Authorities)
• Source data in bespoke CSV format
• Exported in collection-based files
– People, Places, organisations, time periods, etc.
• Interrelated references so needed to be considered as a whole
• Bespoke python script
• Creating Schema.org entity descriptions
• Output RDF format file for loading into Knowledge graph
triplestore
39
Data Ingestion Pipelines – NAS & CMS
• Source data in DC-XML files
• Step 1: DC-XML to DC-Terms RDF
– Bespoke XSLT script to translate from DC record to DCT-RDF
• Step 2: TTE lookup
– Lookup identified values against TTE entities in Knowledge graph
• use URI if matched
• Step 3: Schema.org entity creation
– SPARQL script create schema representation of DCT entities
• Output individual RDF format files for loading into Knowledge graph
40
Data Ingestion Pipelines – NAS & CMS
• Source data in DC-XML files
• Step 1: DC-XML to DC-Terms RDF
– Bespoke XSLT script to translate from DC record to DCT-RDF
• Step 2: TTE lookup
– Lookup identified values against TTE entities in Knowledge graph
• use URI if matched
• Step 3: Schema.org entity creation
– SPARQL script create schema representation of DCT entities
• Output individual RDF format files for loading into Knowledge graph
41
Data Ingestion Pipelines – NAS & CMS
• Source data in DC-XML files
• Step 1: DC-XML to DC-Terms RDF
– Bespoke XSLT script to translate from DC record to DCT-RDF
• Step 2: TTE lookup
– Lookup identified values against TTE entities in Knowledge graph
• use URI if matched
• Step 3: Schema.org entity creation
– SPARQL script create schema representation of DCT entities
• Output individual RDF format files for loading into Knowledge graph
42
Data Ingestion Pipelines – NAS & CMS
• Source data in DC-XML files
• Step 1: DC-XML to DC-Terms RDF
– Bespoke XSLT script to translate from DC record to DCT-RDF
• Step 2: TTE lookup
– Lookup identified values against TTE entities in Knowledge graph
• use URI if matched
• Step 3: Schema.org entity creation
– SPARQL script create schema representation of DCT entities
• Output individual RDF format files for loading into Knowledge graph
43
Technical Architecture (simplified)
Hosted on Amazon Web Services
Batch Scripts
import control
Etc.
SOURCE DATA
IMPORT
44
Technical Architecture (simplified)
Hosted on Amazon Web Services
Pipeline
processing
Batch Scripts
import control
Etc.
SOURCE DATA
IMPORT
45
Technical Architecture (simplified)
Hosted on Amazon Web Services
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
Pipeline
processing
Batch Scripts
import control
Etc.
SOURCE DATA
IMPORT
46
PUBLIC ACCESS
Technical Architecture (simplified)
Hosted on Amazon Web Services
DI
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
Pipeline
processing
Batch Scripts
import control
Etc.
SOURCE DATA
IMPORT
47
CURATOR ACCESS
PUBLIC ACCESS
Technical Architecture (simplified)
Hosted on Amazon Web Services
DI
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
GraphDB
Cluster
Pipeline
processing
Batch Scripts
import control
Etc.
SOURCE DATA
IMPORT
DMI
48
A need for entity reconciliation …..
• Lots (and lots and lots) of source entities – 70.4 million entities
• Lots of duplication
– Lee, Kuan Yew – 1st Prime Minister of Singapore
• 160 individual entities in ILS source data
– Singapore Art Museum
• Entities from source data
• 21 CMS, 1 NAS, 66 ILS, 1 TTE
• Users only want 1 of each!
49
A need for entity reconciliation …..
• Lots (and lots and lots) of source entities – 70.4 million entities
• Lots of duplication
– Lee, Kuan Yew – 1st Prime Minister of Singapore
• 160 individual entities in ILS source data
– Singapore Art Museum
• Entities from source data
• 21 CMS, 1 NAS, 66 ILS, 1 TTE
• Users only want 1 of each!
50
A need for entity reconciliation …..
• Lots (and lots and lots) of source entities – 70.4 million entities
• Lots of duplication
– Lee, Kuan Yew – 1st Prime Minister of Singapore
• 160 individual entities in ILS source data
– Singapore Art Museum
• Entities from source data
• 21 CMS, 1 NAS, 66 ILS, 1 TTE
• Users only want 1 of each!
51
A need for entity reconciliation …..
• Lots (and lots and lots) of source entities – 70.4 million entities
• Lots of duplication
– Lee, Kuan Yew – 1st Prime Minister of Singapore
• 160 individual entities in ILS source data
– Singapore Art Museum
• Entities from source data
• 21 CMS, 1 NAS, 66 ILS, 1 TTE
• Users only want 1 of each!
52
70.4(million) into 6.1(million) entities does go!
• We know that now - but how did we get there?
• Requirements hurdles to be cleared:
– Not a single source of truth
– Regular automatic updates – add / update / delete
– Manual management of combined entities
• Suppression of incorrect or ’private’ attributes from display
• Addition of attributes not in source data
• Creating / breaking relationships between entities
• Near real-time updates
53
70.4(million) into 6.1(million) entities does go!
• We know that now - but how did we get there?
• Requirements hurdles to be cleared:
– Not a single source of truth
– Regular automatic updates – add / update / delete
– Manual management of combined entities
• Suppression of incorrect or ’private’ attributes from display
• Addition of attributes not in source data
• Creating / breaking relationships between entities
• Near real-time updates
54
70.4(million) into 6.1(million) entities does go!
• We know that now - but how did we get there?
• Requirements hurdles to be cleared:
– Not a single source of truth
– Regular automatic updates – add / update / delete
– Manual management of combined entities
• Suppression of incorrect or ’private’ attributes from display
• Addition of attributes not in source data
• Creating / breaking relationships between entities
• Near real-time updates
55
70.4(million) into 6.1(million) entities does go!
• We know that now - but how did we get there?
• Requirements hurdles to be cleared:
– Not a single source of truth
– Regular automatic updates – add / update / delete
– Manual management of combined entities
• Suppression of incorrect or ’private’ attributes from display
• Addition of attributes not in source data
• Creating / breaking relationships between entities
• Near real-time updates
56
70.4(million) into 6.1(million) entities does go!
• We know that now - but how did we get there?
• Requirements hurdles to be cleared:
– Not a single source of truth
– Regular automatic updates – add / update / delete
– Manual management of combined entities
• Suppression of incorrect or ’private’ attributes from display
• Addition of attributes not in source data
• Creating / breaking relationships between entities
• Near real-time updates
57
Adaptive Data Model Concepts
• Source entitles
– Individual representation of source data
• Aggregation entities
– Tracking relationships between source entities for the same thing
– No copying of attributes
• Primary Entities
– Searchable by users
– Displayable to users
– Consolidation of aggregated source data & managed attributes
58
Adaptive Data Model Concepts
• Source entitles
– Individual representation of source data
• Aggregation entities
– Tracking relationships between source entities for the same thing
– No copying of attributes
• Primary Entities
– Searchable by users
– Displayable to users
– Consolidation of aggregated source data & managed attributes
59
Adaptive Data Model Concepts
• Source entitles
– Individual representation of source data
• Aggregation entities
– Tracking relationships between source entities for the same thing
– No copying of attributes
• Primary Entities
– Searchable by users
– Displayable to users
– Consolidation of aggregated source data & managed attributes
60
61
62
63
64
65
66
67
68
69
Entity Matching
• Candidate matches based on schema:names
– Lucene indexes matching
– Levenshtein similarity refined
– Same entity type
– Entity type specific rules eg:
• Work: name + creator / contributor / author / sameAs
• Person: name + birthDate / deathDate / sameAs
70
Entity Matching
• Candidate matches based on schema:names
– Lucene indexes matching
– Levenshtein similarity refined
– Same entity type
– Entity type specific rules eg:
• Work: name + creator / contributor / author / sameAs
• Person: name + birthDate / deathDate / sameAs
71
Entity Matching
• Candidate matches based on schema:names
– Lucene indexes matching
– Levenshtein similarity refined
– Same entity type
– Entity type specific rules eg:
• Work: name + creator / contributor / author / sameAs
• Person: name + birthDate / deathDate / sameAs
72
Entity Matching
• Candidate matches based on schema:names
– Lucene indexes matching
– Levenshtein similarity refined
– Same entity type
– Entity type specific rules eg:
• Work: name + creator / contributor / author / sameAs
• Person: name + birthDate / deathDate / sameAs
73
Entity Matching
• Candidate matches based on schema:names
– Lucene indexes matching
– Levenshtein similarity refined
– Same entity type
– Entity type specific rules eg:
• Work: name + creator / contributor / author / sameAs
• Person: name + birthDate / deathDate / sameAs
74
The entity iceberg
75
The entity iceberg
Primary
76
The entity iceberg
Primary
Discovery
77
The entity iceberg
Primary
Aggregation
Discovery
78
The entity iceberg
Primary
Aggregation
Source
Ingestion
Pipelines
Discovery
79
The entity iceberg
Primary
Aggregation
Source
Ingestion
Pipelines
Discovery
Management
80
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
81
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
82
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
83
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
84
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
85
Data Management Interface
• Rapidly developed – easily modified
• Using mataphactory’s semantic visual interface
• A collaboration environment for data curators
• Pre-publish entity management
• Search & discovery – emulates DI
• Dashboard view
• Entity management / Creation
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
Discovery Interface (DI)
104
105
106
107
108
109
110
111
LDMS System Live – December 2022
• Live – data curation team using DMI
– Entity curation
– Manually merging & splitting aggregations
• Live – data ingest pipelines
– Daily / Weekly / Monthly
– Delta update exports from source systems
• Live – Discovery Interface ready for deployment
• Live – Support
– Issue resolution & updates
112
LDMS System Live – December 2022
• Live – data curation team using DMI
– Entity curation
– Manually merging & splitting aggregations
• Live – data ingest pipelines
– Daily / Weekly / Monthly
– Delta update exports from source systems
• Live – Discovery Interface ready for deployment
• Live – Support
– Issue resolution & updates
113
LDMS System Live – December 2022
• Live – data curation team using DMI
– Entity curation
– Manually merging & splitting aggregations
• Live – data ingest pipelines
– Daily / Weekly / Monthly
– Delta update exports from source systems
• Live – Discovery Interface ready for deployment
• Live – Support
– Issue resolution & updates
114
LDMS System Live – December 2022
• Live – data curation team using DMI
– Entity curation
– Manually merging & splitting aggregations
• Live – data ingest pipelines
– Daily / Weekly / Monthly
– Delta update exports from source systems
• Live – Discovery Interface ready for deployment
• Live – Support
– Issue resolution & updates
115
Discovery Interface
• Live and ready for deployment
– Performance, load and user tested
• Designed & developed by NLB Data Team to demonstrate
characteristics and benefits of a Linked Data service
• NLB services rolled-out by National Library & Public Library divisions
– Recently identified a service for Linked Data integration
• Currently undergoing fine tuning to meet service’s UI requirements
116
Discovery Interface
• Live and ready for deployment
– Performance, load and user tested
• Designed & developed by NLB Data Team to demonstrate
characteristics and benefits of a Linked Data service
• NLB services rolled-out by National Library & Public Library divisions
– Recently identified a service for Linked Data integration
• Currently undergoing fine tuning to meet service’s UI requirements
117
Discovery Interface
• Live and ready for deployment
– Performance, load and user tested
• Designed & developed by NLB Data Team to demonstrate
characteristics and benefits of a Linked Data service
• NLB services rolled-out by National Library & Public Library divisions
– Recently identified a service for Linked Data integration
• Currently undergoing fine tuning to meet service’s UI requirements
118
Daily Delta Ingestion & Reconciliation
119
• Inaccurate source data
– Not apparent in source systems
– Very obvious when aggregated in LDMS
• Eg. Subject and Person URIs duplicated
– Identified and analysed in DMI
– Reported to source curators and fixed
– Updated in next delta update
– Data issues identified in LDMS – source data quality enhanced
Some Challenges Addressed on the Way
120
• Inaccurate source data
– Not apparent in source systems
– Very obvious when aggregated in LDMS
• Eg. Subject and Person URIs duplicated
– Identified and analysed in DMI
– Reported to source curators and fixed
– Updated in next delta update
– Data issues identified in LDMS – source data quality enhanced
Some Challenges Addressed on the Way
121
• Asian name matching
– Often consisting of several short 3-4 letter names
– Lucene matches not of much help
– Introduction of Levenshtein similarity matching helped
– Tuning is challenging – trading off European & Asian names
Some Challenges Addressed on the Way
122
• Asian name matching
– Often consisting of several short 3-4 letter names
– Lucene matches not of much help
– Introduction of Levenshtein similarity matching helped
– Tuning is challenging – trading off European & Asian names
Some Challenges Addressed on the Way
137
From Ambition to Go Live – A Summary
• Ambition based on previous years of trials, experimentation, and understanding of
Linked Data potential
• Ambition to deliver a production LDMS to provide Linked Data curation and
management capable of delivering a public discovery service
• With experienced commercial partners & tools as part of a globally distributed team
– Benefiting from Open Source community developments where appropriate
• Unique and challenging requirements
– Distributed sources of truth in disparate data formats
– Auto updating, consolidated entity view – plus individual entity management
– Web friendly – providing structured data for search engines
• Live and operational actively delivering benefits from December 2022
138
From Ambition to Go Live – A Summary
• Ambition based on previous years of trials, experimentation, and understanding of
Linked Data potential
• Ambition to deliver a production LDMS to provide Linked Data curation and
management capable of delivering a public discovery service
• With experienced commercial partners & tools as part of a globally distributed team
– Benefiting from Open Source community developments where appropriate
• Unique and challenging requirements
– Distributed sources of truth in disparate data formats
– Auto updating, consolidated entity view – plus individual entity management
– Web friendly – providing structured data for search engines
• Live and operational actively delivering benefits from December 2022
139
From Ambition to Go Live – A Summary
• Ambition based on previous years of trials, experimentation, and understanding of
Linked Data potential
• Ambition to deliver a production LDMS to provide Linked Data curation and
management capable of delivering a public discovery service
• With experienced commercial partners & tools as part of a globally distributed team
– Benefiting from Open Source community developments where appropriate
• Unique and challenging requirements
– Distributed sources of truth in disparate data formats
– Auto updating, consolidated entity view – plus individual entity management
– Web friendly – providing structured data for search engines
• Live and operational actively delivering benefits from December 2022
140
From Ambition to Go Live – A Summary
• Ambition based on previous years of trials, experimentation, and understanding of
Linked Data potential
• Ambition to deliver a production LDMS to provide Linked Data curation and
management capable of delivering a public discovery service
• With experienced commercial partners & tools as part of a globally distributed team
– Benefiting from Open Source community developments where appropriate
• Unique and challenging requirements
– Distributed sources of truth in disparate data formats
– Auto updating, consolidated entity view – plus individual entity management
– Web friendly – providing structured data for search engines
• Live and operational actively delivering benefits from December 2022
141
From Ambition to Go Live – A Summary
• Ambition based on previous years of trials, experimentation, and understanding of
Linked Data potential
• Ambition to deliver a production LDMS to provide Linked Data curation and
management capable of delivering a public discovery service
• With experienced commercial partners & tools as part of a globally distributed team
– Benefiting from Open Source community developments where appropriate
• Unique and challenging requirements
– Distributed sources of truth in disparate data formats
– Auto updating, consolidated entity view – plus individual entity management
– Web friendly – providing structured data for search engines
• Live and operational actively delivering benefits from December 2022
From Ambition to Go Live
The National Library Board of Singapore’s Journey
to an operational Linked Data Management
and Discovery System
Richard Wallis
Evangelist and Founder
Data Liberate
richard.wallis@dataliberate.com
@dataliberate
15th Semantic Web in Libraries Conference
SWIB23 - 12th September 2023 - Berlin
1 de 127

Mais conteúdo relacionado

Similar a From Ambition to Go Live SWIB.pdf

Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
18.5K visualizações67 slides
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
1.9K visualizações67 slides
WorldCat, Works, and Schema.orgWorldCat, Works, and Schema.org
WorldCat, Works, and Schema.orgRichard Wallis
2.3K visualizações95 slides

Similar a From Ambition to Go Live SWIB.pdf(20)

Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
EUCLID project18.5K visualizações
Telling the World and Our Users What We HaveTelling the World and Our Users What We Have
Telling the World and Our Users What We Have
Richard Wallis1.7K visualizações
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
Richard Wallis636 visualizações
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
Marin Dimitrov1.9K visualizações
WorldCat, Works, and Schema.orgWorldCat, Works, and Schema.org
WorldCat, Works, and Schema.org
Richard Wallis2.3K visualizações
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
Richard Wallis735 visualizações
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
National Information Standards Organization (NISO)4.8K visualizações
Entification: The Route to 'Useful' Library DataEntification: The Route to 'Useful' Library Data
Entification: The Route to 'Useful' Library Data
Richard Wallis1.7K visualizações
The Web of Data is Our OpportunityThe Web of Data is Our Opportunity
The Web of Data is Our Opportunity
Richard Wallis1.6K visualizações
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked Data
Adrian Stevenson1.1K visualizações
Structured Data: It's All about the Graph | Richard Wallis, Data LiberateStructured Data: It's All about the Graph | Richard Wallis, Data Liberate
Structured Data: It's All about the Graph | Richard Wallis, Data Liberate
Click Consult (Part of Ceuta Group)402 visualizações
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
Richard Wallis692 visualizações
Extending Schema.orgExtending Schema.org
Extending Schema.org
Richard Wallis4.8K visualizações
Finding Data SetsFinding Data Sets
Finding Data Sets
Anja Jentzsch1.1K visualizações
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
Barbara Starr5.9K visualizações
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
Rachel Vacek2.1K visualizações
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
National Information Standards Organization (NISO)1.5K visualizações

Último(20)

MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
CXL Forum110 visualizações
Business Analyst Series 2023 -  Week 2 Session 3Business Analyst Series 2023 -  Week 2 Session 3
Business Analyst Series 2023 - Week 2 Session 3
DianaGray10319 visualizações
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang34 visualizações
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 visualizações
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray1094 visualizações
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum203 visualizações
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver24 visualizações
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman161 visualizações
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum23 visualizações
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting177 visualizações

From Ambition to Go Live SWIB.pdf

  • 1. From Ambition to Go Live The National Library Board of Singapore’s Journey to an operational Linked Data Management and Discovery System Richard Wallis Evangelist and Founder Data Liberate richard.wallis@dataliberate.com @dataliberate 15th Semantic Web in Libraries Conference SWIB23 - 12th September 2023 - Berlin
  • 2. Independent Consultant, Evangelist & Founder W3C Community Groups: • Bibframe2Schema (Chair) – Standardised conversion path(s) • Schema Bib Extend (Chair) - Bibliographic data • Schema Architypes (Chair) - Archives • Financial Industry Business Ontology – Financial schema.org • Tourism Structured Web Data (Co-Chair) • Schema Course Extension • Schema IoT Community • Educational & Occupational Credentials in Schema.org richard.wallis@dataliberate.com — @dataliberate 40+ Years – Computing 30+ Years – Cultural Heritage technology 20+ Years – Semantic Web & Linked Data Worked With: • Google – Schema.org vocabulary, site, extensions. documentation and community • OCLC – Global library cooperative • FIBO – Financial Industry Business Ontology Group • Various Clients – Implementing/understanding Linked Data, Schema.org: National Library Board Singapore British Library — Stanford University — Europeana 2
  • 3. 3 National Library Board Singapore Public Libraries Network of 28 Public Libraries, including 2 partner libraries* Reading Programmes and Initiatives Programmes and Exhibitions targeted at Singapore communities *Partner libraries are libraries which are partner owned and funded but managed by NLB/NLB’s subsidiary Libraries and Archives Solutions Pte Ltd. Library@Chinatown and the Lifelong Learning Institute Library are Partner libraries. National Archives Transferred from NHB to NLB in Nov 2012 Custodian of Singapore’s Collective Memory: Responsible for Collection, Preservation and Management of Singapore’s Public and Private Archival Records Promotes Public Interest in our Nation’s History and Heritage National Library Preserving Singapore’s Print and Literary Heritage, and Intellectual memory Reference Collections Legal Deposit (including electronic)
  • 4. 4 Over 560,000 Singapore & SEA items Over 147,000 Chinese, Malay & Tamil Languages items Reference Collection Over 62,000 Social Sciences & Humanities items Over 39,000 Science & Technology items Over 53,000 Arts items Over 19,000 Rare Materials items Archival Materials Over 290,000 Government files & Parliament papers Over 190,000 Audiovisual & sound recordings Over 70,000 Maps & building plans Over 1.14m Photographs Over 35,000 Oral history interviews Over 55,000 Speeches & press releases Over 7,000 Posters National Library Board Singapore Over 5m print collection Over 2.4m music tracks 78 databases Over 7,400 e-newspapers and e-magazines titles Over 8,000 e-learning courses Over 1.7m e-books and audio books Lending Collection
  • 5. 5 National Library Board Online Services
  • 6. 7 The Ambition • To enable the discovery & display of entitles from different sources in a combined interface • To bring together resources physical and digital • To bring together diverse systems across the National Library, National Archives, and Public Libraries in a Linked Data Environment • To provide a staff interface to view and manage all entities, their descriptions and relationships
  • 7. 8 The Ambition • To enable the discovery & display of entitles from different sources in a combined interface • To bring together resources physical and digital • To bring together diverse systems across the National Library, National Archives, and Public Libraries in a Linked Data Environment • To provide a staff interface to view and manage all entities, their descriptions and relationships
  • 8. 9 The Ambition • To enable the discovery & display of entitles from different sources in a combined interface • To bring together resources physical and digital • To bring together diverse systems across the National Library, National Archives, and Public Libraries in a Linked Data Environment • To provide a staff interface to view and manage all entities, their descriptions and relationships
  • 9. 10 The Ambition • To enable the discovery & display of entitles from different sources in a combined interface • To bring together resources physical and digital • To bring together diverse systems across the National Library, National Archives, and Public Libraries in a Linked Data Environment • To provide a staff interface to view and manage all entities, their descriptions and relationships
  • 10. 11 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 11. 12 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 12. 13 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 13. 14 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 14. 15 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 15. 16 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 16. 17 The Journey Starts(March 2021) A Linked Data Management System (LDMS) • Cloud-based System • Management Interface • Source data curated at source – ILS, Authorities, CMS, NAS • Linked Data • Public Interface – Google friendly – Schema.org • Shared standards – Bibframe, Schema.org
  • 17. 18 Contract Awarded metaphactory platform Low-code knowledge graph platform Semantic knowledge modeling Semantic search & discovery AWS Partner Public sector partner Singapore based Linked Data, Structured data, Semantic Web, bibliographic meta data, Schema.org and management systems consultant
  • 18. 19 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 19. 20 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 20. 21 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 21. 22 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 22. 23 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 23. 24 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 24. 25 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 25. 26 Initial Challenges • Data – lots of it! – Record-based formats: MARC-XML, DC-XML, CSV • Entities described in the data – lots and lots! • Regular updates – daily, weekly, monthly • Automatic ingestion • Not the single source of truth • Manual update capability • Entity-based search & discovery • Web friendly – Schema.org output
  • 26. 27 Basic Data Model • Linked Data – BIBFRAME to capture detail of bibliographic records – Schema.org to deliver structured data for search engines – Schema.org representation of CMS, NAS, TTE data – Schema.org enrichment of BIBFRAME • Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph – All entities described using Schema.org as a minimum.
  • 27. 28 Basic Data Model • Linked Data – BIBFRAME to capture detail of bibliographic records – Schema.org to deliver structured data for search engines – Schema.org representation of CMS, NAS, TTE data – Schema.org enrichment of BIBFRAME • Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph – All entities described using Schema.org as a minimum.
  • 28. 29 Basic Data Model • Linked Data – BIBFRAME to capture detail of bibliographic records – Schema.org to deliver structured data for search engines – Schema.org representation of CMS, NAS, TTE data – Schema.org enrichment of BIBFRAME • Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph – All entities described using Schema.org as a minimum.
  • 29. 30 Basic Data Model • Linked Data – BIBFRAME to capture detail of bibliographic records – Schema.org to deliver structured data for search engines – Schema.org representation of CMS, NAS, TTE data – Schema.org enrichment of BIBFRAME • Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph – All entities described using Schema.org as a minimum.
  • 30. 31 Basic Data Model • Linked Data – BIBFRAME to capture detail of bibliographic records – Schema.org to deliver structured data for search engines – Schema.org representation of CMS, NAS, TTE data – Schema.org enrichment of BIBFRAME • Schema.org as the ‘lingua franca’ vocabulary of the Knowledge graph – All entities described using Schema.org as a minimum.
  • 31. 32 Data Data Data! Data Source Source Records Entity Count Update Frequency ILS 1.4m 56.8m Daily CMS 82k 228k Weekly NAS 1.6m 6.7m Monthly TTE 3k 317k Monthly 3.1m 70.4m
  • 32. 33 Data Ingest Pipelines – ILS • Source data in MARC-XML files • Step 1: marc2bibframe2 scripts – Open source – shared by Library of Congress – “standard” approach – Output BIBFRAME as individual RDF-XML files • Step 2: bibframe2schema.org script – Open source – Bibframe2Schema.org Community Group – SPARQL-based script – Output Schema.org enriched individual BIBFRAME RDF files for loading into Knowledge graph triplestore
  • 33. 34 Data Ingest Pipelines – ILS • Source data in MARC-XML files • Step 1: marc2bibframe2 scripts – Open source – shared by Library of Congress – “standard” approach – Output BIBFRAME as individual RDF-XML files • Step 2: bibframe2schema.org script – Open source – Bibframe2Schema.org Community Group – SPARQL-based script – Output Schema.org enriched individual BIBFRAME RDF files for loading into Knowledge graph triplestore
  • 34. 35 Data Ingest Pipelines – ILS • Source data in MARC-XML files • Step 1: marc2bibframe2 scripts – Open source – shared by Library of Congress – “standard” approach – Output BIBFRAME as individual RDF-XML files • Step 2: bibframe2schema.org script – Open source – Bibframe2Schema.org Community Group – SPARQL-based script – Output Schema.org enriched individual BIBFRAME RDF files for loading into Knowledge graph triplestore
  • 35. 36 Data Ingestion Pipelines – TTE (Authorities) • Source data in bespoke CSV format • Exported in collection-based files – People, Places, organisations, time periods, etc. • Interrelated references so needed to be considered as a whole • Bespoke python script • Creating Schema.org entity descriptions • Output RDF format file for loading into Knowledge graph triplestore
  • 36. 37 Data Ingestion Pipelines – TTE (Authorities) • Source data in bespoke CSV format • Exported in collection-based files – People, Places, organisations, time periods, etc. • Interrelated references so needed to be considered as a whole • Bespoke python script • Creating Schema.org entity descriptions • Output RDF format file for loading into Knowledge graph triplestore
  • 37. 38 Data Ingestion Pipelines – TTE (Authorities) • Source data in bespoke CSV format • Exported in collection-based files – People, Places, organisations, time periods, etc. • Interrelated references so needed to be considered as a whole • Bespoke python script • Creating Schema.org entity descriptions • Output RDF format file for loading into Knowledge graph triplestore
  • 38. 39 Data Ingestion Pipelines – NAS & CMS • Source data in DC-XML files • Step 1: DC-XML to DC-Terms RDF – Bespoke XSLT script to translate from DC record to DCT-RDF • Step 2: TTE lookup – Lookup identified values against TTE entities in Knowledge graph • use URI if matched • Step 3: Schema.org entity creation – SPARQL script create schema representation of DCT entities • Output individual RDF format files for loading into Knowledge graph
  • 39. 40 Data Ingestion Pipelines – NAS & CMS • Source data in DC-XML files • Step 1: DC-XML to DC-Terms RDF – Bespoke XSLT script to translate from DC record to DCT-RDF • Step 2: TTE lookup – Lookup identified values against TTE entities in Knowledge graph • use URI if matched • Step 3: Schema.org entity creation – SPARQL script create schema representation of DCT entities • Output individual RDF format files for loading into Knowledge graph
  • 40. 41 Data Ingestion Pipelines – NAS & CMS • Source data in DC-XML files • Step 1: DC-XML to DC-Terms RDF – Bespoke XSLT script to translate from DC record to DCT-RDF • Step 2: TTE lookup – Lookup identified values against TTE entities in Knowledge graph • use URI if matched • Step 3: Schema.org entity creation – SPARQL script create schema representation of DCT entities • Output individual RDF format files for loading into Knowledge graph
  • 41. 42 Data Ingestion Pipelines – NAS & CMS • Source data in DC-XML files • Step 1: DC-XML to DC-Terms RDF – Bespoke XSLT script to translate from DC record to DCT-RDF • Step 2: TTE lookup – Lookup identified values against TTE entities in Knowledge graph • use URI if matched • Step 3: Schema.org entity creation – SPARQL script create schema representation of DCT entities • Output individual RDF format files for loading into Knowledge graph
  • 42. 43 Technical Architecture (simplified) Hosted on Amazon Web Services Batch Scripts import control Etc. SOURCE DATA IMPORT
  • 43. 44 Technical Architecture (simplified) Hosted on Amazon Web Services Pipeline processing Batch Scripts import control Etc. SOURCE DATA IMPORT
  • 44. 45 Technical Architecture (simplified) Hosted on Amazon Web Services GraphDB Cluster GraphDB Cluster GraphDB Cluster GraphDB Cluster Pipeline processing Batch Scripts import control Etc. SOURCE DATA IMPORT
  • 45. 46 PUBLIC ACCESS Technical Architecture (simplified) Hosted on Amazon Web Services DI GraphDB Cluster GraphDB Cluster GraphDB Cluster GraphDB Cluster Pipeline processing Batch Scripts import control Etc. SOURCE DATA IMPORT
  • 46. 47 CURATOR ACCESS PUBLIC ACCESS Technical Architecture (simplified) Hosted on Amazon Web Services DI GraphDB Cluster GraphDB Cluster GraphDB Cluster GraphDB Cluster Pipeline processing Batch Scripts import control Etc. SOURCE DATA IMPORT DMI
  • 47. 48 A need for entity reconciliation ….. • Lots (and lots and lots) of source entities – 70.4 million entities • Lots of duplication – Lee, Kuan Yew – 1st Prime Minister of Singapore • 160 individual entities in ILS source data – Singapore Art Museum • Entities from source data • 21 CMS, 1 NAS, 66 ILS, 1 TTE • Users only want 1 of each!
  • 48. 49 A need for entity reconciliation ….. • Lots (and lots and lots) of source entities – 70.4 million entities • Lots of duplication – Lee, Kuan Yew – 1st Prime Minister of Singapore • 160 individual entities in ILS source data – Singapore Art Museum • Entities from source data • 21 CMS, 1 NAS, 66 ILS, 1 TTE • Users only want 1 of each!
  • 49. 50 A need for entity reconciliation ….. • Lots (and lots and lots) of source entities – 70.4 million entities • Lots of duplication – Lee, Kuan Yew – 1st Prime Minister of Singapore • 160 individual entities in ILS source data – Singapore Art Museum • Entities from source data • 21 CMS, 1 NAS, 66 ILS, 1 TTE • Users only want 1 of each!
  • 50. 51 A need for entity reconciliation ….. • Lots (and lots and lots) of source entities – 70.4 million entities • Lots of duplication – Lee, Kuan Yew – 1st Prime Minister of Singapore • 160 individual entities in ILS source data – Singapore Art Museum • Entities from source data • 21 CMS, 1 NAS, 66 ILS, 1 TTE • Users only want 1 of each!
  • 51. 52 70.4(million) into 6.1(million) entities does go! • We know that now - but how did we get there? • Requirements hurdles to be cleared: – Not a single source of truth – Regular automatic updates – add / update / delete – Manual management of combined entities • Suppression of incorrect or ’private’ attributes from display • Addition of attributes not in source data • Creating / breaking relationships between entities • Near real-time updates
  • 52. 53 70.4(million) into 6.1(million) entities does go! • We know that now - but how did we get there? • Requirements hurdles to be cleared: – Not a single source of truth – Regular automatic updates – add / update / delete – Manual management of combined entities • Suppression of incorrect or ’private’ attributes from display • Addition of attributes not in source data • Creating / breaking relationships between entities • Near real-time updates
  • 53. 54 70.4(million) into 6.1(million) entities does go! • We know that now - but how did we get there? • Requirements hurdles to be cleared: – Not a single source of truth – Regular automatic updates – add / update / delete – Manual management of combined entities • Suppression of incorrect or ’private’ attributes from display • Addition of attributes not in source data • Creating / breaking relationships between entities • Near real-time updates
  • 54. 55 70.4(million) into 6.1(million) entities does go! • We know that now - but how did we get there? • Requirements hurdles to be cleared: – Not a single source of truth – Regular automatic updates – add / update / delete – Manual management of combined entities • Suppression of incorrect or ’private’ attributes from display • Addition of attributes not in source data • Creating / breaking relationships between entities • Near real-time updates
  • 55. 56 70.4(million) into 6.1(million) entities does go! • We know that now - but how did we get there? • Requirements hurdles to be cleared: – Not a single source of truth – Regular automatic updates – add / update / delete – Manual management of combined entities • Suppression of incorrect or ’private’ attributes from display • Addition of attributes not in source data • Creating / breaking relationships between entities • Near real-time updates
  • 56. 57 Adaptive Data Model Concepts • Source entitles – Individual representation of source data • Aggregation entities – Tracking relationships between source entities for the same thing – No copying of attributes • Primary Entities – Searchable by users – Displayable to users – Consolidation of aggregated source data & managed attributes
  • 57. 58 Adaptive Data Model Concepts • Source entitles – Individual representation of source data • Aggregation entities – Tracking relationships between source entities for the same thing – No copying of attributes • Primary Entities – Searchable by users – Displayable to users – Consolidation of aggregated source data & managed attributes
  • 58. 59 Adaptive Data Model Concepts • Source entitles – Individual representation of source data • Aggregation entities – Tracking relationships between source entities for the same thing – No copying of attributes • Primary Entities – Searchable by users – Displayable to users – Consolidation of aggregated source data & managed attributes
  • 59. 60
  • 60. 61
  • 61. 62
  • 62. 63
  • 63. 64
  • 64. 65
  • 65. 66
  • 66. 67
  • 67. 68
  • 68. 69 Entity Matching • Candidate matches based on schema:names – Lucene indexes matching – Levenshtein similarity refined – Same entity type – Entity type specific rules eg: • Work: name + creator / contributor / author / sameAs • Person: name + birthDate / deathDate / sameAs
  • 69. 70 Entity Matching • Candidate matches based on schema:names – Lucene indexes matching – Levenshtein similarity refined – Same entity type – Entity type specific rules eg: • Work: name + creator / contributor / author / sameAs • Person: name + birthDate / deathDate / sameAs
  • 70. 71 Entity Matching • Candidate matches based on schema:names – Lucene indexes matching – Levenshtein similarity refined – Same entity type – Entity type specific rules eg: • Work: name + creator / contributor / author / sameAs • Person: name + birthDate / deathDate / sameAs
  • 71. 72 Entity Matching • Candidate matches based on schema:names – Lucene indexes matching – Levenshtein similarity refined – Same entity type – Entity type specific rules eg: • Work: name + creator / contributor / author / sameAs • Person: name + birthDate / deathDate / sameAs
  • 72. 73 Entity Matching • Candidate matches based on schema:names – Lucene indexes matching – Levenshtein similarity refined – Same entity type – Entity type specific rules eg: • Work: name + creator / contributor / author / sameAs • Person: name + birthDate / deathDate / sameAs
  • 79. 80 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 80. 81 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 81. 82 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 82. 83 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 83. 84 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 84. 85 Data Management Interface • Rapidly developed – easily modified • Using mataphactory’s semantic visual interface • A collaboration environment for data curators • Pre-publish entity management • Search & discovery – emulates DI • Dashboard view • Entity management / Creation
  • 85. 86
  • 86. 87
  • 87. 88
  • 88. 89
  • 89. 90
  • 90. 91
  • 91. 92
  • 92. 93
  • 93. 94
  • 94. 95
  • 95. 96
  • 96. 97
  • 97. 98
  • 98. 99
  • 99. 100
  • 100. 101
  • 101. 102
  • 103. 104
  • 104. 105
  • 105. 106
  • 106. 107
  • 107. 108
  • 108. 109
  • 109. 110
  • 110. 111 LDMS System Live – December 2022 • Live – data curation team using DMI – Entity curation – Manually merging & splitting aggregations • Live – data ingest pipelines – Daily / Weekly / Monthly – Delta update exports from source systems • Live – Discovery Interface ready for deployment • Live – Support – Issue resolution & updates
  • 111. 112 LDMS System Live – December 2022 • Live – data curation team using DMI – Entity curation – Manually merging & splitting aggregations • Live – data ingest pipelines – Daily / Weekly / Monthly – Delta update exports from source systems • Live – Discovery Interface ready for deployment • Live – Support – Issue resolution & updates
  • 112. 113 LDMS System Live – December 2022 • Live – data curation team using DMI – Entity curation – Manually merging & splitting aggregations • Live – data ingest pipelines – Daily / Weekly / Monthly – Delta update exports from source systems • Live – Discovery Interface ready for deployment • Live – Support – Issue resolution & updates
  • 113. 114 LDMS System Live – December 2022 • Live – data curation team using DMI – Entity curation – Manually merging & splitting aggregations • Live – data ingest pipelines – Daily / Weekly / Monthly – Delta update exports from source systems • Live – Discovery Interface ready for deployment • Live – Support – Issue resolution & updates
  • 114. 115 Discovery Interface • Live and ready for deployment – Performance, load and user tested • Designed & developed by NLB Data Team to demonstrate characteristics and benefits of a Linked Data service • NLB services rolled-out by National Library & Public Library divisions – Recently identified a service for Linked Data integration • Currently undergoing fine tuning to meet service’s UI requirements
  • 115. 116 Discovery Interface • Live and ready for deployment – Performance, load and user tested • Designed & developed by NLB Data Team to demonstrate characteristics and benefits of a Linked Data service • NLB services rolled-out by National Library & Public Library divisions – Recently identified a service for Linked Data integration • Currently undergoing fine tuning to meet service’s UI requirements
  • 116. 117 Discovery Interface • Live and ready for deployment – Performance, load and user tested • Designed & developed by NLB Data Team to demonstrate characteristics and benefits of a Linked Data service • NLB services rolled-out by National Library & Public Library divisions – Recently identified a service for Linked Data integration • Currently undergoing fine tuning to meet service’s UI requirements
  • 117. 118 Daily Delta Ingestion & Reconciliation
  • 118. 119 • Inaccurate source data – Not apparent in source systems – Very obvious when aggregated in LDMS • Eg. Subject and Person URIs duplicated – Identified and analysed in DMI – Reported to source curators and fixed – Updated in next delta update – Data issues identified in LDMS – source data quality enhanced Some Challenges Addressed on the Way
  • 119. 120 • Inaccurate source data – Not apparent in source systems – Very obvious when aggregated in LDMS • Eg. Subject and Person URIs duplicated – Identified and analysed in DMI – Reported to source curators and fixed – Updated in next delta update – Data issues identified in LDMS – source data quality enhanced Some Challenges Addressed on the Way
  • 120. 121 • Asian name matching – Often consisting of several short 3-4 letter names – Lucene matches not of much help – Introduction of Levenshtein similarity matching helped – Tuning is challenging – trading off European & Asian names Some Challenges Addressed on the Way
  • 121. 122 • Asian name matching – Often consisting of several short 3-4 letter names – Lucene matches not of much help – Introduction of Levenshtein similarity matching helped – Tuning is challenging – trading off European & Asian names Some Challenges Addressed on the Way
  • 122. 137 From Ambition to Go Live – A Summary • Ambition based on previous years of trials, experimentation, and understanding of Linked Data potential • Ambition to deliver a production LDMS to provide Linked Data curation and management capable of delivering a public discovery service • With experienced commercial partners & tools as part of a globally distributed team – Benefiting from Open Source community developments where appropriate • Unique and challenging requirements – Distributed sources of truth in disparate data formats – Auto updating, consolidated entity view – plus individual entity management – Web friendly – providing structured data for search engines • Live and operational actively delivering benefits from December 2022
  • 123. 138 From Ambition to Go Live – A Summary • Ambition based on previous years of trials, experimentation, and understanding of Linked Data potential • Ambition to deliver a production LDMS to provide Linked Data curation and management capable of delivering a public discovery service • With experienced commercial partners & tools as part of a globally distributed team – Benefiting from Open Source community developments where appropriate • Unique and challenging requirements – Distributed sources of truth in disparate data formats – Auto updating, consolidated entity view – plus individual entity management – Web friendly – providing structured data for search engines • Live and operational actively delivering benefits from December 2022
  • 124. 139 From Ambition to Go Live – A Summary • Ambition based on previous years of trials, experimentation, and understanding of Linked Data potential • Ambition to deliver a production LDMS to provide Linked Data curation and management capable of delivering a public discovery service • With experienced commercial partners & tools as part of a globally distributed team – Benefiting from Open Source community developments where appropriate • Unique and challenging requirements – Distributed sources of truth in disparate data formats – Auto updating, consolidated entity view – plus individual entity management – Web friendly – providing structured data for search engines • Live and operational actively delivering benefits from December 2022
  • 125. 140 From Ambition to Go Live – A Summary • Ambition based on previous years of trials, experimentation, and understanding of Linked Data potential • Ambition to deliver a production LDMS to provide Linked Data curation and management capable of delivering a public discovery service • With experienced commercial partners & tools as part of a globally distributed team – Benefiting from Open Source community developments where appropriate • Unique and challenging requirements – Distributed sources of truth in disparate data formats – Auto updating, consolidated entity view – plus individual entity management – Web friendly – providing structured data for search engines • Live and operational actively delivering benefits from December 2022
  • 126. 141 From Ambition to Go Live – A Summary • Ambition based on previous years of trials, experimentation, and understanding of Linked Data potential • Ambition to deliver a production LDMS to provide Linked Data curation and management capable of delivering a public discovery service • With experienced commercial partners & tools as part of a globally distributed team – Benefiting from Open Source community developments where appropriate • Unique and challenging requirements – Distributed sources of truth in disparate data formats – Auto updating, consolidated entity view – plus individual entity management – Web friendly – providing structured data for search engines • Live and operational actively delivering benefits from December 2022
  • 127. From Ambition to Go Live The National Library Board of Singapore’s Journey to an operational Linked Data Management and Discovery System Richard Wallis Evangelist and Founder Data Liberate richard.wallis@dataliberate.com @dataliberate 15th Semantic Web in Libraries Conference SWIB23 - 12th September 2023 - Berlin