Merging controlled vocabularies through semantic alignment based on linked data
1. Merging controlled vocabularies
through semantic alignment
based on linked data
Authors: Konstantinos Kyprianos, Ioannis Papadakis
IONIAN UNIVERSITY
DEPARTMENT OF ARCHIVES, LIBRARY SCIENCE AND MUSEOLOGY
Ioannou Theotoki 72, 49100, Corfu
1
3. Introduction (1/2)
Controlled vocabularies are predefined lists of words for
knowledge organization and the description of libraries’
collections
Creation of semantically similar yet syntactically and
linguistically heterogeneous controlled vocabularies with
overlapping parts
Matching tools and techniques: Lexical similarity
Matching tools and techniques: Semantic alignment
◦ Compares terms according to the order of their characters
◦ Edit – distance, prefix / suffix variations, n-grams etc.
◦ Based on semantic techniques to identify similar terms
between two structured vocabularies
3
4. Introduction (2/2)
Our approach:
Methodology to bring together semantically similar
yet different vocabularies through the semantic
alignment of the underlying terms with the
employment of LOD technologies
◦ Semantic alignment is achieved through external linguistic
datasets
◦ There is no requirement of any kind of structure (schema or
ontology) to the compared datasets
4
5. Proposed approach
• S is the set of terms in Source dataset
• T is the set of terms in Target dataset
• L is the set of terms in the Linguistic
dataset
• L’ is the set of terms that are found to
be linguistically associated with some
terms of the Source dataset
• L’’ is the set of terms in L that are found
to be semantically associated with
some terms of the L’
• T’ contains the terms in T that are
linguistically associated with some
terms of L’ and L’’
5
6. Proof of concept (1/2)
University of Piraeus digital library (Dione)
◦ Theses and dissertations
◦ 3,323 bilingual subject headings
◦ DSpace installation
New York Times – NYT
◦ Approximately 10.000 subject headings
◦ Journal articles
DBpedia
◦ Extracts structured information from Wikipedia
◦ 3,5 million entities
WordNet
◦ Lexical database
◦ Consists of synsets (~117,659 distinct concepts containing terms
interlinked through conceptual-semantic relations)
6
7. Proof of concept (2/2)
1. let the source dataset S be D (i.e. Dione)
2. let the target dataset T be N (i.e. NYT)
3. let the linguistic datasetA L be DB (i.e.
DBpedia) and
4. let the linguistic datasetB L be W
(i.e.WordNet)
5. D1’ corresponds to S’, assuming that the
linguistic dataset L is DB. In a similar
manner, D2’ corresponds to S’, assuming
that the linguistic dataset L is W.
6. DB’ and DB’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is DB. In a similar manner, W’
and W’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is W.
7. N1’ corresponds to T’, assuming that the
linguistic dataset L is DB. In a similar
manner, N2’ corresponds to T’ assuming
that the linguistic dataset L is W.
7
8. Deployment of the proposed approach
◦
◦
◦
Google Refine
Tool to manipulate tabular data
Reconciliation of data with existent knowledge bases
RDF extension
Process
1.
2.
3.
4.
5.
6.
Subject headings from Dione are imported to Google Refine
DBpedia and WordNet endpoints are registered in Google Refine as
SPARQL reconciliation services
The subject headings of Dione are linguistically matched (i.e. lexical
similarity) against DBpedia’s and WordNet’s reconciliation services
creating the corresponding subsets
The terms in the subsets of step 3 are extended with semantically
equivalent terms (i.e. semantic alignment) deriving from the rest of
DBpedia and WordNet
Subject headings from NYT are imported to Google Refine
The subject headings of NYT are linguistically matched (i.e. lexical
similarity) against the terms belonging to the subsets that are
described in steps 3 and 4
8
9. Deployment results (1/2)
Linguistically
matched terms
between
◦
◦
Dione and DBpedia
Dione and Wordnet
through lexical
similarity techniques
Dione
DBpedia
WordNet
One-word
Subject
Headings
331 (29%)
297 (65%)
Two-words
Subject
Headings
658 (59%)
128 (28%)
Subject
Headings with
3+ words
130 (12%)
30 (7%)
Subject
Headings with
Subdivisions
0
0
Sum
(1,574)
1,119
455
9
10. Deployment results (2/2)
D = 3,323 terms
D
D2’
D1’
1119
DB
DB’’
455
DB’
W’’
W’
W
986
5,700
72
86
77
45
N
N = 10,000 terms
N1’ = 163
N2’ = 117
10
11. Comparative evaluation (1/4)
The proposed methodology is compared against
an algorithm (introduced in a previous work*)
addressed to Dione and NYT based only on
lexical similarity techniques
Dione and NYT are not described by schemas.
Thus, any attempt to merge their underlying
terms cannot be based on traditional ontologyalignment techniques
*Papadakis, I., Kyprianos, K.: Merging Controlled Vocabularies for More Efficient
Subject-based Search. International Journal of Knowledge Management. 7(3), 76-90,
July-September (2011)
11
12. Comparative evaluation (2/4)
List A
List B
207
280
List A. Previous work: only lexically matched pairs between Dione and NYT
List B. Proposed work: lexically AND semantically matched pairs between Dione
and NYT
12
15. Conclusions
A methodology was presented that is capable of finding
equivalent terms between semantically similar controlled
vocabularies
Lexical similarities discovery and semantic alignment
through external LOD datasets
Google Refine renders the deployment of the proposed
methodology as a straightforward process that can be
applied to other cases aiming in discovering equivalent
terms in different yet semantically similar datasets
The deployment of the proposed methodology is facilitated
through the employment of linked data technologies
15
16. Future work
Future work is targeted towards the reconciliation of
Dione’s subject headings with linked data services such as
French National Library (RAMEAU), German National
Library (GND), Biblioteca National de Espana (BNE) and
LIBRIS.
16