Merging controlled vocabularies through semantic alignment based on linked data

Merging controlled vocabularies
through semantic alignment
based on linked data
Authors: Konstantinos Kyprianos, Ioannis Papadakis

IONIAN UNIVERSITY
DEPARTMENT OF ARCHIVES, LIBRARY SCIENCE AND MUSEOLOGY
Ioannou Theotoki 72, 49100, Corfu

1

Presentation outline









Introduction
Proposed approach
Proof of concept
Deployment of the proposed
approach
Deployment results
Comparative evaluation
Conclusions
Future work

2

Introduction (1/2)


Controlled vocabularies are predefined lists of words for
knowledge organization and the description of libraries’
collections



Creation of semantically similar yet syntactically and
linguistically heterogeneous controlled vocabularies with
overlapping parts



Matching tools and techniques: Lexical similarity



Matching tools and techniques: Semantic alignment

◦ Compares terms according to the order of their characters
◦ Edit – distance, prefix / suffix variations, n-grams etc.
◦ Based on semantic techniques to identify similar terms
between two structured vocabularies

3

Introduction (2/2)


Our approach:

Methodology to bring together semantically similar
yet different vocabularies through the semantic
alignment of the underlying terms with the
employment of LOD technologies
◦ Semantic alignment is achieved through external linguistic
datasets
◦ There is no requirement of any kind of structure (schema or
ontology) to the compared datasets

4

Proposed approach

• S is the set of terms in Source dataset
• T is the set of terms in Target dataset
• L is the set of terms in the Linguistic
dataset
• L’ is the set of terms that are found to
be linguistically associated with some
terms of the Source dataset
• L’’ is the set of terms in L that are found
to be semantically associated with
some terms of the L’
• T’ contains the terms in T that are
linguistically associated with some
terms of L’ and L’’
5

Proof of concept (1/2)


University of Piraeus digital library (Dione)
◦ Theses and dissertations
◦ 3,323 bilingual subject headings
◦ DSpace installation



New York Times – NYT
◦ Approximately 10.000 subject headings
◦ Journal articles



DBpedia
◦ Extracts structured information from Wikipedia
◦ 3,5 million entities



WordNet
◦ Lexical database
◦ Consists of synsets (~117,659 distinct concepts containing terms
interlinked through conceptual-semantic relations)
6

Proof of concept (2/2)
1. let the source dataset S be D (i.e. Dione)
2. let the target dataset T be N (i.e. NYT)
3. let the linguistic datasetA L be DB (i.e.
DBpedia) and
4. let the linguistic datasetB L be W
(i.e.WordNet)
5. D1’ corresponds to S’, assuming that the
linguistic dataset L is DB. In a similar
manner, D2’ corresponds to S’, assuming
that the linguistic dataset L is W.
6. DB’ and DB’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is DB. In a similar manner, W’
and W’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is W.
7. N1’ corresponds to T’, assuming that the
linguistic dataset L is DB. In a similar
manner, N2’ corresponds to T’ assuming
that the linguistic dataset L is W.

7

Deployment of the proposed approach




◦
◦
◦

Google Refine

Tool to manipulate tabular data
Reconciliation of data with existent knowledge bases
RDF extension

Process
1.
2.
3.

4.

5.
6.

Subject headings from Dione are imported to Google Refine
DBpedia and WordNet endpoints are registered in Google Refine as
SPARQL reconciliation services
The subject headings of Dione are linguistically matched (i.e. lexical
similarity) against DBpedia’s and WordNet’s reconciliation services
creating the corresponding subsets
The terms in the subsets of step 3 are extended with semantically
equivalent terms (i.e. semantic alignment) deriving from the rest of
DBpedia and WordNet
Subject headings from NYT are imported to Google Refine
The subject headings of NYT are linguistically matched (i.e. lexical
similarity) against the terms belonging to the subsets that are
described in steps 3 and 4
8

Deployment results (1/2)
Linguistically
matched terms
between



◦
◦

Dione and DBpedia
Dione and Wordnet

through lexical
similarity techniques

Dione

DBpedia

WordNet

One-word
Subject
Headings

331 (29%)

297 (65%)

Two-words
Subject
Headings

658 (59%)

128 (28%)

Subject
Headings with
3+ words

130 (12%)

30 (7%)

Subject
Headings with
Subdivisions

0

0

Sum
(1,574)

1,119

455

9

Deployment results (2/2)
D = 3,323 terms

D
D2’

D1’

1119

DB

DB’’

455

DB’

W’’

W’

W

986

5,700
72

86
77

45

N

N = 10,000 terms
N1’ = 163

N2’ = 117

10

Comparative evaluation (1/4)


The proposed methodology is compared against
an algorithm (introduced in a previous work*)
addressed to Dione and NYT based only on
lexical similarity techniques
 Dione and NYT are not described by schemas.
Thus, any attempt to merge their underlying
terms cannot be based on traditional ontologyalignment techniques

*Papadakis, I., Kyprianos, K.: Merging Controlled Vocabularies for More Efficient

Subject-based Search. International Journal of Knowledge Management. 7(3), 76-90,
July-September (2011)
11

List A

List B

207
280

List A. Previous work: only lexically matched pairs between Dione and NYT
List B. Proposed work: lexically AND semantically matched pairs between Dione
and NYT
12

List B

List A

27

180

100

List A ∧ List B = 180 terms

13

Matched
pairs

List A

List B

D1-NYT1

 (lexical)

 (lexical)

…

 (lexical)

 (lexical)

D158-NYT158

 (lexical)

 (lexical)

…

 (lexical)

 (semantic)

D180-NYT180

 (lexical)

 (semantic)

…



 (semantic)

D280-NYT280



 (semantic)

…

 (lexical)



D307-NYT307

 (lexical)


TOTAL:

No. of pairs

158
180
22
100
27
307

14

Conclusions


A methodology was presented that is capable of finding
equivalent terms between semantically similar controlled
vocabularies



Lexical similarities discovery and semantic alignment
through external LOD datasets



Google Refine renders the deployment of the proposed
methodology as a straightforward process that can be
applied to other cases aiming in discovering equivalent
terms in different yet semantically similar datasets



The deployment of the proposed methodology is facilitated
through the employment of linked data technologies

15

Future work


Future work is targeted towards the reconciliation of
Dione’s subject headings with linked data services such as
French National Library (RAMEAU), German National
Library (GND), Biblioteca National de Espana (BNE) and
LIBRIS.

16

Thank you for your attention!
Questions?

17

Merging controlled vocabularies through semantic alignment based on linked data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Merging controlled vocabularies through semantic alignment based on linked data

Semelhante a Merging controlled vocabularies through semantic alignment based on linked data (20)

Último

Último (20)

Merging controlled vocabularies through semantic alignment based on linked data