Combining Vocabulary Alignment Techniques

Combining Vocabulary Alignment Techniques Anna Tordai, Jacco van Ossenbruggen, Guus Schreiber VU University Amsterdam

Vocabulary Alignments Many museums, libraries and archives capture their knowledge in structured vocabularies covering similar areas (materials, subject matter) One goal in the CH field is data integration by making museum collections and their vocabularies available through common portals Our solution is aligning vocabularies to each other and/or aligning to large commonly used resources

Aligning is difficult Differences in: Lexical conventions Structure/metamodel Ontological commitments Use of: Jargon Homonyms/polysemes Background knowledge/implicit context

What can we achieve with current alignment tools? Wide selection of alignment tools exist OAEI workshop: alignment tools are tested on benchmark sets and on real world applications A practical methodology that tells us which tool to use in which situation is still lacking How can I get better results on MY data?

Concepts 11,995 WordNet NL (Cornetto) 70,434 My Data: E-Culture Cloud

Research Question Does combining alignment techniques have added value? If yes, we need a methodology that tells us how to combine alignment techniques.

Case Study Setup 2 data sets in Dutch: RKD subject thesaurus Cornetto, a lexical thesaurus linked to WordNet Alignment techniques for generating exact-matches Baseline technique Lexical technique Structural technique Manual Evaluation Techniques for improving precision/recall Combining alignments techniques to improve recall and precision Disambiguation techniques for improving precision

Data Sets RKD subject thesaurus 3,342 concepts 3,342 preferred labels 242 alternative labels Broader, narrower and related relations between concepts Cornetto 70,434 synsets 102,572 sense-labels 16 relation types including hypernym relation One word can be part of multiple synsets Rationale: link small to large hub vocabularies Small specialized vocabularies are frequent (in the CH field) Linking to large vocabularies adds synonyms and relations

Alignments techniques Baseline technique: optimizes precision Plain string matching Ignores ambiguous matches Lexical technique (STITCH tool): increases recall Matches terms and uses lemmatization and compound splitting Returns all (possibly ambiguous) matches Structural technique (Falcon – AO): best tool in town (OAEI 2007) Uses the structure of vocabularies Uses lexical measures, lemmatization and distance metrics

Quantitative Results: 4375 Candidate Alignments Baseline (30%) STITCH (86%) 59 10 1726 1145 92 836 507 Falcon (59%)

Evaluation 1 person (me) evaluated the entire set 2493 concepts with 4375 alignments Taking approximately 26 person-hours 5 (external) people evaluated small samples of alignments to validate the manual evaluation 50 concepts with around 80 alignments Taking 17 minutes on average

Validation of Manual Evaluation We measured inter-observer agreement for exact matches between me and the 5 raters using Cohen’s Kappa κ= 0.70 Reasons for disagreement: Disagreement in the vocabulary interpretation Vocabulary error Human error We will use the list of correct exact-matches as a “Gold Standard” to compare the performance of the tools

Qualitative Results The tools found no alignments for 849 concepts Recall is based on the correct exact-matches that were found

Overlap in correct exact-match alignments (precision) Baseline STITCH 53 90% 9 90% 429 25% 1073 94% 434 52% 87 95% 147 29% Falcon Distinct total: 2232

Disambiguation Total aligned concepts 2,493 with 4,375 alignments 860 concepts have more than one alignment with a total of 2712 alignments From the manual evaluation we know that many of these alignments are wrong We will disambiguate alignments using the structure of the vocabularies(broader/hyponym relations) Child match Parent match

Child Match Parent Match Target thesaurus Source thesaurus

Disambiguation Results Child match: 120 out of 449 alignments for 112 concepts have highest number of child alignments with 24% false positives and 10 % false negatives Parent match: 234 out of 561 alignments for 185 concepts had the highest number of parent alignments with 22 % false positives and 12 % false negatives Small overlap of 59 alignments for 18 concepts A third of ambiguous alignments is resolved using the two disambiguation methods: for 279 out of 860 concepts we keep 336 alignments and throw away 615 alignments

Conclusion and Future Work A methodology is much needed in this area Our next step is to see how alignment techniques can be combined with regard to larger vocabularies: We are currently working on experiments with Getty’s AAT and Princeton WordNet

Thanks and Acknowledgements Cornetto project team The Netherlands Institute for Art History (RKD) Antoine Isaac and the STITCH team Wei Hu (Falcon) Mark van Assem, Willem van Hage, Laura Hollink and Jan Wielemaker for their contribution to the alignment evaluation Bob Wielinga for comments on earlier versions of the paper

Combining Vocabulary Alignment Techniques

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Combining Vocabulary Alignment Techniques

Semelhante a Combining Vocabulary Alignment Techniques (20)

Último

Último (20)

Combining Vocabulary Alignment Techniques

Notas do Editor