Identifying alignments between vocabularies has become a central knowledge engineering activity. A plethora of alignment techniques has been developed over the past
years. In this paper we present a case study in which we examine and evaluate the practical use of three typical alignment techniques. The study involves the alignment
of two vocabularies used in a semantic-search engine for cultural-heritage objects. We show that a sequence can be beneficial. The case study gives insight into evaluation issues, such as techniques for identification of false positives. We see this work as a step to a badly-needed methodology for alignment.
2. Vocabulary Alignments Many museums, libraries and archives capture their knowledge in structured vocabularies covering similar areas (materials, subject matter) One goal in the CH field is data integration by making museum collections and their vocabularies available through common portals Our solution is aligning vocabularies to each other and/or aligning to large commonly used resources
3. Aligning is difficult Differences in: Lexical conventions Structure/metamodel Ontological commitments Use of: Jargon Homonyms/polysemes Background knowledge/implicit context
4. What can we achieve with current alignment tools? Wide selection of alignment tools exist OAEI workshop: alignment tools are tested on benchmark sets and on real world applications A practical methodology that tells us which tool to use in which situation is still lacking How can I get better results on MY data?
6. Research Question Does combining alignment techniques have added value? If yes, we need a methodology that tells us how to combine alignment techniques.
7. Case Study Setup 2 data sets in Dutch: RKD subject thesaurus Cornetto, a lexical thesaurus linked to WordNet Alignment techniques for generating exact-matches Baseline technique Lexical technique Structural technique Manual Evaluation Techniques for improving precision/recall Combining alignments techniques to improve recall and precision Disambiguation techniques for improving precision
8. Data Sets RKD subject thesaurus 3,342 concepts 3,342 preferred labels 242 alternative labels Broader, narrower and related relations between concepts Cornetto 70,434 synsets 102,572 sense-labels 16 relation types including hypernym relation One word can be part of multiple synsets Rationale: link small to large hub vocabularies Small specialized vocabularies are frequent (in the CH field) Linking to large vocabularies adds synonyms and relations
9. Alignments techniques Baseline technique: optimizes precision Plain string matching Ignores ambiguous matches Lexical technique (STITCH tool): increases recall Matches terms and uses lemmatization and compound splitting Returns all (possibly ambiguous) matches Structural technique (Falcon – AO): best tool in town (OAEI 2007) Uses the structure of vocabularies Uses lexical measures, lemmatization and distance metrics
11. Evaluation 1 person (me) evaluated the entire set 2493 concepts with 4375 alignments Taking approximately 26 person-hours 5 (external) people evaluated small samples of alignments to validate the manual evaluation 50 concepts with around 80 alignments Taking 17 minutes on average
12. Validation of Manual Evaluation We measured inter-observer agreement for exact matches between me and the 5 raters using Cohen’s Kappa κ= 0.70 Reasons for disagreement: Disagreement in the vocabulary interpretation Vocabulary error Human error We will use the list of correct exact-matches as a “Gold Standard” to compare the performance of the tools
13. Qualitative Results The tools found no alignments for 849 concepts Recall is based on the correct exact-matches that were found
15. Disambiguation Total aligned concepts 2,493 with 4,375 alignments 860 concepts have more than one alignment with a total of 2712 alignments From the manual evaluation we know that many of these alignments are wrong We will disambiguate alignments using the structure of the vocabularies(broader/hyponym relations) Child match Parent match
17. Disambiguation Results Child match: 120 out of 449 alignments for 112 concepts have highest number of child alignments with 24% false positives and 10 % false negatives Parent match: 234 out of 561 alignments for 185 concepts had the highest number of parent alignments with 22 % false positives and 12 % false negatives Small overlap of 59 alignments for 18 concepts A third of ambiguous alignments is resolved using the two disambiguation methods: for 279 out of 860 concepts we keep 336 alignments and throw away 615 alignments
19. Conclusion and Future Work A methodology is much needed in this area Our next step is to see how alignment techniques can be combined with regard to larger vocabularies: We are currently working on experiments with Getty’s AAT and Princeton WordNet
20. Thanks and Acknowledgements Cornetto project team The Netherlands Institute for Art History (RKD) Antoine Isaac and the STITCH team Wei Hu (Falcon) Mark van Assem, Willem van Hage, Laura Hollink and Jan Wielemaker for their contribution to the alignment evaluation Bob Wielinga for comments on earlier versions of the paper
Notas do Editor
Lexical convention example: plural vs singularStructure example: hypernymyvs broader thanOntological commitments: OWL vs SKOSJargon: Expert terms vs layman termsHomonyms: bank (financial institution vs bank (river)Polysemes: to milk (act of) and milk (product)Background knowledge: the application domain can define the meaning of concepts
This is what My, well our data looks like. We have a number of museum datasets, the blobs with the same color indicate that. These datasets contain artwork metadata, and vocabularies describing people, locations concepts and events. We have some domain specific vocabularies such as Getty’s Art and Architecture Thesaurus and Lexical resources in various languages, English, Dutch and French WordNets. In this talk I will focus on the alignment between these these two blobs. Our problem essentially is that none of the tools we tried works well enough on our data then our main research question are…
Does combining alignment techniques or tools have added value and if yes then we are in need of a methodology that tells us how to combine alignment techniques given certain goals. As a first step we performed a case study.
RKD: The Netherlands Institute for Art HistoryWe perfomed a manual evualuation for determining the correctness of the alignmentsBecause the lexical thesaurus contains multiple homonyms we also applied disambiguation techniques to improve precision
RKD: The Netherlands Institute for Art HistoryRkd subject thesaurus contains less than 3 and a half thousand concepts, each with a preferred label. There are few alternative labels. There are also broader, narrower and related relations between concepts.Cornetto contains over 70 thousand synsets with over 100 thousand labels. A large portion of the synsets has a single label but there are synsets with over a dozen labels such asThere are 16 relation types including the hyperonym relation. One word can be part of multiple synsets which creates a disambiguation problem for automatic alignment techniques. Our rationale is to link small vocabularies to large hub vocabularies as small vocabularies are frequent in the Cultural Heritage field. Also linking to large vocabularies adds new synonyms and relations which make the data more searchable
We used the following alignments techniques:Baseline technique which optimizes precision. It performs plain string matching and simply ignores ambiguous matches. The lexical technique where we used the tool from the STITCH project is geared towards increasing recall. It matches terms and uses lemmatization and compound splitting. It also returns all matches found even ambiguous ones. Finally we have the structural technique, Here we used Falcon-AO which was the best performing tool in the 2007 alignment workshop. It uses the structure of vocabularies as well as lexical measures, lemmatization and distance metrics for finding the best possible alignments.The three tools found overlapping sets of candidate alignments. Note that at this point we do not say anything about the quality of the alignments
So what were the quantitative results? The three tools together returned over 4 thousand candidate alignments. The baseline tool found 30 % of all candidate alignments, followed by Falcon with 60 % and the STITCH tool found 86%.There is a large overlap between the three tools as well as between STITCH and Falcon. Almost half of the alignments found by STITCH were not found by the other two techniquesBut how good are these alignments?
In order to find out about the quality of alignments we performed a manual evaluation of the alignments. One person, you may guess who, evaluated the entire set taking approximately 26 hours. We also had 5 external people evaluating separate sets of sample alignments to validate the manual evaluation each taking 17 minutes on average.
We then measured inter observer agreement for exact matches between me and the 5 raters using Cohen’s kappa and found a kappa of 0.7 which is relatively low. This just goes to show how difficult it is to evaluate alignments even for humans.The reasons for disagreement were either due to differences in the interpretation of the terms of the vocabularies, or due to differences in dealing with errors in the vocabulary and because of plain human error such as accidentally pushing the wrong button without noticing. We will use the list of correct exact-matches as a Gold Standard to compare the performances of the tool
For the baseline tool we have a high number of correct exact matches as expected. About half of the non exact match alignments is some semantic relation like broader narrower or related. The incorrect alignments were entirely due to homonyms where one meaning appears in one vocabulary and the other meaning in the other vocabulary. However the baseline tool returned approximately half of the total number of correct alignments found. For the stitch tool the picture is quite different with only about half of the alignments being correct and most of the non exact match alignments entirely incorrect altough there are 750 more correct concepts found in than the baseline. The performance of the falcon tool is somewhere in the middle higher percentage of correct exact matches than the STITCH tool but slightly lower coverage, still Falcon also returned significantly more correct alignments than the baseline tool. When looking at the distinct total we see that for almost every concept aligned there is at least one correct alignment.In addition the tools found no alignments for around 850 concepts of the subject thesaurus so for around 1000 concepts we have no correct alignment. Our recall is based on the correct exact matches found and could have been called coverage.
The most interesting conclusions can be drawn where the STITCH tool and Falcon don’t overlap with the baseline. We see that in the overlap between Stitch and falcon the precision to 50 % while for alignments found only by Falcon the precision drops further to 29% and even further for Stitch to 25 %One key observation is that most of the alignments found only by STITCH and to some extent by Falcon are ambiguous, that is for a single source concept we have multiple alignments. To tackle this problem we also perform automatic disambiguation
So here is a small sample of the hierarchies of two thesauri. The source thesaurus is the RKD subject thesaurus and the Target thesaurus is Cornetto.For the Child Match technique if we have two alignments for a single source concept. We look at the bottom of the hierarchy whether there are any alignments between the children of these concepts. In this case we have two alignments for the top most alignment. Here we make the assumption that this is the correct alignments while we discard the other alignment. The parent Match technique works similarly except there we look at the bottom of the hierarchy. Again we have a source concept with multiple alignments but in this case we look at the parents of the concepts and if there is at least one parent alignment we consider that alignment to be correct and discard the other one. So what were the results of the disambiguation
The key part of this slide is that there is a small overlap between the two methods but with this computationally cheap method we were able to disambiguate a third of all the ambiguous alignments. About 23 % of the alignments we keep are actually false positives and we of the discarded alignments about 10 % were correct. For more information I refer you to the paper.
So in general when it comes to combining the tools and the use of disambiguation we found the following:By doing an additional manual evaluation of a very selective subset we can boost the recall and precision even further. This Is described in more detail in the paper.
I would like to conclude by saying that a methodology is much needed in this area.With regard to future work our next step is to see how alignment techniques can be combined to on larger vocabularies.We are currently working on experiments with Getty’s Art and Architecture Thesaurus and Princeton WordNet.