SlideShare uma empresa Scribd logo
1 de 20
Combining Vocabulary Alignment Techniques Anna Tordai, Jacco van Ossenbruggen, Guus Schreiber VU University Amsterdam
Vocabulary Alignments Many museums, libraries and archives capture their knowledge in structured vocabularies covering similar areas (materials, subject matter) One goal in the CH field is data integration by making museum collections and their vocabularies available through common portals Our solution is aligning vocabularies to each other and/or aligning to large commonly used resources
Aligning is difficult	 Differences in: Lexical conventions Structure/metamodel Ontological commitments Use of:  Jargon Homonyms/polysemes Background knowledge/implicit context
What can we achieve with current alignment tools? Wide selection of alignment tools exist OAEI workshop: alignment tools are tested on benchmark sets and on real world applications A practical methodology that tells us which tool to use in which situation is still lacking How can I get better results on MY data?
Concepts 11,995 WordNet NL (Cornetto) 70,434  My Data: E-Culture Cloud
Research Question	 Does combining alignment techniques have added value? If yes, we need a methodology that tells us how to combine alignment techniques.
Case Study Setup 2 data sets in Dutch: RKD subject thesaurus  Cornetto, a lexical thesaurus linked to WordNet Alignment techniques for generating exact-matches Baseline technique Lexical technique Structural technique Manual Evaluation Techniques for improving precision/recall Combining alignments techniques to improve recall and precision Disambiguation techniques for improving precision
Data Sets RKD subject thesaurus 3,342 concepts 3,342 preferred  labels 242 alternative labels Broader, narrower and related relations between concepts Cornetto 70,434 synsets 102,572 sense-labels 16 relation types including hypernym relation One word can be part of multiple synsets Rationale: link small to large hub vocabularies Small specialized vocabularies are frequent (in the CH field) Linking to large vocabularies adds synonyms and relations
Alignments techniques Baseline technique: optimizes precision Plain string matching Ignores ambiguous matches Lexical technique (STITCH tool): increases recall Matches terms and uses lemmatization and compound splitting Returns all (possibly ambiguous) matches Structural technique (Falcon – AO): best tool in town (OAEI 2007) Uses the structure of vocabularies Uses lexical measures, lemmatization and distance metrics
Quantitative Results: 4375 Candidate Alignments Baseline  (30%) STITCH (86%) 59 10 1726 1145 92 836 507 Falcon (59%)
Evaluation 1 person (me) evaluated the entire set  2493 concepts with 4375 alignments  Taking approximately 26 person-hours 5 (external) people evaluated small samples of alignments to validate the manual evaluation 50 concepts with around 80 alignments Taking 17 minutes on average
Validation of Manual Evaluation We measured inter-observer agreement for exact matches between me and the 5 raters using Cohen’s Kappa  κ= 0.70 Reasons for disagreement:  Disagreement in the vocabulary interpretation Vocabulary error Human error We will use the list of correct exact-matches as a “Gold Standard” to compare the performance of the tools
Qualitative Results The tools found no alignments for 849 concepts Recall is based on the correct exact-matches that were found
Overlap in correct exact-match alignments (precision) Baseline STITCH 53 90% 9 90% 429 25% 1073 94% 434 52% 87 95% 147 29% Falcon Distinct total: 2232
Disambiguation Total aligned concepts 2,493 with 4,375 alignments 860 concepts have more than one alignment with a total of 2712 alignments From the manual evaluation we know that many of these alignments are wrong We will disambiguate alignments using the structure of the vocabularies(broader/hyponym relations) Child match  Parent match
Child Match Parent Match Target thesaurus Source thesaurus
Disambiguation Results Child match: 120  out of 449 alignments for 112 concepts have highest number of child alignments  with 24% false positives  and 10 % false negatives Parent match: 234 out of 561 alignments for 185 concepts had the highest number of parent alignments  with 22 % false positives  and 12 % false negatives Small overlap of 59 alignments for 18 concepts A third of ambiguous alignments is resolved using the two disambiguation methods: for 279 out of 860 concepts we keep 336 alignments and throw away 615 alignments
Final Results
Conclusion and Future Work	 A methodology is much needed in this area Our next step is to see how alignment techniques can be combined with regard to larger vocabularies: We are currently working on experiments with Getty’s AAT and Princeton WordNet
Thanks and Acknowledgements Cornetto project team The Netherlands Institute for Art History (RKD) Antoine Isaac and the STITCH team Wei Hu (Falcon) Mark van Assem, Willem van Hage, Laura Hollink and Jan Wielemaker for their contribution to the alignment evaluation Bob Wielinga for comments on earlier versions of the paper

Mais conteúdo relacionado

Semelhante a Combining Vocabulary Alignment Techniques

taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
butest
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...
Alexander Decker
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
Bridging the gap: e-learning research
Bridging the gap: e-learning researchBridging the gap: e-learning research
Bridging the gap: e-learning research
grainne
 

Semelhante a Combining Vocabulary Alignment Techniques (20)

Word sense disambiguation and lexical chains construction using wordnet
Word sense disambiguation and lexical chains construction using wordnetWord sense disambiguation and lexical chains construction using wordnet
Word sense disambiguation and lexical chains construction using wordnet
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
 
Developing an integrated thesaurus for the cornell genomics initiative digita...
Developing an integrated thesaurus for the cornell genomics initiative digita...Developing an integrated thesaurus for the cornell genomics initiative digita...
Developing an integrated thesaurus for the cornell genomics initiative digita...
 
Class14
Class14Class14
Class14
 
DETECTING OXYMORON IN A SINGLE STATEMENT
DETECTING OXYMORON IN A SINGLE STATEMENTDETECTING OXYMORON IN A SINGLE STATEMENT
DETECTING OXYMORON IN A SINGLE STATEMENT
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 
Thesaurus alignment for linked data publishing DC 2011
Thesaurus alignment for linked data publishing DC 2011Thesaurus alignment for linked data publishing DC 2011
Thesaurus alignment for linked data publishing DC 2011
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Managing Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan TermsManaging Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan Terms
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
Bridging the gap: e-learning research
Bridging the gap: e-learning researchBridging the gap: e-learning research
Bridging the gap: e-learning research
 
Adaptive Learning Resources Sequencing in Educational Hypermedia Systems
Adaptive Learning Resources Sequencing in Educational Hypermedia Systems Adaptive Learning Resources Sequencing in Educational Hypermedia Systems
Adaptive Learning Resources Sequencing in Educational Hypermedia Systems
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Combining Vocabulary Alignment Techniques

  • 1. Combining Vocabulary Alignment Techniques Anna Tordai, Jacco van Ossenbruggen, Guus Schreiber VU University Amsterdam
  • 2. Vocabulary Alignments Many museums, libraries and archives capture their knowledge in structured vocabularies covering similar areas (materials, subject matter) One goal in the CH field is data integration by making museum collections and their vocabularies available through common portals Our solution is aligning vocabularies to each other and/or aligning to large commonly used resources
  • 3. Aligning is difficult Differences in: Lexical conventions Structure/metamodel Ontological commitments Use of: Jargon Homonyms/polysemes Background knowledge/implicit context
  • 4. What can we achieve with current alignment tools? Wide selection of alignment tools exist OAEI workshop: alignment tools are tested on benchmark sets and on real world applications A practical methodology that tells us which tool to use in which situation is still lacking How can I get better results on MY data?
  • 5. Concepts 11,995 WordNet NL (Cornetto) 70,434 My Data: E-Culture Cloud
  • 6. Research Question Does combining alignment techniques have added value? If yes, we need a methodology that tells us how to combine alignment techniques.
  • 7. Case Study Setup 2 data sets in Dutch: RKD subject thesaurus Cornetto, a lexical thesaurus linked to WordNet Alignment techniques for generating exact-matches Baseline technique Lexical technique Structural technique Manual Evaluation Techniques for improving precision/recall Combining alignments techniques to improve recall and precision Disambiguation techniques for improving precision
  • 8. Data Sets RKD subject thesaurus 3,342 concepts 3,342 preferred labels 242 alternative labels Broader, narrower and related relations between concepts Cornetto 70,434 synsets 102,572 sense-labels 16 relation types including hypernym relation One word can be part of multiple synsets Rationale: link small to large hub vocabularies Small specialized vocabularies are frequent (in the CH field) Linking to large vocabularies adds synonyms and relations
  • 9. Alignments techniques Baseline technique: optimizes precision Plain string matching Ignores ambiguous matches Lexical technique (STITCH tool): increases recall Matches terms and uses lemmatization and compound splitting Returns all (possibly ambiguous) matches Structural technique (Falcon – AO): best tool in town (OAEI 2007) Uses the structure of vocabularies Uses lexical measures, lemmatization and distance metrics
  • 10. Quantitative Results: 4375 Candidate Alignments Baseline (30%) STITCH (86%) 59 10 1726 1145 92 836 507 Falcon (59%)
  • 11. Evaluation 1 person (me) evaluated the entire set 2493 concepts with 4375 alignments Taking approximately 26 person-hours 5 (external) people evaluated small samples of alignments to validate the manual evaluation 50 concepts with around 80 alignments Taking 17 minutes on average
  • 12. Validation of Manual Evaluation We measured inter-observer agreement for exact matches between me and the 5 raters using Cohen’s Kappa κ= 0.70 Reasons for disagreement: Disagreement in the vocabulary interpretation Vocabulary error Human error We will use the list of correct exact-matches as a “Gold Standard” to compare the performance of the tools
  • 13. Qualitative Results The tools found no alignments for 849 concepts Recall is based on the correct exact-matches that were found
  • 14. Overlap in correct exact-match alignments (precision) Baseline STITCH 53 90% 9 90% 429 25% 1073 94% 434 52% 87 95% 147 29% Falcon Distinct total: 2232
  • 15. Disambiguation Total aligned concepts 2,493 with 4,375 alignments 860 concepts have more than one alignment with a total of 2712 alignments From the manual evaluation we know that many of these alignments are wrong We will disambiguate alignments using the structure of the vocabularies(broader/hyponym relations) Child match Parent match
  • 16. Child Match Parent Match Target thesaurus Source thesaurus
  • 17. Disambiguation Results Child match: 120 out of 449 alignments for 112 concepts have highest number of child alignments with 24% false positives and 10 % false negatives Parent match: 234 out of 561 alignments for 185 concepts had the highest number of parent alignments with 22 % false positives and 12 % false negatives Small overlap of 59 alignments for 18 concepts A third of ambiguous alignments is resolved using the two disambiguation methods: for 279 out of 860 concepts we keep 336 alignments and throw away 615 alignments
  • 19. Conclusion and Future Work A methodology is much needed in this area Our next step is to see how alignment techniques can be combined with regard to larger vocabularies: We are currently working on experiments with Getty’s AAT and Princeton WordNet
  • 20. Thanks and Acknowledgements Cornetto project team The Netherlands Institute for Art History (RKD) Antoine Isaac and the STITCH team Wei Hu (Falcon) Mark van Assem, Willem van Hage, Laura Hollink and Jan Wielemaker for their contribution to the alignment evaluation Bob Wielinga for comments on earlier versions of the paper

Notas do Editor

  1. Lexical convention example: plural vs singularStructure example: hypernymyvs broader thanOntological commitments: OWL vs SKOSJargon: Expert terms vs layman termsHomonyms: bank (financial institution vs bank (river)Polysemes: to milk (act of) and milk (product)Background knowledge: the application domain can define the meaning of concepts
  2. This is what My, well our data looks like. We have a number of museum datasets, the blobs with the same color indicate that. These datasets contain artwork metadata, and vocabularies describing people, locations concepts and events. We have some domain specific vocabularies such as Getty’s Art and Architecture Thesaurus and Lexical resources in various languages, English, Dutch and French WordNets. In this talk I will focus on the alignment between these these two blobs. Our problem essentially is that none of the tools we tried works well enough on our data then our main research question are…
  3. Does combining alignment techniques or tools have added value and if yes then we are in need of a methodology that tells us how to combine alignment techniques given certain goals. As a first step we performed a case study.
  4. RKD: The Netherlands Institute for Art HistoryWe perfomed a manual evualuation for determining the correctness of the alignmentsBecause the lexical thesaurus contains multiple homonyms we also applied disambiguation techniques to improve precision
  5. RKD: The Netherlands Institute for Art HistoryRkd subject thesaurus contains less than 3 and a half thousand concepts, each with a preferred label. There are few alternative labels. There are also broader, narrower and related relations between concepts.Cornetto contains over 70 thousand synsets with over 100 thousand labels. A large portion of the synsets has a single label but there are synsets with over a dozen labels such asThere are 16 relation types including the hyperonym relation. One word can be part of multiple synsets which creates a disambiguation problem for automatic alignment techniques. Our rationale is to link small vocabularies to large hub vocabularies as small vocabularies are frequent in the Cultural Heritage field. Also linking to large vocabularies adds new synonyms and relations which make the data more searchable
  6. We used the following alignments techniques:Baseline technique which optimizes precision. It performs plain string matching and simply ignores ambiguous matches. The lexical technique where we used the tool from the STITCH project is geared towards increasing recall. It matches terms and uses lemmatization and compound splitting. It also returns all matches found even ambiguous ones. Finally we have the structural technique, Here we used Falcon-AO which was the best performing tool in the 2007 alignment workshop. It uses the structure of vocabularies as well as lexical measures, lemmatization and distance metrics for finding the best possible alignments.The three tools found overlapping sets of candidate alignments. Note that at this point we do not say anything about the quality of the alignments
  7. So what were the quantitative results? The three tools together returned over 4 thousand candidate alignments. The baseline tool found 30 % of all candidate alignments, followed by Falcon with 60 % and the STITCH tool found 86%.There is a large overlap between the three tools as well as between STITCH and Falcon. Almost half of the alignments found by STITCH were not found by the other two techniquesBut how good are these alignments?
  8. In order to find out about the quality of alignments we performed a manual evaluation of the alignments. One person, you may guess who, evaluated the entire set taking approximately 26 hours. We also had 5 external people evaluating separate sets of sample alignments to validate the manual evaluation each taking 17 minutes on average.
  9. We then measured inter observer agreement for exact matches between me and the 5 raters using Cohen’s kappa and found a kappa of 0.7 which is relatively low. This just goes to show how difficult it is to evaluate alignments even for humans.The reasons for disagreement were either due to differences in the interpretation of the terms of the vocabularies, or due to differences in dealing with errors in the vocabulary and because of plain human error such as accidentally pushing the wrong button without noticing. We will use the list of correct exact-matches as a Gold Standard to compare the performances of the tool
  10. For the baseline tool we have a high number of correct exact matches as expected. About half of the non exact match alignments is some semantic relation like broader narrower or related. The incorrect alignments were entirely due to homonyms where one meaning appears in one vocabulary and the other meaning in the other vocabulary. However the baseline tool returned approximately half of the total number of correct alignments found. For the stitch tool the picture is quite different with only about half of the alignments being correct and most of the non exact match alignments entirely incorrect altough there are 750 more correct concepts found in than the baseline. The performance of the falcon tool is somewhere in the middle higher percentage of correct exact matches than the STITCH tool but slightly lower coverage, still Falcon also returned significantly more correct alignments than the baseline tool. When looking at the distinct total we see that for almost every concept aligned there is at least one correct alignment.In addition the tools found no alignments for around 850 concepts of the subject thesaurus so for around 1000 concepts we have no correct alignment. Our recall is based on the correct exact matches found and could have been called coverage.
  11. The most interesting conclusions can be drawn where the STITCH tool and Falcon don’t overlap with the baseline. We see that in the overlap between Stitch and falcon the precision to 50 % while for alignments found only by Falcon the precision drops further to 29% and even further for Stitch to 25 %One key observation is that most of the alignments found only by STITCH and to some extent by Falcon are ambiguous, that is for a single source concept we have multiple alignments. To tackle this problem we also perform automatic disambiguation
  12. So here is a small sample of the hierarchies of two thesauri. The source thesaurus is the RKD subject thesaurus and the Target thesaurus is Cornetto.For the Child Match technique if we have two alignments for a single source concept. We look at the bottom of the hierarchy whether there are any alignments between the children of these concepts. In this case we have two alignments for the top most alignment. Here we make the assumption that this is the correct alignments while we discard the other alignment. The parent Match technique works similarly except there we look at the bottom of the hierarchy. Again we have a source concept with multiple alignments but in this case we look at the parents of the concepts and if there is at least one parent alignment we consider that alignment to be correct and discard the other one. So what were the results of the disambiguation
  13. The key part of this slide is that there is a small overlap between the two methods but with this computationally cheap method we were able to disambiguate a third of all the ambiguous alignments. About 23 % of the alignments we keep are actually false positives and we of the discarded alignments about 10 % were correct. For more information I refer you to the paper.
  14. So in general when it comes to combining the tools and the use of disambiguation we found the following:By doing an additional manual evaluation of a very selective subset we can boost the recall and precision even further. This Is described in more detail in the paper.
  15. I would like to conclude by saying that a methodology is much needed in this area.With regard to future work our next step is to see how alignment techniques can be combined to on larger vocabularies.We are currently working on experiments with Getty’s Art and Architecture Thesaurus and Princeton WordNet.