A similarity measure based on semantic and linguistic information
1. A Similarity Measure Based on Semantic and Linguistic Information Nitish Aggarwal DERI, NUI Galway firstname.lastname@deri.org Wednesday,15th June, 2011 DERI, Reading Group 1
2. Based On: “A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness” Authors: Giuseppe Pirro and JeoromeEuzenat Published: International Semantic Web Conference, 2010 “SyMSS: A syntax-based measure for short-text semantic similarity ” Author: J. Oliva, J. Serrano, M. Castillo, and Ángel Iglesias Published: Journal Data & Knowledge Engineering, Volume 70 Issue 4 April,2011 2
3. Overview Introduction Classical Approaches Ontology-based Similarity Set of relations Information Content SyMSS (Syntax-based) Deep Parsing Influence of adjectives and adverbs Conclusion 3
4. Introduction & Motivation Short-text Similarity Lack of Semantics and Linguistics Applications Semantic Annotation Semantic Search Information Retrieval and Extraction 4
5. Classical Approaches String Similarity Levenshteindistance, Dice Coefficient Corpus-based ESA, Google distance,Vector-Space Model Ontology-based Path distance, Information content Syntax Similarity Word-order, Part of Speech 5
6. First Paper: “A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness” Authors: Giuseppe Pirro and JeoromeEuzenat Published: International Semantic Web Conference, 2010 “SyMSS: A syntax-based measure for short-text semantic similarity ” Author: J. Oliva, J. Serrano, M. Castillo, and Ángel Iglesias Published: Journal Data & Knowledge Engineering, Volume 70 Issue 4 April,2011 6
7. Ontology-based - Overview Features Whole set of semantic relations defined in an ontology Resnik’s Information Content IC(c) = -log p(c) Intrinsic Information Content Overcome the analysis of large corpora Extended Information Content Map feature-based model to information theoretic domain 7
9. Ontology-based - model Tversky’s feature-based similarity model common features of two concepts ~ similarity Extra feature ~ 1/similarity . Ratio-base formulation of Tverky’s model . 9
13. Ontology-based - Framework Intrinsic information content(iIC) . where sub(c) is number of sub-concept of given concept c. Extended information content(eIC) where EIC(c) is relatedness coefficient using all kind of relations 13
14. DataSet: 65 human evaluated pairs Correlation values: 14 Ontology-based – Evaluation of Similarity
16. Ontology-based - Summary Intrinsic similarity measure Ontology-based similarity Outperforms corpus measures Limitation No short-text Model-based E,g, only concepts in the ontology are considered (e.g. car accident) 16
17. Second paper (SyMSS) “A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness” Authors: Giuseppe Pirro and JeoromeEuzenat Published: International Semantic Web Conference, 2010 “SyMSS: A syntax-based measure for short-text semantic similarity ” Author: J. Oliva, J. Serrano, M. Castillo, and Ángel Iglesias Published: Journal Data & Knowledge Engineering, Volume 70 Issue 4 April,2011 17
18. SyMSS - Overview SyMSS = “syntax-based similarity for short-term text” Syntactic Information Not only word order Deep Parsing Parts of speech Semantic Information Wordnet similarity Different ontology-based similarity 18
19. SyMSS - Semantic Information Path-base measure Shortest path Hirst and st. Onge (HSO) Information Content Resnik measure Jiang and Corath measure Lin measure Gloss-base measure Gloss Overlap and Gloss vector 19
20. SyMSS - Syntactic Information Parse tree phrases Head of phrases Head similarity Head of phrases which have same syntactic function Penalization factor Non shared phrases 20
21. SyMSS - Model My brother has a dog with four legs My brother has four legs Sim(Has,Has) = 1 Sim(brother,brother) = 1 Sim(dog,leg) = 0.1414 PF = 0.03
22. SyMSS - Evaluation DataSet: 30 pairs out of 65 human evaluated pairs Correlation values: 22
23. SyMSS - Effect of adverb and adjective Sentence1: ”I have a big dog” Sentence2: ”I have a little dog” 8.68% gain in SyMSS with HSO 23
24. SyMSS - Summary Syntax-based similarity considers… Nouns and verbs Influence of adjectives and adverbs Limitation Depend on parsed structure E.g. not grammatically correct Depend on word similarity 24
25. Conclusion No established method for short text Parsing of phrases is difficult Concept similarity depend on model Weak model E.g. xebr: Extraordinary Income and xebr: Other Operating Income -> Pathlength = 0.2 and Expert = 0.8 Need a syntactic similarity for concepts tag (word or phrase) 25