O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Extracting Key Terms From Noisy and Multi-theme Documents

1.220 visualizações

Publicada em

Publicada em: Tecnologia, Educação
  • Seja o primeiro a comentar

Extracting Key Terms From Noisy and Multi-theme Documents

  1. 1. Extracting Key Terms From Noisy and Multi - theme D ocuments Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS
  2. 2. Outline <ul><li>Key terms extraction: traditional approaches and applications </li></ul><ul><li>Using Wikipedia as a knowledge base for Natural Language Processing </li></ul><ul><li>Main techniques of our approach: </li></ul><ul><ul><li>Wikipedia-based semantic relatedness </li></ul></ul><ul><ul><li>Network analysis algorithm to detect community structure in networks </li></ul></ul><ul><li>Our method </li></ul><ul><li>Experimental evaluation </li></ul>
  3. 3. Key Terms Extraction <ul><li>B asic step for various NLP tasks : </li></ul><ul><ul><li>document classification </li></ul></ul><ul><ul><li>document clustering </li></ul></ul><ul><ul><li>text summarization </li></ul></ul><ul><ul><li>inferring a more general topic of a text document </li></ul></ul><ul><li>C ore task of Internet content - based advertising systems , such as Google AdSense and Yahoo! Contextual Match </li></ul><ul><ul><li>Web pages are typically noisy ( side bars/menus, comments, future announces, etc. ) </li></ul></ul><ul><ul><li>Dealing with multi-theme Web pages (portal home pages, etc.) </li></ul></ul>
  4. 4. Approaches to Key Terms Extraction <ul><li>Based on statistical learning : </li></ul><ul><ul><li>use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document ( KEA ) </li></ul></ul><ul><ul><li>compute statistical features over Wikipedia corpus ( Wikify! ) </li></ul></ul><ul><ul><li>require training set </li></ul></ul><ul><li>Based on analyzing syntactic or semantic term relatedness within a document </li></ul><ul><ul><li>compute semantic relatedness between terms (using, for example, Wikipedia) </li></ul></ul><ul><ul><li>modeling document as a semantic graph of terms and applying graph analysis techniques to it ( TextRank ) </li></ul></ul><ul><ul><li>no training set required </li></ul></ul>
  5. 5. Using Wikipedia as a Knowledge Base for Natural Language Processing <ul><li>Wikipedia (www.wikipedia.org) – free open encyclopedia </li></ul><ul><ul><li>Today Wikipedia is the biggest encyclopedia ( more than 2 . 7 million articles in English Wikipedia ) </li></ul></ul><ul><ul><li>It is always up-to-date thanks to millions of editors over the world </li></ul></ul><ul><ul><li>Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages = > rich resource for bootstrapping NLP and IR tasks </li></ul></ul>
  6. 6. Basic Techniques of Our Method: Semantic Relatedness of Terms <ul><li>S emantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms </li></ul><ul><li>We use Wikipedia compute terms semantic relatedness </li></ul><ul><li>We use semantic relatedness to model document as a graph of terms </li></ul>
  7. 7. <ul><li>Wikipedia-based semantic relatedness for the two terms c an be computed using : </li></ul><ul><ul><li>the links found within their corresponding Wikipedia articles </li></ul></ul><ul><ul><li>Wikipedia categories structure </li></ul></ul><ul><ul><li>the article’s textual content </li></ul></ul><ul><li>Using Dice-measure for Wikipedia-based semantic relatedness </li></ul>Basic Techniques of Our Method: Semantic Relatedness of Terms
  8. 8. Basic Techniques of Our Method: Detecting Community Structure in Networks <ul><li>We discover terms communities in a document graph </li></ul><ul><li>Community – densely interconnected group of nodes in a network </li></ul><ul><li>Girvan-Newman algorithm for detection community structure in networks: </li></ul><ul><li>betweenness – how much is edge “in between” different communities </li></ul><ul><li>modularity - partition is a good one, if there are many edges within communities and only a few between them </li></ul>
  9. 9. Our Method <ul><li>Candidate t erms e xtraction </li></ul><ul><li>Word sense disambiguation </li></ul><ul><li>Building semantic graph </li></ul><ul><li>Discovering community structure of the semantic graph </li></ul><ul><li>Selecting valuable communities </li></ul>
  10. 10. Our Method: Candidate T erms E xtraction <ul><li>Goal: e xtract all terms from the document and f or each term prepare a set of Wikipedia articles that can describe its meaning </li></ul><ul><li>P arse the input document and extract all possible n - grams </li></ul><ul><li>For each n-gram (+ its morphological variations ) provide a set of Wikipedia article titles </li></ul><ul><ul><li>“ drinks ”, “ drinking ”, “ drink ” => [Wikipedia:] Drink ; Drinking </li></ul></ul>
  11. 11. Our Method: Word Sense Disambiguation <ul><li>Goal: choose the most appropriate W ikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step </li></ul><ul><li>U se of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms </li></ul><ul><li>Denis Turdakov, Pavel Velikhov </li></ul><ul><ul><li>“ Semantic Relatedness Metric for Wikipedia Concepts Based on </li></ul></ul><ul><ul><li>Link Analysis and its Application to Word Sense Disambiguation ” </li></ul></ul><ul><li>SYRCoDIS, 2008 </li></ul>
  12. 12. Our Method: Building Semantic Graph <ul><li>Goal: building document semantic graph using semantic relatedness between terms </li></ul>Semantic graph built from a news article &quot; Apple to Make ITunes More Accessible For the Blind &quot;
  13. 13. Our Method: Detecting Community Structure of the Semantic Graph
  14. 14. Our Method: Selecting Valuable Communities <ul><li>Goal: rank term communities in a way that: </li></ul><ul><ul><li>the highest ranked communities contain key terms </li></ul></ul><ul><ul><li>the lowest ranked communities contain not important terms, and possible disambiguation mistakes </li></ul></ul><ul><li>Use: </li></ul><ul><ul><li>density of community – sum of inner edges of community divided by the number of vertices in this community </li></ul></ul><ul><ul><li>informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms </li></ul></ul><ul><li>Community rank: density*informativeness </li></ul>
  15. 15. Our Method: Selecting Valuable Communities <ul><li>In 73% of web pages decline in communities scores separates key-terms communities from non-important ones </li></ul>
  16. 16. Advantages of the Method <ul><li>No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia </li></ul><ul><li>Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages </li></ul><ul><li>Thematically grouped key terms . Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph </li></ul><ul><li>High accuracy . Evaluated using human judgments (further in this presentation) </li></ul>
  17. 17. Experimental Evaluation on Noise-free dataset <ul><li>Classical – TFxIDF , Yahoo! Terms Extractor </li></ul><ul><li>Wikipedia-based – Wikify! , TextRank </li></ul><ul><li>Evaluation on noise-free dataset (blog posts) using human judgment </li></ul>
  18. 18. <ul><li>Comparison to other methods </li></ul>Experimental Evaluation on Web Pages <ul><li>Performance of our method on different kinds of Web pages </li></ul>
  19. 19. <ul><li>Multi-theme stability evaluated on compound Web pages (popular news site, portal homepages, etc.) </li></ul>Experimental Evaluation on Web Pages
  20. 20. Thank You! Any Questions? Email [email_address] [email_address] [email_address]