In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Multilingual Text Classification using Ontologies
1. Introduction
Techniques
Overview and Summary
Multilingual Text Classification
Gerard de Melo, Stefan Siersdorfer
Max Planck Institute for Computer Science
Saarbr¨ucken, Germany
2007-04-04
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
2. Introduction
Techniques
Overview and Summary
Text Classification
Text Classification
task: automatically assign text
documents to classes (e.g.
thematically, geographically)
machine learning algorithms, e.g.
SVM, can learn from pre-classified
training documents
multilingual case: documents in
multiple languages
applications: news wire filtering,
library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
3. Introduction
Techniques
Overview and Summary
Text Classification
Text Classification
task: automatically assign text
documents to classes (e.g.
thematically, geographically)
machine learning algorithms, e.g.
SVM, can learn from pre-classified
training documents
multilingual case: documents in
multiple languages
applications: news wire filtering,
library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
4. Introduction
Techniques
Overview and Summary
Text Classification
Text Classification
task: automatically assign text
documents to classes (e.g.
thematically, geographically)
machine learning algorithms, e.g.
SVM, can learn from pre-classified
training documents
multilingual case: documents in
multiple languages
applications: news wire filtering,
library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
5. Introduction
Techniques
Overview and Summary
Text Classification
Text Classification
task: automatically assign text
documents to classes (e.g.
thematically, geographically)
machine learning algorithms, e.g.
SVM, can learn from pre-classified
training documents
multilingual case: documents in
multiple languages
applications: news wire filtering,
library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
6. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)
variety of expression in source languages
lexical ambiguity in LI (unnecessary introduction of additional
ambiguity)
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
7. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)
variety of expression in source languages
lexical ambiguity in LI (unnecessary introduction of additional
ambiguity)
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
8. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)
variety of expression in source languages
lexical ambiguity in LI (unnecessary introduction of additional
ambiguity)
Spanish coche −→ car
French voiture −→ automobile
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
9. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),
thus distinguishing different senses of a word while identifying
synonyms
disambiguate using context information
construct feature vectors by counting occurrences of concepts
rather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
10. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),
thus distinguishing different senses of a word while identifying
synonyms
disambiguate using context information
construct feature vectors by counting occurrences of concepts
rather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
11. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),
thus distinguishing different senses of a word while identifying
synonyms
disambiguate using context information
construct feature vectors by counting occurrences of concepts
rather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
12. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Semantic Concepts
Problems
understemming
polysemy: highly related senses are treated as distinct
incongruent concepts between languages
variety of expression
lexical lacunae
English I have a headache I have a headache
Spanish Me duele la cabeza *It hurts the head to me
French J’ai mal `a la t^ete *I have pain at the head
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
13. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Weight Propagation
propagate weight from original
concepts to related concepts
choose path to c maximizing
its weight
Dijkstra-like algorithm in order
to assign maximal possible
weight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
14. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Weight Propagation
propagate weight from original
concepts to related concepts
choose path to c maximizing
its weight
Dijkstra-like algorithm in order
to assign maximal possible
weight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
15. Introduction
Techniques
Overview and Summary
Machine Translation
Mapping to Semantic Concept
Weight Propagation
Weight Propagation
propagate weight from original
concepts to related concepts
choose path to c maximizing
its weight
Dijkstra-like algorithm in order
to assign maximal possible
weight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
16. Introduction
Techniques
Overview and Summary
Overview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilingual
lexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
17. Introduction
Techniques
Overview and Summary
Overview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilingual
lexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
18. Introduction
Techniques
Overview and Summary
Overview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilingual
lexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
entire regions of concepts are
relevant, so propagate a part
of the concept’s weight to
related concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification