The document proposes an approach to automatically extract conceptual taxonomies from text using multiple cooperating techniques. Key aspects of the approach include identifying relevant concepts, generalizing similar concepts, and performing reasoning by concept association. Preliminary experiments show promise, but extensions are needed to improve concept descriptions, representation of relations, and similarity measures. Future work is outlined to address limitations and refine the approach.
1. Università degli studi di Bari “Aldo Moro”
Dipartimento di Informatica
Cooperating Techniques for
Extracting Conceptual Taxonomies from Text
S. Ferilli, F. Leuzzi, F. Rotella
L.A.C.A.M.
http://lacam.di.uniba.it:8000
AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence
Workshop on Mining Complex Patterns (MCP 2011)
Palermo, Italy, September 17, 2011
2. Overview
1. Introduction & Objectives
2. Extraction of knowledge from text
3. Knowledge representation formalism
4. Identification of relevant concepts
5. Generalization of similar concepts
6. Reasoning ‘by association’
7. Conclusions & Future works
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2
3. Introduction
The spread of electronic documents and document
repositories has generated the need for automatic techniques
to understand and handle the documents content in order to
help users in satisfying their information needs.
Full Text Understading is not trivial, due to:
1. intrinsic ambiguity of natural language;
2. huge amount of common sense and conceptual background
knowledge.
For facing these problems lexical and/or conceptual
taxonomies are useful, even if manually building is very costly
and error prone.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 3
4. Introduction
This lack is a strong motivation towards
automatic construction of conceptual
networks by mining large amounts of
documents in natural language.
However, even assuming a correct
knowledge representation, we are
far to simulate human abilities yet.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 4
5. Objectives
1. Definition of a representation formalism for knowledge
extracted from natural language texts
2. Extraction of concepts and relevance assessment
3. Generalization of concepts having similar descriptions
4. Definition of a kind of reasoning by concept association that
looks for possible indirect connections between two
identified concepts
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 5
6. Extraction of knowledge
from text
Knowledge extracted by processing each sentence separately.
Stanford Stanford
Parser [1] Dependencies [2]
The final output of the Stanford Dependencies is a typed
syntactic structure of each sentence.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 6
7. Knowledge representation
formalism
Among all grammatical roles played by words in a sentence,
only subject, verb and complement have been considered.
In the final conceptual graph subjects and complements will
represent concepts, while verbs will express relations between
them.
subject,
subject,
verb,
complement
complement
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 7
8. Identification of
relevant concept
A mix of several techniques are brought to cooperation for
identifying relevant concepts:
● Hub Words [3]: words having high frequency whose relevance is
computed as:
W (t )=α w 0 +β n+γ ∑ i=1 w (t i )
where: w0 , initial weight; n, # of relationships;
w(ti), tf*idf weight of i-th word related to t.
● Keyword extraction techniques from single documents.
● EM Clustering provided by Weka [4] based on Euclidean
distance.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 8
9. Identification of
relevant concept
Inspired to the Hub Words approach we have defined a
Relevance Weight:
A B C D E
w (̄)
c e(̄)c ∑( c , ̄ ) w (c ) d M −d ( c )
c ̄ k (̄)
c
W ( ̄ )=α
c +β +γ +δ +ε
max c w( c ) max c e ( c ) e( ̄ ) c dM max c k ( c )
where: α + β+γ +δ +ε =1
Nodes in the network are ranked by decreasing Relevance
Weight.
A suitable cut-point in the ranking is determined by choosing
the first item such that:
W ( c k )-W (c k+1 )≥ p⋅ max ( W ( c i )-W (c i+1 ) )
i =0,.. . , n−1
where: p∈ [ 0,1 ]
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
10. Identification of relevant concept
Relevance Weight in details
Definition of the Initial Weight
The whole set of triples <subject,verb,complement> is
represented in a Concepts x Attributes matrix V recalling the
classical Terms x Documents Vector Space Model.
f i, j ∣A∣
Resembling tf*idf: ⋅log
∑ k
f k, j ∣{ j : c i ∈a j }∣
w (c )
̄
Therefore component A is: α
max c w ( c)
where w(c) is the initial weight assigned to node c computed
according to the above tf*idf schema.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 10
11. Identification of relevant concept
Relevance Weight in details
Connections Number
Component B considers the number of connections (edges) in
which c is involved
e(̄)c
β
max c e ( c )
Neighborhood Weight Summary
Component C takes into account the average
initial weight of all neighbors of c
∑ (c,c )
̄
w ( c)
γ
e( c )
̄
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 11
12. Identification of relevant concept
Relevance Weight in details
Inverse Distance form Center
Component D represents the closeness to center of the cluster
d M −d( c )
̄
δ
dM
KE Influence
Component E takes into account the outcome of three KE
techniques suitably weighted:
k (̄ )
c
ε
max c k (c )
where:
k ( ̄ )=ςk co−occurrences ( ̄ )+ηk synset ( ̄ )+θk mvn ( ̄ )
c c c c
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 12
13. Identification of relevant concept
Relevance Weight in details
2
KE based on χ
k co− occurrences=ς
●
2
co-occurrences max cluster χ
kw synset
● KE based on k synset =η
WordNet Synsets max ( kw synset )
KE by means
kw mvn
●
Multivariate Normal k mvn=θ
max ( kw mvn )
Distribution (MVN)
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 13
14. Identification of relevant concept
Evaluations
Test # α β γ δ ε p
1 0.10 0.10 0.30 0.25 0.25 1.0
2 0.20 0.15 0.15 0.25 0.25 0.7
3 0.15 0.25 0.30 0.15 0.15 1.0
Test # Concept A B C D E W
1 network 0.100 0.100 0.021 0.178 0.250 0.649
access 0.001 0.001 0.154 0.239 0.250 0.646
subset 6.32E-4 0.001 0.150 0.239 0.250 0.641
2 network 0.200 0.150 0.0105 0.178 0.250 0.789
3 network 0.150 0.250 0.021 0.146 0.150 0.717
user 0.127 0.195 0.022 0.146 0.150 0.641
number 0.113 0.187 0.022 0.146 0.150 0.619
individual 0.103 0.174 0.020 0.146 0.150 0.594
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 14
15. Generalization of similar concepts
Pairwise clustering
Take in account the description of each concept, consisting in
a binary vector that represents presence or absence (1 or 0
respectively) of a <subject,complement> relation between
the involved concepts. The Hamming distance provides a
similarity evaluation between them.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 15
16. Generalization of similar concepts
WordNet
WordNet1 is an external resource that has some useful
properties:
1. lexical taxonomy
2. each concept is described as a set of synonyms (synset)
3. synsets are interlinked by means of conceptual-
semantic and lexical relations
We are focused on hyperonymy, a relation that links the
current synset to more general ones.
1. http://wordnet.princeton.edu/
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 16
17. Generalization of similar concepts
Taxonomical similarity function
More general: provides a More specific: provides a
similarity value on the bases of similarity value on the bases of
common relations, without common relations, relying on
focusing on the specific path. the specific path.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 17
18. Generalization of similar concepts
WSD Domain Driven
One Domain per Discourse assumption: many uses of a word
in a coherent portion of text tend to share the same domain.
Prevalent domain
Prevalent domain
individuation
individuation
Extraction of all
Extraction of all
synsets for each term
synsets for each term
Extraction of all
Extraction of all
domains for each synset
domains for each synset
Choice of prevalent
Choice of prevalent
domain synset
domain synset
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 18
19. Generalization of similar concepts
Evaluations
Two toy experiments have been performed with Hamming
distance threshold respectively equal to 0.001 and 0.0001,
while taxonomical similarity function threshold has been kept
equal to 0.4.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 19
20. Reasoning ‘by association’
Breadth-First Search
Given two nodes (concepts), a Breadth-First Search starts
from both nodes, the former searches the latter's frontier and
vice versa, until the two frontiers meet by common nodes.
Then the path is restored going backward to the roots in both
directions.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 20
21. Reasoning ‘by association’
Evaluations
The table below shows a sample of possible outcomes.
E.g., an interpretation of case 5 can be:
“the adults write about freedom and use platform, that is
recognized as a technology, as well as the internet”.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 21
22. Conclusions
This work proposes an approach to extract automatic conceptual
taxonomy from natural language texts.
It works mixing different techniques in order to:
● identify relevant terms/concepts in text;
● generalize similar concepts;
● perform some kind of reasoning “by association”.
Preliminary experiments show that this approach can be viable
although extensions and refinements are needed.
A reliable outcome might help users in understanding the text
content and machines to automatically perform some kind of
reasoning on the taxonomy.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 22
23. Future works
1. Extending the knowledge representation formalism to
express negation.
2. Defining a strategy to make a better choice of weights in
Relevance Weight computation.
3. Enriching the adjacency matrix to improve concept
descriptions.
4. ODD alternatives exploration, to overcome its limits.
5. Taxonomical similarity measures take into account only the
hypernym relation, while a more accurate similarity can be
obtained adding other relations.
6. Define a strategy to prefer one verb rather than keeping all
of them, in reasoning ‘by association’ phase.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 23
24. References
[1] Dan Klein and Christopher D. Manning. Fast exact
inference with a factored model for natural language parsing.
In Advances in Neural Information Processing Systems,
volume 15. MIT Press, 2003.
[2] Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. Generating typed dependency parses
from phrase structure trees. In LREC, 2006.
[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing
an ontology based on hub words. In ISMIS’03, pages 93–97,
2003.
[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I.H. Witten. The weka data mining software: an update.
SIGKDD Explorations, 11(1):10–18,2009.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 24