Natural language descriptors used for categorizations are
present from folksonomies to ontologies. While some descriptors are composed of simple expressions, other descriptors have complex compositional patterns (e.g. ‘French Senators Of The Second Empire’, ‘Churches
Destroyed In The Great Fire Of London And Not Rebuilt’). As conceptual models get more complex and decentralized, more content is transferred to unstructured natural language descriptors, increasing the
terminological variation, reducing the conceptual integration and the structure level of the model. This work describes a formal representation for complex natural language category descriptors (NLCDs). In the
representation, complex categories are decomposed into a graph of primitive concepts, supporting their interlinking and semantic interpretation. A category extractor is built and the quality of its extraction under the proposed representation model is evaluated.
On the Semantic Representation and Extraction of Complex Category Descriptors
1. On the Semantic Representation and
Extraction of Complex Category
Descriptors
André Freitas, Rafael Vieira, Edward Curry, Danilo
Carvalho, João C. Pereira da Silva
Insight Centre for Data Analytics
NLDB 2014
Montpellier, France
4. Big Data
Vision: More complete data-based picture of the world for
systems and users.
4
5. “Schema” Growth & Complexity
Fundamental shift in the database landscape
How to build large ‘schemas’?
10s-100s attributes
1,000s-1,000,000s attributes
5
6. Target Motivational Scenario: Wikipedia
Decentralized content generation
300,000 editors have edited Wikipedia more than 10 times
> 280,000 distinct Natural Language Category Descriptors
(NLCDs)
6
8. NLCDs
Natural Language Category Descriptors (NLCDs) are
natural language descriptors for sets
Simple NLCDs:
- ‘People’
- ‘Countries’
- ‘Films’
Complex NLCDs:
- ‘French Senators Of The Second Empire’
- ‘United Kingdom Parliamentary Constituencies Represented
By A Sitting Prime Minister’
Goal:
- Parse NLCDs into an integrated structured graph
8
9. Assumptions
N
L
C
D
NLCDs as a more syntactically tractable subset of natural
language
NLCDs as a low effort interface for structuring a domain of
discourse
IE
9
12. Other Examples
IFRS and US GAAP
- ‘Partially owned properties’
- ‘Residential portfolio segment’
- ‘Assets arising from exploration for and evaluation of mineral
resources’
- ‘Key management personnel compensation’
- ‘Other long-term employee benefits’
12
34. Evaluation Setup
Total of 287,957 English Wikipedia categories (Open Domain
scenario)
Selected random sample of 2,696 categories
Manual evaluation of the core extraction features
- Entity segmentation
- Relation identification
- Unary operators
- Specialization relations
- Category core identification
- Entity core identification
- Word Sense Disambiguation (WordNet)
- Entity linking (DBpedia)
34
35. Results
Performance:
- (i) graph extraction time: 9.8 ms per graph
- (ii) word sense disambiguation: 121.0 ms per word
- (iii) entity linking: 530.0 ms per link
* i5-3317U (1.70GHz) CPU computer with 4GB RAM (4 core, 2 threads per core).
35
36. Summary
NLCDs can provide a more tractable (from the IE perspective)
natural language interface for structuring large KBs
We developed an approach for the representation, extraction
and integration of NLCDs
- ~75% extraction accuracy
Limitations:
- Need for a more principled and formal definition for a NLCD
- Need for a better entity recognition and linking approach
Future Work: evaluation under a domain-specific scenario
36