HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization
1. HIEDS: A Generic and Efficient Approach
to Hierarchical Dataset Summarization
Gong Cheng, Cheng Jin, Yuzhong Qu
National Key Laboratory for Novel Software Technology
Nanjing University, China
Websoft
2. Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
3. Scenario: browsing a dataset in an
open data portal
https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory
I need some insight
into the contents,
not just metadata.
4. Meeting the challenge with a
dataset summary
i.e., automatically generated small-sized, high-level abstraction of data,
to summarize the contents of a dataset for quick inspection.
5. Expected features of a dataset summary
• To provide multigranular abstraction of data to be
incrementally explored
• To preserve the structural nature of a dataset
• To be comprehensible
6. Constitution of a dataset summary
• An example
A hierarchical grouping of entities Relations connecting sibling groups
A property-value pair differentiates a group of entities from sibling groups.
7. Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
8. Quality of a dataset summary
• Coverage of data
• large subgroups, frequent relations
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
9. Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• moderate-sized subgroups
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
10. Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• informative (i.e., less frequent) property-value pairs
• Overlap between groups
• Homogeneity of groups
11. Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• controllable overlap
• Homogeneity of groups
12. Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
• different values of the same property
13. Problem formulation:
multidimensional knapsack problem (MKP)
maximizing moderateness
of each subgroup
maximizing cohesion
within each subgroup
disallowing large overlap
between subgroups
selecting ≤k subgroups
(optionally) disallowing different properties
14. Problem solution
• A greedy strategy is used
(sorting candidates by )
but its efficient implementation is non-trivial.
15. Experiments
• Baseline: LODeX (ISWC’14)
• flat grouping
• biased towards coverage (e.g., Type:Person)
• redundant information (e.g., Type:Person and Type:Chair)
• Advantages of HIEDS
• hierarchical grouping
• trade-off between coverage and cohesion (e.g., Type:Actor)
• controllable overlap