TYPifier: Inferring the Type Semantics of Structured Data (icde2013)
1. KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
TYPifier: Inferring the Type Semantics of Structured Data
Yongtao Ma, Thanh Tran
29th IEEE International Conference on Data Engineering (ICDE2013)
2. Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013
Contents
Introduction
TYPification Features
TYPification Algorithm
Evaluation
Conclusion
ICDE2013, Brisbane
3. Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
4. Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
ID Title Price Brand Description
p1
Epson
E1700
260 Epson
Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5,
A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP
620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4
Print
p3
LG
47LM7600
1143 LG
Standby Mode 0.1 W. Full HD 1080p gives high picture quality over
standard HDTV via LG LED... LG’s 47-inch Smart TV is a
revolutionary...
p4
Panasonic
L55DT50
2399 Panasonic
Power consumption 85 W. The DT50 LED-LCD series provides a
fantastic Smart TV experience and features a 3D IPS LED panel,
1080p Full HD resolution, and a new narrow metal frame.
p5
MadMaps
Pacific
8 Spotitout
Windows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast
GPS Travel Directory by MAD Maps into your GPS device.
p6
Garmin
Maps
99 Gamin
Windows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin
Colorado, Dakota, eTrex...Coverage includes detailed maps for
traveling in Australia.
p7
Rosetta
Spanish
399
Rosetta
Stone
Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and
language abilities... Discover how to speak, read, write, and
understand…
p8
Learn
German
9 Innovative
Windows Vista / 7 / XP. Media: DVD. Learn level 9 German
vocabulary with the audio playback tool, Listen to the lesson dialog
and master the language…
5. Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
Typification: inferring the type semantics of structured data
ICDE2013, Brisbane
6. Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013
Contributions
We formulate Typification as a clustering problem, where
the goal is to identify a particular kind of clusters that
represent the types of entities
We propose a solution for automatically computing
pseudo-schema features from data
We propose TYPifier, a novel clustering algorithm for the
typification problem, which is
An divisive hierarchical clustering algorithm
Optimized for (pseudo-)schema-based features
Determine the number of types (clusters) automatically
Show that typification helps to improve date integration!
ICDE2013, Brisbane
7. Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013
FEATURES FOR TYPIFICATION
ICDE2013, Brisbane
8. Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013
Schema Features
Features characterize a type well if:
Shared by most entities of that type
Not in the feature sets of other entities that belong to other types
Schema Features: labels of attributes or relations
e.g. Resolution but also HD and LET Tech for type TV
Advantages: Better type indicators
Problems: missing, scarce
Solutions: derive pseudo-schema features
ICDE2013, Brisbane
9. Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013
Pseudo-schema Features
Words in attribute values that act as schema features
TF-IDF
Importance of a term for a document, relative to others in the corpus
Representative for instances rather than types
Learning words in attribute values representative for types
ID Title Price Brand Description
p1
Epson
E1700
260 Epson
Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5,
A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP
620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono
A4 Print
p3
LG
47LM7600
1143 LG
Standby Mode 0.1 W. Full HD 1080p gives high picture quality over
standard HDTV via LG LED... LG’s 47-inch Smart TV is a
revolutionary...
p4
Panasonic
L55DT50
2399 Panasonic
Power consumption 85 W. The DT50 LED-LCD series provides a
fantastic Smart TV experience and features a 3D IPS LED panel,
1080p Full HD resolution, and a new narrow metal frame.
ICDE2013, Brisbane
10. Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
Feature Co-occurrence Graph
Feature Co-occurrence Graph is a weighted directed graph G
= (N,E,L) with:
- N: the set of words in the attribute values
- E: edges as ordered vertex pair (n1,n2), indicating that n1
co-occurs with n2 in the description of some instances
- L: edge labels. Let Nn1 and Nn2 be set of instances that
contain n1 and n2 in description, the edge labels stand for
the conditional co-occurrence probabilities calculated as
p(n2|n1)= |Nn1∩Nn2|/|Nn1|
11. Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
Instance W dpi
p1 X X
p2 X X
p3 X
p4 X
0.5
1.0
NW={p1,p2,p3,p4}
Ndpi={p1,p2}
w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5
w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0
HD
12. Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013
Pseudo-schema Schema Features
v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
0.5
1.0
HD
θ=0.50
13. Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
w
ppm dpi
A4
Maximum
Clique
HD
TV Smart
LED
W
14. Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013
TYPIFICATION ALGORITHM
ICDE2013, Brisbane
15. Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013
Clusters
ICDE2013, Brisbane
A cluster is defined as a tuple C(F, N, S)
F: the set of (pseudo-)schema features
N: the set of all entities that have an element in F as feature
S: the set of clusters that are either child or descendant nodes of C
Cluster Distance
: co-occurrence count of features fi and fj
: the count of entities having f as feature
Ni (Nj ) is the entity set associated with Ci (Cj )
| NE
f
|
count( fi, fj )
16. Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013
Cluster Relation
ICDE2013, Brisbane
Four cluster relations
: Ci a parent (ancestor) of Cj
: Ci a child (descendant) of Cj
: Ci and Cj represent the same cluster
: there is no relation between Ci and Cj
Ci > (>>)Cj
Ci < (<<)Cj
Ci = Cj
Ci ¹ Cj
Evidence No counter-evidence
17. Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013
Typification
ICDE2013, Brisbane
S*
root
Power
platform
Media
Resolution
Print
Speed
LED
HD
Coverage
Level
Language
C
Empty
0
Root
Power
1. Power < Root
Add & Split Clusters
Resolution
2. Resolution < Power
Add & Split Clusters
Print
Speed
3.Print Speed = Resolution
Merge
S*
Power
Resolution
Print
Speed
LED
HD
C
platform
Media
Coverage
Level
Language
1
S*
Resolution
Print
Speed
C
LED
HD
2
S*
Resolution
Empty
C
LED
HD
3
S*
Power
LED
HD
C
platform
Media
Coverage
Level
Language
4
Children or
Descendants
of the root
Siblings of the
root
4. Split Entities
18. Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013
EVALUATION
ICDE2013, Brisbane
19. Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013
Evaluation
Baselines
Hierarchical: BIRCH
Partitional: K-means++
Kernel-based: SVC
Density-based: OPTICS
Datasets
BTC
DBpedia (DBP)
Product Data (P)
PPS: using pseudo-schema features
PTFIDF: using TF-IDF features
PD: using all words
Dataset Entity Triple Schema
Feature
Type Hierarchy PS Features
BTC 334,661 2,991,411 537 163 0 -
DBP 3,600 49,751 146 16 5 -
PPS 22,331 111,647 5 6 0 136
PTFIDF 22,331 111,647 5 6 0 7,211
PD 22,331 111,647 5 6 0 18,917
ICDE2013, Brisbane
20. Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013
Efficiency
ICDE2013, Brisbane
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
DBP BTC PPS PTFIDF PD
Timelog(ms)
Datasets
TYPifier
K-Means++
BIRCH
OPTICS
SVC
TYPifier, K-means++ and BIRCH are similar in efficiency
Pseudo-schema features help to improve efficiency
21. Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013
Effectiveness
ICDE2013, Brisbane
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
DBP BTC PPS PTFIDF PD
F-measure(%)
Datasets
TYPifier
K-Means++
BIRCH
OPTICS
SVC
TYPifier outperforms other baselines
+33.92% in F-measure (compared to second best)
Pseudo-schema feature outperforms other types of feature
+86.15% in F-measure (compared to second best)
22. Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013
Hierarchies
ICDE2013, Brisbane
TYPifier outperforms other baselines
Original Hierarchies
Hierarchies Generated by OPTICS
Hierarchies Generated by BIRCH
Hierarchies Generated by TYPifier
Tree Edit Distance
TYPifier OPTICS BIRCH
12 14 24
23. Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013
Parameter Sensitivity
Precision improves with higher θ, because pseudo-schema
features become more representative
Recall improves as θ increases (at low level), drops at high
level, because less and lesser pseudo-schema features can
be generated
ICDE2013, Brisbane
0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5
Precision(%)
θ
TYPifier
KMeans++
BIRCH
0
10
20
30
40
50
60
70
80
0.1 0.2 0.3 0.4 0.5
Recall(%)
θ
TYPifier
KMeans++
BIRCH
24. Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013
Parameter Sensitivity
The sensitivity of ε depends on feature correlations
Higher ε leads to better precision and recall
Extremely high ε may leads to poor quality of hierarchies
ICDE2013, Brisbane
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Precision(%)
ε
DBP
BTC
P_PS
P_TFIDF
0
10
20
30
40
50
60
70
80
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall(%)
ε
DBP
BTC
P_PS
P_TFIDF
25. Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013
Conclusion
Introduce and formulate Typification as clustering problem
Learning pseudo-schema features
A divisive hierarchical clustering solution for TYPification
TYPifier outperforms baselines by +33.92% in F-measure!
Pseudo-schema feature is essential also for baselines!
(outperforms other types of feature by +86.15% in F-measure)
Generate not only clusters but also hierarchies that closely match
human conceptualization / ground truth model!
ICDE2013, Brisbane
26. Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013
Thank you for your attention! Questions?
Thanh Tran, https://sites.google.com/site/kimducthanh/
ICDE2013, Brisbane