SlideShare uma empresa Scribd logo
1 de 26
KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
TYPifier: Inferring the Type Semantics of Structured Data
Yongtao Ma, Thanh Tran
29th IEEE International Conference on Data Engineering (ICDE2013)
Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013
Contents
Introduction
TYPification Features
TYPification Algorithm
Evaluation
Conclusion
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
ID Title Price Brand Description
p1
Epson
E1700
260 Epson
Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5,
A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP
620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4
Print
p3
LG
47LM7600
1143 LG
Standby Mode 0.1 W. Full HD 1080p gives high picture quality over
standard HDTV via LG LED... LG’s 47-inch Smart TV is a
revolutionary...
p4
Panasonic
L55DT50
2399 Panasonic
Power consumption 85 W. The DT50 LED-LCD series provides a
fantastic Smart TV experience and features a 3D IPS LED panel,
1080p Full HD resolution, and a new narrow metal frame.
p5
MadMaps
Pacific
8 Spotitout
Windows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast
GPS Travel Directory by MAD Maps into your GPS device.
p6
Garmin
Maps
99 Gamin
Windows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin
Colorado, Dakota, eTrex...Coverage includes detailed maps for
traveling in Australia.
p7
Rosetta
Spanish
399
Rosetta
Stone
Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and
language abilities... Discover how to speak, read, write, and
understand…
p8
Learn
German
9 Innovative
Windows Vista / 7 / XP. Media: DVD. Learn level 9 German
vocabulary with the audio playback tool, Listen to the lesson dialog
and master the language…
Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013
Problem
Type information is Missing
Dynamic Web Data
Heterogeneous Enterprise Data
Typification: inferring the type semantics of structured data
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013
Contributions
We formulate Typification as a clustering problem, where
the goal is to identify a particular kind of clusters that
represent the types of entities
We propose a solution for automatically computing
pseudo-schema features from data
We propose TYPifier, a novel clustering algorithm for the
typification problem, which is
An divisive hierarchical clustering algorithm
Optimized for (pseudo-)schema-based features
Determine the number of types (clusters) automatically
Show that typification helps to improve date integration!
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013
FEATURES FOR TYPIFICATION
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013
Schema Features
Features characterize a type well if:
Shared by most entities of that type
Not in the feature sets of other entities that belong to other types
Schema Features: labels of attributes or relations
e.g. Resolution but also HD and LET Tech for type TV
Advantages: Better type indicators
Problems: missing, scarce
Solutions: derive pseudo-schema features
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013
Pseudo-schema Features
Words in attribute values that act as schema features
TF-IDF
Importance of a term for a document, relative to others in the corpus
Representative for instances rather than types
Learning words in attribute values representative for types
ID Title Price Brand Description
p1
Epson
E1700
260 Epson
Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5,
A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP
620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono
A4 Print
p3
LG
47LM7600
1143 LG
Standby Mode 0.1 W. Full HD 1080p gives high picture quality over
standard HDTV via LG LED... LG’s 47-inch Smart TV is a
revolutionary...
p4
Panasonic
L55DT50
2399 Panasonic
Power consumption 85 W. The DT50 LED-LCD series provides a
fantastic Smart TV experience and features a 3D IPS LED panel,
1080p Full HD resolution, and a new narrow metal frame.
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
Feature Co-occurrence Graph
Feature Co-occurrence Graph is a weighted directed graph G
= (N,E,L) with:
- N: the set of words in the attribute values
- E: edges as ordered vertex pair (n1,n2), indicating that n1
co-occurs with n2 in the description of some instances
- L: edge labels. Let Nn1 and Nn2 be set of instances that
contain n1 and n2 in description, the edge labels stand for
the conditional co-occurrence probabilities calculated as
p(n2|n1)= |Nn1∩Nn2|/|Nn1|
Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
Instance W dpi
p1 X X
p2 X X
p3 X
p4 X
0.5
1.0
NW={p1,p2,p3,p4}
Ndpi={p1,p2}
w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5
w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0
HD
Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013
Pseudo-schema Schema Features
v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
0.5
1.0
HD
θ=0.50
Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
w
ppm dpi
A4
Maximum
Clique
HD
TV Smart
LED
W
Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013
TYPIFICATION ALGORITHM
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013
Clusters
ICDE2013, Brisbane
A cluster is defined as a tuple C(F, N, S)
F: the set of (pseudo-)schema features
N: the set of all entities that have an element in F as feature
S: the set of clusters that are either child or descendant nodes of C
Cluster Distance
: co-occurrence count of features fi and fj
: the count of entities having f as feature
Ni (Nj ) is the entity set associated with Ci (Cj )
| NE
f
|
count( fi, fj )
Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013
Cluster Relation
ICDE2013, Brisbane
Four cluster relations
: Ci a parent (ancestor) of Cj
: Ci a child (descendant) of Cj
: Ci and Cj represent the same cluster
: there is no relation between Ci and Cj
Ci > (>>)Cj
Ci < (<<)Cj
Ci = Cj
Ci ¹ Cj
Evidence No counter-evidence
Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013
Typification
ICDE2013, Brisbane
S*
root
Power
platform
Media
Resolution
Print
Speed
LED
HD
Coverage
Level
Language
C
Empty
0
Root
Power
1. Power < Root
Add & Split Clusters
Resolution
2. Resolution < Power
Add & Split Clusters
Print
Speed
3.Print Speed = Resolution
Merge
S*
Power
Resolution
Print
Speed
LED
HD
C
platform
Media
Coverage
Level
Language
1
S*
Resolution
Print
Speed
C
LED
HD
2
S*
Resolution
Empty
C
LED
HD
3
S*
Power
LED
HD
C
platform
Media
Coverage
Level
Language
4
Children or
Descendants
of the root
Siblings of the
root
4. Split Entities
Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013
EVALUATION
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013
Evaluation
Baselines
Hierarchical: BIRCH
Partitional: K-means++
Kernel-based: SVC
Density-based: OPTICS
Datasets
BTC
DBpedia (DBP)
Product Data (P)
PPS: using pseudo-schema features
PTFIDF: using TF-IDF features
PD: using all words
Dataset Entity Triple Schema
Feature
Type Hierarchy PS Features
BTC 334,661 2,991,411 537 163 0 -
DBP 3,600 49,751 146 16 5 -
PPS 22,331 111,647 5 6 0 136
PTFIDF 22,331 111,647 5 6 0 7,211
PD 22,331 111,647 5 6 0 18,917
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013
Efficiency
ICDE2013, Brisbane
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
DBP BTC PPS PTFIDF PD
Timelog(ms)
Datasets
TYPifier
K-Means++
BIRCH
OPTICS
SVC
TYPifier, K-means++ and BIRCH are similar in efficiency
Pseudo-schema features help to improve efficiency
Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013
Effectiveness
ICDE2013, Brisbane
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
DBP BTC PPS PTFIDF PD
F-measure(%)
Datasets
TYPifier
K-Means++
BIRCH
OPTICS
SVC
TYPifier outperforms other baselines
+33.92% in F-measure (compared to second best)
Pseudo-schema feature outperforms other types of feature
+86.15% in F-measure (compared to second best)
Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013
Hierarchies
ICDE2013, Brisbane
TYPifier outperforms other baselines
Original Hierarchies
Hierarchies Generated by OPTICS
Hierarchies Generated by BIRCH
Hierarchies Generated by TYPifier
Tree Edit Distance
TYPifier OPTICS BIRCH
12 14 24
Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013
Parameter Sensitivity
Precision improves with higher θ, because pseudo-schema
features become more representative
Recall improves as θ increases (at low level), drops at high
level, because less and lesser pseudo-schema features can
be generated
ICDE2013, Brisbane
0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5
Precision(%)
θ
TYPifier
KMeans++
BIRCH
0
10
20
30
40
50
60
70
80
0.1 0.2 0.3 0.4 0.5
Recall(%)
θ
TYPifier
KMeans++
BIRCH
Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013
Parameter Sensitivity
The sensitivity of ε depends on feature correlations
Higher ε leads to better precision and recall
Extremely high ε may leads to poor quality of hierarchies
ICDE2013, Brisbane
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Precision(%)
ε
DBP
BTC
P_PS
P_TFIDF
0
10
20
30
40
50
60
70
80
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall(%)
ε
DBP
BTC
P_PS
P_TFIDF
Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013
Conclusion
Introduce and formulate Typification as clustering problem
Learning pseudo-schema features
A divisive hierarchical clustering solution for TYPification
TYPifier outperforms baselines by +33.92% in F-measure!
Pseudo-schema feature is essential also for baselines!
(outperforms other types of feature by +86.15% in F-measure)
Generate not only clusters but also hierarchies that closely match
human conceptualization / ground truth model!
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013
Thank you for your attention! Questions?
Thanh Tran, https://sites.google.com/site/kimducthanh/
ICDE2013, Brisbane

Mais conteúdo relacionado

Semelhante a TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

1901200100000 presentation short term mini project on python
1901200100000 presentation short term mini project on python1901200100000 presentation short term mini project on python
1901200100000 presentation short term mini project on pythonSANTOSHJAISWAL52
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxhkabir55
 
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind PersonsIRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind PersonsIRJET Journal
 
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable Systems
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable SystemsDSE and Profiling of Multi-Context Coarse-Grained Reconfigurable Systems
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable SystemsMDC_UNICA
 
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDSFACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDSIRJET Journal
 
IRJET- Optical Character Recognition for Blind using Raspberry Pi
IRJET- Optical Character Recognition for Blind using Raspberry PiIRJET- Optical Character Recognition for Blind using Raspberry Pi
IRJET- Optical Character Recognition for Blind using Raspberry PiIRJET Journal
 
BBA100 Business and SocietyGood Evening, everyone.T.docx
BBA100 Business and SocietyGood Evening, everyone.T.docxBBA100 Business and SocietyGood Evening, everyone.T.docx
BBA100 Business and SocietyGood Evening, everyone.T.docxgarnerangelika
 
Lecture 6 -_presentation_layer
Lecture 6 -_presentation_layerLecture 6 -_presentation_layer
Lecture 6 -_presentation_layerSerious_SamSoul
 
SMART LEARNING ASSISTANT
SMART LEARNING ASSISTANTSMART LEARNING ASSISTANT
SMART LEARNING ASSISTANTIRJET Journal
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
SBC_Group_DC_Expert
SBC_Group_DC_ExpertSBC_Group_DC_Expert
SBC_Group_DC_Expertsbcgroup
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Volha Bryl
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 
CPaaS.io Y1 Review Meeting - Use Cases
CPaaS.io Y1 Review Meeting - Use CasesCPaaS.io Y1 Review Meeting - Use Cases
CPaaS.io Y1 Review Meeting - Use CasesStephan Haller
 
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...diannepatricia
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and rKelli-Jean Chun
 

Semelhante a TYPifier: Inferring the Type Semantics of Structured Data (icde2013) (20)

1901200100000 presentation short term mini project on python
1901200100000 presentation short term mini project on python1901200100000 presentation short term mini project on python
1901200100000 presentation short term mini project on python
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind PersonsIRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
 
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable Systems
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable SystemsDSE and Profiling of Multi-Context Coarse-Grained Reconfigurable Systems
DSE and Profiling of Multi-Context Coarse-Grained Reconfigurable Systems
 
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDSFACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
 
Data science and Machine learning Booklet
Data science and Machine learning BookletData science and Machine learning Booklet
Data science and Machine learning Booklet
 
Industrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information ExtractionIndustrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information Extraction
 
IRJET- Optical Character Recognition for Blind using Raspberry Pi
IRJET- Optical Character Recognition for Blind using Raspberry PiIRJET- Optical Character Recognition for Blind using Raspberry Pi
IRJET- Optical Character Recognition for Blind using Raspberry Pi
 
BBA100 Business and SocietyGood Evening, everyone.T.docx
BBA100 Business and SocietyGood Evening, everyone.T.docxBBA100 Business and SocietyGood Evening, everyone.T.docx
BBA100 Business and SocietyGood Evening, everyone.T.docx
 
Resume2020
Resume2020Resume2020
Resume2020
 
Lecture 6 -_presentation_layer
Lecture 6 -_presentation_layerLecture 6 -_presentation_layer
Lecture 6 -_presentation_layer
 
SMART LEARNING ASSISTANT
SMART LEARNING ASSISTANTSMART LEARNING ASSISTANT
SMART LEARNING ASSISTANT
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
SBC_Group_DC_Expert
SBC_Group_DC_ExpertSBC_Group_DC_Expert
SBC_Group_DC_Expert
 
My Resume
My ResumeMy Resume
My Resume
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
CPaaS.io Y1 Review Meeting - Use Cases
CPaaS.io Y1 Review Meeting - Use CasesCPaaS.io Y1 Review Meeting - Use Cases
CPaaS.io Y1 Review Meeting - Use Cases
 
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and r
 

Último

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

  • 1. KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran 29th IEEE International Conference on Data Engineering (ICDE2013)
  • 2. Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013 Contents Introduction TYPification Features TYPification Algorithm Evaluation Conclusion ICDE2013, Brisbane
  • 3. Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data ICDE2013, Brisbane
  • 4. Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data ICDE2013, Brisbane ID Title Price Brand Description p1 Epson E1700 260 Epson Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W p2 HP 55252 2699 HP 620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print p3 LG 47LM7600 1143 LG Standby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary... p4 Panasonic L55DT50 2399 Panasonic Power consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame. p5 MadMaps Pacific 8 Spotitout Windows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast GPS Travel Directory by MAD Maps into your GPS device. p6 Garmin Maps 99 Gamin Windows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin Colorado, Dakota, eTrex...Coverage includes detailed maps for traveling in Australia. p7 Rosetta Spanish 399 Rosetta Stone Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and language abilities... Discover how to speak, read, write, and understand… p8 Learn German 9 Innovative Windows Vista / 7 / XP. Media: DVD. Learn level 9 German vocabulary with the audio playback tool, Listen to the lesson dialog and master the language…
  • 5. Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data Typification: inferring the type semantics of structured data ICDE2013, Brisbane
  • 6. Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013 Contributions We formulate Typification as a clustering problem, where the goal is to identify a particular kind of clusters that represent the types of entities We propose a solution for automatically computing pseudo-schema features from data We propose TYPifier, a novel clustering algorithm for the typification problem, which is An divisive hierarchical clustering algorithm Optimized for (pseudo-)schema-based features Determine the number of types (clusters) automatically Show that typification helps to improve date integration! ICDE2013, Brisbane
  • 7. Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013 FEATURES FOR TYPIFICATION ICDE2013, Brisbane
  • 8. Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013 Schema Features Features characterize a type well if: Shared by most entities of that type Not in the feature sets of other entities that belong to other types Schema Features: labels of attributes or relations e.g. Resolution but also HD and LET Tech for type TV Advantages: Better type indicators Problems: missing, scarce Solutions: derive pseudo-schema features ICDE2013, Brisbane
  • 9. Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013 Pseudo-schema Features Words in attribute values that act as schema features TF-IDF Importance of a term for a document, relative to others in the corpus Representative for instances rather than types Learning words in attribute values representative for types ID Title Price Brand Description p1 Epson E1700 260 Epson Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W p2 HP 55252 2699 HP 620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print p3 LG 47LM7600 1143 LG Standby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary... p4 Panasonic L55DT50 2399 Panasonic Power consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame. ICDE2013, Brisbane
  • 10. Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane Feature Co-occurrence Graph Feature Co-occurrence Graph is a weighted directed graph G = (N,E,L) with: - N: the set of words in the attribute values - E: edges as ordered vertex pair (n1,n2), indicating that n1 co-occurs with n2 in the description of some instances - L: edge labels. Let Nn1 and Nn2 be set of instances that contain n1 and n2 in description, the edge labels stand for the conditional co-occurrence probabilities calculated as p(n2|n1)= |Nn1∩Nn2|/|Nn1|
  • 11. Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane dpi A4 ppm W Smart TV LED Instance W dpi p1 X X p2 X X p3 X p4 X 0.5 1.0 NW={p1,p2,p3,p4} Ndpi={p1,p2} w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5 w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0 HD
  • 12. Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013 Pseudo-schema Schema Features v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ ICDE2013, Brisbane dpi A4 ppm W Smart TV LED 0.5 1.0 HD θ=0.50
  • 13. Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane w ppm dpi A4 Maximum Clique HD TV Smart LED W
  • 14. Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013 TYPIFICATION ALGORITHM ICDE2013, Brisbane
  • 15. Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013 Clusters ICDE2013, Brisbane A cluster is defined as a tuple C(F, N, S) F: the set of (pseudo-)schema features N: the set of all entities that have an element in F as feature S: the set of clusters that are either child or descendant nodes of C Cluster Distance : co-occurrence count of features fi and fj : the count of entities having f as feature Ni (Nj ) is the entity set associated with Ci (Cj ) | NE f | count( fi, fj )
  • 16. Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013 Cluster Relation ICDE2013, Brisbane Four cluster relations : Ci a parent (ancestor) of Cj : Ci a child (descendant) of Cj : Ci and Cj represent the same cluster : there is no relation between Ci and Cj Ci > (>>)Cj Ci < (<<)Cj Ci = Cj Ci ¹ Cj Evidence No counter-evidence
  • 17. Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013 Typification ICDE2013, Brisbane S* root Power platform Media Resolution Print Speed LED HD Coverage Level Language C Empty 0 Root Power 1. Power < Root Add & Split Clusters Resolution 2. Resolution < Power Add & Split Clusters Print Speed 3.Print Speed = Resolution Merge S* Power Resolution Print Speed LED HD C platform Media Coverage Level Language 1 S* Resolution Print Speed C LED HD 2 S* Resolution Empty C LED HD 3 S* Power LED HD C platform Media Coverage Level Language 4 Children or Descendants of the root Siblings of the root 4. Split Entities
  • 18. Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013 EVALUATION ICDE2013, Brisbane
  • 19. Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013 Evaluation Baselines Hierarchical: BIRCH Partitional: K-means++ Kernel-based: SVC Density-based: OPTICS Datasets BTC DBpedia (DBP) Product Data (P) PPS: using pseudo-schema features PTFIDF: using TF-IDF features PD: using all words Dataset Entity Triple Schema Feature Type Hierarchy PS Features BTC 334,661 2,991,411 537 163 0 - DBP 3,600 49,751 146 16 5 - PPS 22,331 111,647 5 6 0 136 PTFIDF 22,331 111,647 5 6 0 7,211 PD 22,331 111,647 5 6 0 18,917 ICDE2013, Brisbane
  • 20. Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013 Efficiency ICDE2013, Brisbane 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 DBP BTC PPS PTFIDF PD Timelog(ms) Datasets TYPifier K-Means++ BIRCH OPTICS SVC TYPifier, K-means++ and BIRCH are similar in efficiency Pseudo-schema features help to improve efficiency
  • 21. Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013 Effectiveness ICDE2013, Brisbane 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 DBP BTC PPS PTFIDF PD F-measure(%) Datasets TYPifier K-Means++ BIRCH OPTICS SVC TYPifier outperforms other baselines +33.92% in F-measure (compared to second best) Pseudo-schema feature outperforms other types of feature +86.15% in F-measure (compared to second best)
  • 22. Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013 Hierarchies ICDE2013, Brisbane TYPifier outperforms other baselines Original Hierarchies Hierarchies Generated by OPTICS Hierarchies Generated by BIRCH Hierarchies Generated by TYPifier Tree Edit Distance TYPifier OPTICS BIRCH 12 14 24
  • 23. Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013 Parameter Sensitivity Precision improves with higher θ, because pseudo-schema features become more representative Recall improves as θ increases (at low level), drops at high level, because less and lesser pseudo-schema features can be generated ICDE2013, Brisbane 0 10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 Precision(%) θ TYPifier KMeans++ BIRCH 0 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 Recall(%) θ TYPifier KMeans++ BIRCH
  • 24. Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013 Parameter Sensitivity The sensitivity of ε depends on feature correlations Higher ε leads to better precision and recall Extremely high ε may leads to poor quality of hierarchies ICDE2013, Brisbane 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision(%) ε DBP BTC P_PS P_TFIDF 0 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall(%) ε DBP BTC P_PS P_TFIDF
  • 25. Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013 Conclusion Introduce and formulate Typification as clustering problem Learning pseudo-schema features A divisive hierarchical clustering solution for TYPification TYPifier outperforms baselines by +33.92% in F-measure! Pseudo-schema feature is essential also for baselines! (outperforms other types of feature by +86.15% in F-measure) Generate not only clusters but also hierarchies that closely match human conceptualization / ground truth model! ICDE2013, Brisbane
  • 26. Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013 Thank you for your attention! Questions? Thanh Tran, https://sites.google.com/site/kimducthanh/ ICDE2013, Brisbane