SlideShare uma empresa Scribd logo
1 de 34
An Overview of Data Completeness
Assessment Techniques
Simon Razniewski
Free University of Bozen-Bolzano, Italy
Background
• Diplom (~Master) from TU Dresden, Germany, 2010
• PhD from Free University of Bozen-Bolzano, Italy, 2014
– Spent some time at UCSD and AT&T Labs-Research
• Now Assistant Professor in Bozen-Bolzano
• Trilingual province
– (German, Italian, Ladin)
• Autonomous since 43 years
• University founded in 1997
• 3500 students 2
Bolzano
Background (2)
• PhD centered on formal approaches to data
completeness
• Other research interests:
– Data currency (see WebDB2015 paper)
– Process mining
– Data-driven (machine learning) approaches to data
completeness
– ….
• Presentation today: Joint work with Werner Nutt,
Divesh Srivastava and Flip Korn
3
Continent
Name Population
(billion)
Africa 1
America Null
Asia 5
Australia 0.03
Continent
Name Population
(billion)
Area
(million km²)
Africa 1 30
America Null 16
Asia 5 43
Australia 0.03 3
Europe 0.7 4
Data Completeness
• Data quality commonly distinguishes dimensions
– Correctness
– Timeliness
– Completeness
• (In-)completeness is an issue in many settings, e.g.
– Data from multiple sources
– Optional data
– Human-intensive workflows
• Aspects of incompleteness
– Schema
– Records
– Values
Focus today on records, for values see
[Razniewski&Nutt, CIKM 2012] 4
What can one research?
• How to avoid incompleteness
– Information systems design
– Process design
• How to deal with incompleteness
– Statistical procedures to predict missing data
– Missing value imputation
• How to understand incompleteness
– How to describe it
– How to reason about it
5
Motivation: Data warehouse of a
telecommunication company
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
Maintenance
ID resp reason
tw37 A disk failure
tw59 D software crash
tw83 B unknown
tw91 C update failure
tw91 C network error
Teams
name specialization
A hardware
B hardware
C network
C software
D network
Admin John knows
• Team table is complete (HR says so)
• Maintenance is complete for teams A, B and C
• their reporting systems export data automatically
• Warnings is complete for all of Week 1,
and Monday and Wednesday of Week 2
• Potential data loss due to a system failure on Tuesday
• Data further than Wednesday maybe not fully loaded
6
John wants to know
“Give me all warnings in week 2 that are generated
by objects in maintenance with a hardware team.”
SELECT *
FROM Warnings W
JOIN Maintenance M ON W.ID = M.ID
JOIN Teams T ON M.responsible = T.name
WHERE W.week = 2
AND T.specialization = 'hardware'
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization
Wed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Is this all that
hardware
teams have
done?
7
John reasons
“Give me all warnings in week 2 that are generated
by objects in maintenance with a hardware team.”
• Warnings is complete for Week 1 and Monday and Wednesday of Week 2
• Maintenance is complete for teams A, B and C
• Team is complete
 The query result definitely contains all warnings from
– Monday for team A
– Monday for team B
– Monday for team C
– Wednesday for team A
– Wednesday for team B
– Wednesday for team C 8
Warnings
day week ID message
Maintenance
ID resp reason
Teams
name specialization
John looks at the data
 The query result definitely contains all warnings from
– Monday for team A
– Monday for team B
– Monday for team C
– Wednesday for team A
– Wednesday for team B
– Wednesday for team C
• There are no other hardware teams than A and B
 The query result is fully complete for Monday and Wednesday
9
Teams
name specialization
A hardware
B hardware
C network
C software
D network
Questions
“Warnings are complete for Week 1”
1. How can we formally describe
complete parts of a database?
“The query result contains all warnings
from Monday of week 2 for team A”
2. How can we use database completeness
information to identify
complete parts of query answers?
10
Related work
Publication Description Language Focus of the work
Motro,
TODS 1989
Views
Schema-level reasoning
Levy,
VLDB 1996
LC statements,
similar to views
Schema-level reasoning
Fan & Geerts,
PODS 2009
Various query
languages
(CQ-Datalog)
Master data
management,
where an upper bound
database exists
Lang et al.,
SIGMOD 2014
Columns/operators Distributed databases
on the web,
operational failures
during query execution
11
Formalism: Patterns
We have all warnings from week 1
We have all warnings from
Monday of week 2
• Less expressive than previous formalisms
• Can be expressed in the same schema as the data
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
12
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
John’s knowledge expressed by patterns
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
Wed 2 * *
Maintenance
ID resp reason
tw37 A disk failure
tw59 D software crash
tw83 B unknown
tw91 C update failure
tw91 C network error
* A *
* B *
* C *
Teams
name specialization
A hardware
B hardware
C network
C software
D network
* *
13
Team table is complete Maintenance is complete
for teams A, B and C
Warnings is complete for all of Week 1,
and Monday and Wednesday of Week 2
John’s conclusions expressed by patterns
“Give me all warnings in week 2 that are generated
by objects in maintenance with a hardware team.”
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization
Wed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization
Wed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Mon * * * * A * A *
14
 The query result contains all warnings from
• Monday for team A
• …
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization
Wed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Mon * * * * A * A *
Mon * * * * B * B *
Mon * * * * C * C *
Wed * * * * A * A *
Wed * * * * B * B *
Wed * * * * C * C *
How to compute the completeness patterns for queries?
Queries are computed by relational algebra
Here: Select, project, equijoin
Schema reasoning:
- Apply algebra operators to completeness patterns
(analogous to query result computation) 15
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎 𝑤𝑒𝑒𝑘=2
⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒
𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤"
𝑇𝑒𝑎𝑚𝑠
?
?
𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻)
Teams
name specialization
A hardware
B hardware
C network
C software
D network
* *
name specialization
A hardware
B hardware
* *
Rule 1: Statements with * survive
16
Reasoning about selections
name specialization
A hardware
B hardware
17
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎 𝑤𝑒𝑒𝑘=2
⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒
𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤"
𝑇𝑒𝑎𝑚𝑠
day week ID message
Wed 2 tw37 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
?
𝝈 𝒘𝒆𝒆𝒌=𝟐(𝑾)
Rule 2: Irrelevant constants are ignored
Rule 3: Selected constants survive and are promoted
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
Wed 2 * *
day week ID message
Wed 2 tw37 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
Mon 2 * *
Wed 2 * *
18
Reasoning about selections (2)
*
*
19
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎 𝑤𝑒𝑒𝑘=2
⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒
𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤"
𝑇𝑒𝑎𝑚𝑠
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
?
𝑴 ⋈ 𝑴.𝒓𝒆𝒔𝒑=𝑻.𝒏𝒂𝒎𝒆 𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻)
name specialization
A hardware
B hardware
* *
Maintenance
ID resp reason
tw37 A disk failure
tw59 D software crash
tw83 B unknown
tw91 C update failure
tw91 C network error
* A *
* B *
* C *
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* A * A *
* B * B *
* C * C *
20
Reasoning about joins
Rule 1: Constants join with equal constants
Rule 2: Wildcards join with anything
Rule 3: Constants can be promoted
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* A * * *
* B * * *
* C * * *
* * * A *
* * * B *
* * * C *
Algorithmic completeness
Proven: Extended algebra gives all conclusions
that hold on the schema level
(reasoning only with the yellow metadata)
• Independent of the algebra tree chosen
21
𝑴 ⋈ 𝑴.𝒓𝒆𝒔𝒑=𝑻.𝒏𝒂𝒎𝒆 𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻)
name specialization
A hardware
B hardware
* *
Maintenance
ID resp reason
tw37 A disk failure
tw59 D software crash
tw83 B unknown
tw91 C update failure
tw91 C network error
* A *
* B *
* C *
22
Looking at the data
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* A * * *
* B * * *
* C * * *
* * * A *
* * * B *
* * * C *
There
cannot be
other hardware
teams than
A and B
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* * * * *
Database instance
allows for more promotion!
(for details see paper)
So much about the theory, but…
1. How can we implement this?
2. How fast is this?
– In comparison with query evaluation
3. How can we manage large sets of
statements?
23
How can we implement this?
• Ideally, a plugin inside a DBMS
– Promotion procedure benefits from fast access to data
• So far: Separate Java program
• Schema-level algebra can also be encoded in SQL
 Could compile normal queries into metadata queries
24
How fast is this? (1)
• Synthetic data
• Wikipedia has around 1000 lists declared as complete
(using a template or in natural language)
25
http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29
• Manually extracted some and grouped them by topic
– Recurrent topics: Sports teams, political assemblies, geographical features,
songs, operas and other pieces of art
• Generated one table each about cities, schools and countries
26
city
name country state county
* USA Virginia *
* Germany * *
* Ukraine * *
* Bulgaria * *
* USA New York *
* UK Carmarthenshire *
* USA West Virginia Hampshire County
* Czech Moravian-Silesia Nový Jičín
* Slovenia * *
How fast is this? (2)
27
SELECT *
FROM country, city, school
WHERE country.capital=city.name
AND city.state=school.state
SQL runtime: 2040 ms (25891 records)
Completeness pattern runtime: 900 ms (46 patterns)
Median over 7 join queries:
• SQL runtime: 2040 ms
• Completeness pattern runtime: 460 ms
How fast is this? (3)
How can we manage large sets of
patterns?
Redundancies in workflows may lead to redundant patterns
- Introduce overhead and restrict comprehensibility
 Should be identified and removed
John reports first that all data for Monday of week 2 is complete,
later, that the data for the whole week 2 is complete
(Monday,2)
(*,2)
Trivial?
(Monday,*,hardware) (Wednesday,*,software)
(Tuesday,2,software) (*,*,hardware)
(Monday,2,*) (*,2,software) 28
Minimization of sets of patterns: Options
• Option 1: Pairwise comparison
• Option 2: Employment of index structures for quick entailment checking
(similar problem studied in theorem proving/AI)
– Path indexes
– Discrimination trees
• Option 3: Hashing
– Store all statements in a hashmap
– For each statement, all generalizations are generated (exponentially many!)
– A statement is most general, if none of its generalizations exists in the hashmap
(Mon, 1, sw)  (*, 1, sw), (Mon, *, sw), (Mon, 1, *), (*, *, sw), (Mon, *, *), (*, 1, *), (*, *, *)
• Options can be combined with sorting by number of wildcards
(*, *, *), (Mon, *, *), (*, 2, sw), (Tue, 1, hw)
 Later statements cannot entail earlier statements
29
Minimization of sets of patterns -
Results
30
(Pairwise comparison and path
indexes failed immediately)
Time/space tradeoff:
• Unsorted discrimination trees fasted
• Sorted hashing/discrimination trees most space efficient
Summary
• Completeness patterns are a natural way to describe
complete parts of databases and query answers
– Can be expressed in the same schema
• Modified the relational algebra
to manipulate completeness patterns
– Selection and projection easy
– Join may be expensive (in theory, in practice, usually not)
• Current work
– Correctness and completeness patterns
– Column-level patterns
31
Open Questions
• Automated ways to get large sets of statements
– Sensor networks
– Web extraction (e.g. from Wikipedia)
– Streams (e.g. transit data)
• What can be said if an answer is not be guaranteed to be
complete
– Probabilistic completeness assessment based on historical data
– Error bounds
• Algorithmic completeness of promotion
32
References
• Technical part today based on:
– Identifying the Extent of Completeness of Query Answers over
Partially Complete Databases, Simon Razniewski, Flip Korn,
Werner Nutt and Divesh Srivastava, SIGMOD 2015
• Other relevant papers:
– Spatial data completeness: Adding Completeness Information to
Query Answers over Spatial Data, Simon Razniewski and Werner
Nutt, SIGSPATIAL, 2014
– Completeness over processes: Verification of Query
Completeness over Processes, Simon Razniewski, Marco Montali
and Werner Nutt, BPM 2013
– Completeness of values: Completeness of Queries over SQL
Databases, Werner Nutt and Simon Razniewski, CIKM 2012
33
Acknowledgment
This research has been supported by the project “MAGIC”,
funded by the Province of Bozen-Bolzano, Italy

Mais conteúdo relacionado

Último

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Último (20)

300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

An Overview of Data Completeness Assessment Techniques

  • 1. An Overview of Data Completeness Assessment Techniques Simon Razniewski Free University of Bozen-Bolzano, Italy
  • 2. Background • Diplom (~Master) from TU Dresden, Germany, 2010 • PhD from Free University of Bozen-Bolzano, Italy, 2014 – Spent some time at UCSD and AT&T Labs-Research • Now Assistant Professor in Bozen-Bolzano • Trilingual province – (German, Italian, Ladin) • Autonomous since 43 years • University founded in 1997 • 3500 students 2 Bolzano
  • 3. Background (2) • PhD centered on formal approaches to data completeness • Other research interests: – Data currency (see WebDB2015 paper) – Process mining – Data-driven (machine learning) approaches to data completeness – …. • Presentation today: Joint work with Werner Nutt, Divesh Srivastava and Flip Korn 3
  • 4. Continent Name Population (billion) Africa 1 America Null Asia 5 Australia 0.03 Continent Name Population (billion) Area (million km²) Africa 1 30 America Null 16 Asia 5 43 Australia 0.03 3 Europe 0.7 4 Data Completeness • Data quality commonly distinguishes dimensions – Correctness – Timeliness – Completeness • (In-)completeness is an issue in many settings, e.g. – Data from multiple sources – Optional data – Human-intensive workflows • Aspects of incompleteness – Schema – Records – Values Focus today on records, for values see [Razniewski&Nutt, CIKM 2012] 4
  • 5. What can one research? • How to avoid incompleteness – Information systems design – Process design • How to deal with incompleteness – Statistical procedures to predict missing data – Missing value imputation • How to understand incompleteness – How to describe it – How to reason about it 5
  • 6. Motivation: Data warehouse of a telecommunication company Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart Maintenance ID resp reason tw37 A disk failure tw59 D software crash tw83 B unknown tw91 C update failure tw91 C network error Teams name specialization A hardware B hardware C network C software D network Admin John knows • Team table is complete (HR says so) • Maintenance is complete for teams A, B and C • their reporting systems export data automatically • Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 • Potential data loss due to a system failure on Tuesday • Data further than Wednesday maybe not fully loaded 6
  • 7. John wants to know “Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.” SELECT * FROM Warnings W JOIN Maintenance M ON W.ID = M.ID JOIN Teams T ON M.responsible = T.name WHERE W.week = 2 AND T.specialization = 'hardware' W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware Is this all that hardware teams have done? 7
  • 8. John reasons “Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.” • Warnings is complete for Week 1 and Monday and Wednesday of Week 2 • Maintenance is complete for teams A, B and C • Team is complete  The query result definitely contains all warnings from – Monday for team A – Monday for team B – Monday for team C – Wednesday for team A – Wednesday for team B – Wednesday for team C 8 Warnings day week ID message Maintenance ID resp reason Teams name specialization
  • 9. John looks at the data  The query result definitely contains all warnings from – Monday for team A – Monday for team B – Monday for team C – Wednesday for team A – Wednesday for team B – Wednesday for team C • There are no other hardware teams than A and B  The query result is fully complete for Monday and Wednesday 9 Teams name specialization A hardware B hardware C network C software D network
  • 10. Questions “Warnings are complete for Week 1” 1. How can we formally describe complete parts of a database? “The query result contains all warnings from Monday of week 2 for team A” 2. How can we use database completeness information to identify complete parts of query answers? 10
  • 11. Related work Publication Description Language Focus of the work Motro, TODS 1989 Views Schema-level reasoning Levy, VLDB 1996 LC statements, similar to views Schema-level reasoning Fan & Geerts, PODS 2009 Various query languages (CQ-Datalog) Master data management, where an upper bound database exists Lang et al., SIGMOD 2014 Columns/operators Distributed databases on the web, operational failures during query execution 11
  • 12. Formalism: Patterns We have all warnings from week 1 We have all warnings from Monday of week 2 • Less expressive than previous formalisms • Can be expressed in the same schema as the data Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Mon 2 * * 12 Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Mon 2 * * Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Mon 2 * *
  • 13. John’s knowledge expressed by patterns Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Mon 2 * * Wed 2 * * Maintenance ID resp reason tw37 A disk failure tw59 D software crash tw83 B unknown tw91 C update failure tw91 C network error * A * * B * * C * Teams name specialization A hardware B hardware C network C software D network * * 13 Team table is complete Maintenance is complete for teams A, B and C Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2
  • 14. John’s conclusions expressed by patterns “Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.” W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware Mon * * * * A * A * 14  The query result contains all warnings from • Monday for team A • … W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware Mon * * * * A * A * Mon * * * * B * B * Mon * * * * C * C * Wed * * * * A * A * Wed * * * * B * B * Wed * * * * C * C *
  • 15. How to compute the completeness patterns for queries? Queries are computed by relational algebra Here: Select, project, equijoin Schema reasoning: - Apply algebra operators to completeness patterns (analogous to query result computation) 15 𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠 𝜎 𝑤𝑒𝑒𝑘=2 ⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷 𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒 ⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒 𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤" 𝑇𝑒𝑎𝑚𝑠
  • 16. ? ? 𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻) Teams name specialization A hardware B hardware C network C software D network * * name specialization A hardware B hardware * * Rule 1: Statements with * survive 16 Reasoning about selections name specialization A hardware B hardware
  • 17. 17 𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠 𝜎 𝑤𝑒𝑒𝑘=2 ⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷 𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒 ⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒 𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤" 𝑇𝑒𝑎𝑚𝑠
  • 18. day week ID message Wed 2 tw37 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart ? 𝝈 𝒘𝒆𝒆𝒌=𝟐(𝑾) Rule 2: Irrelevant constants are ignored Rule 3: Selected constants survive and are promoted Warnings day week ID message Mon 1 tw37 high voltage Fri 1 tw37 high voltage Wed 2 tw37 overheat Tue 1 tw59 auto restart Fri 1 tw59 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart * 1 * * Mon 2 * * Wed 2 * * day week ID message Wed 2 tw37 overheat Mon 2 tw83 high voltage Tue 2 tw83 auto restart Mon 2 * * Wed 2 * * 18 Reasoning about selections (2) * *
  • 19. 19 𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠 𝜎 𝑤𝑒𝑒𝑘=2 ⋈ 𝑊.𝐼𝐷=𝑀.𝐼𝐷 𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒 ⋈ 𝑀.𝑟𝑒𝑠𝑝=𝑇.𝑛𝑎𝑚𝑒 𝜎𝑠𝑝𝑒𝑐= "ℎ𝑤" 𝑇𝑒𝑎𝑚𝑠
  • 20. M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown B hardware ? 𝑴 ⋈ 𝑴.𝒓𝒆𝒔𝒑=𝑻.𝒏𝒂𝒎𝒆 𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻) name specialization A hardware B hardware * * Maintenance ID resp reason tw37 A disk failure tw59 D software crash tw83 B unknown tw91 C update failure tw91 C network error * A * * B * * C * M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown B hardware * A * A * * B * B * * C * C * 20 Reasoning about joins Rule 1: Constants join with equal constants Rule 2: Wildcards join with anything Rule 3: Constants can be promoted M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown B hardware * A * * * * B * * * * C * * * * * * A * * * * B * * * * C *
  • 21. Algorithmic completeness Proven: Extended algebra gives all conclusions that hold on the schema level (reasoning only with the yellow metadata) • Independent of the algebra tree chosen 21
  • 22. 𝑴 ⋈ 𝑴.𝒓𝒆𝒔𝒑=𝑻.𝒏𝒂𝒎𝒆 𝝈 𝒔𝒑𝒆𝒄= "𝒉𝒘" (𝑻) name specialization A hardware B hardware * * Maintenance ID resp reason tw37 A disk failure tw59 D software crash tw83 B unknown tw91 C update failure tw91 C network error * A * * B * * C * 22 Looking at the data M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown B hardware * A * * * * B * * * * C * * * * * * A * * * * B * * * * C * There cannot be other hardware teams than A and B M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown B hardware * * * * * Database instance allows for more promotion! (for details see paper)
  • 23. So much about the theory, but… 1. How can we implement this? 2. How fast is this? – In comparison with query evaluation 3. How can we manage large sets of statements? 23
  • 24. How can we implement this? • Ideally, a plugin inside a DBMS – Promotion procedure benefits from fast access to data • So far: Separate Java program • Schema-level algebra can also be encoded in SQL  Could compile normal queries into metadata queries 24
  • 25. How fast is this? (1) • Synthetic data • Wikipedia has around 1000 lists declared as complete (using a template or in natural language) 25 http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29
  • 26. • Manually extracted some and grouped them by topic – Recurrent topics: Sports teams, political assemblies, geographical features, songs, operas and other pieces of art • Generated one table each about cities, schools and countries 26 city name country state county * USA Virginia * * Germany * * * Ukraine * * * Bulgaria * * * USA New York * * UK Carmarthenshire * * USA West Virginia Hampshire County * Czech Moravian-Silesia Nový Jičín * Slovenia * * How fast is this? (2)
  • 27. 27 SELECT * FROM country, city, school WHERE country.capital=city.name AND city.state=school.state SQL runtime: 2040 ms (25891 records) Completeness pattern runtime: 900 ms (46 patterns) Median over 7 join queries: • SQL runtime: 2040 ms • Completeness pattern runtime: 460 ms How fast is this? (3)
  • 28. How can we manage large sets of patterns? Redundancies in workflows may lead to redundant patterns - Introduce overhead and restrict comprehensibility  Should be identified and removed John reports first that all data for Monday of week 2 is complete, later, that the data for the whole week 2 is complete (Monday,2) (*,2) Trivial? (Monday,*,hardware) (Wednesday,*,software) (Tuesday,2,software) (*,*,hardware) (Monday,2,*) (*,2,software) 28
  • 29. Minimization of sets of patterns: Options • Option 1: Pairwise comparison • Option 2: Employment of index structures for quick entailment checking (similar problem studied in theorem proving/AI) – Path indexes – Discrimination trees • Option 3: Hashing – Store all statements in a hashmap – For each statement, all generalizations are generated (exponentially many!) – A statement is most general, if none of its generalizations exists in the hashmap (Mon, 1, sw)  (*, 1, sw), (Mon, *, sw), (Mon, 1, *), (*, *, sw), (Mon, *, *), (*, 1, *), (*, *, *) • Options can be combined with sorting by number of wildcards (*, *, *), (Mon, *, *), (*, 2, sw), (Tue, 1, hw)  Later statements cannot entail earlier statements 29
  • 30. Minimization of sets of patterns - Results 30 (Pairwise comparison and path indexes failed immediately) Time/space tradeoff: • Unsorted discrimination trees fasted • Sorted hashing/discrimination trees most space efficient
  • 31. Summary • Completeness patterns are a natural way to describe complete parts of databases and query answers – Can be expressed in the same schema • Modified the relational algebra to manipulate completeness patterns – Selection and projection easy – Join may be expensive (in theory, in practice, usually not) • Current work – Correctness and completeness patterns – Column-level patterns 31
  • 32. Open Questions • Automated ways to get large sets of statements – Sensor networks – Web extraction (e.g. from Wikipedia) – Streams (e.g. transit data) • What can be said if an answer is not be guaranteed to be complete – Probabilistic completeness assessment based on historical data – Error bounds • Algorithmic completeness of promotion 32
  • 33. References • Technical part today based on: – Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, Simon Razniewski, Flip Korn, Werner Nutt and Divesh Srivastava, SIGMOD 2015 • Other relevant papers: – Spatial data completeness: Adding Completeness Information to Query Answers over Spatial Data, Simon Razniewski and Werner Nutt, SIGSPATIAL, 2014 – Completeness over processes: Verification of Query Completeness over Processes, Simon Razniewski, Marco Montali and Werner Nutt, BPM 2013 – Completeness of values: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, CIKM 2012 33
  • 34. Acknowledgment This research has been supported by the project “MAGIC”, funded by the Province of Bozen-Bolzano, Italy

Notas do Editor

  1. These are really all, while for other days, additional records might show up
  2. This are patterns for single tables.
  3. Say: Efficient implementation of this one possible
  4. Global or not says whether a pattern is compared with all others, or only with the ones loaded so far