Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
1. Going Beyond Provenance: Explaining Query Answers
with Pattern-based Counterbalances
SIGMOD 2019
Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy
Illinois Institute of Technology Duke University
SIGMOD Research Session 5 - July 3rd - 11:30am
Slide 1 of 16 Q. Zeng - CAPE:
5. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
Slide 3 of 16 Q. Zeng - CAPE: Introduction
6. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction
Slide 3 of 16 Q. Zeng - CAPE: Introduction
7. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction
All based on provenance
Slide 3 of 16 Q. Zeng - CAPE: Introduction
11. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Slide 4 of 16 Q. Zeng - CAPE: Introduction
12. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
13. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.
Boris: Okay, I’m cutting low your stipend.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
14. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian: I was on a plane to SIGMOD for 8 hours.
Boris: Fair enough.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
15. Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue
Slide 5 of 16 Q. Zeng - CAPE: Introduction
16. Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue
author venue year pubcnt
AX SIGKDD 2006 4
AX SIGKDD 2007 1
AX SIGKDD 2008 4
Slide 5 of 16 Q. Zeng - CAPE: Introduction
17. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Slide 6 of 16 Q. Zeng - CAPE: Introduction
18. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
Slide 6 of 16 Q. Zeng - CAPE: Introduction
19. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Slide 6 of 16 Q. Zeng - CAPE: Introduction
20. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Slide 6 of 16 Q. Zeng - CAPE: Introduction
21. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Our approach
—By counterbalance
AX ’s high publication number in
other conference or other year
Slide 6 of 16 Q. Zeng - CAPE: Introduction
22. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Slide 7 of 16 Q. Zeng - CAPE: Introduction
23. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs
Slide 7 of 16 Q. Zeng - CAPE: Introduction
24. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance
Slide 7 of 16 Q. Zeng - CAPE: Introduction
25. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
Slide 7 of 16 Q. Zeng - CAPE: Introduction
26. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
offline Interactive with user question
Slide 7 of 16 Q. Zeng - CAPE: Introduction
27. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
offline Interactive with user question
CAPE
Slide 7 of 16 Q. Zeng - CAPE: Introduction
28. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
29. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
30. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
31. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
An aggregate function
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
32. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
An aggregate function
A regression model type
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
33. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
34. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes Say,
P holds on AX
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
35. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
A pattern can also hold globally if it holds for sufficiently many values
of partition attributes (A good number of authors)
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
36. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
A pattern can also hold globally if it holds for sufficiently many values
of partition attributes (A good number of authors)
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
37. Mining ARP
Brute Force: at least 3|R| candidate patterns
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
38. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
39. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
maximum 4 attributes in a pattern. This alone would reduce the
number of candidate patterns to polynomial.
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
40. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
41. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Partition Attributes Predictor Attributes
A,B,C D
A,B C,D
A B,C,D
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
42. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Detecting and Applying Functional Dependency
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
43. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Detecting and Applying Functional Dependency
"For each A, agg(α) is linear over C"
A → B
⇒ "For each A and B, agg(α) is linear over C"
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
45. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
46. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
47. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
48. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
49. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
AX ’s number of SIGKDD publications each year:
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
50. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
51. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
52. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"
AX ’s number of publications each year:
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
53. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
54. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
55. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
56. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
P1="For author AX and ICDE, the total publication is constant over
the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
57. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
P1="For author AX and ICDE, the total publication is constant over
the years"
In this simple example it happens that we refined back to the same
attributes as user question but it doesn’t necessarily have to be
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
58. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P1="For author AX and ICDE, the total publication is constant over
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
59. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P1="For author AX and ICDE, the total publication is constant over
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1
t [pubcnt] = 6 is a high
outlier
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
60. Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)
Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
61. Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)
Not all counterbalances are good. We need to score them and return top
ones.
Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
62. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
63. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
⇒ Tuples that are more similar are more likely to cause unusual result.
For φ=(AX , SIGKDD, 2007, 1), 2007 is better than 2006 for an
answer, ICDE is better than a conference in other area like SIGCOMM
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
64. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
65. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
66. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.
AX ’s SIGKDD publication: AX ’s ICDE publication:
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
67. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
68. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
69. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
70. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
71. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0
4 Assault 26 2011 10 40.1
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
72. Conclusion & Future Work
Conclusions
Provenance may be insufficient
Reasonable explanations can be given by counterbalance
Mine patterns offline
Look for counterbalance and rank online
Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
73. Conclusion & Future Work
Conclusions
Provenance may be insufficient
Reasonable explanations can be given by counterbalance
Mine patterns offline
Look for counterbalance and rank online
Future Work
Extend to larger class of queries
e.g., joins
Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
75. References I
[Arab et al., 2014] Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., and Glavic, B. (2014).
A generic provenance middleware for database queries, updates, and transactions.
In Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance.
[Green et al., 2007] Green, T. J., Karvounarakis, G., and Tannen, V. (2007).
Provenance semirings.
In PODS, pages 31–40.
[Meliou et al., 2010] Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D. (2010).
The complexity of causality and responsibility for query answers and non-answers.
PVLDB, 4(1):34–45.
[Roy and Suciu, 2014] Roy, S. and Suciu, D. (2014).
A formal approach to finding explanations for database queries.
In SIGMOD, pages 1579–1590.
[Wu and Madden, 2013] Wu, E. and Madden, S. (2013).
Scorpion: Explaining away outliers in aggregate queries.
PVLDB, 6(8):553–564.
Slide 16 of 16 Q. Zeng - CAPE: Bibliography