O slideshow foi denunciado.

Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis

0

Compartilhar

Próximos SlideShares
Week8finalexamlivelecture2011
Week8finalexamlivelecture2011
Carregando em…3
×
1 de 77
1 de 77

Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis

0

Compartilhar

Baixar para ler offline

In this talk, Sherry Wu talks about Errudite and reproducible results.

Presented 09/25/2019

**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**

In this talk, Sherry Wu talks about Errudite and reproducible results.

Presented 09/25/2019

**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**

Mais Conteúdo rRelacionado

Audiolivros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo

Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis

  1. 1. 1 Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington
  2. 2. 2 Motivation & Contributions
  3. 3. 3 Error analysis is important for… Uncovering bugs Improving the state-of-art Safeguarding deployments
  4. 4. 4 Where We Are We performed an error analysis on a sample of 100 questions Fader et tal. ACL’13 We sample 100 incorrect predictions and try to find common error categories. Chen et al. ACL’16 We randomly select 50 incorrect questions and categorize them into 6 classes. Wadhwa et al. ACL’18
  5. 5. 5 Where We Are We performed an error analysis on a sample of 100 questions Fader et tal. ACL’13 We sample 100 incorrect predictions and try to find common error categories. Chen et al. ACL’16 We randomly select 50 incorrect questions and categorize them into 6 classes. Wadhwa et al. ACL’18 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”
  6. 6. 6 Where We Are “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause
  7. 7. 7 Where We Are & Our Contribution “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis
  8. 8. A B C D E F 8
  9. 9. A B C D E F 9
  10. 10. 10 Design & Use Scenario Examine the distractor hypothesis on BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016) Independently tested by 4 (out of 10) participants in the user study
  11. 11. 11 The Scenario: Analyze BiDAF …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who?Question In the context of Machine Comprehension (MC)… + ↓ Murray Gold Context Answer
  12. 12. 12 Why the incorrect prediction? Who created the 2005 theme for Doctor Who? …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original.
  13. 13. 13 Why the mistake? Distractor hypothesis …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? BiDAF… Matches entity types Knows to find a PERSON Finds the exact answer spans Distracted by other PERSON spans
  14. 14. 14 Why the incorrect prediction? Who created the 2005 theme for Doctor Who? …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Does BiDAF make similar distractor mistakes often?
  15. 15. Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 15 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Too ambiguous to reproduce Biased conclusion due to… Subjectively defined hypotheses
  16. 16. …started his writing journey during his college year. But, he didn’t finish his first book until 1996. prediction groundtruth 16 What is distractor hypothesis? Have the same recognizable entity typeD2D1 D2 D1 Are roughly the same entity type The groundtruth and the prediction… Not a recognizable DATE!Not a recognizable DATE! DATE (YEAR) matches “when”!
  17. 17. Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses 17 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Quantify instances with a domain specific language Biased conclusion due to… Subjectively defined hypotheses
  18. 18. 18 Precise DSL (Domain Specific Language) A B C D E F Attribute Extractor OperatorsTargetDSL = + +A B C D Extract Instance Attribute E F Instance Groups Filter length(q) > 20
  19. 19. C D 19 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 C D F
  20. 20. 20 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity “The groundtruth is an ENTity.” ENT(Murray Gold) == PERSON
  21. 21. 21 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor “There are more tokens matching the ground truth entity type (ENT(g)) in the whole context than in the groundtruth.” count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3 count(PERSON : Murray Gold) == 1
  22. 22. 22 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type “The model prediction ENTity type matches the groundtruth ENTity type.” ENT(John Debney) == PERSON
  23. 23. 23 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted “The model prediction is incorrect.”
  24. 24. 24 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted CorrectIncorrect Have the same recognizable entity typeD2 D1 Are roughly the same entity type
  25. 25. 25 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Subjectively defined hypotheses + Small samples Errudite Precise & reproducible hypotheses + Scale up to the entire dev set 100 << 2000+ errors in total
  26. 26. 26 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted CorrectIncorrect 5.7% of all BiDAF errors: The distractor hypothesis seems correct!
  27. 27. 27 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Focus exclusively on errors Wrongly prioritize groups that are well-handled in average.
  28. 28. Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 28 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Wrongly prioritize groups that are well-handled in average. Biased conclusion due to… Focus exclusively on errors Errudite Cover errors & correct instances
  29. 29. 29 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted all_instance CorrectIncorrect 88% EM > 68% EM: BiDAF performs better when have distractors & entity type is matched, than overall. Reject / revise the hypothesis!
  30. 30. 30 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Small samples + Focus exclusively on errors Errudite Scale up to the entire dev set + Cover errors & correct instances
  31. 31. 31 …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? is_distracted Distractor entity? HAS distractor prediction != IS WRONG due to distractor prediction Multi-sentence reasoning?
  32. 32. Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 32 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” HAS distractor prediction != IS WRONG due to distractor prediction Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… No Test on true cause
  33. 33. A B C D E F 33 Are the 192 instances really wrong because of the distractor? Would BiDAF work perfectly if we remove the distractors? Answer what-if questions with counterfactual analysis! F
  34. 34. 34 Counterfactual Analysis with Rewrite Rules rewrite( , → )target tofrom Re-write the target part of an instance by replacing from with to
  35. 35. 35 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )target tofrom Re-write the target part of an instance by replacing from with to
  36. 36. 36 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c tofrom Re-write the context part of an instance by replacing from with to
  37. 37. 37 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c tostring(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with to
  38. 38. 38 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with a placeholder token “#”
  39. 39. 39 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Q: Who created the 2005 theme for Doctor Who? C: …John Dobney # created a new arrangement of Ron Grainer’s … Murray Gold provided a new arrangement… Incorrect Incorrect
  40. 40. Another distractor is still confusing the model! 40 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Incorrect Incorrect
  41. 41. 41 Counterfactual Analysis with Rewrite Rules p(m) for the 192 rewritten is_distracted instances are… rewrite( , → )c "#"string(p(m)) Another distractor is still confusing the model! Incorrect Incorrect
  42. 42. 42 Counterfactual Analysis with Rewrite Rules p(m) for the 192 rewritten is_distracted instances are… rewrite( , → )c "#"string(p(m)) Incorrect Incorrect 29% Another distractor is still confusing the model! 48% The distractor was fooling the model! Incorrect Correct 23% Other factors are at play!Unchanged age of 18, 10.5% # from 18 to 24…
  43. 43. 43 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… No Test on true cause Errudite Test via counterfactual analysis
  44. 44. 44 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute is_entity has_distractor correct_type is_distracted all_instanceENT(g) rewrite( c, string(p(m))→"#")
  45. 45. 45 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute BiDAF is … not particularly bad at distractors. Seemingly distractor errors can be due to other factors. + + applied to…
  46. 46. 46 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute Other datasets & Other models… ? at handling distractor. + + applied to…Re-
  47. 47. 47 Biased conclusion due to… Subjectively defined hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis
  48. 48. 48 Better error analysis will then… Uncover bugs Improve the state-of-art Safeguard deployments Errudite improves error analysis… Precise Reproducible Scalable Testable
  49. 49. 49 Thank you! Video demo: https://tinyurl.com/errudite-video Opensource: https://github.com/uwdata/errudite
  50. 50. 50 Backup
  51. 51. 51 Domain Specific Language (DSL)
  52. 52. 52 Precise DSL (Domain Specific Language) The model is bad on long questions questions with more than 20 tokens Qualitative Description Quantitative Description length(q) > 20 Attribute Extractor length question_type answer_type Target question context groundtruth prediction (model) token sentence Operators > != in has_any
  53. 53. 53 DSL (Domain Specific Language) Attribute Extractor Basic Attributes General purpose linguistic features Standard prediction performance metrics Between-target relations Domain-specific attribute Length LEMMA,POS,ENT f1,accuracy overlap(t1, t2) answer_type,question_type
  54. 54. 54 DSL (Domain Specific Language) Attribute Extractor …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who?
  55. 55. 55 User Interface Functionality
  56. 56. 56 Suggestions via programming-by-demonstration …John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? starts_with(p(m),pattern="NNP")) starts_with(p(m),pattern="PERSON")) answer_type(g) == answer_type(p(m)) exact_match(m) == 0 is_correct_sent(m) == False overlap(q, sentence(p(m))) > overlap(q, sentence(g))
  57. 57. 57 Suggestions via programming-by-demonstration Who What person created the 2005 theme for Doctor Who?Who What person created the 2005 theme for Doctor Who?
  58. 58. 58 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  59. 59. 59 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  60. 60. 60 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  61. 61. 61 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  62. 62. 62 Additional Case on Preciseness
  63. 63. 63 User study: What is imprecise answer boundaries? Off by at most 2 tokens both on the left and right exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2D1 exact_match(p(m)) == 0 and f1(p(m)) > 0.7 No exact match, but high overlap “The model is making predictions with missing or additional words…?”
  64. 64. 64 User study: What is imprecise answer boundaries? exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 Off by at most 2 tokens both on the left and rightD2 exact_match(p(m)) == 0 and f1(p(m)) > 0.7 D1 No exact match, but high overlap “The model is making predictions with missing or additional words…?”
  65. 65. …the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does.. prediction groundtruth 65 User study: What is imprecise answer boundaries? Off by at most 2 tokens both on the left and rightD2D1 No exact match, but high overlap D1 D2
  66. 66. 66 User Study 10 participants = NLP graduate students + QA engineers from industry Examine BiDAF (Seo et al., 2016) on SQuAD (Rajpurkar et al., 2016) One hour section: Replicate prior error analysis + Freely explore the model
  67. 67. 67 Study #1 Replication: Errudite flexible enough? Read BiDAF error analysis: 50 errors, hand-labeled into 6 classes Rate closeness: Recreated groups == originals? semantic Recreate 4 classes with Errudite on the entire dataset
  68. 68. 68 Users were able to express their intended groups well with Errudite.
  69. 69. 69 Study #1 Replication: Easy group ≠ reproducible! How many errors are covered by user-built Imprecise Error Boundary? Groups with low inter-agreement! 13.8% 45.8% How close does the approximation match the paper definition? Most confident, an easy group 1 2 3 4 5 Closeness Boundary Group 0% 10% 20% 30% 40% 50% Error Coverage Boundary Group
  70. 70. 70 Study #1 Replication: Easy group ≠ reproducible! Off by at most 2 tokens both on the left and right exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D1 Coverage = 22.1% D2 exact_match(m) == 0 and f1(m) > 0.7 No exact match, but high overlap Coverage = 13.8%
  71. 71. 71 Study #1 Replication: Easy group ≠ reproducible! Off by at most 2 tokens both on the left and right exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2 exact_match(m) == 0 and f1(m) > 0.7 No exact match, but high overlapD1 Coverage = 22.1% Coverage = 13.8%
  72. 72. Study #1 Replication: Easy group ≠ reproducible! Coverage = 22.1% Coverage = 13.8% …commercial, scientific, and cultural growth…D1 D2 D1 D2 D1 D2 …from Karakorum in Mongolia to Khanbaliq… …the polynomial time hierarchy collapses. …believed that the polynomial hierarchy does.. Off by at most 2 tokens both on the left and right D2 No exact match, but high overlapD1
  73. 73. 73 Users were able to express their intended groups well with Errudite. Ambiguous manual labels prevents consistent replication, even when users thought they did!
  74. 74. 74 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to… Subjectively defined group + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible grouping + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to… Subjectively defined group Errudite Precise & reproducible grouping
  75. 75. 75 Study #2 Exploration: Errudite Useful Enough? Freely explore BiDAF with Errudite, think aloud Rate insights on importance, confidence, relative easiness Describe their observations / insights on BiDAF
  76. 76. 76 Study #2 Exploration: Errudite Useful Enough? Confirmed prior hypotheses Extended previous knowledge Rejected prior hypotheses Users reported μ = 2.1, σ = 0.94 findings. Users thought their insights are… 1 2 3 4 5 Score Importance Fidelity Easiness Quality Users learned more about the model (μ= 3.9,σ=0.94).
  77. 77. 77 User Feedback: Did they like Errudite? Enhanced their error analysis experience. Systematically scaled up the analysis. Precise and inspiring more confidence. Much faster exploration.

×