O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data Excellence: Better Data for Better AI

Invited talk at the Crowd Science Workshop @NeurIPS 2020 Conference: https://research.yandex.com/workshops/crowd/neurips-2020#schedule

  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Data Excellence: Better Data for Better AI

  1. 1. Data Excellence: Better Data for Better AI Crowd Science Workshop @ NeurIPS2020 Lora Aroyo http://lora-aroyo.org @laroyo By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN 0-8109-6720-0) by Justin Foote (talk)., Fair use, https://en.wikipedia.org/w/index.php?curid=3955850
  2. 2. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 2 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability (xRR), significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a novel method for crowdsourcing adverse test sets for ML models (CATS4ML)
  3. 3. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 3 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability (xRR), significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a novel method for crowdsourcing adverse test sets for ML models (CATS4ML)
  4. 4. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 4 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability (xRR), significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a novel method for crowdsourcing adverse test sets for ML models (CATS4ML)
  5. 5. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 5 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability (xRR), significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a novel method for crowdsourcing adverse test sets for ML models (CATS4ML)
  6. 6. http://lora-aroyo.org @laroyo 6 The Rise of the Machines “AI Winter” lab experiments Expert Systems small scale experiments
  7. 7. http://lora-aroyo.org @laroyo 7 The Rise of the Machines “AI Winter” → “AI Breakthroughs in Games” IBM Watson Jeopardy DeepMind AlphaGo beat the humans
  8. 8. http://lora-aroyo.org @laroyo 8 The Rise of the Machines “AI Winter” → “AI Breakthroughs in Games” → “Real World Tasks” Health diagnostics Flue prediction Weather prediction Text, Image and Video classification Text Generation Text Translation Conversational AI support the humans
  9. 9. http://lora-aroyo.org @laroyo 9 Mainstream Deployment of AI “Real World Tasks” deployed in the wild → Unintended behaviors Microsoft Tay bot IBM Watson Oncology Amazon Rekognition Google Photos Apple Face ID Facebook chat bots Various Speech Assistants
  10. 10. http://lora-aroyo.org @laroyo 10 data quality is essential for guiding AI away from unintended behaviours getting computers to “see” the diversity of data Data is the compass for AI
  11. 11. http://lora-aroyo.org @laroyo 11 The Life of AI Data “It exists!” bootstrapping AI with data Caltech101 LabelMe Berkley-3D https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
  12. 12. http://lora-aroyo.org @laroyo 12 The Life of AI Data “It exists!” → “It is bigger!” data hungry AI ImageNet SIFT10M OpenImages COCO Web 1T 5-Gram https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
  13. 13. http://lora-aroyo.org @laroyo 13 The Life of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ...
  14. 14. http://lora-aroyo.org @laroyo 14 The Life of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ... it got worse ...
  15. 15. http://lora-aroyo.org @laroyo 15 Unintended Behaviors in AI Adapted from “AI in the Open World: Discovering Blind Spots of AI”, SafeAI 2020, Ece Kumar
  16. 16. http://lora-aroyo.org @laroyo 16 The Life of AI Data “It exists!” → “It is bigger!” → “It is better!” but before it got better ... reactive data improvement
  17. 17. http://lora-aroyo.org @laroyo 17 The Life of AI Data “It exists!” → “It is bigger!” → “It is better!” to reach here we need proactive data improvement
  18. 18. http://lora-aroyo.org @laroyo 18 The Life of AI Data Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (2009) In the decade since then, the research community have done a lot with quantity, but quality has been left behind
  19. 19. http://lora-aroyo.org @laroyo 19 datasets are not easy to debug low data quality is typically not a result of: - software bugs, or - just human errors achieving excellent data requires change in the way we gather data and use it to evaluate AI systems: - ensure that the datasets represent the problems they are intended for - ensure that the human annotation process for these datasets allows to capture the natural variance & bias in the problem - ensure that the metrics used to measure the AI quality utilize the variance & bias in the human responses But ...Data Quality is not easy ...
  20. 20. http://lora-aroyo.org @laroyo it is not easy to give Y/N answer for most of our AI tasks 20 Do these images depict a GUITAR ? Data Quality is not only human error ✓ ✓ ✓ ✘ ✘ ✘✘✓ ✓
  21. 21. http://lora-aroyo.org @laroyo 21 Do these images depict NEW ZEALAND ? Data Quality should consider context of use it is not easy to give Y/N answer for most of our AI tasks the answer typically depends on the context, on the task, on the usage, etc ✓ ✘ ✓ ✓ ✘ ✘
  22. 22. http://lora-aroyo.org @laroyo 22 Do these images depict a WEDDING ? Data Quality should include real world diversity it is not easy to give Y/N answer for most of our AI tasks the answer typically depends on the context, on the task, on the usage, etc disagreement is signal for natural diversity and variance in human annotations and should be included in AI training ✓ ✘ ✓ ✓ ✘ ✓
  23. 23. http://lora-aroyo.org @laroyo 23 Does the sentence express TREATS relation between Chloroquine, Malaria? Data Quality is difficult even with experts For prevention of malaria, use only in individuals traveling to malarious areas where CHLOROQUINE resistant P. falciparum MALARIA has not been reported. Rheumatoid arthritis and MALARIA have been treated with CHLOROQUINE for decades. Among 56 subjects reporting to a clinic with symptoms of MALARIA 53 (95%) had ordinarily effective levels of CHLOROQUINE in blood. ✓ ✘ ✓ “Crowdsourcing Ground Truth for Medical Relation Extraction”, ACM TiiS Special Issue on Human-Centered Machine Learning, 2018. A. Dumitrache, L. Aroyo, C. Welty
  24. 24. http://lora-aroyo.org @laroyo DISAGREEMENT IS SIGNAL There are different source of disagreement
  25. 25. http://lora-aroyo.org @laroyo 25 Three sides of human interpretation CROWDTRUTH Disagreement provides guidance in task analysis: ● items with poor semantics ● items with salient terms ● items difficult to classify ● items that are ambiguous ● subjective annotations ● time-sensitive annotations ● difficult annotation tasks ● mis-translated annotations ● users with/without specific knowledge ● communities of thought ● spammers “Three Sides of CrowdTruth”, Human Computation Journal, 2014, L. Aroyo, C. Welty
  26. 26. http://lora-aroyo.org @laroyo 26 One truth: knowledge acquisition typically assumes one correct interpretation for every example Experts rule: knowledge is captured from domain experts One is enough: single expert’s knowledge is sufficient Disagreement bad: when people disagree, they must not understand the problem Detailed explanations help: if examples cause disagreement - adding instructions should help Once done, forever valid: knowledge is not updated; new data not aligned with old All examples are created equal: triples are triples, one is not more important than another, they are all either true or false … but in current practices disagreement is continuously avoided and the world is forced into a binary setting 7 Myths about Human Annotation “Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
  27. 27. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 27 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability, significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a novel method for crowdsourcing adverse test sets for ML models (CATS4ML)
  28. 28. http://lora-aroyo.org @laroyo Your AI model is as good as your evaluation data … but is your evaluation data missing relevant examples? cats4ml.humancomputation.com
  29. 29. http://lora-aroyo.org @laroyo Your AI model is as good as your evaluation data … but is your evaluation data missing relevant examples? How can we find such examples, especially if they are AI blindspots (i.e. unknown unknowns)? cats4ml.humancomputation.com
  30. 30. http://lora-aroyo.org @laroyo offers a crowdsourced solution for finding blindspots of your AI models inspired by prior research in human computation Beat the Machine: Challenging Humans to Find a Predictive Model's “Unknown Unknowns” (Attenberg, Ipeirotis, Provost, 2014) Vulnerability Reward Program (Google) CATS4ML Challenge Crowdsourcing Adverse Test Sets for ML
  31. 31. http://lora-aroyo.org @laroyo offers a crowdsourced red team for finding blindspots of your AI models CATS4ML Challenge Crowdsourcing Adverse Test Sets for ML cats4ml.humancomputation.com
  32. 32. http://lora-aroyo.org @laroyo In this first version of the CATS4ML challenge participants will discover AI blindspots in the Open Images Dataset Lipstick? Airplane? Car? Construction worker? Thanksgiving? These AI blindspots are real images with visual patterns that confuse AI models in ways humans might find meaningful cats4ml.humancomputation.com
  33. 33. http://lora-aroyo.org @laroyo image1 label1 image2 label2 … challenge participants labels comparison Open Images dataset submission image 1: label 1: lipstick image 2: label 2: thanksgiving ... human-in-the-loop verification image1-label1 image2-label2 ... image1-label1 image2-label2 ... submission scoring vs. The Challenge Process cats4ml.humancomputation.com
  34. 34. http://lora-aroyo.org @laroyo cats4ml.humancomputation.com
  35. 35. http://lora-aroyo.org @laroyo cats4ml.humancomputation.com
  36. 36. http://lora-aroyo.org @laroyo cats4ml.humancomputation.com
  37. 37. http://lora-aroyo.org @laroyo cats4ml.humancomputation.com The challenge will run until 30 April, 2021 Winners and all participants will be invited to the final (virtual) event in 2021 submission quota
  38. 38. http://lora-aroyo.org @laroyo cats4ml.humancomputation.com Join us now! ● build a team & join ● contribute some creative adverse examples ● spread the word to your external research community ● give us feedback
  39. 39. http://lora-aroyo.org @laroyo 39 The Team Praveen Paritosh Ka Wong Lora Aroyo Devi Krishna ˈl ɪ k ə r t
  40. 40. http://lora-aroyo.org @laroyo TAKE HOME MESSAGE 40 data is the compass for AI - AI advances where there is data data quality must be addressed in AI practices especially in the way we evaluate AI improving evaluation of AI must consider ways to measure variance and capture bias to bring us one step closer to data excellence to address variance in AI evaluation we propose a number of novel metrics for reliability (xRR), significance (metrology for AI) and disagreement (CrowdTruth) to address bias in AI evaluation we propose a method for crowdsourcing adverse test sets for ML models (CATS4ML)
  41. 41. Data Excellence: Better Data for Better AI Crowd Science Workshop @ NeurIPS2020 Lora Aroyo http://lora-aroyo.org @laroyo By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN 0-8109-6720-0) by Justin Foote (talk)., Fair use, https://en.wikipedia.org/w/index.php?curid=3955850

×