O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

AI-driven product innovation: from Recommender Systems to COVID-19

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 77 Anúncio

AI-driven product innovation: from Recommender Systems to COVID-19

Baixar para ler offline

AI/Machine Learning has become an integral part of many household tech products, from Netflix to our phones. In this talk I will draw from my experience driving AI teams at some of those companies to showcase how AI can positively impact products as different as Netflix and Curai, an online telehealth service.

AI/Machine Learning has become an integral part of many household tech products, from Netflix to our phones. In this talk I will draw from my experience driving AI teams at some of those companies to showcase how AI can positively impact products as different as Netflix and Curai, an online telehealth service.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Semelhante a AI-driven product innovation: from Recommender Systems to COVID-19 (20)

Mais de Xavier Amatriain (20)

Anúncio

Mais recentes (20)

AI-driven product innovation: from Recommender Systems to COVID-19

  1. 1. AI-driven product innovation from Recommender Systems to COVID-19 Xavier Amatriain Co-founder/CTO Curai, former Quora, Netflix, researcher (>5k citations) May 7, 2021
  2. 2. About me... ● PhD on Audio and Music Signal Processing and Modeling ● Led research on Augmented/Immersive Reality at UCSB ● Researcher in Recommender Systems for several years ● Started and led ML Algorithms at Netflix ● Head of Engineering at Quora ● Currently co-founder/CTO at Curai 2
  3. 3. Outline 1. The recommender problem & the Netflix Prize 2. Recommendations beyond rating prediction 3. AI + Healthcare: A bit about Curai 4. Lessons Learned
  4. 4. 1. The Recommender Problem 4
  5. 5. The Age of Search has come to an end ● ... long live the Age of Recommendation! ● Chris Anderson in “The Long Tail” o “We are leaving the age of information and entering the age of recommendation” ● CNN Money, “The race to create a 'smart' Google”: o “The Web, they say, is leaving the era of search and entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.”
  6. 6. The “Recommender problem” ● “Traditional” definition: Estimate a utility function that automatically predicts how much a user will like an item. ● Based on: o Past behavior o Relations to other users o Item similarity o Context o …
  7. 7. Approaches to Recommendation ● Collaborative Filtering: Recommendations based only on user behavior (target user and similar users’ behavior) ○ Item-based: Find similar items to those that I have liked ○ User-based: Find similar users to me, and recommend what they liked ● Others ○ Content-based: Recommend items similar to the ones I liked based on their features ○ Demographic: Recommend items liked by users with similar features to me ○ Social: Recommend items liked by people socially connected to me ● Personalized learning-to-rank: Treat recommendations not as binary like/not like, but rather as a ranking/sorting problem ● Hybrid: Combined any of the above
  8. 8. ● Depends on the domain and particular problem ● However, in the general case it has been demonstrated that the best isolated approach is CF o Other approaches can be hybridized to improve results in specific cases (cold-start problem...) ● What matters: o Data preprocessing: outlier removal, denoising, removal of global effects (e.g. individual user's average) o “Smart” dimensionality reduction using MF o Combining methods through ensembles What works?
  9. 9. What we were interested in: ▪ High quality recommendations Proxy question: ▪ Accuracy in predicted rating ▪ Improve by 10% = $1million! ▪ Top 2 algorithms ▪ SVD - Prize RMSE: 0.8914 ▪ RBM - Prize RMSE: 0.8990 ▪ Linear blend Prize RMSE: 0.88 ▪ Limitations ▪ Designed for 100M ratings, not XB ratings ▪ Not adaptable as users add ratings ▪ Performance issues
  10. 10. What about the final prize ensembles? ● Offline studies showed they were too computationally intensive to scale ● Expected improvement not worth engineering effort ● Plus…. Focus had already shifted to other issues that had more impact than rating prediction.
  11. 11. Outline 1. The recommender problem & the Netflix Prize 2. Recommendations beyond rating prediction 3. AI + Healthcare: A bit about Curai 4. Lessons Learned
  12. 12. 2. Beyond Rating Prediction 12
  13. 13. Everything is a recommendation
  14. 14. Evolution of the Recommender Problem Rating Ranking Page Optimization 4.7 Context-aware Recommendations Context
  15. 15. Ranking ● Most recommendations are presented in a sorted list ● Recommendation can be understood as a ranking problem ● Popularity is the obvious baseline, but it is not personalized ● Can we combine it with personalized rating predictions?
  16. 16. Ranking by ratings 4.7 4.6 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Niche titles High average ratings… by those who would watch it RMSE
  17. 17. Popularity Predicted Rating 1 2 3 4 5 Linear Model: frank (u,v) = w1 p(v) + w2 r(u,v) + b Final Ranking Example: Two features, linear model
  18. 18. Popularity 1 2 3 4 5 Final Ranking Predicted Rating Example: Two features, linear model
  19. 19. Ranking - Quora Feed Goal: Present most interesting stories for a user at a given time Interesting = topical relevance + social relevance + timeliness Stories = questions + answers ML: Personalized learning-to-rank approach Relevance-ordered vs time-ordered = big gains in engagement
  20. 20. 10,000s of possible rows … 10-40 rows Variable number of possible videos per row (up to thousands) 1 personalized page per device Page Composition
  21. 21. User Attention Modeling From “Modeling User Attention and Interaction on the Web” 2014 - PhD Thesis by Dmitry Lagun (Emory U.)
  22. 22. N-dimensional model
  23. 23. Deep learning
  24. 24. However...
  25. 25. Outline 1. The recommender problem & the Netflix Prize 2. Recommendations beyond rating prediction 3. AI + Healthcare: A bit about Curai 4. Lessons Learned
  26. 26. 3. AI + Healthcare: A bit about Curai 27
  27. 27. ● >50% world with no access to essential health services ○ ~30% of US adults under-insured ● ~15 min. to capture information, diagnose, recommend treatment ● 30% of the medical errors causing ~400k deaths a year are due to misdiagnosis Healthcare access, quality, and scalability shortage of 120,000 physicians by 2030
  28. 28. 29 We have an opportunity to reimagine healthcare
  29. 29. 30 We have an opportunity obligation to reimagine healthcare
  30. 30. Towards an AI powered learning health system ● Mobile-First Care, always on, accessible, affordable ● AI + human providers in the loop for quality care ● Always-Learning system ● AI to operate in-the-wild (EHR) FEEDBACK DATA MODEL AI-augmented medical conversations
  31. 31. Towards an FDA-approved “AI Doctor” Clinical vignettes + automated offline evaluations KB Analytics Layer Security & Privacy Dataset datasheets Additional Analytics A/B Testing Model lineage
  32. 32. Breakthroughs in AI & healthcare
  33. 33. Research areas at Curai ● Medical Reasoning and Diagnosis ○ Learning from the Experts: combining expert systems and deep learning ● NLP ○ Medical dialogue summarization ○ Transfer Learning for Similar Questions ○ Medical entity recognition: An active learning approach ● Multimodal healthcare AI ○ Few-shot dermatology image classification
  34. 34. General principles 1. Extensible a. Data feedback loops b. Incrementally/iteratively learn from “physician-in-the-loop” 2. Domain knowledge + Data 3. In the wild a. Uncertainty in prediction b. Fall-back to “physician-in-the-loop”
  35. 35. Research at Curai
  36. 36. SOTA Medical reasoning and diagnosis
  37. 37. Our approach to COVID-19
  38. 38. ML + Expert systems for Dx models female middle aged fever cough Influenza 16.9 bacterial pneumonia 16.9 acute sinusitis 10.9 asthma 10.9 common cold 10.9 influenza 0.753 bacterial pneumonia 0.205 asthma 0.017 acute sinusitis 0.008 pulmonary tuberculosis 0.007 Inputs DDx with expert system DDx with ML model Expert system Clinical case simulator Clinical cases DDx ML model Common cold UTI Acute bronchitis Female Middle-aged Chronic cough Nasal congestion Other data (e.g. EHR)
  39. 39. COVID-aware modeling Expert system Clinical case simulator Clinical cases with DDx ML model Common cold UTI Acute bronchitis Female Middle-aged Chronic cough Nasal congestion COVID-19 assessment data COVID-19 COVID-19 female middle-age cough headache nose discharge cigarette smoking hospital personnel
  40. 40. Examples Inputs DDx before COVID DDx after COVID female middle aged fever cough healthcare worker influenza bacterial pneumonia asthma COVID-19 influenza bacterial pneumonia female middle aged fever cough nasal congestion influenza adenovirus infection bacterial pneumonia influenza COVID-19 adenovirus infection
  41. 41. Question similarity for COVID FAQs ● Transfer learning ● Double-finetuned BERT model ● Handle data sparsity ● Medical domain knowledge through an intermediate QA binary task Does A answer Q? Q A Pretrained BERT BERT Q1 similar to Q2? Q1 Q2 BERT Proceedings of the 2020 ACM SIGKDD
  42. 42. Product in action 44
  43. 43. Medical summarization Best paper at ACL Workshop on medical conversations 2021 Machine Learning for Healthcare, 2021 EMNLP Findings 2020
  44. 44. Our Approach: GPT-3-ENS Hasn't used any thing to help GPT-3 Priming Inference GPT-3 GPT-3 Hasn’t used any thing to help other than hydrocortisone Used nothing else to help other than Benadryl and hydrocortisone. 10 Trials 21 Unique Labeled Examples per Priming Context Confidentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized. Chat Snippet: DR: Thanks for answering my questions. Apart from Benadryl and Hydrocortisone, have you used anything to help? PT: No that’s everything
  45. 45. GPT-3 vs GPT-3-ENS ● 12.5% of summaries produced by GPT-3-ENS are better than those of GPT-3
  46. 46. GPT-3-ENS as data synthesizer Hasn't used any thing to help Priming Inference GPT-3 Hasn’t used any thing to help other than hydrocortisone Used nothing else to help other than Benadryl and hydrocortisone. 10 Trials 21 Labeled Examples per Priming Context GPT-3 GPT-3 GPT-3-ENS Labeled Dataset + Doctor Labeled/Corrected Dataset In-House Summarization Model ● Privacy concerns with invoking GPT-3 at inference time ● Doctor in the loop Confidentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized. Chat Snippet: DR: Thanks for ... PT: No that’s everything
  47. 47. Results ● We are able to train a summarization model using GPT-3-ENS labeled data (which needed 210 doctor-labeled examples) comparable in performance to a model trained using 6400 doctor-labeled examples ● We find that a model that is trained on a mix of human and GPT-3-ENS labeled data does better than a model trained on either independently
  48. 48. Conclusions ● Healthcare needs to scale quickly, and this has become obvious in a global pandemic like the one we are facing ● The only way to scale healthcare while improving quality and accessibility is through technology and SOTA AI ● But, AI cannot be simply “dropped” in the middle of old workflows and processes ○ It needs to be integrated in end-to-end medical care benefitting both patients and providers https://curai.com/work
  49. 49. Outline 1. The recommender problem & the Netflix Prize 2. Recommendations beyond rating prediction 3. AI + Healthcare: A bit about Curai 4. Lessons Learned
  50. 50. 4. Lessons Learned 52
  51. 51. 1.Data or and Models?
  52. 52. More data or better models? Really?
  53. 53. More data or better models? Sometimes, it’s not about more data
  54. 54. More data or better models? Norvig: “Google does not have better Algorithms only more Data” Many features/ low-bias models
  55. 55. More data or better models? Sometimes you might not need all your “Big Data” 0 2 4 6 8 10 12 14 16 18 20 Number of Training Examples (in Millions) Testing Accuracy
  56. 56. What about Deep Learning? Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal) 1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other texts (1991) Hidden Markov Model (1984) 1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The Extended Book” (1991) Negascout planning algorithm (1983) 2005 Google’s Arabic- and Chinese-to-English translation 1,8 trillion tokens from Google Web and News pages (collected in 2005) Statistical machine translation algorithm (1988) 2011 IBM watson become the world Jeopardy! Champion 8,6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg (updated in 2005) Mixture-of-Experts algorithm (1991) 2014 Google’s GoogLeNet object classification at near-human performance ImageNet corpus of 1,5 million labeled images and 1,000 object catagories (2010) Convolution neural network algorithm (1989) 2015 Google’s Deepmind achieved human parity in playing 29 Atari games by learning general control from video Arcade Learning Environment dataset of over 50 Atari games (2013) Q-learning algorithm (1992) Average No. Of Years to Breakthrough 3 years 18 years The average elapsed time between key algorithm proposals and corresponding advances was about 18 years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than 3 years, or about 6 times faster.
  57. 57. Models and Recipes Pretrained Available models trained using OpenNMT → English → German → German → English → English Summarization → Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO More models coming soon: → Ubuntu Dialog Dataset → Syntactic Parsing → Image-to-Text What about Deep Learning?
  58. 58. More data and better data
  59. 59. 2.Implicit user signals Are Always sometimes better than Explicit ones
  60. 60. Implicit vs. Explicit ● Many have acknowledged that implicit feedback is more useful ● Is implicit feedback really always more useful? ● If so, why?
  61. 61. ● Implicit data is (usually): ○ More dense, and available for all users ○ Better representative of user behavior vs. user reflection ○ More related to final objective function ○ Better correlated with AB test results ● E.g. Rating vs watching Implicit vs. Explicit
  62. 62. ● However ○ It is not always the case that direct implicit feedback correlates well with long-term retention ○ E.g. clickbait ● Solution: ○ Combine different forms of implicit + explicit to better represent long-term goal Implicit vs. Explicit
  63. 63. 3. MODEL ACCURACY IS NOT ALL THAT MATTERS
  64. 64. Explanation/Support for Recommendations Social Support
  65. 65. 4. Model training depends not only on input data
  66. 66. Training a model ● Model will learn according to: ○ Training data (e.g. implicit and explicit) ○ Target function (e.g. probability of user reading an answer) ○ Metric (e.g. precision vs. recall) ● Example 1 (made up): ○ Optimize probability of a user going to the cinema to watch a movie and rate it “highly” by using purchase history and previous ratings. Use NDCG of the ranking as final metric using only movies rated 4 or higher as positives.
  67. 67. Example 2 - Quora’s feed ● Training data = implicit + explicit ● Target function: Value of showing a story to a user ~ weighted sum of actions: v = ∑a va 1{ya = 1} ○ predict probabilities for each action, then compute expected value: v_pred = E[ V | x ] = ∑a va p(a | x) ● Metric: any ranking metric
  68. 68. 5. THERE IS SOMETHING EVEN MORE IMPORTANT THAN MODELS AND DATA: EXPERIMENTAL PROCESS
  69. 69. Offline/Online testing process
  70. 70. Executing A/B tests ● Measure differences in metrics across statistically identical populations that each experience a different algorithm. ● Decisions on the product always data-driven ● Overall Evaluation Criteria (OEC) = member retention ○ Use long-term metrics whenever possible ○ Short-term metrics can be informative and allow faster decisions ■ But, not always aligned with OEC
  71. 71. Offline testing ● Measure model performance, using (IR) metrics ● Offline performance = indication to make decisions on follow-up A/B tests ● A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.
  72. 72. 5. Conclusions 74
  73. 73. 01. 02. 03. 04. 05. Choose the right metric Be thoughtful about your data Understand dependencies between data, models & systems Optimize only what matters, beware of biases Be thoughtful about : data infrastructure/tools, how to organize your teams, and above all, your product requirements
  74. 74. 2. Further “reading” 76
  75. 75. 4 hour lecture on recommendations Carnegie Mellon (2014) 1 hour lecture on practical Deep Learning UC Berkeley (2020) 10 minutes on AI for COVID Stanford (2020) 1 hour podcast on AI for Healthcare Gradient Dissent (2021)

×