AI/Machine Learning has become an integral part of many household tech products, from Netflix to our phones. In this talk I will draw from my experience driving AI teams at some of those companies to showcase how AI can positively impact products as different as Netflix and Curai, an online telehealth service.
2. About me...
● PhD on Audio and Music Signal Processing and Modeling
● Led research on Augmented/Immersive Reality at UCSB
● Researcher in Recommender Systems for several years
● Started and led ML Algorithms at Netflix
● Head of Engineering at Quora
● Currently co-founder/CTO at Curai
2
3. Outline
1. The recommender problem & the Netflix Prize
2. Recommendations beyond rating prediction
3. AI + Healthcare: A bit about Curai
4. Lessons Learned
5. The Age of Search has come to an end
● ... long live the Age of Recommendation!
● Chris Anderson in “The Long Tail”
o “We are leaving the age of information and entering the age of
recommendation”
● CNN Money, “The race to create a 'smart' Google”:
o “The Web, they say, is leaving the era of search and entering one of
discovery. What's the difference? Search is what you do when you're
looking for something. Discovery is when something wonderful that
you didn't know existed, or didn't know how to ask for, finds you.”
6. The “Recommender problem”
● “Traditional” definition: Estimate a utility function that
automatically predicts how much a user will like an
item.
● Based on:
o Past behavior
o Relations to other users
o Item similarity
o Context
o …
7. Approaches to Recommendation
● Collaborative Filtering: Recommendations based only on user
behavior (target user and similar users’ behavior)
○ Item-based: Find similar items to those that I have liked
○ User-based: Find similar users to me, and recommend what they liked
● Others
○ Content-based: Recommend items similar to the ones I liked based on their
features
○ Demographic: Recommend items liked by users with similar features to me
○ Social: Recommend items liked by people socially connected to me
● Personalized learning-to-rank: Treat recommendations not as
binary like/not like, but rather as a ranking/sorting problem
● Hybrid: Combined any of the above
8. ● Depends on the domain and particular problem
● However, in the general case it has been demonstrated that
the best isolated approach is CF
o Other approaches can be hybridized to improve results in specific cases
(cold-start problem...)
● What matters:
o Data preprocessing: outlier removal, denoising, removal of global effects
(e.g. individual user's average)
o “Smart” dimensionality reduction using MF
o Combining methods through ensembles
What works?
9. What we were interested in:
▪ High quality recommendations
Proxy question:
▪ Accuracy in predicted rating
▪ Improve by 10% = $1million!
▪ Top 2 algorithms
▪ SVD - Prize RMSE: 0.8914
▪ RBM - Prize RMSE: 0.8990
▪ Linear blend Prize RMSE: 0.88
▪ Limitations
▪ Designed for 100M ratings, not XB ratings
▪ Not adaptable as users add ratings
▪ Performance issues
10. What about the final prize ensembles?
● Offline studies showed they were too computationally
intensive to scale
● Expected improvement not worth engineering effort
● Plus…. Focus had already shifted to other issues that had more
impact than rating prediction.
11. Outline
1. The recommender problem & the Netflix Prize
2. Recommendations beyond rating prediction
3. AI + Healthcare: A bit about Curai
4. Lessons Learned
14. Evolution of the Recommender Problem
Rating Ranking Page Optimization
4.7
Context-aware
Recommendations
Context
15. Ranking
● Most recommendations are presented in a sorted list
● Recommendation can be understood as a ranking problem
● Popularity is the obvious baseline, but it is not personalized
● Can we combine it with personalized rating predictions?
16. Ranking by ratings
4.7 4.6 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5
Niche titles
High average ratings… by those who would watch it
RMSE
20. Ranking - Quora Feed
Goal: Present most interesting stories for a
user at a given time
Interesting = topical relevance +
social relevance + timeliness
Stories = questions + answers
ML: Personalized learning-to-rank approach
Relevance-ordered vs time-ordered = big
gains in engagement
26. Outline
1. The recommender problem & the Netflix Prize
2. Recommendations beyond rating prediction
3. AI + Healthcare: A bit about Curai
4. Lessons Learned
28. ● >50% world with no access
to essential health services
○ ~30% of US adults
under-insured
● ~15 min. to capture
information, diagnose,
recommend treatment
● 30% of the medical errors
causing ~400k deaths a
year are due to
misdiagnosis
Healthcare access, quality, and scalability
shortage of 120,000 physicians by 2030
30. 30
We have an opportunity
obligation to reimagine
healthcare
31. Towards an AI powered learning health system
● Mobile-First Care, always
on, accessible, affordable
● AI + human providers in
the loop for quality care
● Always-Learning system
● AI to operate in-the-wild
(EHR)
FEEDBACK
DATA
MODEL
AI-augmented
medical
conversations
32.
33. Towards an FDA-approved “AI Doctor”
Clinical vignettes +
automated offline
evaluations
KB Analytics Layer
Security & Privacy
Dataset datasheets
Additional Analytics
A/B Testing
Model lineage
35. Research areas at Curai
● Medical Reasoning and Diagnosis
○ Learning from the Experts: combining expert systems and
deep learning
● NLP
○ Medical dialogue summarization
○ Transfer Learning for Similar Questions
○ Medical entity recognition: An active learning approach
● Multimodal healthcare AI
○ Few-shot dermatology image classification
36. General principles
1. Extensible
a. Data feedback loops
b. Incrementally/iteratively learn from
“physician-in-the-loop”
2. Domain knowledge + Data
3. In the wild
a. Uncertainty in prediction
b. Fall-back to “physician-in-the-loop”
40. ML + Expert systems for Dx models
female
middle aged
fever
cough
Influenza 16.9
bacterial pneumonia 16.9
acute sinusitis 10.9
asthma 10.9
common cold 10.9
influenza 0.753
bacterial pneumonia 0.205
asthma 0.017
acute sinusitis 0.008
pulmonary tuberculosis 0.007
Inputs
DDx with expert system DDx with ML model
Expert
system
Clinical case
simulator
Clinical cases
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
Other data
(e.g. EHR)
41. COVID-aware modeling
Expert
system
Clinical case
simulator
Clinical cases with
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
COVID-19
assessment data
COVID-19
COVID-19
female
middle-age
cough
headache
nose discharge
cigarette smoking
hospital personnel
43. Question similarity for COVID FAQs
● Transfer learning
● Double-finetuned BERT model
● Handle data sparsity
● Medical domain knowledge through an intermediate QA
binary task
Does A answer Q?
Q A
Pretrained
BERT
BERT
Q1 similar to Q2?
Q1 Q2
BERT
Proceedings of the 2020 ACM SIGKDD
45. Medical summarization
Best paper at ACL Workshop on medical conversations 2021
Machine Learning for Healthcare, 2021
EMNLP Findings 2020
46. Our Approach:
GPT-3-ENS
Hasn't used any thing to help
GPT-3
Priming Inference
GPT-3
GPT-3
Hasn’t used any thing to help other than
hydrocortisone
Used nothing else to help other than
Benadryl and hydrocortisone.
10
Trials
21 Unique Labeled
Examples per
Priming Context
Confidentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized.
Chat Snippet:
DR: Thanks for answering my questions.
Apart from Benadryl and Hydrocortisone,
have you used anything to help?
PT: No that’s everything
47. GPT-3 vs GPT-3-ENS
● 12.5% of summaries produced by
GPT-3-ENS are better than those of GPT-3
48. GPT-3-ENS as data synthesizer
Hasn't used any thing to
help
Priming Inference
GPT-3
Hasn’t used any thing to help
other than hydrocortisone
Used nothing else to help
other than Benadryl and
hydrocortisone.
10
Trials
21 Labeled Examples
per Priming Context
GPT-3
GPT-3
GPT-3-ENS Labeled Dataset
+
Doctor Labeled/Corrected Dataset
In-House Summarization Model
● Privacy concerns with
invoking GPT-3 at
inference time
● Doctor in the loop
Confidentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized.
Chat Snippet:
DR: Thanks for ...
PT: No that’s everything
49. Results
● We are able to train a summarization model using GPT-3-ENS
labeled data (which needed 210 doctor-labeled examples)
comparable in performance to a model trained using 6400
doctor-labeled examples
● We find that a model that is trained on a mix of human and
GPT-3-ENS labeled data does better than a model trained on either
independently
50. Conclusions
● Healthcare needs to scale quickly, and this has become obvious in a
global pandemic like the one we are facing
● The only way to scale healthcare while improving quality and
accessibility is through technology and SOTA AI
● But, AI cannot be simply “dropped” in the middle of old workflows
and processes
○ It needs to be integrated in end-to-end medical care benefitting both patients and
providers
https://curai.com/work
51. Outline
1. The recommender problem & the Netflix Prize
2. Recommendations beyond rating prediction
3. AI + Healthcare: A bit about Curai
4. Lessons Learned
55. More data or better models?
Sometimes,
it’s not about
more data
56. More data or better models?
Norvig:
“Google does not have
better Algorithms only
more Data”
Many
features/
low-bias
models
57. More data or better models?
Sometimes
you might not
need all your
“Big Data”
0 2 4 6 8 10 12 14 16 18 20
Number of Training Examples (in Millions)
Testing
Accuracy
58. What about Deep Learning?
Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal)
1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other
texts (1991)
Hidden Markov Model (1984)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The
Extended Book” (1991)
Negascout planning algorithm (1983)
2005 Google’s Arabic- and Chinese-to-English
translation
1,8 trillion tokens from Google Web and News
pages (collected in 2005)
Statistical machine translation algorithm (1988)
2011 IBM watson become the world Jeopardy!
Champion
8,6 million documents from Wikipedia,
Wiktionary, Wikiquote, and Project Gutenberg
(updated in 2005)
Mixture-of-Experts algorithm (1991)
2014 Google’s GoogLeNet object classification at
near-human performance
ImageNet corpus of 1,5 million labeled images
and 1,000 object catagories (2010)
Convolution neural network algorithm (1989)
2015 Google’s Deepmind achieved human parity in
playing 29 Atari games by learning general
control from video
Arcade Learning Environment dataset of over
50 Atari games (2013)
Q-learning algorithm (1992)
Average No. Of Years to Breakthrough 3 years 18 years
The average elapsed time between key algorithm proposals and corresponding advances was about 18 years,
whereas the average elapsed time between key dataset availabilities and corresponding advances was less
than 3 years, or about 6 times faster.
59. Models and
Recipes
Pretrained
Available models trained using OpenNMT
→ English → German
→ German → English
→ English Summarization
→ Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO
More models coming soon:
→ Ubuntu Dialog Dataset
→ Syntactic Parsing
→ Image-to-Text
What about Deep Learning?
62. Implicit vs. Explicit
● Many have acknowledged
that implicit feedback is more
useful
● Is implicit feedback really always
more useful?
● If so, why?
63. ● Implicit data is (usually):
○ More dense, and available for all users
○ Better representative of user behavior vs.
user reflection
○ More related to final objective function
○ Better correlated with AB test results
● E.g. Rating vs watching
Implicit vs. Explicit
64. ● However
○ It is not always the case that
direct implicit feedback correlates
well with long-term retention
○ E.g. clickbait
● Solution:
○ Combine different forms of
implicit + explicit to better represent
long-term goal
Implicit vs. Explicit
68. Training a model
● Model will learn according to:
○ Training data (e.g. implicit and explicit)
○ Target function (e.g. probability of user reading an answer)
○ Metric (e.g. precision vs. recall)
● Example 1 (made up):
○ Optimize probability of a user going to the cinema to watch a movie and rate it
“highly” by using purchase history and previous ratings. Use NDCG of the ranking as
final metric using only movies rated 4 or higher as positives.
69. Example 2 - Quora’s feed
● Training data = implicit + explicit
● Target function: Value of showing a story
to a
user ~ weighted sum of actions:
v = ∑a
va
1{ya
= 1}
○ predict probabilities for each action, then compute expected
value: v_pred = E[ V | x ] = ∑a
va
p(a | x)
● Metric: any ranking metric
70. 5. THERE IS SOMETHING EVEN MORE IMPORTANT THAN
MODELS AND DATA: EXPERIMENTAL PROCESS
72. Executing A/B tests
● Measure differences in metrics across statistically identical populations that
each experience a different algorithm.
● Decisions on the product always data-driven
● Overall Evaluation Criteria (OEC) = member retention
○ Use long-term metrics whenever possible
○ Short-term metrics can be informative and allow faster decisions
■ But, not always aligned with OEC
73. Offline testing
● Measure model performance, using (IR)
metrics
● Offline performance = indication to make
decisions on follow-up A/B tests
● A critical (and mostly unsolved) issue is
how offline metrics correlate with A/B test
results.
75. 01.
02.
03.
04.
05.
Choose the right metric
Be thoughtful about your data
Understand dependencies between data, models & systems
Optimize only what matters, beware of biases
Be thoughtful about : data infrastructure/tools, how to
organize your teams, and above all, your product
requirements
77. 4 hour lecture on recommendations
Carnegie Mellon (2014)
1 hour lecture on practical Deep Learning
UC Berkeley (2020)
10 minutes on AI for COVID
Stanford (2020)
1 hour podcast on AI for Healthcare
Gradient Dissent (2021)