AI-driven product innovation: from Recommender Systems to COVID-19

AI-driven product
innovation
from Recommender Systems to COVID-19
Xavier Amatriain
Co-founder/CTO Curai, former Quora, Netflix, researcher (>5k citations)
May 7, 2021

About me...
● PhD on Audio and Music Signal Processing and Modeling
● Led research on Augmented/Immersive Reality at UCSB
● Researcher in Recommender Systems for several years
● Started and led ML Algorithms at Netflix
● Head of Engineering at Quora
● Currently co-founder/CTO at Curai
2

Outline
1. The recommender problem & the Netflix Prize
2. Recommendations beyond rating prediction
3. AI + Healthcare: A bit about Curai
4. Lessons Learned

The Age of Search has come to an end
● ... long live the Age of Recommendation!
● Chris Anderson in “The Long Tail”
o “We are leaving the age of information and entering the age of
recommendation”
● CNN Money, “The race to create a 'smart' Google”:
o “The Web, they say, is leaving the era of search and entering one of
discovery. What's the diﬀerence? Search is what you do when you're
looking for something. Discovery is when something wonderful that
you didn't know existed, or didn't know how to ask for, ﬁnds you.”

The “Recommender problem”
● “Traditional” deﬁnition: Estimate a utility function that
automatically predicts how much a user will like an
item.
● Based on:
o Past behavior
o Relations to other users
o Item similarity
o Context
o …

Approaches to Recommendation
● Collaborative Filtering: Recommendations based only on user
behavior (target user and similar users’ behavior)
○ Item-based: Find similar items to those that I have liked
○ User-based: Find similar users to me, and recommend what they liked
● Others
○ Content-based: Recommend items similar to the ones I liked based on their
features
○ Demographic: Recommend items liked by users with similar features to me
○ Social: Recommend items liked by people socially connected to me
● Personalized learning-to-rank: Treat recommendations not as
binary like/not like, but rather as a ranking/sorting problem
● Hybrid: Combined any of the above

● Depends on the domain and particular problem
● However, in the general case it has been demonstrated that
the best isolated approach is CF
o Other approaches can be hybridized to improve results in specific cases
(cold-start problem...)
● What matters:
o Data preprocessing: outlier removal, denoising, removal of global effects
(e.g. individual user's average)
o “Smart” dimensionality reduction using MF
o Combining methods through ensembles
What works?

What we were interested in:
▪ High quality recommendations
Proxy question:
▪ Accuracy in predicted rating
▪ Improve by 10% = $1million!
▪ Top 2 algorithms
▪ SVD - Prize RMSE: 0.8914
▪ RBM - Prize RMSE: 0.8990
▪ Linear blend Prize RMSE: 0.88
▪ Limitations
▪ Designed for 100M ratings, not XB ratings
▪ Not adaptable as users add ratings
▪ Performance issues

What about the final prize ensembles?
● Oﬄine studies showed they were too computationally
intensive to scale
● Expected improvement not worth engineering eﬀort
● Plus…. Focus had already shifted to other issues that had more
impact than rating prediction.

2. Beyond
Rating
Prediction
12

Everything is a recommendation

Evolution of the Recommender Problem
Rating Ranking Page Optimization
4.7
Context-aware
Recommendations
Context

Ranking
● Most recommendations are presented in a sorted list
● Recommendation can be understood as a ranking problem
● Popularity is the obvious baseline, but it is not personalized
● Can we combine it with personalized rating predictions?

Ranking by ratings
4.7 4.6 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5
Niche titles
High average ratings… by those who would watch it
RMSE

Popularity
Predicted
Rating
1
2
3
4
5
Linear Model:
frank
(u,v) = w1
p(v) + w2
r(u,v) + b
Final
Ranking
Example: Two features, linear model

Popularity
1
2
3
4
5
Final
Ranking
Predicted
Rating
Example: Two features, linear model

Ranking - Quora Feed
Goal: Present most interesting stories for a
user at a given time
Interesting = topical relevance +
social relevance + timeliness
Stories = questions + answers
ML: Personalized learning-to-rank approach
Relevance-ordered vs time-ordered = big
gains in engagement

10,000s of
possible
rows …
10-40
rows
Variable number of
possible videos per
row (up to thousands)
1 personalized page
per device
Page Composition

User Attention Modeling
From “Modeling User Attention and
Interaction on the Web” 2014 - PhD Thesis by Dmitry Lagun (Emory U.)

3. AI + Healthcare:
A bit about Curai
27

● >50% world with no access
to essential health services
○ ~30% of US adults
under-insured
● ~15 min. to capture
information, diagnose,
recommend treatment
● 30% of the medical errors
causing ~400k deaths a
year are due to
misdiagnosis
Healthcare access, quality, and scalability
shortage of 120,000 physicians by 2030

29
We have an opportunity to
reimagine healthcare

30
We have an opportunity
obligation to reimagine
healthcare

Towards an AI powered learning health system
● Mobile-First Care, always
on, accessible, affordable
● AI + human providers in
the loop for quality care
● Always-Learning system
● AI to operate in-the-wild
(EHR)
FEEDBACK
DATA
MODEL
AI-augmented
medical
conversations

Towards an FDA-approved “AI Doctor”
Clinical vignettes +
automated offline
evaluations
KB Analytics Layer
Security & Privacy
Dataset datasheets
Additional Analytics
A/B Testing
Model lineage

Breakthroughs in AI & healthcare

Research areas at Curai
● Medical Reasoning and Diagnosis
○ Learning from the Experts: combining expert systems and
deep learning
● NLP
○ Medical dialogue summarization
○ Transfer Learning for Similar Questions
○ Medical entity recognition: An active learning approach
● Multimodal healthcare AI
○ Few-shot dermatology image classification

General principles
1. Extensible
a. Data feedback loops
b. Incrementally/iteratively learn from
“physician-in-the-loop”
2. Domain knowledge + Data
3. In the wild
a. Uncertainty in prediction
b. Fall-back to “physician-in-the-loop”

SOTA Medical reasoning and diagnosis

ML + Expert systems for Dx models
female
middle aged
fever
cough
Influenza 16.9
bacterial pneumonia 16.9
acute sinusitis 10.9
asthma 10.9
common cold 10.9
influenza 0.753
bacterial pneumonia 0.205
asthma 0.017
acute sinusitis 0.008
pulmonary tuberculosis 0.007
Inputs
DDx with expert system DDx with ML model
Expert
system
Clinical case
simulator
Clinical cases
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
Other data
(e.g. EHR)

COVID-aware modeling
Expert
system
Clinical case
simulator
Clinical cases with
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
COVID-19
assessment data
COVID-19
COVID-19
female
middle-age
cough
headache
nose discharge
cigarette smoking
hospital personnel

Examples
Inputs DDx before COVID DDx after COVID
female
middle aged
fever
cough
healthcare worker
influenza
bacterial pneumonia
asthma
COVID-19
influenza
bacterial pneumonia
female
middle aged
fever
cough
nasal congestion
influenza
adenovirus infection
bacterial pneumonia
influenza
COVID-19
adenovirus infection

Question similarity for COVID FAQs
● Transfer learning
● Double-finetuned BERT model
● Handle data sparsity
● Medical domain knowledge through an intermediate QA
binary task
Does A answer Q?
Q A
Pretrained
BERT
BERT
Q1 similar to Q2?
Q1 Q2
BERT
Proceedings of the 2020 ACM SIGKDD

Medical summarization
Best paper at ACL Workshop on medical conversations 2021
Machine Learning for Healthcare, 2021
EMNLP Findings 2020

Our Approach:
GPT-3-ENS
Hasn't used any thing to help
GPT-3
Priming Inference
GPT-3
GPT-3
Hasn’t used any thing to help other than
hydrocortisone
Used nothing else to help other than
Benadryl and hydrocortisone.
10
Trials
21 Unique Labeled
Examples per
Priming Context
Conﬁdentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized.
Chat Snippet:
DR: Thanks for answering my questions.
Apart from Benadryl and Hydrocortisone,
have you used anything to help?
PT: No that’s everything

GPT-3 vs GPT-3-ENS
● 12.5% of summaries produced by
GPT-3-ENS are better than those of GPT-3

GPT-3-ENS as data synthesizer
Hasn't used any thing to
help
Priming Inference
GPT-3
Hasn’t used any thing to help
other than hydrocortisone
Used nothing else to help
other than Benadryl and
hydrocortisone.
10
Trials
21 Labeled Examples
per Priming Context
GPT-3
GPT-3
GPT-3-ENS Labeled Dataset
+
Doctor Labeled/Corrected Dataset
In-House Summarization Model
● Privacy concerns with
invoking GPT-3 at
inference time
● Doctor in the loop
Conﬁdentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized.
Chat Snippet:
DR: Thanks for ...
PT: No that’s everything

Results
● We are able to train a summarization model using GPT-3-ENS
labeled data (which needed 210 doctor-labeled examples)
comparable in performance to a model trained using 6400
doctor-labeled examples
● We find that a model that is trained on a mix of human and
GPT-3-ENS labeled data does better than a model trained on either
independently

Conclusions
● Healthcare needs to scale quickly, and this has become obvious in a
global pandemic like the one we are facing
● The only way to scale healthcare while improving quality and
accessibility is through technology and SOTA AI
● But, AI cannot be simply “dropped” in the middle of old workflows
and processes
○ It needs to be integrated in end-to-end medical care benefitting both patients and
providers
https://curai.com/work

More data or better models?
Really?

Sometimes,
it’s not about
more data

Norvig:
“Google does not have
better Algorithms only
more Data”
Many
features/
low-bias
models

Sometimes
you might not
need all your
“Big Data”
0 2 4 6 8 10 12 14 16 18 20
Number of Training Examples (in Millions)
Testing
Accuracy

What about Deep Learning?
Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal)
1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other
texts (1991)
Hidden Markov Model (1984)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The
Extended Book” (1991)
Negascout planning algorithm (1983)
2005 Google’s Arabic- and Chinese-to-English
translation
1,8 trillion tokens from Google Web and News
pages (collected in 2005)
Statistical machine translation algorithm (1988)
2011 IBM watson become the world Jeopardy!
Champion
8,6 million documents from Wikipedia,
Wiktionary, Wikiquote, and Project Gutenberg
(updated in 2005)
Mixture-of-Experts algorithm (1991)
2014 Google’s GoogLeNet object classification at
near-human performance
ImageNet corpus of 1,5 million labeled images
and 1,000 object catagories (2010)
Convolution neural network algorithm (1989)
2015 Google’s Deepmind achieved human parity in
playing 29 Atari games by learning general
control from video
Arcade Learning Environment dataset of over
50 Atari games (2013)
Q-learning algorithm (1992)
Average No. Of Years to Breakthrough 3 years 18 years
The average elapsed time between key algorithm proposals and corresponding advances was about 18 years,
whereas the average elapsed time between key dataset availabilities and corresponding advances was less
than 3 years, or about 6 times faster.

Models and
Recipes
Pretrained
Available models trained using OpenNMT
→ English → German
→ German → English
→ English Summarization
→ Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO
More models coming soon:
→ Ubuntu Dialog Dataset
→ Syntactic Parsing
→ Image-to-Text
What about Deep Learning?

2.Implicit user signals
Are Always sometimes better than
Explicit ones

Implicit vs. Explicit
● Many have acknowledged
that implicit feedback is more
useful
● Is implicit feedback really always
more useful?
● If so, why?

● Implicit data is (usually):
○ More dense, and available for all users
○ Better representative of user behavior vs.
user reﬂection
○ More related to ﬁnal objective function
○ Better correlated with AB test results
● E.g. Rating vs watching

● However
○ It is not always the case that
direct implicit feedback correlates
well with long-term retention
○ E.g. clickbait
● Solution:
○ Combine diﬀerent forms of
implicit + explicit to better represent
long-term goal

3. MODEL ACCURACY IS NOT ALL THAT MATTERS

Explanation/Support for Recommendations
Social Support

4. Model training depends not only
on input data

Training a model
● Model will learn according to:
○ Training data (e.g. implicit and explicit)
○ Target function (e.g. probability of user reading an answer)
○ Metric (e.g. precision vs. recall)
● Example 1 (made up):
○ Optimize probability of a user going to the cinema to watch a movie and rate it
“highly” by using purchase history and previous ratings. Use NDCG of the ranking as
final metric using only movies rated 4 or higher as positives.

Example 2 - Quora’s feed
● Training data = implicit + explicit
● Target function: Value of showing a story
to a
user ~ weighted sum of actions:
v = ∑a
va
1{ya
= 1}
○ predict probabilities for each action, then compute expected
value: v_pred = E[ V | x ] = ∑a
va
p(a | x)
● Metric: any ranking metric

5. THERE IS SOMETHING EVEN MORE IMPORTANT THAN
MODELS AND DATA: EXPERIMENTAL PROCESS

Oﬄine/Online testing process

Executing A/B tests
● Measure diﬀerences in metrics across statistically identical populations that
each experience a diﬀerent algorithm.
● Decisions on the product always data-driven
● Overall Evaluation Criteria (OEC) = member retention
○ Use long-term metrics whenever possible
○ Short-term metrics can be informative and allow faster decisions
■ But, not always aligned with OEC

Offline testing
● Measure model performance, using (IR)
metrics
● Offline performance = indication to make
decisions on follow-up A/B tests
● A critical (and mostly unsolved) issue is
how offline metrics correlate with A/B test
results.

01.
02.
03.
04.
05.
Choose the right metric
Be thoughtful about your data
Understand dependencies between data, models & systems
Optimize only what matters, beware of biases
Be thoughtful about : data infrastructure/tools, how to
organize your teams, and above all, your product
requirements

4 hour lecture on recommendations
Carnegie Mellon (2014)
1 hour lecture on practical Deep Learning
UC Berkeley (2020)
10 minutes on AI for COVID
Stanford (2020)
1 hour podcast on AI for Healthcare
Gradient Dissent (2021)

AI-driven product innovation: from Recommender Systems to COVID-19

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a AI-driven product innovation: from Recommender Systems to COVID-19

Semelhante a AI-driven product innovation: from Recommender Systems to COVID-19 (20)

Mais de Xavier Amatriain

Mais de Xavier Amatriain (20)

Último

Último (20)

AI-driven product innovation: from Recommender Systems to COVID-19