Healthcare is different from any other application domain, or is it not? While it is true that there are specific aspects, such as high stakes decisions and a complex regulatory framework, that make healthcare somewhat different, it is also the case that many of the lessons learned from building data-driven products in other domains translate remarcably well into healthcare. This is particularly so because healthcare is also a user facing domain, where users can be both patients or healthcare professionals. Given that data has shown to improve user experience while ensuring quality and scalability, few would argue that healthcare cannot benefit from being much more data-driven than it has traditionally been.
In this talk, I described how this experience building impactful data and AI solutions into user facing products for decades can be leveraged to revolutionize telehealth. At Curai, we combine approaches such as state-of-the-art large language models with expert systems in areas such as NLP, vision, and automated diagnosis to augment and scale doctors, and to improve user experience and healthcare outcomes. We will see some of those applications while analyzing the role of data and ML algorithms in making them possible.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Data/AI Product Development Principles Healthcare Recommender Systems Telehealth
1. Data/AI driven product
development
from video streaming to telehealth
Xavier Amatriain
Co-founder/CTO Curai
(with Anitha Kannan, Head of ML Research, Curai)
August 18, 2022
2. About me...
● Researcher in Recommender Systems
● Started and led ML Algorithms at Netflix
● Head of Engineering at Quora
● Currently co-founder/CTO at Curai
2
3. Outline
1. Data/AI driven product development: experiences in recommender
systems
2. Data/AI driven product development in healthcare: the Curai
experience
3. Principles for data/AI driven product development
4. Principles for data/AI driven product development (preview)
1. Make data trustworthy and accessible
2. Follow a hypothesis-driven offline/online
experimentation approach with clearly defined metrics
3. Start from the simplest approach, ensure AI improves
over time, with data/metrics driving improvement
4. More data only matters if it’s better data, and if the
model is complex enough to learn from it
5. AI affects UX and UX affects AI
6. What we were interested in:
▪ Improving the product with data + AI
▪ Hypothesis: higher quality recommendations
will lead to higher member retention
Proxy (offline) question:
▪ Accuracy in predicted rating
▪ Improve by 10% = $1million!
▪ Metric:
▪ Top 2 algorithms
▪ SVD - Prize RMSE: 0.8914
▪ RBM - Prize RMSE: 0.8990
▪ Linear blend Prize RMSE: 0.88
▪ Limitations
▪ Designed for 100M ratings, not XB ratings
▪ Not adaptable as users add ratings
▪ Performance issues
7. What about the final prize ensembles?
● Offline studies showed they were
too computationally intensive to
scale
● Expected improvement not worth
engineering effort
● Plus…. we uncovered that the
proxy question (offline
experiment) did not correlate with
online product gains
https://amatriain.net/blog/
8. Evolution of the Recommender Problem
Rating Ranking Page
Optimization
4.7
Context-aware
Recommendatio
ns
Context
12. Ranking - Quora Feed
Goal: Present most interesting stories for
a user at a given time
Interesting = topical relevance +
social relevance + timeliness
Stories = questions + answers
Model: Personalized learning-to-rank
approach
Relevance-ordered vs time-ordered =
big gains in engagement
13. From ranking to page composition and beyond
From “Modeling User Attention and
Interaction on the Web” 2014 - D. Lagun
15. ● >50% world with no access
to essential health services
○ ~30% of US adults
under-insured
● ~15 min. to capture
information, diagnose,
recommend treatment
● 30% of the medical errors
causing ~400k deaths a
year are due to
misdiagnosis
Healthcare access, quality, and scalability
shortage of 120,000 physicians by 2030
16. Towards an AI powered learning health system
● Mobile-First Care, always
on, accessible, affordable
● AI + human providers in
the loop for quality care
● Always-Learning system
● AI to operate in-the-wild
(EHR)
FEEDBACK
DATA
MODEL
AI-augmented
medical
conversations
19. Research areas at Curai
● Medical Reasoning and
Diagnosis
● NLP/Conversational AI
● Multimodal AI
20. Healthcare is knowledge intensive
● Medical terminologies/ontologies
○ SNOMED, UMLS, ICD 10
● Expert systems for clinical decision making
○ 1000s diseases and 3500+ findings
○ 30+ years of expert curation
● Electronic access to medical research
● Online reputed websites
Adding domain knowledge to modern AI
approaches is an active area of research 20
23. ML + Expert systems for Dx models
female
middle aged
fever
cough
Influenza 16.9
bacterial pneumonia 16.9
acute sinusitis 10.9
asthma 10.9
common cold 10.9
influenza 0.753
bacterial pneumonia 0.205
asthma 0.017
acute sinusitis 0.008
pulmonary tuberculosis 0.007
Inputs
DDx with expert system DDx with ML model
Expert
system
Clinical case
simulator
Clinical cases
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
Other data
(e.g. EHR)
24. COVID-aware modeling
Expert
system
Clinical case
simulator
Clinical cases with
DDx
ML
model
Common cold
UTI
Acute bronchitis
Female
Middle-aged
Chronic cough
Nasal congestion
COVID-19
assessment data
COVID-19
COVID-19
female
middle-age
cough
headache
nose discharge
cigarette smoking
hospital personnel
25. Evaluation
Clinical cases from Semigran dataset.
No clinical case corresponding to COVID
top-1 top-3 top-5
Practitioners 72.1% 84.3% -
Razzaki et.al. - 46.6% 64.7%
Expert system 66% 75% 86%
Ours - Baseline 67.6% 85.8% 92.9%
Ours - COVID as label 61.8% 84.4% 93.3%
Semigran et.al. Evaluation of symptom checkers for self diagnosis and triage: audit study, BMJ 2015
Adding COVID does not
adversely affect
performance
Previous best result based on
inference on graphical model
25
27. 27
P: Right now my stomach hurts.
P: It feels like I do need to do a clean out. If you know what I
mean
D: Sorry for the abdominal pain. When did you have last
bowel movement?
P: It was yesterday
D: What was the consistency of stool. Was it soft
well-formed or was it hard?
P: Right now I just want and its watery and very loosely
P: That was was causing with my stomach hurts
D: Any blood or mucus with stools? Was it foul smelling?
P: Nope for all three
D: Any fever
P: Nope
D: I asked as blood or mucus in stool can be due an
underlying infection
D: Any nausea/vomiting?
P: Nope
P: Why does this happen to me?
P: Is it something I have ate?
D: Diarrhea can be often due to indigestion or infection. Did
you eat any outside food or packaged food?
P: yes
Patient-provider dialogue
*The conversation has been de-identified for privacy protection
28. Combining SOTA LLMs with knowledge
● LLMs are great at:
○ Ability to adapt to a broad range
of tasks and situations
○ Ability to engage with the
audience
○ Giving empathetic responses
○ Showing personality and
sounding natural
28
Thoppilan et.al. LaMDA: Language Models for Dialog Applications, 2022
Roller et. al. Recipes for building an open-domain chatbot 2020
Adiwardna et.al. Towards a human-like open-domain chatbot 2020
● LLMs are not great at:
○ Staying truthful. I.e. they often
hallucinate knowledge
○ Dealing with long-range
dependencies and solving tasks
with large output space
○ Reasoning. They can “retrieve”
knowledge without deeper
understanding or reasoning
29. 29
Conversational history taking
1. Natural language
understanding
a. What did the patient say?
2. Dialog management
a. What to ask when?
b. How to decide when to stop
3. Natural language generation
a. How to ask?
31. Medical summarization using LLMs
● Insight 1: LLMs (e.g. GPT-3) can be
prompted to produce good
summaries in a few-shot setting
● Insight 2: LLMs can be ensembled
and used as data generators to
improve quality of summarization
results
● Insight 3: Medical domain knowledge
can be injected into these models so
that they produce medically correct
and complete summaries
32. Hasn't used any thing to
help
Priming Inference
GPT-3
Hasn’t used any thing to help
other than hydrocortisone
Used nothing else to help
other than Benadryl and
hydrocortisone.
10
Trials
21 Labeled Examples
per Priming Context
GPT-3
GPT-3
GPT-3-ENS Labeled Dataset
+
Doctor Labeled/Corrected Dataset
In-House Summarization Model
Confidentiality note: In accordance with our privacy policy, the illustrative examples included in this document do NOT correspond to real patients. They are either synthetic or fully anonymized.
Chat Snippet:
DR: Thanks for ...
PT: No that’s everything
33. Qualitative Results
Snippet
Model trained on 6400
doctor-labeled
Model trained on 6400
GPT-3 Ensembled
Model trained on
doctor-labeled + GPT-3
Ensembled
DR: Have you ever been tested
for any underlying health
conditions such as diabetes,
hypothyroidism or polycystic
ovarian syndrome?
PT: No
PT: I have been told I have
prediabetes.
Has not been tested for
any underlying health
conditions.
Hasn’t tested for any
underlying health
conditions such as
diabetes, hypothyroidism
or polycystic ovarian
syndrome
Has not been tested for any
underlying health conditions.
Has been told has
prediabetes.
DR: Do you have pus appearing
discharge from the site?
PT: Yes. If the bubbles pop it
leaks out a watery substance
Has pus appearing from
the site.
Pus appearing from the
site
Pus discharge from the site.
If bubbles pop it leaks out a
substance.
33
*The conversation has been de-identified for privacy protection
Chintagunta et.al. Medically aware GPT-3 as a data generator for medical dialog summarization, MLHC 2021
39. The “Big data paradox” is not a paradox
● Not all data is good data (aka more data only matters if it is “better
data”)
● Only more complex models can benefit from more data -
bias/variance tradeoff
● We need to combine better data with better/more complex models
● And… all of this does not hold for highly parametrized deep learning
models where the bias/variance tradeoff breaks for still unknown
reasons (maybe related to double descent)
40. Better data leads to better models
Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal)
1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other
texts (1991)
Hidden Markov Model (1984)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The
Extended Book” (1991)
Negascout planning algorithm (1983)
2005 Google’s Arabic- and Chinese-to-English
translation
1,8 trillion tokens from Google Web and News
pages (collected in 2005)
Statistical machine translation algorithm (1988)
2011 IBM watson become the world Jeopardy!
Champion
8,6 million documents from Wikipedia,
Wiktionary, Wikiquote, and Project Gutenberg
(updated in 2005)
Mixture-of-Experts algorithm (1991)
2014 Google’s GoogLeNet object classification at
near-human performance
ImageNet corpus of 1,5 million labeled images
and 1,000 object catagories (2010)
Convolution neural network algorithm (1989)
2015 Google’s Deepmind achieved human parity in
playing 29 Atari games by learning general
control from video
Arcade Learning Environment dataset of over
50 Atari games (2013)
Q-learning algorithm (1992)
Average No. Of Years to Breakthrough 3 years 18 years
The average elapsed time between key algorithm proposals and corresponding advances was about 18 years,
whereas the average elapsed time between key dataset availabilities and corresponding advances was less
than 3 years, or about 6 times faster.
42. Model learning depends on objective + metric
● Quora feed example:
○ Training data = implicit + explicit
○ Target function = Value of showing
a story to a user ~ weighted sum of
actions
■ Compute probability of each
action given a story, weight them
by their value to compute expected
value
○ Metric = Any ranking metric
44. The importance of the experimentation framework
● Offline
○ Measure model performance, using metrics
○ Offline performance = indication to make decisions
on follow-up A/B tests
○ A critical (and mostly unsolved) issue is how offline
metrics correlate with A/B test results.
● Online
○ Measure differences in metrics across statistically
identical populations that each experience a
different algorithm.
○ Overall Evaluation Criteria (OEC)
■ Use long-term metrics whenever possible
■ Short-term metrics can be informative and
allow faster decisions. But, not always
aligned with OEC
46. Principles for data/AI driven product development
● Make Data Trustworthy
● Make Data Accessible
● Follow a Hypothesis-driven approach
● Define Clear Metrics
● Measure offline/online
47. Principles for data/AI driven product development (I)
● Data/metrics drive AI
● AI should improve over time
● More data only matters if it’s better data
● Start with simplest model
● Increase model complexity and data size in parallel
● Connect AI to UI
48. Principles for data/AI driven product development (summary)
1. Make data trustworthy and accessible
2. Follow a hypothesis-driven offline/online
experimentation approach with clearly defined metrics
3. Start from the simplest approach, ensure AI improves
over time, with data/metrics driving improvement
4. More data only matters if it’s better data, and if the
model is complex enough to learn from it
5. AI affects UX and UX affects AI
50. 4 hour lecture on recommendations
Carnegie Mellon (2014)
1 hour lecture on practical Deep Learning
UC Berkeley (2020)
10 minutes on AI for COVID
Stanford (2020)
1 hour podcast on AI for Healthcare
Gradient Dissent (2021)