Journal club and talk given to Health Data Analytics MSc, February 2023. Reflecting on how to do good machine learning over biomedical data, the pitfalls and good practices
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
ML, biomedical data & trust
1. gsk.com
AI & Big Data Expo, London
Machine learning, biomedical data & trust
Paul Agapow (Statistics & Data Science Innovation Hub)
2. Background & disclaimer
• Previously a health informatician, biomedical ML
researcher, bioinformatician, “computer guy”,
disease chaser, epi-informatician,
phylogeneticist, evolutionary biologist,
immunologist, biochemist …
• Now a director @GSK
• This presentation does not reflect thought,
policy or projects in progress at GSK
• There are no conflicts of interest
3. 10 June 2021 3
“AI will not replace
drug hunters, but drug
hunters who don’t use
AI will be replaced by
those who do.”
-Andrew Hopkins, CEO Exscientia
5. 5
07 February 2023
3 hurdles to using AI/ML in therapy development
Biological & physiological
complexity
Insufficient & uneven data
A gap between AI/ML practice &
medical needs
6. To make a
new drug,
you must
first solve for
everything
6
7. 12 July 2021 7
The complexity of biology:
About 50 trillion cells of 200 types
Each cell has 23 pairs of chromosomes
In total 6.4 billion basepairs (positions)
Organised into about 18,000 genes
(Or maybe more like 40,000 genes)
Genetic material elsewhere in the cell
Epigenetic modification
1 million different types of molecules
Lifestyle & history
Exposure & environment
Immune system repertoire & priming
…
Of which we know only a fraction
8. The data types and sources we need are myriad & varied
8
Hughes et al. (2010) ”Principles of early drug discovery”
9. • There are many different
modalities of intervention
• With different (data)
considerations & different
levels of ML experience
07 February 2023 9
There are many different means to the same end
McKinsey, EvaluatePharma 2022
10. It’s often not
the right data
• Difficult / expensive to generate
• Unstructured
• Unlabeled
• The wrong type
• Sparse, unevenly sampled
• WEIRD
• In different formats and silos
10
11. 07 February 2023 11
Melanie Mitchell via Dagmar Monett
A disconnect between AI/ML practice and medical needs
Academic focus on problems with low medical value
12. • There are many models
that work perfectly … in
the lab
• Why?
- Unrealistic or poor
training data
- Emphasis on hitting
metrics
07 February 2023 12
A disconnect between AI/ML practice and medical needs
A tendency to treat biomedicine as simply a data / ML problem
13. The classic
analytical
tension
13
What we need to solve
What we tend to solve
Easy things
Available, ideal data
Ground truth
Simplify
“Interesting”
“Table-land”
Useful things
Incomplete messy data
Unclear biological reality
Uncertain findings
Needful
“Network-land”
14. 14
Laure Wynants via Maarten van Smeden
A disconnect between AI/ML practice and medical needs
Many ”good” models are not fit for production
15. 07 February 2023 15
• The pandemic prompted a flood of publications &
preprints
• Most plagued by the usual biomedical AI problems
• … and also produced by those outside the field
• As a general principle, any paper applying ML to COVID
is terrible
• Bad models in a crisis situation are not neutral, they
distract, expend effort, are an opportunity cost
COVID was a lightning rod for bad biomedical ML
16. 07 February 2023 16
• What does it purport to do: Find risk factors
associated with deterioration of COVID patients
• Why? Better / faster assessment of incoming
patients
• Who? Patients admitted to two hospitals with +ve
PCR test for COVID with CT scan with lesions
• Data? Demographics, bloods, labs, breathing/
oxygen scores, CT scans manually scored
“Interpretable Prediction of Severity & Crucial Factors of COVID Patients”
Zheng et al. BioMed Research International (2021), DOI: 10.1155/2021/8840835
17. 07 February 2023 17
• Conflates diagnosis & prognosis
• The cohort:
- Suggested this can replace PCR but cohort are selected
by PCR result
- The act of taking a CT scan in some ways selects for
cohort
- Unclear when some readings taken, when we are looking
at deterioration
- Are the training set the set that a model might be used on
in the clinic?
- Not many critical – so actually testing for severe cases
- What’s the split between hospitals
- Patients are different already, pre-existing conditions
- Association with age & general health
- Old patients running a temperature with lesioned lungs do
poorly
• Clinical use:
- Will all this data be available in a timely fashion for a
model in the clinic
- If the severity is based of bloods & oxygenation readings,
why not just use them
- Information complexity?
• Validation:
- Would it work for another time period at same hospitals?
At other hospitals?
• Analytics
- “The impenetrable wall of math”
- XGBoost is always a good place to start
- Ensemble methods usually are
- Feature interaction?
- Some features overlap (neutrophils, n. ratio, NLR)
- What features correlate?
- No attempt to simplify model
- Any model is interpretable with SHAP
• Still useful for intrinsic / research purposes
Thoughts and questions
Not necessarily faults, not all easily answerable
18. 07 February 2023 18
• Models will always tell you the truth
- But it’s the truth conditioned on the data they’ve seen
- It might not be the truth you think
• Biomedical data is complex, it always come with a context
• Patients are complex, they always come with a medical history
• How were these patients selected?
• What is this model actually saying and why?
• Does this model replicate in other populations?
• But despite all this, we have to make and actionably interpret
models
Some principles for better biomedical ML