We describe a precompetitive collaboration that makes public life science data FAIR and annotated with detailed, high quality metadata, at a shared cost. A data model based on public ontologies was defined to address the participants' business questions. This slide deck was presented at the Cambridge Cheminformatics meeting on June 2, 2021.
DataFAIRy bioassays pilot -- lessons learned and future outlook
1. DataFAIRy bioassays
pilot project - lessons
learned and future
outlook
Isabella Feierberg, AstraZeneca
Samantha Jeschonek, Collaborative Drug Discovery
Nick Lynch, Curlew Research
2021-06-02
2. Why DataFAIRy?
2
Substantial investments are being made in AI, ML and FAIR data
across life science industry and academia
Available metadata in (public domain) data repositories is often
insufficient for answering current and future business questions
Pharma companies already pay for curation of partially overlapping
public domain data (e.g., ChEMBL, papers, chemistry patents)
There is a need for FAIR public domain data with high quality
annotations using public ontologies and a common data model
5. Siloed data is
not helpful
5
My organization’s data
Public data
Partner’s data
6. The proposed DataFAIRy operational model (2018)
7
Curation and QC by
independent domain
experts
unstructured
public data
FAIR
data
DataFAIRy
Partners
Cost-shared annotation of public domain bioassay descriptions with high quality, using an agreed data model, making data FAIR
FAIR = Findable, Accessible, Interoperable, Reusable
7. Small molecule bioassays make up
a good pilot case
8
Chemogenomic model building
Assay development, e.g., assay conditions and tool compounds
Enriching public chemogenomics data with FAIR metadata will
show impact across the cheminformatics domain
Project planning – what is available in the public domain?
•
•
•
•
8. Roche
Project team
9
Rama Balakrishnan
Martin Romacker
Novartis
Anosha Siripala
Gabriel Backiananthan
BMS
Dana Vanderwall
AstraZeneca
Tim Ikeda
Isabella Feierberg
Collaborative Drug
Discovery
Samantha Jeschonek
Jason Harris
Whitney Smith
Pistoia Alliance
Vladimir Makarov
Thomas Liener
9. Feasibility study, guidance for a larger initiative, example creation
Pilot project (2020) – Summary
10
Curation of 496 public domain assay descriptions were converted
into FAIR information objects using an agreed data model, which
was guided by jointly defined business questions. Upload of the
metadata to PubChem.
Learning points were captured along with recommendations for
future endeavors
10. Pilot Project - Business questions
11
Biology oriented literature mining for discovery project planning
Assay technology oriented
Chemistry/tool compound oriented
Specific assay conditions
Computational chemogenomic modelling (e.g., target activity, ”PAINS”)
1
2
3
4
5
26 initial questions, pruned down to 15, across 5 main categories
12. Pilot Project - Assay selection
13
245 Commercial panel assays: ThermoFisher’s kinase selectivity Z’-lyte panel
-Downloaded vendor’s pdf document with assay protocol
42 PubChem NCATS assays – qHTS, large datasets
-Assay Description and Assay Protocol sections in plain text on Pubchem page
210 publication assays: ChEMBL assays where the target is EGFR, and the reference is Open Access
-Paper/supplementary material, references
1
2
3
100 of these 496 annotated assays were subjected to manual QC by project team members
15. Pilot Project – Learnings
16
Review of supplements and citations → High
cost. Choose assays wisely.
No persistent links exist for commercial
assay panel protocols
Errors propagate between papers
Commercial assay panels were the easiest to
annotate (low-hanging fruit)
Fully automated is not fully accurate:
Benefit from good work practices: audit trail,
versioning, iterative QC by experts
Need for a common community data standard
for future assay publications.
1
2
3
4
5
6
7
Hard and expensive to annotate old assay
protocols from literature : A need for published
assay protocols to be well-annotated in public
databanks and linked to the publication
16. Value statement
17
“Richly annotated FAIR bioassay data has been very valuable for an internal data
integration project, where it has provided additional terminology aiding the
assimilation of the chemogenomics datasets used by the machine-learning models.
The extra annotations better harmonise our dataset with those from external
partners, enabling the federated platform to provide superior multi-task predictions
across range of panels and safety screens in a privacy preserving way”
Lewis Mervin, Machine Learning and Cheminformatics Expert, Molecular AI,
Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca
17. Optimize process, data sources, tools, QC within quality constraints.
Define quality metrics.
Next steps:
18
Define and promote a community standard for assay reporting and
publishing --align with vendors, publishers, government agencies.
Attract new project members and sufficient funding to start the
next phase
Scale up (x 10-100) in next steps. Having more partners
lowers cost per partner per assay and overhead cost
18. 19
Thanks to
AstraZeneca
Nigel Green
David Hayes
Tom Plasterer
BMS
Rick Bishop
Janssen
Herman van Vlijmen
Novartis
Fabien Pernot
MMV
Jeremy Burrows
PubChem
Evan Bolton
ChEMBL
Anna Gaulton
Andrew Leach
Roche
Olivier Roche
Medicines Discovery
Catapult
John Overington
Mark Davies
Pangeadata.ai
Vibhor Gupta
University of Miami
Stephan Schürer
BioSci Consulting
Scott Wagers
Collaborative Drug
Discovery
Barry Bunin
Frank Cole
Alex Clark
Hande Kücük McGinty
(now Univ. Of Ohio)
Pistoia Alliance
Carmen Nitsche (now at CCDC)
Nick Lynch (Now at curlew Research)