A talk given at the International Congress "Contrasts in Pharmacology 2.0" held in Turin, May 14-16 2015
It describes our work with Bigger datasets, working on Tuberculosis as well as other areas.
1. Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Sean EkinsSean Ekins
Phoenix Nest, Inc., Brooklyn, NY.
Collaborations in Chemistry, Inc., Fuquay Varina, NC.
Collaborative Drug Discovery, Inc., Burlingame, CA.
Collaborations Pharmaceuticals, Inc., Fuquay Varina, NC.
2. In a Perfect World…
• All major diseases cured
• All > 7000 rare diseases have treatments available
• Neglected diseases are eradicated
• Antibiotics, antivirals, vaccines developed to anticipate all
future mutations
• Drug resistance eradicated
• All research coordinated globally
• Government/individuals collaboration- discovers / fund all
research
• Billions of molecules will be available with data for different
targets
• All decisions will involve machine learning
• Life expectancy is infinite
8. Just a matter of scale?
Drug Discovery’s
definition of Big data
Everyone else’s definition of Big data
9. What about Chemistry and Biology -
Pharmacology X.0
• Data Sources
• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
12. But what about small data?
• In some cases its all we have
• In vivo data is not high throughput
• Small data builds networks DATA
V
http://smalldatagroup.com/
13. The past
• 1996
• Data from low throughput
Drug-drug interaction studies
• E.g. Ki values with CYP 3A4
• A drug company might have
10s of values
• This data was used to build
3D QSAR, pharmacophores
JPET, 290: 429-438, 1999
14. Hydrophobi
c features
(HPF)
Hydrogen
bond
acceptor
(HBA)
Hydrogen
bond
donor
(HBD)
Observed
vs.
predicted
IC50 r
Acoustic mediated process
2 1 1 0.92
Tip-based process
0 2 1 0.80
Acoustic Tip based
Generated with Discovery Studio Generated with Discovery Studio
(Accelrys)(Accelrys)
Cyan = hydrophobicCyan = hydrophobic
Green = hydrogen bond acceptorGreen = hydrogen bond acceptor
Purple = hydrogen bond donorPurple = hydrogen bond donor
Each model shows most potent Each model shows most potent
molecule mappingmolecule mapping
How you dispense liquids may be important: insights from small dataHow you dispense liquids may be important: insights from small data
PLoS ONE 8(5): e62325 (2013)
15. Ebola inhibitor
Pharmacophore
Ekins S, Freundlich JS and Coffee M
F1000Research 2014, 3:277
Docking FDA approved
compounds in VP35
protein showing overlap
with ligand (yellow)
Proposed amodiaquine,
chloroquine, clomiphene toremifene
Which all are active in vitro may have
common features and bind common
site / target
A common feature pharmacophore for FDA-approved drugs inhibiting the Ebola virus
16. The last 5 years -Present
• 2010
• Data from high
throughput screens at
Pfizer
• E.g. metabolic
stability data ~200K
compounds
• This data was used to
build machine
learning models
• 2015
• Could easily be
double this amount
Drug Metab Dispos, 38: 2083-2090, 2010
17. Ebola Machine Learning Models
Models
(training set
868
compounds)
RP Forest
(Out of
bag ROC)
RP Single
Tree (With 5
fold cross
validation
ROC)
SVM
(with 5 fold
cross
validation
ROC)
Bayesian
(with 5 fold
cross
validation
ROC)
Bayesian
(leave out
50% x 100
ROC)
Open Bayesian
(with 5 fold
cross
validation
ROC)
Ebola
replication
(actives = 20)
0.70 0.78 0.73 0.86 0.86 0.82
Ebola
Pseudotype
(actives = 41)
0.85 0.81 0.76 0.85 0.82 0.82
Ekins, Freundlich, Madrid and Clark
19. Tuberculosis still kills 1.6-1.7m/yr (~1 every 8 seconds)
1/3rd
of worlds population infected!!!!
streptomycin (1943)streptomycin (1943)
para-para-aminosalicyclic acid (1949)aminosalicyclic acid (1949)
isoniazid (1952)isoniazid (1952)
pyrazinamide (1954)pyrazinamide (1954)
cycloserine (1955)cycloserine (1955)
ethambutol (1962)ethambutol (1962)
rifampicin (1967)rifampicin (1967)
Multi drug resistance in 4.3% of casesMulti drug resistance in 4.3% of cases
Extensively drug resistant increasingExtensively drug resistant increasing
incidenceincidence
2 new drugs (bedaquiline, delamanid)2 new drugs (bedaquiline, delamanid)
in 40 yrsin 40 yrs
Tuberculosis – a big diseaseTuberculosis – a big disease
21. Over 8000 molecules with dose
response data for Mtb in CDD Public
from NIAID/SRI
https://app.collaborativedrug.com/register
22. Over 6 years analyzed in vitro data and built models
Top scoring molecules
assayed for
Mtb growth inhibition
Mtb screening
molecule
database/s
High-throughput
phenotypic
Mtb screening
Descriptors + Bioactivity (+Cytotoxicity)
Bayesian Machine Learning classification Mtb Model
Molecule Database
(e.g. GSK malaria
actives)
virtually scored
using Bayesian Models
New bioactivity data
may enhance models
Identify in vitro hits and test models3 x published prospective tests ~750~750
molecules were testedmolecules were tested in vitroin vitro
198 actives were identified198 actives were identified
>20 % hit rate>20 % hit rate
Multiple retrospective tests 3-10 fold
enrichment
N
H
S
N
Ekins et al., Pharm Res 31: 414-435, 2014
Ekins, et al., Tuberculosis 94; 162-169, 2014
Ekins, et al., PLOSONE 8; e63240, 2013
Ekins, et al., Chem Biol 20: 370-378, 2013
Ekins, et al., JCIM, 53: 3054−3063, 2013
Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011
Ekins et al., Mol BioSyst, 6: 840-851, 2010
Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,
23. 5 active compounds vs Mtb in a few months
7 tested, 5 active (70% hit rate)
Ekins et al.,Chem
Biol 20, 370–378,
2013
1. Virtually screen
13,533-member GSK
antimalarial hit library
2. Bayesian Model = SRI
TAACF-CB2 dose
response + cytotoxicity
model
3. Top 46 commercially
available compounds
visually inspected
4. 7 compounds chosen
for Mtb testing based
on
- drug-likeness
- chemotype diversity
GSK #
Bayesian
Score Chemical Structure
Mtb H37Rv
MIC
(µg/mL)
GSK
Reported
% Inhibition
HepG2 @ 10
µM cmpd
TCMDC-
123868 5.73 >32 40
TCMDC-
125802 5.63 0.0625 5
TCMDC-
124192 5.27 2.0 4
TCMDC-
124334 5.20 2.0 4
TCMDC-
123856 5.09 1.0 83
TCMDC-
123640 4.66 >32 10
TCMDC-
124922 4.55 1.0 9
24. Filling out the triazine matrix using SARtable:
A new kind of map
Green = good activity, Red = bad; colored dots are predictions
25. No relationship between internal or external ROC and the
number of molecules in the training set?
PCA of combined
data and ARRA(red)
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
Internal and leave out 50%x100 ROC track each other
External ROC less correlation
Smaller models do just as well with external testing
~350,000
26. What matters most >70 years of TB mouse in vivo data – Mind
the gap - 770 molecules
MIND THE TB GAP
Ekins et al.,
J Chem Inf Model 54: 1070-82, 2014
Ekins, Nuermberger & Freundlich
DDT 19: 1279-1282, 2014
27. In vivo Machine Learning Models
ROC 5 fold cross validation
RP Forest RP Single
Tree
SVM Bayesian
3 /11
(27.2%)
4/11
(36.4%)
7/11
(63.6%)
8/11
(72.7%)
External test set
Ekins et al.,
J Chem Inf Model 54: 1070-82, 2014
RP Forest RP Single
Tree
SVM Bayesian
0.75 0.71 0.77 0.73
28. ow can we find the in vivo active compound
e need a map..
29. >70 years of TB in vivo data
Green = in vivo mouse active
Empty = in vivo inactive
Yellow = 2013-2015 data
Uses Bayesian fingerprints
and clustering by similarity
Clark and Ekins - unpublished
Clustering in vivo
mouse TB dataHex
plot
30. >70 years of TB in vivo data
Green = in vivo mouse active
Empty = in vivo inactive
Yellow = 2013-2015
Clark and Ekins - unpublished
Clustering in vivo
mouse TB data
Triazine surrounded by
inactives
Issues
High Log P, poor solubility
31. How do we ‘increase drug discovery’?
• Make data and models more accessible
• Collaborate
• Share
– Create mobile apps
• Encourage engagement from non scientists
33. • CDD Vision
Uses Bayesian algorithm and FCFP_6 fingerprints
Bayesian models
Clark et al., J Cheminform 6:38 2014
34. Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6
fingerprints; (b) modified Bayesian estimators for active and inactive compounds;fingerprints; (b) modified Bayesian estimators for active and inactive compounds;
(c) structures of selected binders.(c) structures of selected binders.
For each listed target with at least two binders, it is first assumed that all of theFor each listed target with at least two binders, it is first assumed that all of the
molecules in the collection that do not indicate this as one of their targets aremolecules in the collection that do not indicate this as one of their targets are
inactive.inactive.
In the app we used ECFP_6 fingerprintsIn the app we used ECFP_6 fingerprints
Building Bayesian models for each target in TB MobileBuilding Bayesian models for each target in TB Mobile
Clark et al., J Cheminform 6:38 2014
35. TB Mobile Vers.2TB Mobile Vers.2
Ekins et al., J Cheminform 5:13, 2013
Clark et al., J Cheminform 6:38 2014
Predict targets
Cluster molecules
http://goo.gl/vPOKS
http://goo.gl/iDJFR
41. What do 2000 ChEMBL models
look like
Folding bit size
Average
ROC
http://molsync.com/bayesian2
42. Bigger datasets and model
collections
• Profiling “big datasets” is going to be the norm.
• A recent study mined PubChem datasets for
compounds that have rat in vivo acute toxicity
data
• This could be used in other big data initiatives
like ToxCast (> 1000 compounds x 800 assays)
and Tox21 etc.
• Kinase screening data (1000s mols x 100s
assays)
• GPCR datasets etc (1000s mols x 100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal
Toxicants by Automatically Mining Public
Bioassay Data: A Big Data Approach for
Computational Toxicology. PLoS ONE 9(6):
e99863. doi:10.1371/journal.pone.0099863
http://127.0.0.1:8081/plosone/article?id=info:doi/1
43. • Data is at your fingertips instantly
• labs add data to a massive corpus
of knowledge
• Instantly available to all
• Algorithms for mining, prediction
• Millions of models accessible
• Making decisions on experiments
needed and running them
• Data visualization, exploration is
real-time, updated
• Data follows you
Sean Ekins, a computational drug discovery consultant at Collaborations in
Chemistry in North Carolina, is much more skeptical. He notes pharma
companies have found hundreds of antimalaria compounds more potent
than TNP-470 and says that he is not convinced Eve can do QSAR. He wants
to see Eve go head-to-head with a real computational chemist. “Eve should
go back to the Garden of Eden and leave drug discovery to scientists who
know what they are doing,” Ekins says.
How close are we?
44. • Computers and models do not replace scientists
• A tool to help us sift through ideas quickly
• Many examples have lead to leads
• Bigger data not needed for good models
• More data becoming public
• Can model ADME, bioactivity and more
• Collaboration and software is important
• Mobile apps have useful cheminformatics features -
aid anyone to do drug discovery
• Models are compact < 1MB and portable
• The age of model sharing is here
Conclusions
45. Wanted
• “Bigger” small moleculescreening datasets
• Preferably > 500,000 – 1,000,000 moleculeswith data
• To test how machinelearningAlgorithmsScale
• Contact ekinssean@yahoo.com
46. Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and manyNadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many
others …Funding:others …Funding: Bill and Melinda Gates Foundation (Grant#49852)Bill and Melinda Gates Foundation (Grant#49852) 1R41AI088893-01,1R41AI088893-01,
2R42AI088893-02, R43 LM011152-01,2R42AI088893-02, R43 LM011152-01, 9R44TR000942-02, 1R41AI108003-01,
1U19AI109713-01, MM4TB, Software: BioviaMM4TB, Software: Biovia
Freundlich Lab
Notas do Editor
You do not need big data to show fundamental observations