What can your library do for you?

What can your library do for you?
Rajarshi Guha, Dac-Trung Nguyen,
Alexey Zhakarov, Ajit Jadhav
NIH NCATS
ACS Fall Meeting 2016, Philadelphia
August 21, 2016

Library Design
Historical collections and assay data provide information on
how a set of compounds has faired
Use (dis)similarity and machine learning to construct new
collections that show similar behavior
Plus various constraints
Libraries can be designed for certain target families or speciﬁc
screening paradigms
If suﬃciently annotated, compound behavior
may be correlated to assay and biology char-
acteristics

Two Questions
How likely are compounds, associated with a given annotation,
identiﬁed as active?
Given a new set of compounds, what sets of assay conditions (as
implied by the annotations) will they be active in?

Prior Work
BAO annotated datasets
de Souza et al, 2014; Vempati et al, 2012
Analyzing HTS datasets using BAO
Zander-Balderud et al, 2015; Sch¨urer et al, 2011
Semi-automated annotation of assay descriptions using the
BAO
Clark et al, 2014

Workﬂow
Extract unique BAO terms and for each term identify
annotated assays
Extract active compounds from this set of assays
Compute ﬁngerprint bit distribution
Use these conditional bit distributions to identify the BAO
terms that describe the assay that they are likely to be active
in

Dataset Overview
Extracted 4010 Pubchem AIDs from BARD
Primary, conﬁrmation, counterscreening assays
154M outcomes
740K compounds
Pubchem 881-bit keys using CDK and NCGC implementations
192 unique BAO terms

Dataset Overview
1e+02
1e+05
1e+08
Active Inactive Inconclusive Probe Unspecified
Numberofoutcomes
0
200
400
0 2 4 6
log Num. Compound
Density(assays)
Outcome
Active
Inactive
0
300
600
900
2 3 4 5 6 7 8 9 10 11 12 13
Term Depth
NumberofTerms
0
1000
2000
1 2 3 4 5 6 7 8 9 13
Number of BAO Terms
NumberofAssays

Class Imbalances
Imbalanced classes are problematic, and some of
the terms with near-balanced classes are not very
speciﬁc (e.g., imaging method)
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqq
qqq
qqqqqq
q
q
qqq
q
q
q
qq
q
q
qq
q
q
q
q
VICTOR X2 Multilabel Plate Reader
imaging method
radiometry method
SpectraMax 190 Microplate Reader
reporter protein
thermal shift
0
3
6
9
BAO ID
ClassRatio

Problem formulation
For a given library of compounds X, we would like to calculate a
ranked list relevant T of BAO terms that are most likely associated
with X. Let x ∈ X and t is a BAO term. The list T is an ordered
list based on the following:
argmax
i



j
p(ti |xj )



, (1)
where p(ti |xj ) is the probability that BAO term ti is associated
with compound xj . From Bayes’ rule, we have
p(ti |xj ) =
p(xj |ti )p(ti )
p(xj )
or p(ti |xj ) ∝ p(xj |ti )p(ti ).
Given that BAO terms are annotated at the assay level, we instead
have
p(ti |xj ) ∝ p(ti )
k
p(xj |ak)p(ak|ti ), (2)
where ak is a BAO annotated assay.

A Bayesian Approach for Ranking
Note that p(xj |ak) is the sampling function specified over only
active compounds in assay ak. In our model, xj is defined as
independent Bernoulli distribution with parameter θ, i.e.,
p(xj |ak) =
i
θ
xji
i (1 − θi )1−xji
,
where xji ∈ {0, 1} is the i-th bit of the PubChem substructural
fingerprint.
Learning BAO terms for a library of compounds amounts to
estimating θ, p(ti ), and p(ak|ti ).

Per-Term Activity Classiﬁer
For a given ontology term Ti , predict whether a compound
will be active or not
Model this using Na¨ıve Bayes, where we extract set of actives
and inactives from assays annotated with Ti
Results in a set of models {M1, M2, · · · , MN},
For a new compound in library, obtain probability of being
active for term Ti for all i and take top k terms
Aggregate top k terms from all compounds in library
Represents the set of ontology terms deﬁning an assay
in which these compounds would likely be active

Test Libraries
Considered several libraries to test out the approach
MIPE (1912 compounds) - Approved, investigational drugs,
constructed for functional diversity
LOPAC (1280 compounds) - Diverse library, designed for
enrichment of bioactivity
Natural Products (5000 compounds)
1000 member subset of ChEMBL GPCR collection
1000 member subset of ChEMBL Kinase collection

Test Libraries
In the Pubchem ﬁngerprint space, the libraries are not very diﬀerent
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
gpcr1gpcr2kinase1kinase2lopacmipenp
Bit Position
NormalizedFrequency

Test Libraries - Distance Matrix
lopac
mipe
np
kinase1
kinase2
gpcr1
gpcr2
lopac
m
ipe
np
kinase1
kinase2
gpcr1
gpcr2
0
1
2
3
4
Euc. Dist.

Prediction Workﬂow
Bayesian Ranking
Compute liklihood of all
terms for each compound
Aggregate across library
(mean liklihood) and take
top k
Activity Models
For molecules predicted
active, collect corresponding
terms
Retain the top k most
frequent terms across the
library
We take the top k terms as the set of anno-
tations describing an assay in which the library
will show activity in

Result - Bayesian Ranking
BAO_0000001
BAO_0000002
BAO_0000003
BAO_0000004
BAO_0000005
BAO_0000010
BAO_0000015
BAO_0000035
BAO_0000045
BAO_0000046
BAO_0000049
BAO_0000050
BAO_0000051
BAO_0000055
BAO_0000057
BAO_0000062
BAO_0000063
BAO_0000070
BAO_0000079
BAO_0000080
BAO_0000100
BAO_0000123
BAO_0000129
BAO_0000130
BAO_0000131
BAO_0000139
BAO_0000142
BAO_0000152
BAO_0000160
BAO_0000164
BAO_0000166
BAO_0000217
BAO_0000218
BAO_0000219
BAO_0000220
BAO_0000221
BAO_0000223
BAO_0000224
BAO_0000225
BAO_0000249
BAO_0000250
BAO_0000251
BAO_0000254
BAO_0000357
BAO_0000363
BAO_0000366
BAO_0000394
BAO_0000405
BAO_0000450
BAO_0000452
BAO_0000453
BAO_0000508
BAO_0000512
BAO_0000513
BAO_0000515
BAO_0000516
BAO_0000572
BAO_0000577
BAO_0000578
BAO_0000591
BAO_0000593
BAO_0000657
BAO_0000682
BAO_0000691
BAO_0000697
BAO_0000698
BAO_0000699
BAO_0000701
BAO_0000705
BAO_0000706
BAO_0000722
BAO_0000850
BAO_0000884
BAO_0000902
BAO_0000903
BAO_0000904
BAO_0000905
BAO_0000906
BAO_0000913
BAO_0000943
BAO_0000982
BAO_0001019
BAO_0001036
BAO_0001046
BAO_0001047
BAO_0001104
BAO_0002001
BAO_0002041
BAO_0002043
BAO_0002090
BAO_0002100
BAO_0002168
BAO_0002176
BAO_0002182
BAO_0002188
BAO_0002196
BAO_0002424
BAO_0002527
BAO_0002528
BAO_0002530
BAO_0002534
BAO_0002656
BAO_0002989
BAO_0002990
BAO_0002991
BAO_0002993
BAO_0002994
BAO_0002995
BAO_0002996
BAO_0002997
BAO_0002998
BAO_0003000
BAO_0003002
BAO_0003003
BAO_0003004
BAO_0003005
BAO_0003006
BAO_0003007
BAO_0003008
BAO_0003009
BAO_0003010
BAO_0003063
BAO_0003064
BAO_0003069
lopac
m
ipe
N
P
−5e+09
−4e+09
−3e+09
−2e+09
−1e+09
Avg Prob

Result - Per Term Activity Classiﬁer
bioluminescence
molecular redistribution determination method
direct enzyme activity measurement method
whole cell lysate format
BAO_0000722
lopac
m
ipe
npkinase1kinase2
gpcr1
gpcr2
Rank
1
2
3
4
5

What’s Diﬀerent Between Libraries?

Term Depth for the ’Diﬀerential’ Terms
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qqq
qq
qq
q
q
qqqq q
q
3
4
5
6
7
8
3 4 5 6 7 8
Term Depth (LOPAC)
TermDepth(MIPE)
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qqq
qq
qq
q
q
q qqq q
q
3
4
5
6
7
8
3 4 5 6 7 8
Term Depth (NP)
TermDepth(MIPE)

Pitfalls
If suﬃciently annotated, compound behavior may
be correlated to assay and biology characteristics
A very abstract, possibly lossy, view of the eﬀect of
compounds on biology
Depends on correct and meaningful annotations
Annotations terms are context dependent, but this may not
be considered when annotating a dataset
BAO terms exhibit hierarchical relationships and ignoring
them is simplistic

Acknowledgements
Qiong Cheng (U. Miami)
Stephan Sch¨urer (U. Miami)
Source code and slides

What can your library do for you?

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a What can your library do for you?

Semelhante a What can your library do for you? (20)

Mais de Rajarshi Guha

Mais de Rajarshi Guha (20)

Último

Último (20)

What can your library do for you?