2. Aims
• How does the dynamic range of the data
being modeled impact the apparent
performance of the model? "
• How does experimental error impact the
apparent predictivity of a model? "
• How can we determine whether a model is
applicable to a new dataset?"
• How should we compare the performance
of regression models? "
h0p://media.johnwiley.com.au/product_data/excerpt/00/11181391/1118139100-‐4.pdf
3. Example
Examine
a
number
of
datasets
containing
measured
values
for
aqueous
solubility
and
use
these
datasets
to
build
and
evaluate
predic7ve
models.
4. CChallenges
in
modeling
solubility
Aqueous solubility of a compound can vary
depending on a number of factors:
•
Temperature
•
Purity
•
polymorph
5. Datasets
under
study
•
The
Huuskonen
Dataset
:
1274
experimental
solubility
values
first
largest
solubility
dataset.
•
The
JCIM
Dataset
:
94
experimental
solubility
2008
•
The
PubChem
Dataset
(AID1996):
A
randomly
selected
subset
of
1000
measured
solubility
values
selected
from
a
set
of
58,000
values
that
were
experimentally
determined
using
chemilumenescent
nitrogen
detec7on
(CLND).
7.
Solubility
Comparison
A
boxplot
comparison
of
Log
S
for
the
three
datasets
8. Requirements
for
PredicCve
model
• Reliable experimental data
•
Sets
of
molecular
descriptors
•
Sta7s7cal
or
machine-‐learning
methods
9. Types
of
Models
ClassificaCon
Model
:
• Taking
cutoffs
points
in
modeling
“edge
effects”.
consider
a
case
where
we
have
a
two-‐class
system
with
a
cutoff
of
100
μM.
A
value
of
99
μ
M
will
be
considered
insoluble
while
a
value
of
101
μ
M
will
be
considered
soluble.
• other
difficulty
with
classifica7on
models
is
that
they
provide
limited
direc7on
for
improving
the
proper7es
of
a
compound
10. Types
of
Models
Regression
Model
:
•
difficult
to
create
a
regression
model
given
data
with
a
limited
dynamic
range.
•
limited
dynamic
range
unreliable
model
11. EvaluaCng
a
predicCve
model
• Pearson’s
r:
commonly
referred
to
as
Pearson’s
r
,
or
its
square
r^2
Values
of
r
can
vary
between
−1
and
1,
• Kendall’s
Tau:
Pearson’s
r
is
that
it
is
sensi7ve
to
outliers
and
to
the
distribu7on
of
the
underlying
data.
Employ
rank
order
or
values.
• RMSD:
If
we
consider
paired
values
X
and
Y
,
RMSD
can
be
calculated
using
the
following
equa7on.
12. Steps
involved
in
building
a
predicCve
model
• Integrate
the
experimental
data
and
molecular
descriptors
• Divide
the
data
into
training
and
test
sets
• Build
a
model
from
the
training
set
• Use
this
model
to
predict
the
test
set
13. Random
forest
model
The
dynamic
range
in
a
dataset
can
have
a
large
impact
on
the
apparent
correla7on
between
experimental
and
predicted
ac7vity.
14. Experimental
Error
and
Model
Performance
•
experimental
data
point
has
an
error
associated
with
it.
If
we
measure
the
Log
S
of
a
compound
as
−6
and
that
data
point
has
an
error
of
0.3
log
units,
the
actual
value
could
be
anywhere
between
−6.3
and
−5.7.
• Brown
examined
the
rela7onship
between
experimental
error
and
model
performance.
• Gaussian
distributed
random
values
were
added
to
data
to
simulate
experimental
errors.
•
Correla7on
between
the
measured
values
and
the
same
values
with
simulated
error
is
measured.
15. Experimental
Error
and
Model
Performance
• Table
shows
the
maximum
possible
correla7on
for
each
of
the
three
solubility
datasets
we
have
been
examining
when
experimental
errors
of
0.3,
0.5,
and
1.0
log
are
considered.
• Error
is
more
for
a
dataset
like
pubchem.
16. Model
Applicability
• Models
ofen
perform
poorly
on
molecules
that
bear
ligle
resemblance
to
those
in
the
training
set.
Dataset
Mean
Median
Huuskonen_Test
0.76
0.78
JCIM
0.74
0.62
Pubchem
0.56
0.56
Similarity
of
Each
Test
Set
Dataset
R2
Kendall
RMS
Error
Huuskonen_Test
0.92
0.82
0.58
JCIM
0.58
0.59
0.83
Pubchem
0.11
0.22
1.12
17. Comparing
Predic7ve
Models
•
When
comparing
correla7on
coefficients,
we
must
not
only
consider
the
value
of
the
correla7on
coefficient,
but
also
the
confidence
intervals
around
the
correla7on
coefficient.
•
If
the
confidence
intervals
of
two
correla7ons
overlap,
we
cannot
claim
that
one
predic7ve
model
is
superior
to
another.
• For
subset
of
25
compounds
confidence
intervals
overlap
so
,
we
cannot
say
that
one
correla7on
is
superior
to
the
other.
• For
subset
of
50
compounds,
there
is
a
very
small
difference
between
the
upper
bound
of
the
95%
confidence
interval.
• For
subset
of
100
compounds,
there
is
clear
separa7on
between
the
confidence
intervals
so
it
implies
that
there
is
clear
separa7on
between
correla7on
coefficients.