Automatic Generation of Negative Control Structures for Automated Structure Verification Systems
The generation of positive and negative controls is a fundamental part of good experimental design. Getting a positive outcome on a test performed over a subject known to give a positive result, reasures the scientist the test is working properly. As important, if not more, is to test over subjects known to give negative results. Getting a negative outcome when expected validates the test and increases the result’s confidence when applied to unknowns.
Automated Structure Verification (ASV) is no different than any other scientific test. Postive as well as negative controls should be frequently tested to optimize performance and to obtain a measure of robustness and confidence in the results.
In this poster I will show how to automatically generate relevant negative control structures for any type of NMR data. Furthermore, I will argue that ASV systems fall in the category of binary classifiers, and that their performance can be measured by a host of metrics, already in use in the fields of statistical classification and signal detection theory.
3. Goal
• To develop a method that given a target chemical
structure would rank other proposed structures
based on the expected similarity of their NMR data,
without an a priori knowledge of that data.
Increased Similarity
4. How to Achieve Our Goal
• Calculate a molecular similarity coefficient predictive
of NMR data similarity.
• Develop an NMR-specific molecular fingerprint
5. Molecular Similarity vs. NMR Data Similarity
Molecular Fingerprints
• A molecular fingerprint is a collection of descriptors that is used to characterize a
molecule. For example, the number and type of functional groups, molecular formula,
etc.
• Different metrics can be calculated between fingerprints to find their similarity or
dissimilarity.
• Most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc.
F
F S O S O
H3C OH
F O O
Cl CH3 Cl CH3
F
NMR Data Similarity
• Which two molecules are structurally most similar?
• Which molecules would present the most similar NMR data?
• How to answer the previous question without knowing the actual NMR data.
6. NMR-Specific Molecular Similarity Coefficient
Similarity based on Chemical Environments Around Carbon Atoms
• Define the most common chemical environments up to three shells emanating from a
carbon atom
• Assemble them as bits of a fingerprint
• Count how many times each fingerprint bit (environment) is present in each molecule
• Calculate similarity between two molecules as the Euclidean distance between two
fingerprints
[CH1]([CH3])(OC)[CH1](C)C
SMARTS
Smiles ARbitrary Target Specification (SMARTS) is a
language for specifying substructural patterns in O
molecules.
[#6] any Carbon atom NH
[CH3] Methyl group
[n;!H0] pyrrole-type Nitrogen
[#7,#8;!H0] hydrogen bond donor [cH1]([cH0](C)c)[cH1]c
7. Fingerprint Development
1. Generate all combinations of SMARTS code strings
Bi ( bj ( Rk ) )l
Where:
Bi = { [CH3], [CH2], [CH1], [cH1] }
bj = { -, =, #, : }
Rk = { C, N, O, S, F, Cl, Br, I, c, n, o, s }
l = i – j + 1, l > 0
2. Extract all chemical environments up to three shells
from large compound database
– Database contained about 4.6 million compounds,
extracted from PubChem, for a total of 82 million
chemical environments
8. Method Validation
Test set of 100 commercial compounds
Calculate pairwise Molecular Similarity between all
pairs (4950 pairs total)
Predict 1H, 13C, and construct 1H-13C HSQC data
Calculate Spectral Similarity (1D and 2D binning)
Compare Molecular Similarity vs Spectral Similarity
for all pairs
9. Molecular Similarity vs. Spectral Similarity
Similarity measured as
distance. Smaller
numbers mean greater
similarity
Molecular fingerprint
contains 28,833
chemical environments
(bits)
Spectral Similarity
calculated used 2D
binning and euclidean
distance metric
10. Molecular Similarity vs. Spectral Similarity
Similarity measured as
distance. Smaller
numbers mean greater
similarity
Molecular fingerprint
contains 28,833
chemical environments
(bits)
Spectral Similarity
calculated used 2D
binning and euclidean
distance metric
11. Molecular Similarity vs. Spectral Similarity
Similarity measured as
distance. Smaller
numbers mean greater
similarity
Molecular fingerprint
contains 28,833
chemical environments
(bits)
Spectral Similarity
calculated used 2D
binning and euclidean
distance metric
12. 1H-1D NMR Data
• Predicted similarity was
calculated using a 1H specific
fingerprint containing 100,000
unique three-shell chemical
environments (bits)
• Actual similarity was
calculated as a 1D binning of
the predicted 1H-1D spectra
• In both cases the metric used
was Euclidean distance
between fingerprint bits
13. 13C-1D NMR Data
• Predicted similarity was
calculated using a 13C
specific fingerprint
containing 200,000 bits
• Actual similarity was
calculated as a 1D binning
of the predicted 13C-1D
spectra
• In both cases the metric
used was Euclidean
distance between
fingerprint bits
14. 1H-13C HSQC 2D NMR Data
• Predicted similarity
was calculated using a
H-C correlation specific
fingerprint containing
50,000 bits
• Actual similarity was
calculated as a 1D
binning of the
predicted 13C-1D
spectra
• In both cases the
metric used was
Euclidean distance
between fingerprint
bits
15. Test Set (Database Search)
(MW <= 250 Da, 1 CH3, 3 CH2, 1 CH, 4 Ar)
0 0 0 Pairwise similarity
O OH
20
O
O
H
N
20 Br
NH2
20
a b c d e f g h i j
N
40 H 40 40 10
O
f g h i j
N 60 60 60
6
f1 (ppm)
f1 (ppm)
f1 (ppm)
80 80 80
8 5
a
100
120
b 100
120
c 100
120 4
140
140 140 3
6
160 160 160
2
Molecule B
10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0
f2 (ppm) f2 (ppm) f2 (ppm)
0 0 0 1
a b c d e
20 20 20 4 0
HN
40 N 40
O H N 40
H H
O N 60 60
N O 60
H
f1 (ppm)
2
f1 (ppm)
f1 (ppm)
80 80 80
d 100
120
e 100
120
f 100
120
140 140
0
140
0 2 4 6 8 10
160 160 160
10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 Molecule A
f2 (ppm) f2 (ppm) f2 (ppm)
0 0 0 0
O
O 20 20 20 20
N
O
40 40 40 O
40
N O
60 60 60 O 60
H
N NH2
N
f1 (ppm)
f1 (ppm)
f1 (ppm)
f1 (ppm)
80 H 80 80 80
OH
g 100
120
h 100
120 i
100
120
j 100
120
140 140 140 140
160 160 160 160
10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0
f2 (ppm) f2 (ppm) f2 (ppm) f2 (ppm)
16. Automated Structure Verification
Are Chemical Structure and NMR data consistent with each
other?
Procedure:
Predict NMR data from proposed structure
Compare to experimental data (1H, 1H-13C HSQC)
Calculate matching score
Not seeking full structure elucidation or accurate assignments
Why doing this?
Best way to deal with large number of simple compounds (i.e.
libraries, reagents, etc.)
Leave interesting problems for manual analysis
17. ASV of Negative Control Structures
1.00
0.90
PC-1
0.80
0.70
PC-2
PC-3 Test Set
10 Positive Control Structures
ASV Score
0.60
0.50 5 Negative Control structures generated
0.40
0.30
automatically
0.20 ASV run on all 6 structures against experimental
0.10 NMR data (1H-1D and HSQC) 1
0.00
0.00 5.00 10.00 15.00 20.00 25.00
Molecular Similarity
1.00 1.00 1.00
0.90 0.90 PC-7 0.90 PC-9
PC-4 PC-10
0.80 PC-5 0.80 PC-8 0.80
PC-6
0.70 0.70 0.70
ASV Score
ASV Score
ASV Score
0.60 0.60 0.60
0.50 0.50 0.50
0.40
0.40 0.40
0.30
0.30 0.30
0.20
0.20 0.20
0.10
0.10 0.10
0.00
0.00 0.00 2.00 6.00 10.00 14.00 18.00
0.00 5.00 10.00 15.00 20.00 25.00 0.00 5.00 10.00 15.00 20.00 0.00 4.00 8.00 12.00 16.00 20.00
Molecular Similarity Molecular Similarity Molecular Similarity
1 ASV was run by Phil Keyes at Lexicon Pharmaceuticals using ACDLabs ASV system
20. ASV is a Binary Classifier
• The yellow band is a myth
• A Binary Classifier is a system that selects between
two options
• Binary classifier is a well understood, well developed
area of statistical analysis with many metrics at our
disposal
• Used in many fields including, decision making,
machine learning, signal detection theory
• Set your strategy (false positive/negative tolerant)
and live with it
21. Summary
Developed a molecular similarity method predictive of
NMR data similarity for 1H-1D, 13C-1D and 1H-13C HSQC
data
Similarity calculation can be used for other purposes like
CASE studies if linked to a structure generator
The confidence level of an autoverification can be
calculated by challenging the system with negative
control structures of known similarity to the proposed
structure
22. Acknowledgments
Lexicon Pharmaceuticals Modgraph
Giovanni Cianchetta Jeff Seymour
Phil Keyes
Funding
MestreLab
Carlos Cobas
Chen Peng
Open Source Comunity
ACDLabs
Ryan Sasaki
Sergey Golotvin
OpenBabel