SlideShare a Scribd company logo
1 of 1
Download to read offline
Benchmarking and Validation of
                                                 JChem ECFP and FCFP Fingerprints
                                                                                        Roger Sayle, NextMove Software Ltd, Cambridge, UK
                                                                                                                                              roger@nextmovesoftware.co.uk


   Abstract
1. Overview                                                                                 6. Fingerprint Saturation
                                                                                           6. Fingerprint Saturation
The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that               A common failing with binary fingerprints is caused by their inability to represent
similar compounds have similar therapeutic benefits. Cheminformatics tries to              the number of times a feature (such as a path or substructure) occurs. The
capture this insight by defining measures of similarity between the computer               fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically
representations of two molecules, with the hope of capturing a medicinal                   identical, as are those for many protein and DNA sequences. A more powerful
chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity.           representation that solves these issues is to replace occurrence bits with counts,
This poster evaluates the chemical similarity measures offered by ChemAxon on a            turning binary fingerprints into occurrence histograms.
standard reference benchmark. Any such benchmark must by necessity be                      LINGOs similarity achieves better results on the Briem & Lessel benchmark by
flawed; the similarity between two molecules is influenced by the framework by             using counts instead of bits. However, as described in the Continuous Tanimoto
which they are compared [2]. However a robust similarity measure should                    section below, care has to be taken to use a suitable similarity measure for
typically perform better on such benchmarks, whilst a weaker model of chemical             comparing histograms.
similarity would be expected to perform worse (on average).                                ChemAxon have announced that the upcoming release of JChem, version 5.5, will
                                                                                           support ECFP and FCFP fingerprints with counts.
2. Briem & Lessel Benchmark
The benchmark employed in this evaluation is the commonly used Briem and                   7. Continuous Tanimoto
Lessel benchmark *1+. This test assesses a method’s ability to identify near               Although there is universal agreement on how the Tanimoto coefficient should be
neighbours with the same biological activity from decoy molecules and molecules            interpreted for binary values, its application to continuous values, such as
with different biological activities. Five classes of active compounds are used: 40        histogram counts, has been implemented differently by different authors [4,5].
ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF              Consider the two alternate definitions T0 and T1 given below.
antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds,                                             𝑁                                     𝑁
                                                                                                                            𝑖 𝑥𝑖 𝑦𝑖                                𝑖 min⁡ 𝑥 𝑖 ,
                                                                                                                                                                        (         𝑦 𝑖)
the data set contains 573 “random” MDDR compounds, for a total of 953                           𝑇0 (𝑥, 𝑦) =       𝑁                                𝑇1 (𝑥, 𝑦) =      𝑁
molecules. The benchmark proceeds by determining the 10 nearest neighbours                                    𝑖       𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖                       𝑖 max(𝑥 𝑖 ,     𝑦 𝑖)
for each of the 380 active compounds. The query is not considered a neighbour              Both definitions agree for binary valued vectors, and are guaranteed to return
of itself. The score for each activity class is the fraction of these neighbours that      increasing fractional values between zero and one. Notice however that for
have the same activity as the query. Finally, the overall score is the average of the      x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923.
score of the five activity classes.                                                        In experiments with LINGO’s histograms, T1 was found to be superior (producing
                                                                                           an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%)
 3. Fingerprint Methods
3. Fingerprint Methods
Historically, the similarity method underlying ChemAxon’s JChem search engine
                                                                                           8. Conclusions
relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints
                                                                                           • ChemAxon’s Chemical Fingerprints perform comparably with other path and
similar to Daylight fingerprints, which allow a number of variants depending upon
                                                                                              feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints
parameters for the number of bits in the fingerprint, the longest bond path to
                                                                                              and PubChem/CACTVS fingerprints). All these methods perform equivalently.
encode and the number of bits set by each path. The “Marvin FP” below uses the
                                                                                           • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently
generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The
                                                                                              implemented by ChemAxon, perform exceptionally well on the standard Briem
“JChem FP” below uses the JChemManager defaults of 512 bits, paths up to
                                                                                              and Lessel benchmark.
length 6 and 2 bits per path.
                                                                                           • The announced ECFP histograms would be anticipated to set new records in 2D
Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints
                                                                                              chemical similarity.
originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4”
and “FCFP_8” below indicating the ChemAxon implementation with diameter
                                                                                           9. Acknowledgements
                                                                                           9. Acknowledgements
parameters of 4 bonds and 8 bonds respectively.
                                                                                           To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the
For reference comparison to other methods, also shown are LINGOs similarity [4],
                                                                                           ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing,
MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based
                                                                                           and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity.
chemical similarity (US20080004810A1) as used in their SIMPLE product [7].
                                                                                           10. Bibliography
 4. Tanimoto Coefficient
4. Tanimoto Coefficient                                                                    1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding
Many ways of comparing similarity between binary fingerprints have been                       Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol.
discussed in the literature [9]; generally the best performing of these is the                20, pp. 231-244, 2000.
Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌        𝑋 ∪ 𝑌 . This definition has almost magical          2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare
properties, normalizing the differences between two feature sets by their sizes,              Structure-based Clustering Methods and Descriptors for Use in Compound Selection”,
                                                                                              JCICS, Vol. 36, No. 3, pp. 572-582, 1996.
intuitively “the fraction in common”. Experimentally this correlates well to the           3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse,
chemical and biological notion of what makes two molecules similar.                           “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280,
                                                                                              2002.
5. Evaluation Results                                                                      4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle,
90%                                                                                           “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp.
        79.4%                                                                                 1912-1918, 2006.
80%               76.0%     75.6%                                                          5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based
70%                                   66.2%     65.2%                                         Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp.
                                                         63.7%     64.0%
                                                                                              1201-1213, 2006.
60%                                                                                        6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and
50%                                                                                           Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog
                                                                             42.0%            Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008.
40%
                                                                                           7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining
30%                                                                                           Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12,
20%                                                                                           pp. 304-315, 2007.
                                                                                           8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No.
10%                                                                                           5, pp. 742-754, 2010.
 0%                                                                                        9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”,
       ECFP_4    FCFP_8    LINGOs    MACCS Marvin CF JChem CF Daylight        IBM             JCICS, Vol. 38, No. 6, pp. 893-996, 1998.



                                                                                                                                                         NextMove Software Limited
                                                                                                                                                         Innovation Centre (Unit 23)
                                                                           www.nextmovesoftware.co.uk
                                                                                                                                                            Cambridge Science Park
                                                                           www.nextmovesoftware.com
                                                                                                                                                            Milton Road, Cambridge
                                                                                                                                                                   England CB4 0EY

More Related Content

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detectorUltraUploader
 
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...IJECEIAES
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAngie Miller
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesIAEME Publication
 
Applications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionApplications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionIRJET Journal
 
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfAn Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfNancy Ideker
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesCSCJournals
 
Classification With Ant Colony
Classification With Ant ColonyClassification With Ant Colony
Classification With Ant ColonyGissely Souza
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESIAEME Publication
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESIAEME Publication
 
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing IJECEIAES
 

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints (20)

GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Cdma
CdmaCdma
Cdma
 
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
 
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codes
 
Applications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionApplications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer Prediction
 
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfAn Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
Classification With Ant Colony
Classification With Ant ColonyClassification With Ant Colony
Classification With Ant Colony
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
gkv343.pdf
gkv343.pdfgkv343.pdf
gkv343.pdf
 
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
 
D111823
D111823D111823
D111823
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 

Recently uploaded (20)

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 

Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

  • 1. Benchmarking and Validation of JChem ECFP and FCFP Fingerprints Roger Sayle, NextMove Software Ltd, Cambridge, UK roger@nextmovesoftware.co.uk Abstract 1. Overview 6. Fingerprint Saturation 6. Fingerprint Saturation The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that A common failing with binary fingerprints is caused by their inability to represent similar compounds have similar therapeutic benefits. Cheminformatics tries to the number of times a feature (such as a path or substructure) occurs. The capture this insight by defining measures of similarity between the computer fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically representations of two molecules, with the hope of capturing a medicinal identical, as are those for many protein and DNA sequences. A more powerful chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity. representation that solves these issues is to replace occurrence bits with counts, This poster evaluates the chemical similarity measures offered by ChemAxon on a turning binary fingerprints into occurrence histograms. standard reference benchmark. Any such benchmark must by necessity be LINGOs similarity achieves better results on the Briem & Lessel benchmark by flawed; the similarity between two molecules is influenced by the framework by using counts instead of bits. However, as described in the Continuous Tanimoto which they are compared [2]. However a robust similarity measure should section below, care has to be taken to use a suitable similarity measure for typically perform better on such benchmarks, whilst a weaker model of chemical comparing histograms. similarity would be expected to perform worse (on average). ChemAxon have announced that the upcoming release of JChem, version 5.5, will support ECFP and FCFP fingerprints with counts. 2. Briem & Lessel Benchmark The benchmark employed in this evaluation is the commonly used Briem and 7. Continuous Tanimoto Lessel benchmark *1+. This test assesses a method’s ability to identify near Although there is universal agreement on how the Tanimoto coefficient should be neighbours with the same biological activity from decoy molecules and molecules interpreted for binary values, its application to continuous values, such as with different biological activities. Five classes of active compounds are used: 40 histogram counts, has been implemented differently by different authors [4,5]. ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF Consider the two alternate definitions T0 and T1 given below. antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds, 𝑁 𝑁 𝑖 𝑥𝑖 𝑦𝑖 𝑖 min⁡ 𝑥 𝑖 , ( 𝑦 𝑖) the data set contains 573 “random” MDDR compounds, for a total of 953 𝑇0 (𝑥, 𝑦) = 𝑁 𝑇1 (𝑥, 𝑦) = 𝑁 molecules. The benchmark proceeds by determining the 10 nearest neighbours 𝑖 𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖 𝑖 max(𝑥 𝑖 , 𝑦 𝑖) for each of the 380 active compounds. The query is not considered a neighbour Both definitions agree for binary valued vectors, and are guaranteed to return of itself. The score for each activity class is the fraction of these neighbours that increasing fractional values between zero and one. Notice however that for have the same activity as the query. Finally, the overall score is the average of the x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923. score of the five activity classes. In experiments with LINGO’s histograms, T1 was found to be superior (producing an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%) 3. Fingerprint Methods 3. Fingerprint Methods Historically, the similarity method underlying ChemAxon’s JChem search engine 8. Conclusions relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints • ChemAxon’s Chemical Fingerprints perform comparably with other path and similar to Daylight fingerprints, which allow a number of variants depending upon feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints parameters for the number of bits in the fingerprint, the longest bond path to and PubChem/CACTVS fingerprints). All these methods perform equivalently. encode and the number of bits set by each path. The “Marvin FP” below uses the • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The implemented by ChemAxon, perform exceptionally well on the standard Briem “JChem FP” below uses the JChemManager defaults of 512 bits, paths up to and Lessel benchmark. length 6 and 2 bits per path. • The announced ECFP histograms would be anticipated to set new records in 2D Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints chemical similarity. originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4” and “FCFP_8” below indicating the ChemAxon implementation with diameter 9. Acknowledgements 9. Acknowledgements parameters of 4 bonds and 8 bonds respectively. To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the For reference comparison to other methods, also shown are LINGOs similarity [4], ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing, MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity. chemical similarity (US20080004810A1) as used in their SIMPLE product [7]. 10. Bibliography 4. Tanimoto Coefficient 4. Tanimoto Coefficient 1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding Many ways of comparing similarity between binary fingerprints have been Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol. discussed in the literature [9]; generally the best performing of these is the 20, pp. 231-244, 2000. Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌 𝑋 ∪ 𝑌 . This definition has almost magical 2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare properties, normalizing the differences between two feature sets by their sizes, Structure-based Clustering Methods and Descriptors for Use in Compound Selection”, JCICS, Vol. 36, No. 3, pp. 572-582, 1996. intuitively “the fraction in common”. Experimentally this correlates well to the 3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse, chemical and biological notion of what makes two molecules similar. “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280, 2002. 5. Evaluation Results 4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle, 90% “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp. 79.4% 1912-1918, 2006. 80% 76.0% 75.6% 5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based 70% 66.2% 65.2% Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp. 63.7% 64.0% 1201-1213, 2006. 60% 6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and 50% Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog 42.0% Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008. 40% 7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining 30% Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12, 20% pp. 304-315, 2007. 8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No. 10% 5, pp. 742-754, 2010. 0% 9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”, ECFP_4 FCFP_8 LINGOs MACCS Marvin CF JChem CF Daylight IBM JCICS, Vol. 38, No. 6, pp. 893-996, 1998. NextMove Software Limited Innovation Centre (Unit 23) www.nextmovesoftware.co.uk Cambridge Science Park www.nextmovesoftware.com Milton Road, Cambridge England CB4 0EY