SlideShare a Scribd company logo
1 of 25
Download to read offline
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Recent improvements
to the rdkit
Roger Sayle
NextMove Software, Cambridge, UK
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Motivation: compound acquisition
• Given a existing screening collection of X
compounds, and with Y vendor compounds
available for purchase, how should I select
the next Z diverse compounds to buy.
• Typically, X is about 2M and Y is about 100M.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Rdkit’s maxminpicker
• Picking diverse compounds from large sets, 2014/08
• Optimizing Diversity Picking in the RDKit, 2014/08
• M. Ashton, J. Barnard, P. Willett et al., “Identification
of Diverse Database Subsets using Property-based
and Fragment-based Molecular Descriptors”, Quant.
Struct.-Act. Relat., Vol. 21, pp. 598-604, 2002.
• R. Kennard and L. Stone, “Computer aided design of
experiments”, Technometrics, Vol. 11, No. 1, pp. 137-
148, 1969.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Conceptual algorithm
• If no compounds have been picked so far, choose the
first picked compound at random.
• Repeatedly select the compound furthest from it’s
nearest picked compound [hence the name
maximum-minimum distance].
• Continue until the desired number of picked
compounds has been selected (or the pool of
available compounds has been exhausted).
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Selection visualization
Image Credits: Antoine Stevens, the ProspectR package on github
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Pros vs. cons
• Optimal picking is (NP-)Hard.
• Density vs. Diversity
– With biased data sets, random sampling follows density,
where MaxMin optimizes coverage.
– Picking != Clustering.
• Worst-case NN-1 bounds
– It’s possible to (estimate a) bound on the worst case
distance to nearest neighbor.
• Fraction of Data Set to be Sampled
• Scaling Performance of Algorithm
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Distance matrices are a bad idea
• Naïvely, one approach is to provide a picking
algorithm with a full distance matrix.
• If X compounds are picked already, and Y is the initial
pool to pick, from this requires (X+Y-1)*(X+Y)/2 time
and space.
• One goal is to keep memory requirements O(X+Y).
• An intelligent algorithm should be able to avoid
calculating any given distance more than once.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Taylor-butina vs. maxmin picking
• At a comparable distance threshold, MaxMin picks
are also a Leader (Tabu) clustering.
• As MaxMin picking doesn’t require a distance matrix
(NN list) it is significantly cheaper than Taylor-Butina.
• The distance bound for a given coverage is
discovered rather than specified (by trial and error).
– Taylor, JCICS, 1995, 35, pp. 59-67.
– Butina, JCICS, 1999, 39, pp. 747-750.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 17 possible moves.
The 11 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
Alpha cut-offs allow us to prune the search tree.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Max-min picking
1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
Minimums
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 1 2 3
3
3
0
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Max-min picking
0 1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 3 2 3
3
3
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Rdkit implementation
• For each candidate we track its bound, the number
of picks used to calculate this bound, and a pointer
to the next pool candidate (a singly linked list).
• Memory usage is O(poolsize), less than INT_LIST.
• No need for a distance matrix or distmat cache.
• Linked list preserves tie splitting/skips picked items.
• The same data structure is (ab)used as a visit array in
an initial pass to remove firstPicks from the pool.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Performance improvement
• Using Andrew Dalke’s data set of 12386
benzodiazepines, using the first one thousand as the
existing screening library, select the next 18 most
diverse molecules from the remaining set (of 11386).
– Original RDKit Implementation:
• 224,688,273 FP comparisons 96.30s
– Pruning using alpha cut-off:
• 16,069,573 FP comparisons 6.79s (14x)
– Preserving bounds across picks:
• 1,047,982 FP comparisons 0.46s (209x)
– Timings on Dell laptop, using 2048 bit Morgan radius 2 FPs.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Significant similarity
• Different similarity measures require different
significance thresholds to distinguish random hits.
90%:0.528 95%:0.574 99%:0.656 90%:0.245 95%:0.271 99%:0.323
Thresholds for “random” in fingerprints the RDKit supports. 2013/10
In this talk I’ll ignore the influence of query and database composition on score significance.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Results: data set sampling
• ChEMBL 23 (1727053 compounds)
– 90%: 15107 compounds [5.45B FP cmps]
– 95%: 24430 compounds [9.82B FP cmps]
– 99%: 51463 compounds [23.9B FP cmps]
• eMolecules170601 (14328534 compounds)
– 90%: 19049 compounds [45.7B FP cmps]
– 95%: 32780 compounds [86.5B FP cmps]
– 99%: 80135 compounds [247B FP cmps]
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Astrazeneca vs. bayer
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Astrazeneca vs. bayer
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Screening library enhancement
• Selecting 1K compounds for purchase from
eMolecules (14M) to enhance ChEMBL (1.7M).
– Reading eMolecules: 4780s
– Reading ChEMBL: 821s
– Generating FPs: 1456s
– MaxMinPicker: 42773s[80B FP cmps]
• Fazit: Large scale diversity selection can be run
overnight on a single CPU core.
• Selecting the first 18 compounds takes only 399s
[715M FP cmps].
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Additional considerations #1
• Rule 1: It’s better to filter compounds for desirability,
physical properties, Lipinski, price before picking.
– Filtering is O(N), Picking isn’t.
• With library screening, it’s often preferable to have
hits with neighbors rather than singletons, to confirm
true positives, or provide initial QSAR; either
– Picking should be performed on the pool of compounds
that have neighbors
– Each pick confirmed to have a neighbour as it is chosen.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Additional considerations #2
• If required, it is possible to sample some areas of
chemical space (kinase inhibitors or fragments) with
higher density than others.
• For those that want to do fingerprint comparison
faster than RDKit, Andrew Dalke’s ChemFP is worth a
look.
• One possible refinement is to de-duplicate identical
fingerprints (efficiently done using a hash table).
• Finally, a better similarity measure should produce
better results (SmallWorld sales pitch goes here)
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Implementing full Kennard-stone
• The original Kennard-Stone algorithm requires the
first two picks to be the furthest apart in a data set.
• Traditionally, this has required O(N2) time.
• A break-through in theoretical computer science
since 2013 has changed this assumption:
– Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino, “On computing the
diameter of real-world undirected graphs”, Theoretical Computer Science, Vol. 514, pp. 84-95, 2013.
– Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter Kosters, Andrea Marino and Frank Takes,
“On the Solvability of the Six Degrees of Kevin Bacon Game: A Faster Graph Diameter and Radius
Computation Method”, 2014.
– Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino and Frank W.
Takes, “Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs
with an application to the six degrees of separation games”, Theoretical Computer Science, Vol. 586,
pp. 59-80, 2015.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Bounded vertex eccentricity
• The eccentricity of a vertex, v, is the greatest distance to any other vertex.
• The radius of a graph is the minimum eccentricity.
• The diameter of a graph is a maximum eccentricity.
Image Credit: Tapiocozzo, Wikipedia page on “Centrality”.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Sumsweep results
• Finding furthest two atoms of 327 heavy atoms in
the protein crambin (1CRN)
– Brute Force: 53301 comparisons
– SUMSWEEP: 6636 comparisons [k=21]
• Finding furthest apart (using Hamming distance on
2K MFP2 FPs) of 250251 compounds in NCI August
2000 data set.
– Brute Force: 31,312,656,375 comparisons
– SUMSWEEP: 43,278,372 comparisons [k=173]
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
conclusions
• A little computer science can make compound
purchasing decisions a whole lot faster.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
acknowledgements
• The RDKit crew
– Greg Landrum and Brian Kelley
• The NextMove Software crew
– John Mayfield and Noel O’Boyle
• Industrial Inspiration
– Darren Green, GSK
– Roman Affentranger, Roche
– Pat Walters, Relay Therapeutics
– Andrew Dalke, Dalke Scientific.

More Related Content

What's hot

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataGreg Landrum
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectAnubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AIaimsnist
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Anubhav Jain
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...aimsnist
 

What's hot (20)

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AI
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
 

Similar to Recent improvements to the RDKit

Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsChemseddine Berbague
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.pptLPrashanthi
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptxRvishnupriya2
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingLionel Briand
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 

Similar to Recent improvements to the RDKit (20)

Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systems
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
P1121133727
P1121133727P1121133727
P1121133727
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Clustering
ClusteringClustering
Clustering
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Recently uploaded

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 

Recent improvements to the RDKit

  • 1. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Recent improvements to the rdkit Roger Sayle NextMove Software, Cambridge, UK
  • 2. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Motivation: compound acquisition • Given a existing screening collection of X compounds, and with Y vendor compounds available for purchase, how should I select the next Z diverse compounds to buy. • Typically, X is about 2M and Y is about 100M.
  • 3. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Rdkit’s maxminpicker • Picking diverse compounds from large sets, 2014/08 • Optimizing Diversity Picking in the RDKit, 2014/08 • M. Ashton, J. Barnard, P. Willett et al., “Identification of Diverse Database Subsets using Property-based and Fragment-based Molecular Descriptors”, Quant. Struct.-Act. Relat., Vol. 21, pp. 598-604, 2002. • R. Kennard and L. Stone, “Computer aided design of experiments”, Technometrics, Vol. 11, No. 1, pp. 137- 148, 1969.
  • 4. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Conceptual algorithm • If no compounds have been picked so far, choose the first picked compound at random. • Repeatedly select the compound furthest from it’s nearest picked compound [hence the name maximum-minimum distance]. • Continue until the desired number of picked compounds has been selected (or the pool of available compounds has been exhausted).
  • 5. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Selection visualization Image Credits: Antoine Stevens, the ProspectR package on github
  • 6. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Pros vs. cons • Optimal picking is (NP-)Hard. • Density vs. Diversity – With biased data sets, random sampling follows density, where MaxMin optimizes coverage. – Picking != Clustering. • Worst-case NN-1 bounds – It’s possible to (estimate a) bound on the worst case distance to nearest neighbor. • Fraction of Data Set to be Sampled • Scaling Performance of Algorithm
  • 7. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Distance matrices are a bad idea • Naïvely, one approach is to provide a picking algorithm with a full distance matrix. • If X compounds are picked already, and Y is the initial pool to pick, from this requires (X+Y-1)*(X+Y)/2 time and space. • One goal is to keep memory requirements O(X+Y). • An intelligent algorithm should be able to avoid calculating any given distance more than once.
  • 8. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Taylor-butina vs. maxmin picking • At a comparable distance threshold, MaxMin picks are also a Leader (Tabu) clustering. • As MaxMin picking doesn’t require a distance matrix (NN list) it is significantly cheaper than Taylor-Butina. • The distance bound for a given coverage is discovered rather than specified (by trial and error). – Taylor, JCICS, 1995, 35, pp. 59-67. – Butina, JCICS, 1999, 39, pp. 747-750.
  • 9. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Artificial intelligence (MINI-MAX) Los Alamos Chess (6x6 board) White has 17 possible moves. The 11 that don’t check, lose. Five checks, lose the queen. MAX MIN 5 3 2 4 0 6 1 3 3 3 2 0 1 Alpha cut-offs allow us to prune the search tree.
  • 10. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Max-min picking 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 Minimums Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 1 2 3 3 3 0
  • 11. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Max-min picking 0 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 < Bounds Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 3 2 3 3 3
  • 12. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Rdkit implementation • For each candidate we track its bound, the number of picks used to calculate this bound, and a pointer to the next pool candidate (a singly linked list). • Memory usage is O(poolsize), less than INT_LIST. • No need for a distance matrix or distmat cache. • Linked list preserves tie splitting/skips picked items. • The same data structure is (ab)used as a visit array in an initial pass to remove firstPicks from the pool.
  • 13. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Performance improvement • Using Andrew Dalke’s data set of 12386 benzodiazepines, using the first one thousand as the existing screening library, select the next 18 most diverse molecules from the remaining set (of 11386). – Original RDKit Implementation: • 224,688,273 FP comparisons 96.30s – Pruning using alpha cut-off: • 16,069,573 FP comparisons 6.79s (14x) – Preserving bounds across picks: • 1,047,982 FP comparisons 0.46s (209x) – Timings on Dell laptop, using 2048 bit Morgan radius 2 FPs.
  • 14. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Significant similarity • Different similarity measures require different significance thresholds to distinguish random hits. 90%:0.528 95%:0.574 99%:0.656 90%:0.245 95%:0.271 99%:0.323 Thresholds for “random” in fingerprints the RDKit supports. 2013/10 In this talk I’ll ignore the influence of query and database composition on score significance.
  • 15. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Results: data set sampling • ChEMBL 23 (1727053 compounds) – 90%: 15107 compounds [5.45B FP cmps] – 95%: 24430 compounds [9.82B FP cmps] – 99%: 51463 compounds [23.9B FP cmps] • eMolecules170601 (14328534 compounds) – 90%: 19049 compounds [45.7B FP cmps] – 95%: 32780 compounds [86.5B FP cmps] – 99%: 80135 compounds [247B FP cmps]
  • 16. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Astrazeneca vs. bayer
  • 17. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Astrazeneca vs. bayer
  • 18. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Screening library enhancement • Selecting 1K compounds for purchase from eMolecules (14M) to enhance ChEMBL (1.7M). – Reading eMolecules: 4780s – Reading ChEMBL: 821s – Generating FPs: 1456s – MaxMinPicker: 42773s[80B FP cmps] • Fazit: Large scale diversity selection can be run overnight on a single CPU core. • Selecting the first 18 compounds takes only 399s [715M FP cmps].
  • 19. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Additional considerations #1 • Rule 1: It’s better to filter compounds for desirability, physical properties, Lipinski, price before picking. – Filtering is O(N), Picking isn’t. • With library screening, it’s often preferable to have hits with neighbors rather than singletons, to confirm true positives, or provide initial QSAR; either – Picking should be performed on the pool of compounds that have neighbors – Each pick confirmed to have a neighbour as it is chosen.
  • 20. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Additional considerations #2 • If required, it is possible to sample some areas of chemical space (kinase inhibitors or fragments) with higher density than others. • For those that want to do fingerprint comparison faster than RDKit, Andrew Dalke’s ChemFP is worth a look. • One possible refinement is to de-duplicate identical fingerprints (efficiently done using a hash table). • Finally, a better similarity measure should produce better results (SmallWorld sales pitch goes here)
  • 21. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Implementing full Kennard-stone • The original Kennard-Stone algorithm requires the first two picks to be the furthest apart in a data set. • Traditionally, this has required O(N2) time. • A break-through in theoretical computer science since 2013 has changed this assumption: – Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino, “On computing the diameter of real-world undirected graphs”, Theoretical Computer Science, Vol. 514, pp. 84-95, 2013. – Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter Kosters, Andrea Marino and Frank Takes, “On the Solvability of the Six Degrees of Kevin Bacon Game: A Faster Graph Diameter and Radius Computation Method”, 2014. – Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino and Frank W. Takes, “Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs with an application to the six degrees of separation games”, Theoretical Computer Science, Vol. 586, pp. 59-80, 2015.
  • 22. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Bounded vertex eccentricity • The eccentricity of a vertex, v, is the greatest distance to any other vertex. • The radius of a graph is the minimum eccentricity. • The diameter of a graph is a maximum eccentricity. Image Credit: Tapiocozzo, Wikipedia page on “Centrality”.
  • 23. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Sumsweep results • Finding furthest two atoms of 327 heavy atoms in the protein crambin (1CRN) – Brute Force: 53301 comparisons – SUMSWEEP: 6636 comparisons [k=21] • Finding furthest apart (using Hamming distance on 2K MFP2 FPs) of 250251 compounds in NCI August 2000 data set. – Brute Force: 31,312,656,375 comparisons – SUMSWEEP: 43,278,372 comparisons [k=173]
  • 24. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 conclusions • A little computer science can make compound purchasing decisions a whole lot faster.
  • 25. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 acknowledgements • The RDKit crew – Greg Landrum and Brian Kelley • The NextMove Software crew – John Mayfield and Noel O’Boyle • Industrial Inspiration – Darren Green, GSK – Roman Affentranger, Roche – Pat Walters, Relay Therapeutics – Andrew Dalke, Dalke Scientific.