A benchmark of substructure searching tools given at the Cambridge Cheminformatics Network Meeting (May 27th). Slides have added annotated to aid description.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Substructure Search Face-off
1. Cambridge Cheminformatics Network Meeting, Cambridge, 27th May 2015
Substructure Searching
Face-off
John May and Roger Sayle
NextMove Software
Cambridge, UK
2. Molecular Identity
is foo the same as bar?
Substructure
is foo part of bar?
Chemical Similarity
how much is foo like bar?
“A grand unified theory: the three fundamental forces of Cheminformatics” - R Sayle et al.
SmallWorld: Efficient Maximum Common Subgraph Searching of Large Chemical Databases. ACS.
Aug 2012.
3. SUBSTRUCTURE SEARCH
Some of the many uses:
• Fragment-based lead discovery
• Murcko Scaffolds
• Matched pairs
• Functional groups, atom types
• Similarity Keys (MACCS/CACTVS)
• Toxicology - Tim Allen’s MIE models
Substructure searches are infrequent*…
…they can be slow - correlation/causation?
*~10x more unique similarity queries in Structure Query Collection
4. ANATOMY OF A SUBSEARCH
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
CCO query
0008000000008020 query-fp
1. Generate
fingerprint
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
2. Fingerprint
Scan
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
3. Atom-match
Variables
• fp type
• fp index type
• none
• linear
• sub-linear
• mol format
• line-notation
• binary
• atom-match
• VF2
• Ullmann
• storage
• flatfile
• RDMS
*CBr is actually filtered with this fingerprint but left to show the FP is generally not perfect (precision<1)
5. WHY CAN IT FEEL SLOW?
Identity and Similarity Search
different queries, similar amount of work
Substructure Search
different queries, potentially huge difference in work
Rough est. for ~10 mil compounds
• Identity, index canonical form – b-tree: ≤ 1 ms
• Similarity, index fingerprint – linear scan: ≤ 100 ms
• Substructure, index fingerprint + structure
• linear FP scan, atom-based match: ≥ 100 ms
• sub-linear FP scan, atom-based match: ≥ 5 ms
6. STRATEGIES?
Parallelisation
resource contention from multiple users
Limit query count (first 10)
must decide which ones are first
Limit time (60 s in eMolecules/SureChEMBL)
hits depend on hardware/workload
Cache or filter pathological queries
how to predict?
Limit screen count (fastsearch)
may get no hits?
7. EXISTING BENCHMARKS
Several tools publish isolated performance figures
and should be commended.
“what performance can you expect”
Alas different hardware, queries, and target
dataset limit meaningful conclusions and
comparisons.
Using the same hardware, queries, and target
dataset we present the execution efficiency (how
fast). Retrieval efficiency (hits) will be touched
upon.
9. TARGET DATABASE
Needed moderate size collection
ChEMBL ~1.5 mil, too easy, fits easily in a memory cache
PubChem Compound ~70 mil, too large for many tools
eMolecules 6,996,230 compounds
performed minimal cleanup
eMol IDs will be made available
10. QUERIES
Structure Query Collection (SQC)
3,488 BindingDB_substructure queries
3,342 connected, 121 fragments
User generated
Includes pathological queries
Excluded fragments – some consider benzene as bad:
c1ccccc1.c1ccccc1 slow
c1ccccc1.c1ccccc1.c1ccccc1 very slow
c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1 ouch
Andrew Dalke's BitBucket Project - https://bitbucket.org/dalke/sqc
11. QUERIES
Duplicates due to aromaticity or hydrogens
[H]C1=CN=CN=C1[H] pyrimidine
[H]C1=CN=CN=C1 pyrimidine
C1=CN=CN=C1 pyrimidine
C1=CC=CC=C1 benzene
c1ccccc1 benzene
3,140/3,342 unique - but used the non-unique set:
- Hydrogens change semantics of match
- Queries run unmodified but standardised* to queries-clean
when needed, tools that expect “SMARTS”
*aromatised, suppressed-H, atom-stereo removed
12. KEY Type Chemistry Storage FORMAT
arthor standalone NextMove Flatfile Serialised
bingo-pgsql cartridge Indigo PostgreSQL Serialised
bingo-orcl cartridge Indigo Oracle Serialised
bingo-nosql standalone Indigo Flatfile Serialised
fastsearch standalone Open Babel Flatfile SMILES1
jcart standalone/cartridge ChemAxon Oracle Serialised
jcart-st (1 thread) standalone/cartridge ChemAxon Oracle Serialised
mychem cartridge Open Babel MySQL
(MariaDB)
Serialised
pgchem cartridge Open Babel PostgreSQL Serialised
orchem cartridge CDK Oracle Serialised2
rdcart cartridge RDKit PostgreSQL Serialised
rdluc standalone RDKit Lucene SMILES
rdsmigrep standalone RDKit SMILES SMILES
tripod-ss standalone ChemAxon Oracle Cache SMILES3
1) FastSearch format depends on input. 2) OrChem serialises as a string rather than binary. 3) Tripod’s
Search Search uses molfiles by default. Underlined: queries-clean used
TOOLS EVALUATED
13. Name Type Chemistry Storage
Pinpoint Cartridge Dotmatics Oracle
Direct Cartridge BIOVIA/Accelrys/MDL Oracle
Accord Cartridge BIOVIA/Accelrys Oracle
ABCD Cartridge J&J Oracle
DayCart Cartridge Daylight Oracle/Postgres
Torus Cartridge Digital Chem Oracle
OpenEye Cartridge Cartridge OpenEye Oracle
MolCart Cartridge ICM MySQL
MolSQL Cartridge Scilligence SQL Server
Chord Cartridge OpenEye PostgreSQL
OpenChord Cartridge Open Babel / RDKit PostgreSQL
IC Cartridge Cartridge InfoChem Oracle
AUSPYX Cartridge Tripos Oracle
Oracle Cartridge Cartridge PerkinElemer Oracle
MORE CARTRIDGES
Some not externally available (ABCD) or being retired (Accord)
14. Name Type Chemistry Storage
ChemFP standalone Multiple FPS/BFPS in-memory
Similr standalone ? MongoDB
Data Warrior standalone Acetlion In-memory
Ambit - OpenTox standalone CDK/Ambit MySQL
MolecularLucene standalone CDK Lucene
Wikipedia CSE standalone Acetlion (JS) In-browser-memory
MOLDB6 standalone checkmol/matchmol MySQL
ChemSearch1
SQL SQL Oracle/PostgreSQL
SQLMOL SQL SQL MS SQL
MORE STANDALONES
ChemFP, Similr, MolecularLucene
- currently similarity only
MOLDB6
- index time for eMol est. at 22 days… we started last week
1: Golovin A and Henrick K. Chemical Substructure Search in SQL. JCIM. 2008. 49(1)
15. CHANGES FROM
“OUT OF THE BOX”
MyChem and RDKit Lucene
- fingerprint screen workaround
- OB FP2 and Avalon FP, no bits ≠ no hits
Tripod Search Server
- Load from SMILES instead of molfile
- Avoid aromaticity recalculation
- Used larger cache size
16. MEASUREMENT IS NOT PROPHECY1
Large number of variables
Time taken for this benchmark, with these tool versions,
on a machine, with this load. Tools were setup as per
user manuals (shared memory etc).
A “normal” desktop, £££ less than iPhone 6+.
CentOS 7
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
16 GB RAM
“Trust has no place in science” - NextMove Software will
make inputs available for others to test
1: http://benchmarksgame.alioth.debian.org/
18. EXCLUDED QueRIES
19 queries skipped in one or more tools
3,323/3,342 that were executed by all
2 Non-terminal hydrogens
C1N[H]OC[H]1
1 Invalid stereochemistry
CC[C@@H]
10 Atom classes
C1C[N:1]CNC1
6 Bond stereo not-implemented (should have cleaned)
N
1
HN
H2
C
NH
H
O
H2C
H
H3
C CH
36. OTHER CONSIDERATIONS
What are the hits? o-xylene example
How do they scale?
• Index, tree-based indexes will have many
more internal nodes
• Memory usage
• binary < SMILES < molfile
• fingerprint length
PubChem Compound currently ~10x eMolecules
37. BENCHMARK SUMMARY
Many tools have reasonable performance for most
queries but are hindered by their worst cases.
Minimal difference between Oracle vs PostgreSQL
OrChem vs JCart (slowest and fastest cartridges)
MyChem and pgchem will benefit from improved
query preparation (hydrogen handling)
Domain specific indexes
preferable to table selects.
Efficient IO
Image: http://timvdm.blogspot.co.uk/
38. FINGERPRINT PERFORMANCE
Much work has been done on optimising fingerprints
Screen performance measured for several of the tools
• only partially explains observed time differences
• occasional v. bad precision for some queries
• hydrogens are difficult to encode1
• Maybe False Negatives! – but they’re relative
Tool Type Screenout Precision
bingo Hashed Trees/Rings 98.63% 80.9%
rdcart Hashed Patterns 97.95% 59.9%
jcart,tripod-ss Hashed Paths 98.15% 59.7%
pgchem,mychem,fastsearch Hashed Paths/Rings 98.2% 57.3%
1. http://nextmovesoftware.com/blog/2015/02/16/
H
N
H
41. FINGERPRINTING MATERS
Arthor - 0% screen-out
23,290,449,670 atom-matches
Bingo - 98.63% screen-out
319,079,160 atom-matches
Can test how much different fingerprints mater…
Provide arthor a pre-computed list of screen hits
Times therefore presume instant bit screen
False negatives are relative to 0% screen-out.
44. CONCLUSIONS
Fast searching depends on many factors
• index data-structure
• fingerprint type
• molecule representation
• atom-match algorithm
Fingerprint optimisation does mater but is not the
performance bottleneck and can’t always help:
Generic patterns, a-a
Hydrogens, [H]N([H])C
45. ACKNOWLEDGEMENTS
Daniel Lowe and Noel O’Boyle, NextMove Software
Andrew Dalke
- assembling Structure Query Collection (SQC)
Michael Gilson and Tiqing Liu
- providing BindingDB (http://www.bindingdb.org/)
queries for SQC
Tool and cartridge developers
46. Further optimisation and keeping file open: 36 s including
inverted index FP screen, single threaded:
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
tripod-ss
orchem
fastsearch
mychem
pgchem
jcart-st
jcart
bingo-orcl
bingo-nosql
bingo-pgsql
arthor-fp
arthor
rdcart
rdluc
rdsmigrep
90%
99%
1s 10s 30s1m 5m
47. Some FALSE NEGATIVES
Most false negatives are due to Kekulisation/Aromaticity.
Typically when reading SMILES a tool will assign a Kekulé
form and then re-aromatise.
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
Kekulise
Aromatise
Input
Open Babel