SlideShare a Scribd company logo
1 of 48
Download to read offline
Cambridge Cheminformatics Network Meeting, Cambridge, 27th May 2015
Substructure Searching
Face-off
John May and Roger Sayle
NextMove Software
Cambridge, UK
Molecular Identity
is foo the same as bar?
Substructure
is foo part of bar?
Chemical Similarity
how much is foo like bar?
“A grand unified theory: the three fundamental forces of Cheminformatics” - R Sayle et al.
SmallWorld: Efficient Maximum Common Subgraph Searching of Large Chemical Databases. ACS.
Aug 2012.
SUBSTRUCTURE SEARCH
Some of the many uses:
• Fragment-based lead discovery
• Murcko Scaffolds
• Matched pairs
• Functional groups, atom types
• Similarity Keys (MACCS/CACTVS)
• Toxicology - Tim Allen’s MIE models
Substructure searches are infrequent*…
…they can be slow - correlation/causation?
*~10x more unique similarity queries in Structure Query Collection
ANATOMY OF A SUBSEARCH
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
CCO query
0008000000008020 query-fp
1. Generate
fingerprint
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
2. Fingerprint
Scan
0008800040008020 CCO
1000000050000028 NCC
5008a04040008020 CCCCO
0000000040000024 CCl
0000080000040020 CBr
0048002000008020 C(=O)O
3. Atom-match
Variables
• fp type
• fp index type
• none
• linear
• sub-linear
• mol format
• line-notation
• binary
• atom-match
• VF2
• Ullmann
• storage
• flatfile
• RDMS
*CBr is actually filtered with this fingerprint but left to show the FP is generally not perfect (precision<1)
WHY CAN IT FEEL SLOW?
Identity and Similarity Search
different queries, similar amount of work
Substructure Search
different queries, potentially huge difference in work
Rough est. for ~10 mil compounds
• Identity, index canonical form – b-tree: ≤ 1 ms
• Similarity, index fingerprint – linear scan: ≤ 100 ms
• Substructure, index fingerprint + structure
• linear FP scan, atom-based match: ≥ 100 ms
• sub-linear FP scan, atom-based match: ≥ 5 ms
STRATEGIES?
Parallelisation
resource contention from multiple users
Limit query count (first 10)
must decide which ones are first
Limit time (60 s in eMolecules/SureChEMBL)
hits depend on hardware/workload
Cache or filter pathological queries
how to predict?
Limit screen count (fastsearch)
may get no hits?
EXISTING BENCHMARKS
Several tools publish isolated performance figures
and should be commended.
“what performance can you expect”
Alas different hardware, queries, and target
dataset limit meaningful conclusions and
comparisons.
Using the same hardware, queries, and target
dataset we present the execution efficiency (how
fast). Retrieval efficiency (hits) will be touched
upon.
OUR BENCHMARK:
COUNT TOTAL
NUMBER OF HITS
TARGET DATABASE
Needed moderate size collection
ChEMBL ~1.5 mil, too easy, fits easily in a memory cache
PubChem Compound ~70 mil, too large for many tools
eMolecules 6,996,230 compounds
performed minimal cleanup
eMol IDs will be made available
QUERIES
Structure Query Collection (SQC)
3,488 BindingDB_substructure queries
3,342 connected, 121 fragments
User generated
Includes pathological queries
Excluded fragments – some consider benzene as bad:
c1ccccc1.c1ccccc1 slow
c1ccccc1.c1ccccc1.c1ccccc1 very slow
c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1 ouch
Andrew Dalke's BitBucket Project - https://bitbucket.org/dalke/sqc
QUERIES
Duplicates due to aromaticity or hydrogens
[H]C1=CN=CN=C1[H] pyrimidine
[H]C1=CN=CN=C1 pyrimidine
C1=CN=CN=C1 pyrimidine
C1=CC=CC=C1 benzene
c1ccccc1 benzene
3,140/3,342 unique - but used the non-unique set:
- Hydrogens change semantics of match
- Queries run unmodified but standardised* to queries-clean
when needed, tools that expect “SMARTS”
*aromatised, suppressed-H, atom-stereo removed
KEY Type Chemistry Storage FORMAT
arthor standalone NextMove Flatfile Serialised
bingo-pgsql cartridge Indigo PostgreSQL Serialised
bingo-orcl cartridge Indigo Oracle Serialised
bingo-nosql standalone Indigo Flatfile Serialised
fastsearch standalone Open Babel Flatfile SMILES1
jcart standalone/cartridge ChemAxon Oracle Serialised
jcart-st (1 thread) standalone/cartridge ChemAxon Oracle Serialised
mychem cartridge Open Babel MySQL
(MariaDB)
Serialised
pgchem cartridge Open Babel PostgreSQL Serialised
orchem cartridge CDK Oracle Serialised2
rdcart cartridge RDKit PostgreSQL Serialised
rdluc standalone RDKit Lucene SMILES
rdsmigrep standalone RDKit SMILES SMILES
tripod-ss standalone ChemAxon Oracle Cache SMILES3
1) FastSearch format depends on input. 2) OrChem serialises as a string rather than binary. 3) Tripod’s
Search Search uses molfiles by default. Underlined: queries-clean used
TOOLS EVALUATED
Name Type Chemistry Storage
Pinpoint Cartridge Dotmatics Oracle
Direct Cartridge BIOVIA/Accelrys/MDL Oracle
Accord Cartridge BIOVIA/Accelrys Oracle
ABCD Cartridge J&J Oracle
DayCart Cartridge Daylight Oracle/Postgres
Torus Cartridge Digital Chem Oracle
OpenEye Cartridge Cartridge OpenEye Oracle
MolCart Cartridge ICM MySQL
MolSQL Cartridge Scilligence SQL Server
Chord Cartridge OpenEye PostgreSQL
OpenChord Cartridge Open Babel / RDKit PostgreSQL
IC Cartridge Cartridge InfoChem Oracle
AUSPYX Cartridge Tripos Oracle
Oracle Cartridge Cartridge PerkinElemer Oracle
MORE CARTRIDGES
Some not externally available (ABCD) or being retired (Accord)
Name Type Chemistry Storage
ChemFP standalone Multiple FPS/BFPS in-memory
Similr standalone ? MongoDB
Data Warrior standalone Acetlion In-memory
Ambit - OpenTox standalone CDK/Ambit MySQL
MolecularLucene standalone CDK Lucene
Wikipedia CSE standalone Acetlion (JS) In-browser-memory
MOLDB6 standalone checkmol/matchmol MySQL
ChemSearch1
SQL SQL Oracle/PostgreSQL
SQLMOL SQL SQL MS SQL
MORE STANDALONES
ChemFP, Similr, MolecularLucene
- currently similarity only
MOLDB6
- index time for eMol est. at 22 days… we started last week
1: Golovin A and Henrick K. Chemical Substructure Search in SQL. JCIM. 2008. 49(1)
CHANGES FROM
“OUT OF THE BOX”
MyChem and RDKit Lucene
- fingerprint screen workaround
- OB FP2 and Avalon FP, no bits ≠ no hits
Tripod Search Server
- Load from SMILES instead of molfile
- Avoid aromaticity recalculation
- Used larger cache size
MEASUREMENT IS NOT PROPHECY1
Large number of variables
Time taken for this benchmark, with these tool versions,
on a machine, with this load. Tools were setup as per
user manuals (shared memory etc).
A “normal” desktop, £££ less than iPhone 6+.
CentOS 7
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
16 GB RAM
“Trust has no place in science” - NextMove Software will
make inputs available for others to test
1: http://benchmarksgame.alioth.debian.org/
3342 Queries
6996230 TARGETS
≤23,381,400,660
MATCHES LATER
EXCLUDED QueRIES
19 queries skipped in one or more tools
3,323/3,342 that were executed by all
2 Non-terminal hydrogens
C1N[H]OC[H]1
1 Invalid stereochemistry
CC[C@@H]
10 Atom classes
C1C[N:1]CNC1
6 Bond stereo not-implemented (should have cleaned)
N
1
HN
H2
C
NH
H
O
H2C
H
H3
C CH
RETRIEVAL DIFFERENCES
“We demand rigidly defined areas of doubt and
uncertainty!”
CH3
CH3
o-xylene
RETRIEVAL DIFFERENCES
“We demand rigidly defined areas of doubt and
uncertainty!”
Somewhere between 440,901 and 529,386
rdcart 500,126
rdluc 500,053
bingo (pgsql/orcl) 500,865
bingo (nosql) 498,427
tripod-ss 509,655
jcart 444,541
fastsearch 440,901
orchem 529,386
arthor 444,553
CH3
CH3
o-xylene
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
90%
99%
1s 10s 30s1m 5m
many queries,
running quick
<— all start here
some queries,
running slow
Where the line drops from
3323, that is the fastest
query. Where it falls bellow 0, that
was the slowest.
Area under curve ~ total elapsed
time for all queries. However
log,log plot so can be deceptive.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
rdcart
rdsmigrep
90%
99%
1s 10s 30s1m 5m
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
rdcart
rdsmigrep
90%
99%
1s 10s 30s1m 5m
5h9m
9d15h0m
90% < 10 s
75% < 1 s
No FP screen, all queries
take around 5 mins for rdsmigrep
(no sanitise). rdcart slowest query is
nearly as slow.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
rdcart
rdluc
rdsmigrep
90%
99%
1s 10s 30s1m 5m
5h9m
9d15h0m
1d18h11m FP scan degraded
perfomance for
these queries
rdcart worse case
is only better due to mol
serialisation. rdluc / rdsmigrep
both use SMILES.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
bingo-orcl
bingo-pgsql
90%
99%
1s 10s 30s1m 5m
1h54m
2h6m
Bingo Oracle and PostgreSQL
perform similarly. Small
difference is likely measurement
inaccuracies.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
bingo-orcl
bingo-nosql
bingo-pgsql
90%
99%
1s 10s 30s1m 5m
59m47s
1h54m
2h6m
The recent “NoSQL”
(flatfile) version is about twice
as fast – DB overhead?
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
jcart-st
jcart
90%
99%
1s 10s 30s1m 5m
2h41m
1h2m
jcart was the fastest
cartridge. By default it
uses all available processors…
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
jcart-st
jcart
bingo-orcl
bingo-pgsql
90%
99%
1s 10s 30s1m 5m
2h41m
1h2m
1h54m
2h6m
…limiting to a single thread (st)
we observed a similar performance
to the Bingo cartridges.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
fastsearch
mychem
pgchem
90%
99%
1s 10s 30s1m 5m
13h54m
1d1h52m
1d15h22m
pgchem and mychem have a
different shape curve. This
turns out to be hydrogen
sprouting for some queries.
fastsearch would also
show this but got queries-clean
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
fastsearch
mychem
pgchem
90%
99%
1s 10s 30s1m 5m
13h54m
1d1h52m
1d15h22m
linear scan diff
Both mychem and fastsearch do
a linear FP scan. mychem does this
via a SQL statement, fastsearch does
it sequentially from the file.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
fastsearch
mychem
pgchem
rdcart
90%
99%
1s 10s 30s1m 5m
5h9m
13h54m
pghcem and rdcart both use
a GIST, r-tree based index. This
provides sub-linear screening for
some queries.
The curves are similar and we
would therefore expect
pgchem to also be ~ 6 hrs with the
hydrogen issues resolved.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
tripod-ss
fastsearch
mychem
pgchem
rdcart
90%
99%
1s 10s 30s1m 5m
1d15h22m
1d5h48m
tripod-ss also uses a
tree-based index and we see a
similar performance curve.
The curve is shallower
because pgchem and mychem
use serialised molecules.
5h9m
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
tripod-ss
jcart-st
jcart
rdcart
90%
99%
1s 10s 30s1m 5m
1d5h48m
2h41m
1h2m
tripod-ss use the JChem atom-
match and path fingerprints. jcart
was fastest on overall time…
5h9m
…but sub-linear screens clearly win
for many easy queries. Notice
jcart ran ~1000 queries faster.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
tripod-ss
orchem
fastsearch
rdluc
90%
99%
1s 10s 30s1m 5m
1d15h22m
1d5h48m
3d1h13m
1d18h11m
Unfortunately orchem was the
slowest cartridge measured. The
slowest queries were similar to
others using line-notations.
Recent CDK improvements
could make it faster.
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
bingo-nosql
bingo-nosql-clean
90%
99%
1s 10s 30s1m 5m
59m47s
1h6m
Comparing queries-clean to
default query set.
OTHER CONSIDERATIONS
What are the hits? o-xylene example
How do they scale?
• Index, tree-based indexes will have many
more internal nodes
• Memory usage
• binary < SMILES < molfile
• fingerprint length
PubChem Compound currently ~10x eMolecules
BENCHMARK SUMMARY
Many tools have reasonable performance for most
queries but are hindered by their worst cases.
Minimal difference between Oracle vs PostgreSQL
OrChem vs JCart (slowest and fastest cartridges)
MyChem and pgchem will benefit from improved
query preparation (hydrogen handling)
Domain specific indexes
preferable to table selects.
Efficient IO
Image: http://timvdm.blogspot.co.uk/
FINGERPRINT PERFORMANCE
Much work has been done on optimising fingerprints
Screen performance measured for several of the tools
• only partially explains observed time differences
• occasional v. bad precision for some queries
• hydrogens are difficult to encode1
• Maybe False Negatives! – but they’re relative
Tool Type Screenout Precision
bingo Hashed Trees/Rings 98.63% 80.9%
rdcart Hashed Patterns 97.95% 59.9%
jcart,tripod-ss Hashed Paths 98.15% 59.7%
pgchem,mychem,fastsearch Hashed Paths/Rings 98.2% 57.3%
1. http://nextmovesoftware.com/blog/2015/02/16/
H
N
H
AN EfFICIENT
ATOM-MATCH
Arthor
(codename)
• Query optimisation
• Database preprocessing
• Efficient binary representation
• eMolecules, 1.2GiB
• Matching done on representation
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
orchem
fastsearch
jcart
bingo-nosql
bingo-pgsql
arthor
rdcart
90%
99%
1s 10s 30s1m 5m
27m31s
1h2m
59m47s
FINGERPRINTING MATERS
Arthor - 0% screen-out
23,290,449,670 atom-matches
Bingo - 98.63% screen-out
319,079,160 atom-matches
Can test how much different fingerprints mater…
Provide arthor a pre-computed list of screen hits
Times therefore presume instant bit screen
False negatives are relative to 0% screen-out.
FINGERPRINTING MATERS
Fingerprint Length Density Time False
NEGATIVES
arthor - - 27m35s -
+rdpath 2048 61.90% 2m52s
(x9.5)
61,915
(0.02%)
+maccs 166 29.08% 2m26s
(x11.2)
140,968,649
(57.63%)
+dalke353 353 6.59% 2m2s
(x13.5)
0
+obfp2 1021 12.55% 1m32s
(x17.9)
641,712
(0.26%)
+jpath 1024 23.36% 1m7s
(x24.5)
41,889
(0.01)
+indigo-sub 1600 23.09% 55s
(x29.6)
128,288
(0.05%)
+indigo-sub+ext 1624 22.48% 54s
(x30.1)
128,288
(0.05%)
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
90%
99%
1s
arthor
rdpath
maccs
dalke353
jpath
indigo-sub+ext
indigo-sub
obfp2
n.b. < 3323 because some
queries had no hits from the
fingerprint screen
CONCLUSIONS
Fast searching depends on many factors
• index data-structure
• fingerprint type
• molecule representation
• atom-match algorithm
Fingerprint optimisation does mater but is not the
performance bottleneck and can’t always help:
Generic patterns, a-a
Hydrogens, [H]N([H])C
ACKNOWLEDGEMENTS
Daniel Lowe and Noel O’Boyle, NextMove Software
Andrew Dalke
- assembling Structure Query Collection (SQC)
Michael Gilson and Tiqing Liu
- providing BindingDB (http://www.bindingdb.org/)
queries for SQC
Tool and cartridge developers
Further optimisation and keeping file open: 36 s including
inverted index FP screen, single threaded:
1
10
100
1000
100
101
102
103
104
105
106
107
QueriesNotFinished(n)
Time (ms)
3323
tripod-ss
orchem
fastsearch
mychem
pgchem
jcart-st
jcart
bingo-orcl
bingo-nosql
bingo-pgsql
arthor-fp
arthor
rdcart
rdluc
rdsmigrep
90%
99%
1s 10s 30s1m 5m
Some FALSE NEGATIVES
Most false negatives are due to Kekulisation/Aromaticity.
Typically when reading SMILES a tool will assign a Kekulé
form and then re-aromatise.
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
H3C
N
CH3
P
OO
Kekulise
Aromatise
Input
Open Babel
Some FALSE NEGATIVES
OO
CH3
OO
CH3
OO
CH3
OO
CH3
Kekulise Aromatise
JChem, RDKit
Open Babel, CDK, OEChem
Does have
o-xylene
substructure
Doesn’t have
o-xylene
substructure

More Related Content

What's hot

1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
Deependra Ban
 
Illumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencingIllumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencing
Cristian Cosentino, PhD
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AI
Databricks
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
atmapandey
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
baoilleach
 

What's hot (20)

Ncbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuNcbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osu
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j Overview
 
The UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overviewThe UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overview
 
Matlab bioinformatics presentation
Matlab bioinformatics presentationMatlab bioinformatics presentation
Matlab bioinformatics presentation
 
Illumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencingIllumina GAIIx for high throughput sequencing
Illumina GAIIx for high throughput sequencing
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AI
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sir
 
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
 

Viewers also liked

Sub floor code
Sub floor codeSub floor code
Sub floor code
jbusse
 
ソーシャルリクルーティングの必要性
ソーシャルリクルーティングの必要性ソーシャルリクルーティングの必要性
ソーシャルリクルーティングの必要性
Takeshi Sato
 
lecture 21
lecture 21lecture 21
lecture 21
sajinsc
 
Conformational analysis
Conformational analysisConformational analysis
Conformational analysis
Pinky Vincent
 

Viewers also liked (20)

Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
 
Sub floor code
Sub floor codeSub floor code
Sub floor code
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
20 Single Source Shorthest Path
20 Single Source Shorthest Path20 Single Source Shorthest Path
20 Single Source Shorthest Path
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Mobile Monday Be Inspired Tour: New Era in Mobile and Finance
Mobile Monday Be Inspired Tour: New Era in Mobile and FinanceMobile Monday Be Inspired Tour: New Era in Mobile and Finance
Mobile Monday Be Inspired Tour: New Era in Mobile and Finance
 
ソーシャルリクルーティングの必要性
ソーシャルリクルーティングの必要性ソーシャルリクルーティングの必要性
ソーシャルリクルーティングの必要性
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 
Year 2 Organic Chemistry Mechanism and Stereochemistry Lecture 3 Conformation...
Year 2 Organic Chemistry Mechanism and Stereochemistry Lecture 3 Conformation...Year 2 Organic Chemistry Mechanism and Stereochemistry Lecture 3 Conformation...
Year 2 Organic Chemistry Mechanism and Stereochemistry Lecture 3 Conformation...
 
BEye product presentation
BEye product presentationBEye product presentation
BEye product presentation
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
lecture 21
lecture 21lecture 21
lecture 21
 
Strengthening Families 101
Strengthening Families 101Strengthening Families 101
Strengthening Families 101
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
 
Underpinning Principles of Presentation
Underpinning Principles of PresentationUnderpinning Principles of Presentation
Underpinning Principles of Presentation
 
Conformational analysis
Conformational analysisConformational analysis
Conformational analysis
 
Repair of concrete members
Repair of concrete membersRepair of concrete members
Repair of concrete members
 
Rationally boost your symfony2 application with caching tips and monitoring
Rationally boost your symfony2 application with caching tips and monitoringRationally boost your symfony2 application with caching tips and monitoring
Rationally boost your symfony2 application with caching tips and monitoring
 

Similar to Substructure Search Face-off

OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
gwprice
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulation
Jon Hand
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 

Similar to Substructure Search Face-off (20)

The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
 
Applyinga blockcentricapproach
Applyinga blockcentricapproachApplyinga blockcentricapproach
Applyinga blockcentricapproach
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
How to find what is making your Oracle database slow
How to find what is making your Oracle database slowHow to find what is making your Oracle database slow
How to find what is making your Oracle database slow
 
Oracle Database Cloud Performance Doag 2016
Oracle Database Cloud Performance Doag 2016Oracle Database Cloud Performance Doag 2016
Oracle Database Cloud Performance Doag 2016
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Oracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsOracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethods
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 
XMLDB Building Blocks And Best Practices - Oracle Open World 2008 - Marco Gra...
XMLDB Building Blocks And Best Practices - Oracle Open World 2008 - Marco Gra...XMLDB Building Blocks And Best Practices - Oracle Open World 2008 - Marco Gra...
XMLDB Building Blocks And Best Practices - Oracle Open World 2008 - Marco Gra...
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulation
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-complete
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
JEEConf 2016. Effectiveness and code optimization in Java applications
JEEConf 2016. Effectiveness and code optimization in  Java applicationsJEEConf 2016. Effectiveness and code optimization in  Java applications
JEEConf 2016. Effectiveness and code optimization in Java applications
 
Using AWR for SQL Analysis
Using AWR for SQL AnalysisUsing AWR for SQL Analysis
Using AWR for SQL Analysis
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 

Substructure Search Face-off

  • 1. Cambridge Cheminformatics Network Meeting, Cambridge, 27th May 2015 Substructure Searching Face-off John May and Roger Sayle NextMove Software Cambridge, UK
  • 2. Molecular Identity is foo the same as bar? Substructure is foo part of bar? Chemical Similarity how much is foo like bar? “A grand unified theory: the three fundamental forces of Cheminformatics” - R Sayle et al. SmallWorld: Efficient Maximum Common Subgraph Searching of Large Chemical Databases. ACS. Aug 2012.
  • 3. SUBSTRUCTURE SEARCH Some of the many uses: • Fragment-based lead discovery • Murcko Scaffolds • Matched pairs • Functional groups, atom types • Similarity Keys (MACCS/CACTVS) • Toxicology - Tim Allen’s MIE models Substructure searches are infrequent*… …they can be slow - correlation/causation? *~10x more unique similarity queries in Structure Query Collection
  • 4. ANATOMY OF A SUBSEARCH 0008800040008020 CCO 1000000050000028 NCC 5008a04040008020 CCCCO 0000000040000024 CCl 0000080000040020 CBr 0048002000008020 C(=O)O CCO query 0008000000008020 query-fp 1. Generate fingerprint 0008800040008020 CCO 1000000050000028 NCC 5008a04040008020 CCCCO 0000000040000024 CCl 0000080000040020 CBr 0048002000008020 C(=O)O 2. Fingerprint Scan 0008800040008020 CCO 1000000050000028 NCC 5008a04040008020 CCCCO 0000000040000024 CCl 0000080000040020 CBr 0048002000008020 C(=O)O 3. Atom-match Variables • fp type • fp index type • none • linear • sub-linear • mol format • line-notation • binary • atom-match • VF2 • Ullmann • storage • flatfile • RDMS *CBr is actually filtered with this fingerprint but left to show the FP is generally not perfect (precision<1)
  • 5. WHY CAN IT FEEL SLOW? Identity and Similarity Search different queries, similar amount of work Substructure Search different queries, potentially huge difference in work Rough est. for ~10 mil compounds • Identity, index canonical form – b-tree: ≤ 1 ms • Similarity, index fingerprint – linear scan: ≤ 100 ms • Substructure, index fingerprint + structure • linear FP scan, atom-based match: ≥ 100 ms • sub-linear FP scan, atom-based match: ≥ 5 ms
  • 6. STRATEGIES? Parallelisation resource contention from multiple users Limit query count (first 10) must decide which ones are first Limit time (60 s in eMolecules/SureChEMBL) hits depend on hardware/workload Cache or filter pathological queries how to predict? Limit screen count (fastsearch) may get no hits?
  • 7. EXISTING BENCHMARKS Several tools publish isolated performance figures and should be commended. “what performance can you expect” Alas different hardware, queries, and target dataset limit meaningful conclusions and comparisons. Using the same hardware, queries, and target dataset we present the execution efficiency (how fast). Retrieval efficiency (hits) will be touched upon.
  • 9. TARGET DATABASE Needed moderate size collection ChEMBL ~1.5 mil, too easy, fits easily in a memory cache PubChem Compound ~70 mil, too large for many tools eMolecules 6,996,230 compounds performed minimal cleanup eMol IDs will be made available
  • 10. QUERIES Structure Query Collection (SQC) 3,488 BindingDB_substructure queries 3,342 connected, 121 fragments User generated Includes pathological queries Excluded fragments – some consider benzene as bad: c1ccccc1.c1ccccc1 slow c1ccccc1.c1ccccc1.c1ccccc1 very slow c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1.c1ccccc1 ouch Andrew Dalke's BitBucket Project - https://bitbucket.org/dalke/sqc
  • 11. QUERIES Duplicates due to aromaticity or hydrogens [H]C1=CN=CN=C1[H] pyrimidine [H]C1=CN=CN=C1 pyrimidine C1=CN=CN=C1 pyrimidine C1=CC=CC=C1 benzene c1ccccc1 benzene 3,140/3,342 unique - but used the non-unique set: - Hydrogens change semantics of match - Queries run unmodified but standardised* to queries-clean when needed, tools that expect “SMARTS” *aromatised, suppressed-H, atom-stereo removed
  • 12. KEY Type Chemistry Storage FORMAT arthor standalone NextMove Flatfile Serialised bingo-pgsql cartridge Indigo PostgreSQL Serialised bingo-orcl cartridge Indigo Oracle Serialised bingo-nosql standalone Indigo Flatfile Serialised fastsearch standalone Open Babel Flatfile SMILES1 jcart standalone/cartridge ChemAxon Oracle Serialised jcart-st (1 thread) standalone/cartridge ChemAxon Oracle Serialised mychem cartridge Open Babel MySQL (MariaDB) Serialised pgchem cartridge Open Babel PostgreSQL Serialised orchem cartridge CDK Oracle Serialised2 rdcart cartridge RDKit PostgreSQL Serialised rdluc standalone RDKit Lucene SMILES rdsmigrep standalone RDKit SMILES SMILES tripod-ss standalone ChemAxon Oracle Cache SMILES3 1) FastSearch format depends on input. 2) OrChem serialises as a string rather than binary. 3) Tripod’s Search Search uses molfiles by default. Underlined: queries-clean used TOOLS EVALUATED
  • 13. Name Type Chemistry Storage Pinpoint Cartridge Dotmatics Oracle Direct Cartridge BIOVIA/Accelrys/MDL Oracle Accord Cartridge BIOVIA/Accelrys Oracle ABCD Cartridge J&J Oracle DayCart Cartridge Daylight Oracle/Postgres Torus Cartridge Digital Chem Oracle OpenEye Cartridge Cartridge OpenEye Oracle MolCart Cartridge ICM MySQL MolSQL Cartridge Scilligence SQL Server Chord Cartridge OpenEye PostgreSQL OpenChord Cartridge Open Babel / RDKit PostgreSQL IC Cartridge Cartridge InfoChem Oracle AUSPYX Cartridge Tripos Oracle Oracle Cartridge Cartridge PerkinElemer Oracle MORE CARTRIDGES Some not externally available (ABCD) or being retired (Accord)
  • 14. Name Type Chemistry Storage ChemFP standalone Multiple FPS/BFPS in-memory Similr standalone ? MongoDB Data Warrior standalone Acetlion In-memory Ambit - OpenTox standalone CDK/Ambit MySQL MolecularLucene standalone CDK Lucene Wikipedia CSE standalone Acetlion (JS) In-browser-memory MOLDB6 standalone checkmol/matchmol MySQL ChemSearch1 SQL SQL Oracle/PostgreSQL SQLMOL SQL SQL MS SQL MORE STANDALONES ChemFP, Similr, MolecularLucene - currently similarity only MOLDB6 - index time for eMol est. at 22 days… we started last week 1: Golovin A and Henrick K. Chemical Substructure Search in SQL. JCIM. 2008. 49(1)
  • 15. CHANGES FROM “OUT OF THE BOX” MyChem and RDKit Lucene - fingerprint screen workaround - OB FP2 and Avalon FP, no bits ≠ no hits Tripod Search Server - Load from SMILES instead of molfile - Avoid aromaticity recalculation - Used larger cache size
  • 16. MEASUREMENT IS NOT PROPHECY1 Large number of variables Time taken for this benchmark, with these tool versions, on a machine, with this load. Tools were setup as per user manuals (shared memory etc). A “normal” desktop, £££ less than iPhone 6+. CentOS 7 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 16 GB RAM “Trust has no place in science” - NextMove Software will make inputs available for others to test 1: http://benchmarksgame.alioth.debian.org/
  • 18. EXCLUDED QueRIES 19 queries skipped in one or more tools 3,323/3,342 that were executed by all 2 Non-terminal hydrogens C1N[H]OC[H]1 1 Invalid stereochemistry CC[C@@H] 10 Atom classes C1C[N:1]CNC1 6 Bond stereo not-implemented (should have cleaned) N 1 HN H2 C NH H O H2C H H3 C CH
  • 19. RETRIEVAL DIFFERENCES “We demand rigidly defined areas of doubt and uncertainty!” CH3 CH3 o-xylene
  • 20. RETRIEVAL DIFFERENCES “We demand rigidly defined areas of doubt and uncertainty!” Somewhere between 440,901 and 529,386 rdcart 500,126 rdluc 500,053 bingo (pgsql/orcl) 500,865 bingo (nosql) 498,427 tripod-ss 509,655 jcart 444,541 fastsearch 440,901 orchem 529,386 arthor 444,553 CH3 CH3 o-xylene
  • 21. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 90% 99% 1s 10s 30s1m 5m many queries, running quick <— all start here some queries, running slow Where the line drops from 3323, that is the fastest query. Where it falls bellow 0, that was the slowest. Area under curve ~ total elapsed time for all queries. However log,log plot so can be deceptive.
  • 23. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 rdcart rdsmigrep 90% 99% 1s 10s 30s1m 5m 5h9m 9d15h0m 90% < 10 s 75% < 1 s No FP screen, all queries take around 5 mins for rdsmigrep (no sanitise). rdcart slowest query is nearly as slow.
  • 24. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 rdcart rdluc rdsmigrep 90% 99% 1s 10s 30s1m 5m 5h9m 9d15h0m 1d18h11m FP scan degraded perfomance for these queries rdcart worse case is only better due to mol serialisation. rdluc / rdsmigrep both use SMILES.
  • 25. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 bingo-orcl bingo-pgsql 90% 99% 1s 10s 30s1m 5m 1h54m 2h6m Bingo Oracle and PostgreSQL perform similarly. Small difference is likely measurement inaccuracies.
  • 26. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 bingo-orcl bingo-nosql bingo-pgsql 90% 99% 1s 10s 30s1m 5m 59m47s 1h54m 2h6m The recent “NoSQL” (flatfile) version is about twice as fast – DB overhead?
  • 27. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 jcart-st jcart 90% 99% 1s 10s 30s1m 5m 2h41m 1h2m jcart was the fastest cartridge. By default it uses all available processors…
  • 28. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 jcart-st jcart bingo-orcl bingo-pgsql 90% 99% 1s 10s 30s1m 5m 2h41m 1h2m 1h54m 2h6m …limiting to a single thread (st) we observed a similar performance to the Bingo cartridges.
  • 29. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 fastsearch mychem pgchem 90% 99% 1s 10s 30s1m 5m 13h54m 1d1h52m 1d15h22m pgchem and mychem have a different shape curve. This turns out to be hydrogen sprouting for some queries. fastsearch would also show this but got queries-clean
  • 30. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 fastsearch mychem pgchem 90% 99% 1s 10s 30s1m 5m 13h54m 1d1h52m 1d15h22m linear scan diff Both mychem and fastsearch do a linear FP scan. mychem does this via a SQL statement, fastsearch does it sequentially from the file.
  • 31. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 fastsearch mychem pgchem rdcart 90% 99% 1s 10s 30s1m 5m 5h9m 13h54m pghcem and rdcart both use a GIST, r-tree based index. This provides sub-linear screening for some queries. The curves are similar and we would therefore expect pgchem to also be ~ 6 hrs with the hydrogen issues resolved.
  • 32. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 tripod-ss fastsearch mychem pgchem rdcart 90% 99% 1s 10s 30s1m 5m 1d15h22m 1d5h48m tripod-ss also uses a tree-based index and we see a similar performance curve. The curve is shallower because pgchem and mychem use serialised molecules. 5h9m
  • 33. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 tripod-ss jcart-st jcart rdcart 90% 99% 1s 10s 30s1m 5m 1d5h48m 2h41m 1h2m tripod-ss use the JChem atom- match and path fingerprints. jcart was fastest on overall time… 5h9m …but sub-linear screens clearly win for many easy queries. Notice jcart ran ~1000 queries faster.
  • 34. 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 tripod-ss orchem fastsearch rdluc 90% 99% 1s 10s 30s1m 5m 1d15h22m 1d5h48m 3d1h13m 1d18h11m Unfortunately orchem was the slowest cartridge measured. The slowest queries were similar to others using line-notations. Recent CDK improvements could make it faster.
  • 36. OTHER CONSIDERATIONS What are the hits? o-xylene example How do they scale? • Index, tree-based indexes will have many more internal nodes • Memory usage • binary < SMILES < molfile • fingerprint length PubChem Compound currently ~10x eMolecules
  • 37. BENCHMARK SUMMARY Many tools have reasonable performance for most queries but are hindered by their worst cases. Minimal difference between Oracle vs PostgreSQL OrChem vs JCart (slowest and fastest cartridges) MyChem and pgchem will benefit from improved query preparation (hydrogen handling) Domain specific indexes preferable to table selects. Efficient IO Image: http://timvdm.blogspot.co.uk/
  • 38. FINGERPRINT PERFORMANCE Much work has been done on optimising fingerprints Screen performance measured for several of the tools • only partially explains observed time differences • occasional v. bad precision for some queries • hydrogens are difficult to encode1 • Maybe False Negatives! – but they’re relative Tool Type Screenout Precision bingo Hashed Trees/Rings 98.63% 80.9% rdcart Hashed Patterns 97.95% 59.9% jcart,tripod-ss Hashed Paths 98.15% 59.7% pgchem,mychem,fastsearch Hashed Paths/Rings 98.2% 57.3% 1. http://nextmovesoftware.com/blog/2015/02/16/ H N H
  • 39. AN EfFICIENT ATOM-MATCH Arthor (codename) • Query optimisation • Database preprocessing • Efficient binary representation • eMolecules, 1.2GiB • Matching done on representation
  • 41. FINGERPRINTING MATERS Arthor - 0% screen-out 23,290,449,670 atom-matches Bingo - 98.63% screen-out 319,079,160 atom-matches Can test how much different fingerprints mater… Provide arthor a pre-computed list of screen hits Times therefore presume instant bit screen False negatives are relative to 0% screen-out.
  • 42. FINGERPRINTING MATERS Fingerprint Length Density Time False NEGATIVES arthor - - 27m35s - +rdpath 2048 61.90% 2m52s (x9.5) 61,915 (0.02%) +maccs 166 29.08% 2m26s (x11.2) 140,968,649 (57.63%) +dalke353 353 6.59% 2m2s (x13.5) 0 +obfp2 1021 12.55% 1m32s (x17.9) 641,712 (0.26%) +jpath 1024 23.36% 1m7s (x24.5) 41,889 (0.01) +indigo-sub 1600 23.09% 55s (x29.6) 128,288 (0.05%) +indigo-sub+ext 1624 22.48% 54s (x30.1) 128,288 (0.05%)
  • 44. CONCLUSIONS Fast searching depends on many factors • index data-structure • fingerprint type • molecule representation • atom-match algorithm Fingerprint optimisation does mater but is not the performance bottleneck and can’t always help: Generic patterns, a-a Hydrogens, [H]N([H])C
  • 45. ACKNOWLEDGEMENTS Daniel Lowe and Noel O’Boyle, NextMove Software Andrew Dalke - assembling Structure Query Collection (SQC) Michael Gilson and Tiqing Liu - providing BindingDB (http://www.bindingdb.org/) queries for SQC Tool and cartridge developers
  • 46. Further optimisation and keeping file open: 36 s including inverted index FP screen, single threaded: 1 10 100 1000 100 101 102 103 104 105 106 107 QueriesNotFinished(n) Time (ms) 3323 tripod-ss orchem fastsearch mychem pgchem jcart-st jcart bingo-orcl bingo-nosql bingo-pgsql arthor-fp arthor rdcart rdluc rdsmigrep 90% 99% 1s 10s 30s1m 5m
  • 47. Some FALSE NEGATIVES Most false negatives are due to Kekulisation/Aromaticity. Typically when reading SMILES a tool will assign a Kekulé form and then re-aromatise. H3C N CH3 P OO H3C N CH3 P OO H3C N CH3 P OO H3C N CH3 P OO H3C N CH3 P OO Kekulise Aromatise Input Open Babel
  • 48. Some FALSE NEGATIVES OO CH3 OO CH3 OO CH3 OO CH3 Kekulise Aromatise JChem, RDKit Open Babel, CDK, OEChem Does have o-xylene substructure Doesn’t have o-xylene substructure