TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes and counting
1. Chemical Similarity using
multi-terabyte graph databases:
68 billion nodes and counting
Roger Sayle, John Mayfield and Noel O’Boyle
NextMove Software, Cambridge, UK
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
2. What can big data do for chemistry?
1. Text Mining and Reaction Analytics
2. Scalable (AI) Algorithms for Big Data
3. Graph Databases for 2D Chemical Similarity
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
3. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 1:
text mining and reaction
analytics
4. Automated chemical text mining
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
5. Extracting mps and reactions
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
6. Quantity has a quality of its own
• “This analysis demonstrates that models developed
using text-mined MP data from PATENTS provide an
excellent prediction performance, similar or even
significantly better than the results based on
manually curated data used in previous studies”.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
7. Who let the dogs out?
• Hartenfeller and Schneider 2011-2012 describe 58 unique
reactions, 34 of which are ring forming.
• “A Collection of Robust Organic Synthesis Reactions for In Silico Molecule Design”, J. Chem. Info. Model. 51(12), pp. 3093-3098, 2011/
• “DOGS: Reaction Driven de novo Design of Bioactive Compounds, PLOS Computational Biology, 8(2), February 2012
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
34%
17%
5%
2%
3%
6%
10%
1%
15%
2%
5% Heteroatom alkylation and arylation
Acylation and related processes
C-C bond formations
Heterocycle formation
Protections
Deprotections
Reductions
Oxidations
Functional group conversion
Functional group addition
Resolution
8. Who let the dogs out?
• Big Data can determine the utility and scope of reactions.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
9. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 2:
scalable artificial intelligence
Algorithms for big data
10. Motivation: compound acquisition
• Given a existing screening collection of X
compounds, and with Y vendor compounds
available for purchase, how should I select
the next Z diverse compounds to buy.
• Typically, X is about 2M and Y is about 170M.
• Previously ~O(N3), replaced with << O(N2).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
11. Rdkit’s maxminpicker
• RDKit’s original MaxMinPicker is described in pair of
blog posts by Greg Landrum:
– Picking diverse compounds from large sets, 2014/08
– Optimizing Diversity Picking in the RDKit, 2014/08
• M. Ashton, J. Barnard, P. Willett et al., “Identification of
Diverse Database Subsets using Property-based and
Fragment-based Molecular Descriptors”, Quant. Struct.-Act.
Relat., Vol. 21, pp. 598-604, 2002.
• R. Kennard and L. Stone, “Computer aided design of
experiments”, Technometrics, 11(1), pp. 137-148, 1969.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
12. Selection visualization
Image Credits: Antoine Stevens, the ProspectR package on github
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
13. Conceptual algorithm
• If no compounds have been picked so far, choose the
first picked compound at random.
• Repeatedly select the compound furthest from it’s
nearest picked compound [hence the name
maximum-minimum distance].
• Continue until the desired number of picked
compounds has been selected (or the pool of
available compounds has been exhausted).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
14. Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 16 possible moves.
The 10 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
15. Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 16 possible moves.
The 10 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
Alpha cut-offs allow us to prune the search tree.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
16. classic Max-min picking
1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
Minimums
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 1 2 3
3
3
0
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
17. new Max-min picking
0 1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 3 2 3
3
3
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
18. new Max-min picking
0 1
2 3 4 5 6 7 8 9
1
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
1 0 1 3 3 2 3 2 3
3
3
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
0 1 96 3 9 9 3 7 5
19. Screening library enhancement #1
• Selecting 1K compounds for purchase from
eMolecules (14M) to enhance ChEMBL 23 (1.7M).
– Reading eMolecules: 4780s
– Reading ChEMBL: 821s
– Generating FPs: 1456s
– MaxMinPicker: 42773s[80B FP cmps]
• Selecting the first 18 compounds takes only 399s
[715M FP cmps].
• Fazit: Large scale diversity selection can be run
overnight on a single CPU core.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
20. Screening library enhancement #2
• Selecting 40 compounds for purchase from Enamine
REAL 2017 (171M) to enhance ChEMBL 23 (1.7M).
– Reading mols/FP gen: 77194s + 1204s
– 1st compound: 181.32B FP cmps (1/82750).
– 10th compound: 301.67B FP cmps.
– 40th compound: 438.71B FP cmps.
• A traditional distance matrix requires 60 petabytes of
storage, and 1.5E16 FP comparisons (15 quadrillion).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
21. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 3:
large graph databases for 2d
chemical similarity searching
22. Fighting big data with bigger data
• The real challenge of Big Data is scalability.
• Traditional chemical similarity searching using binary
fingerprints scales linearly, as O(N).
– If a search of 1M compounds takes 1 second, then…
– ChEMBL takes 2s, PubChem takes 90s, Enamine 171s.
• Here we describe the use of a sublinear-scaling
search method over a database that is approximately
constant (perhaps 1K-1M) times larger.
• As data set sizes increase, these approaches make
traditional methods increasingly uncompetitive.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
23. Smallworld chemical space
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
25. Smallworld chemical space
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
26. graph Edit distance
• Graph Edit Distance (GED) is the minimum number of
edit operations required to transform one graph into
another.
– Alberto Sanfeliu and K.S. Fu, “A Distance Measure between
Attributed Relational Graphs for Pattern Recognition”, IEEE
Transactions of Systems, Man and Cybernetics (SMC), Vol.
13, No. 3, pp. 353-362, 1983.
• Edit operations consist of insertions, deletions and
substitutions of nodes and edges (atoms and bonds).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
27. Example edit operations
Benzene Pyridine
Chlorobenzene Fluorobenzene
Benzoxazole Benzothiazole
Benzothiazole
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
28. Example edit operations
Benzene Cyclohexane
Thiazole Tetrahydrofuran
Histidine Histidine Zwitterion
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
29. Example edit operations
Ticlodipine Clopidogrel
Penicillin G Amoxicillin
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
30. Example edit operations
Sildenafil (Viagra) Vardenafil (Levitra)
Sumatriptan (Imitrex) Zolmitriptan (Zomig)
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
31. advantages over fingerprints
• FP similarity based on “local” substructures.
• FP saturation of features/Chemical Space.
– Many peptides/proteins/nucleic acids have identical FPs.
– For alkanes, C16 should be more similar to C18 than C20.
– Identical FPs in Chemistry Toolkit Rosetta benchmark.
– PubChem “similar compounds” uses 90% threshold.
• FPs make no distinction atom type changes.
– Chlorine to Bromine more conservative than HBD to HBA.
– Tautomers/protonation states often have low similarity.
– FPs are more sensitive to Normalization/Standardization.
• Stereochemistry is poorly handled by FPs.
– Either not represented or isomers have low similarity.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
32. Recent Rsc/GDB example
• Bizzini et al., “Synthesis of trinorborane”, Chemical
Communications, 26 September 2017.
– “This new rigid structural type was found to be present in the computer
generated GDB and has until now no real-world counterpart”.
Tetracyclo[5.2.2.01,6.04,9]undecane 1-Azonia-tetracyclo[5.2.2.01,6.04,9]undecane
PubChem CID90865661 (not in GDB!)
Substructure of Dapniglucin A and B
Org. Lett. 5(10):1733-1736, 2003.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
34. Classic Ex ScienTia example
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
MAJ
2×LUP
LDN
2×TUP
MAJ
LUP
Total Distance: 8 Total Topological Distance: 6
Besnard, Hopkins et al. Nature, 492:215-220, Dec 2012
35. smallworld search
SmallWorld lattice: Bold circles denote indexed molecules,
thin circles represent virtual subgraphs.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
36. smallworld search
The solid circle denotes a query structure which may be
either an indexed molecule or a virtual subgraph.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
37. smallworld search
The first iteration of the search adds the neigbors of the
query to the “search wavefront”.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
38. smallworld search
Each subsequent iteration propagates the wavefront by
considering the unvisited neighbors of the wavefront.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
39. smallworld search
At each iteration, “hits” are reported as the set of indexed
molecules that are members of the wavefront.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
40. smallworld search
The search terminates once sufficient indexed neighbors
have been found (or a suitable iteration limit is reached).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
43. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
44. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
45. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
46. Current database statistics
• As of October 2017, the SmallWorld index has
• 68,921,678,269 nodes (~69B or ~236 nodes)
• 258,787,077,793 edges (~76B or ~238 edges)
– 128,762,041,180 ring edges.
– 95,709,763,280 terminal edges
– 34,315,273,333 linker edges.
• Average degree (fan-out) of node: ~7.5
• 8.22B acyclic nodes, 7.12B have a single ring.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
47. Graph database fabrication
• The “raw” source representation of SmallWorld is
28.7 TB of data, one ASCII line (of two SMILES) for
each edge, i.e. 259 billion text lines.
• Hypothetically, these 259B triples could be loaded
into a database such as Oracle, Virtuoso or Neo4j.
• Instead, we “compile” this graph database down to a
5TB form that is very efficiently searched at run-time.
• This 5TB can be delivered to customers on a £150
external USB disk (like a subscription service).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
48. implementation features 2017
• High-performance graph canonicalization/enumeration.
• “Umbrella” sampling of chemical space (<100 bonds).
• Database partitioning by Bond/Ring count.
• Integer node indices and User-Database mapping.
• Adjacency matrix in Compressed Sparse Row (CSR) format.
• Bloom filter hash join of mapped user-databases.
• Custom multigram SMILES compression (Sayle2001).
• Multiplicative Binary Search index lookup.
• Key-length hashing for index lookup (fixed field length).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
49. database partitioning
• Instead of treating the database as a single
monolithic entity, the nodes are partitioned by their
atom, bond and ring counts.
• This results in 2406 partitions, named BxRy where x is
the number of bonds, y is the number of rings.
• Each edge links vertices in neighboring partitions.
– A tdn edge from BxRy leads to Bx-1Ry, tup to Bx+1Ry.
– A rdn edge from BxRy leads to Bx-1Ry-1, rup to Bx+1Ry+1.
– A ldn edge from BxRy leads to Bx-1Ry, lup to Bx+1Ry.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
50. Heatmap of smallworld universe
Bonds →
Rings→
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
51. integer indices and DB mapping
• Inside each partition, each node is assigned a unique
sequential integer index.
– c1ccccc1 → *1*****1 ↔ B6R1.13
– n1cnccc1 → *1*****1 ↔ B6R1.13
• Each edge is then represented by two integers.
• SmallWorld is a “type domain index” over graphs.
• User database are represented as “mappings”.
– B13R1.834 CC(=O)Oc1ccccc1C(=O)O CHEMBL1697753
– B14R4.107563 C1C[N+]23CCC4C2CC1C3C4 CID90865661
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
53. conclusions
• Larger data sets reveal patterns, trends and insights
previously invisible to chemists from small samples.
• As chemical data sets grow exponentially, they form
pain points for tradition processing: Big Data.
• Theoretical computer science and AI can provide new
algorithms to process more data more efficiently.
• The sub-linear behaviour of SmallWorld’s nearest
neighbor similarity makes it faster than fingerprint-
based methods on (sufficiently) large data sets.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
54. acknowledgements
• In memoriam Andy Grant, thank you for everything.
• AstraZeneca R&D, Alderley Park, U.K.
• GlaxoSmithKline, Stevenage, U.K.
• Relay Therapeutics, Boston, U.S.A.
• Eli Lilly, Indianapolis, U.S.A.
• Hoffmann-La Roche, Basel, Switzerland.
• Jose Batista, OpenEye Scientific Software, Germany.
• Jameed Hussain, Chemical Computing Group, U.K.
• Frey group, University of Southampton, U.K.
• Thank you for your time, Any questions?
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
55. J. Andrew grant (1963-2012)
Andy and I at OpenEye EuroCUP 2008
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017