SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Chemical Similarity using
multi-terabyte graph databases:
68 billion nodes and counting
Roger Sayle, John Mayfield and Noel O’Boyle
NextMove Software, Cambridge, UK
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
What can big data do for chemistry?
1. Text Mining and Reaction Analytics
2. Scalable (AI) Algorithms for Big Data
3. Graph Databases for 2D Chemical Similarity
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 1:
text mining and reaction
analytics
Automated chemical text mining
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Extracting mps and reactions
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Quantity has a quality of its own
• “This analysis demonstrates that models developed
using text-mined MP data from PATENTS provide an
excellent prediction performance, similar or even
significantly better than the results based on
manually curated data used in previous studies”.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Who let the dogs out?
• Hartenfeller and Schneider 2011-2012 describe 58 unique
reactions, 34 of which are ring forming.
• “A Collection of Robust Organic Synthesis Reactions for In Silico Molecule Design”, J. Chem. Info. Model. 51(12), pp. 3093-3098, 2011/
• “DOGS: Reaction Driven de novo Design of Bioactive Compounds, PLOS Computational Biology, 8(2), February 2012
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
34%
17%
5%
2%
3%
6%
10%
1%
15%
2%
5% Heteroatom alkylation and arylation
Acylation and related processes
C-C bond formations
Heterocycle formation
Protections
Deprotections
Reductions
Oxidations
Functional group conversion
Functional group addition
Resolution
Who let the dogs out?
• Big Data can determine the utility and scope of reactions.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 2:
scalable artificial intelligence
Algorithms for big data
Motivation: compound acquisition
• Given a existing screening collection of X
compounds, and with Y vendor compounds
available for purchase, how should I select
the next Z diverse compounds to buy.
• Typically, X is about 2M and Y is about 170M.
• Previously ~O(N3), replaced with << O(N2).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Rdkit’s maxminpicker
• RDKit’s original MaxMinPicker is described in pair of
blog posts by Greg Landrum:
– Picking diverse compounds from large sets, 2014/08
– Optimizing Diversity Picking in the RDKit, 2014/08
• M. Ashton, J. Barnard, P. Willett et al., “Identification of
Diverse Database Subsets using Property-based and
Fragment-based Molecular Descriptors”, Quant. Struct.-Act.
Relat., Vol. 21, pp. 598-604, 2002.
• R. Kennard and L. Stone, “Computer aided design of
experiments”, Technometrics, 11(1), pp. 137-148, 1969.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Selection visualization
Image Credits: Antoine Stevens, the ProspectR package on github
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Conceptual algorithm
• If no compounds have been picked so far, choose the
first picked compound at random.
• Repeatedly select the compound furthest from it’s
nearest picked compound [hence the name
maximum-minimum distance].
• Continue until the desired number of picked
compounds has been selected (or the pool of
available compounds has been exhausted).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 16 possible moves.
The 10 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 16 possible moves.
The 10 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
Alpha cut-offs allow us to prune the search tree.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
classic Max-min picking
1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
Minimums
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 1 2 3
3
3
0
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
new Max-min picking
0 1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 3 2 3
3
3
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
new Max-min picking
0 1
2 3 4 5 6 7 8 9
1
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
1 0 1 3 3 2 3 2 3
3
3
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
0 1 96 3 9 9 3 7 5
Screening library enhancement #1
• Selecting 1K compounds for purchase from
eMolecules (14M) to enhance ChEMBL 23 (1.7M).
– Reading eMolecules: 4780s
– Reading ChEMBL: 821s
– Generating FPs: 1456s
– MaxMinPicker: 42773s[80B FP cmps]
• Selecting the first 18 compounds takes only 399s
[715M FP cmps].
• Fazit: Large scale diversity selection can be run
overnight on a single CPU core.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Screening library enhancement #2
• Selecting 40 compounds for purchase from Enamine
REAL 2017 (171M) to enhance ChEMBL 23 (1.7M).
– Reading mols/FP gen: 77194s + 1204s
– 1st compound: 181.32B FP cmps (1/82750).
– 10th compound: 301.67B FP cmps.
– 40th compound: 438.71B FP cmps.
• A traditional distance matrix requires 60 petabytes of
storage, and 1.5E16 FP comparisons (15 quadrillion).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Part 3:
large graph databases for 2d
chemical similarity searching
Fighting big data with bigger data
• The real challenge of Big Data is scalability.
• Traditional chemical similarity searching using binary
fingerprints scales linearly, as O(N).
– If a search of 1M compounds takes 1 second, then…
– ChEMBL takes 2s, PubChem takes 90s, Enamine 171s.
• Here we describe the use of a sublinear-scaling
search method over a database that is approximately
constant (perhaps 1K-1M) times larger.
• As data set sizes increase, these approaches make
traditional methods increasingly uncompetitive.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Smallworld chemical space
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
Counting molecular subgraphs
Name Atoms MW Subgraphs
Benzene 6 78 7
Cubane 8 104 64
Ferrocene 11 186 3,154
Aspirin 13 180 127
Dodecahedrane 20 260 440,473
Ranitidine 21 314 436
Clopidrogel 21 322 10,071
Morphine 21 285 176,541
Amlodipine 28 409 58,139
Lisinopril 29 405 24,619
Gefitinib 31 447 190,901
Atorvastatin 41 559 3,638,523
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
≤ Bond Count %PubChem
≤ 20 bonds 14%
≤ 25 bonds 30%
≤ 30 bonds 55%
≤ 35 bonds 77%
≤ 40 bonds 89%
≤ 45 bonds 93%
≤ 50 bonds 95%
≤ 55 bonds 97%
≤ 60 bonds 98%
≤ 65 bonds 98%
≤ 70 bonds 99%
Smallworld chemical space
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
graph Edit distance
• Graph Edit Distance (GED) is the minimum number of
edit operations required to transform one graph into
another.
– Alberto Sanfeliu and K.S. Fu, “A Distance Measure between
Attributed Relational Graphs for Pattern Recognition”, IEEE
Transactions of Systems, Man and Cybernetics (SMC), Vol.
13, No. 3, pp. 353-362, 1983.
• Edit operations consist of insertions, deletions and
substitutions of nodes and edges (atoms and bonds).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Example edit operations
Benzene Pyridine
Chlorobenzene Fluorobenzene
Benzoxazole Benzothiazole
Benzothiazole
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Example edit operations
Benzene Cyclohexane
Thiazole Tetrahydrofuran
Histidine Histidine Zwitterion
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Example edit operations
Ticlodipine Clopidogrel
Penicillin G Amoxicillin
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Example edit operations
Sildenafil (Viagra) Vardenafil (Levitra)
Sumatriptan (Imitrex) Zolmitriptan (Zomig)
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
advantages over fingerprints
• FP similarity based on “local” substructures.
• FP saturation of features/Chemical Space.
– Many peptides/proteins/nucleic acids have identical FPs.
– For alkanes, C16 should be more similar to C18 than C20.
– Identical FPs in Chemistry Toolkit Rosetta benchmark.
– PubChem “similar compounds” uses 90% threshold.
• FPs make no distinction atom type changes.
– Chlorine to Bromine more conservative than HBD to HBA.
– Tautomers/protonation states often have low similarity.
– FPs are more sensitive to Normalization/Standardization.
• Stereochemistry is poorly handled by FPs.
– Either not represented or isomers have low similarity.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Recent Rsc/GDB example
• Bizzini et al., “Synthesis of trinorborane”, Chemical
Communications, 26 September 2017.
– “This new rigid structural type was found to be present in the computer
generated GDB and has until now no real-world counterpart”.
Tetracyclo[5.2.2.01,6.04,9]undecane 1-Azonia-tetracyclo[5.2.2.01,6.04,9]undecane
PubChem CID90865661 (not in GDB!)
Substructure of Dapniglucin A and B
Org. Lett. 5(10):1733-1736, 2003.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
topological edit/edge types
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Classic Ex ScienTia example
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
MAJ
2×LUP
LDN
2×TUP
MAJ
LUP
Total Distance: 8 Total Topological Distance: 6
Besnard, Hopkins et al. Nature, 492:215-220, Dec 2012
smallworld search
SmallWorld lattice: Bold circles denote indexed molecules,
thin circles represent virtual subgraphs.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
The solid circle denotes a query structure which may be
either an indexed molecule or a virtual subgraph.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
The first iteration of the search adds the neigbors of the
query to the “search wavefront”.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
Each subsequent iteration propagates the wavefront by
considering the unvisited neighbors of the wavefront.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
At each iteration, “hits” are reported as the set of indexed
molecules that are members of the wavefront.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
The search terminates once sufficient indexed neighbors
have been found (or a suitable iteration limit is reached).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
smallworld search
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Current database statistics
• As of October 2017, the SmallWorld index has
• 68,921,678,269 nodes (~69B or ~236 nodes)
• 258,787,077,793 edges (~76B or ~238 edges)
– 128,762,041,180 ring edges.
– 95,709,763,280 terminal edges
– 34,315,273,333 linker edges.
• Average degree (fan-out) of node: ~7.5
• 8.22B acyclic nodes, 7.12B have a single ring.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Graph database fabrication
• The “raw” source representation of SmallWorld is
28.7 TB of data, one ASCII line (of two SMILES) for
each edge, i.e. 259 billion text lines.
• Hypothetically, these 259B triples could be loaded
into a database such as Oracle, Virtuoso or Neo4j.
• Instead, we “compile” this graph database down to a
5TB form that is very efficiently searched at run-time.
• This 5TB can be delivered to customers on a £150
external USB disk (like a subscription service).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
implementation features 2017
• High-performance graph canonicalization/enumeration.
• “Umbrella” sampling of chemical space (<100 bonds).
• Database partitioning by Bond/Ring count.
• Integer node indices and User-Database mapping.
• Adjacency matrix in Compressed Sparse Row (CSR) format.
• Bloom filter hash join of mapped user-databases.
• Custom multigram SMILES compression (Sayle2001).
• Multiplicative Binary Search index lookup.
• Key-length hashing for index lookup (fixed field length).
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
database partitioning
• Instead of treating the database as a single
monolithic entity, the nodes are partitioned by their
atom, bond and ring counts.
• This results in 2406 partitions, named BxRy where x is
the number of bonds, y is the number of rings.
• Each edge links vertices in neighboring partitions.
– A tdn edge from BxRy leads to Bx-1Ry, tup to Bx+1Ry.
– A rdn edge from BxRy leads to Bx-1Ry-1, rup to Bx+1Ry+1.
– A ldn edge from BxRy leads to Bx-1Ry, lup to Bx+1Ry.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
Heatmap of smallworld universe
Bonds →
Rings→
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
integer indices and DB mapping
• Inside each partition, each node is assigned a unique
sequential integer index.
– c1ccccc1 → *1*****1 ↔ B6R1.13
– n1cnccc1 → *1*****1 ↔ B6R1.13
• Each edge is then represented by two integers.
• SmallWorld is a “type domain index” over graphs.
• User database are represented as “mappings”.
– B13R1.834 CC(=O)Oc1ccccc1C(=O)O CHEMBL1697753
– B14R4.107563 C1C[N+]23CCC4C2CC1C3C4 CID90865661
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
SmallWorld Density heatmaps
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
conclusions
• Larger data sets reveal patterns, trends and insights
previously invisible to chemists from small samples.
• As chemical data sets grow exponentially, they form
pain points for tradition processing: Big Data.
• Theoretical computer science and AI can provide new
algorithms to process more data more efficiently.
• The sub-linear behaviour of SmallWorld’s nearest
neighbor similarity makes it faster than fingerprint-
based methods on (sufficiently) large data sets.
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
acknowledgements
• In memoriam Andy Grant, thank you for everything.
• AstraZeneca R&D, Alderley Park, U.K.
• GlaxoSmithKline, Stevenage, U.K.
• Relay Therapeutics, Boston, U.S.A.
• Eli Lilly, Indianapolis, U.S.A.
• Hoffmann-La Roche, Basel, Switzerland.
• Jose Batista, OpenEye Scientific Software, Germany.
• Jameed Hussain, Chemical Computing Group, U.K.
• Frey group, University of Southampton, U.K.
• Thank you for your time, Any questions?
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
J. Andrew grant (1963-2012)
Andy and I at OpenEye EuroCUP 2008
SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017

Mais conteúdo relacionado

Mais procurados

Cheminformatics
CheminformaticsCheminformatics
CheminformaticsVin Anto
 
analogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxanalogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxramadevi824914
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designingDr Vysakh Mohan M
 
Med264 Structural Bioinformatics
Med264 Structural BioinformaticsMed264 Structural Bioinformatics
Med264 Structural BioinformaticsPhilip Bourne
 
DIABETES PREDICTION SYSTEM .pptx
DIABETES PREDICTION SYSTEM .pptxDIABETES PREDICTION SYSTEM .pptx
DIABETES PREDICTION SYSTEM .pptxHome
 
データベース時代の疫学研究デザイン
データベース時代の疫学研究デザインデータベース時代の疫学研究デザイン
データベース時代の疫学研究デザインKoichiro Gibo
 
Autodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingAutodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingGirinath Pillai
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modelingpkchoudhury
 
Rによるノンパラメトリック検定と効果量の出し方
Rによるノンパラメトリック検定と効果量の出し方Rによるノンパラメトリック検定と効果量の出し方
Rによるノンパラメトリック検定と効果量の出し方Hikaru GOTO
 
Brain tumor detection ppt (1)today.pptx
Brain tumor detection  ppt (1)today.pptxBrain tumor detection  ppt (1)today.pptx
Brain tumor detection ppt (1)today.pptxPoorabKumar
 
Cheminformatics by kk sahu
Cheminformatics by kk sahuCheminformatics by kk sahu
Cheminformatics by kk sahuKAUSHAL SAHU
 
Lecture 8 drug targets and target identification
Lecture 8 drug targets and target identificationLecture 8 drug targets and target identification
Lecture 8 drug targets and target identificationRAJAN ROLTA
 
Machine Learning for Medical Image Analysis: What, where and how?
Machine Learning for Medical Image Analysis:What, where and how?Machine Learning for Medical Image Analysis:What, where and how?
Machine Learning for Medical Image Analysis: What, where and how?Debdoot Sheet
 
“Docking Studies and Drug Design”
“Docking  Studies and Drug Design”“Docking  Studies and Drug Design”
“Docking Studies and Drug Design”Naresh Juttu
 

Mais procurados (16)

Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
analogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxanalogue based drug design and discovery.pptx
analogue based drug design and discovery.pptx
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designing
 
Med264 Structural Bioinformatics
Med264 Structural BioinformaticsMed264 Structural Bioinformatics
Med264 Structural Bioinformatics
 
DIABETES PREDICTION SYSTEM .pptx
DIABETES PREDICTION SYSTEM .pptxDIABETES PREDICTION SYSTEM .pptx
DIABETES PREDICTION SYSTEM .pptx
 
データベース時代の疫学研究デザイン
データベース時代の疫学研究デザインデータベース時代の疫学研究デザイン
データベース時代の疫学研究デザイン
 
Autodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingAutodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular Docking
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modeling
 
Rによるノンパラメトリック検定と効果量の出し方
Rによるノンパラメトリック検定と効果量の出し方Rによるノンパラメトリック検定と効果量の出し方
Rによるノンパラメトリック検定と効果量の出し方
 
ACE inhibitor by CADD
ACE inhibitor by CADDACE inhibitor by CADD
ACE inhibitor by CADD
 
Brain tumor detection ppt (1)today.pptx
Brain tumor detection  ppt (1)today.pptxBrain tumor detection  ppt (1)today.pptx
Brain tumor detection ppt (1)today.pptx
 
Cheminformatics by kk sahu
Cheminformatics by kk sahuCheminformatics by kk sahu
Cheminformatics by kk sahu
 
Lecture 8 drug targets and target identification
Lecture 8 drug targets and target identificationLecture 8 drug targets and target identification
Lecture 8 drug targets and target identification
 
ChemDraw 15
ChemDraw 15ChemDraw 15
ChemDraw 15
 
Machine Learning for Medical Image Analysis: What, where and how?
Machine Learning for Medical Image Analysis:What, where and how?Machine Learning for Medical Image Analysis:What, where and how?
Machine Learning for Medical Image Analysis: What, where and how?
 
“Docking Studies and Drug Design”
“Docking  Studies and Drug Design”“Docking  Studies and Drug Design”
“Docking Studies and Drug Design”
 

Semelhante a Chemical similarity using multi-terabyte graph databases: 68 billion nodes and counting

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Chemistry Reserach as a Social Machine
 Chemistry Reserach as a Social Machine Chemistry Reserach as a Social Machine
Chemistry Reserach as a Social MachineJeremy Frey
 
Optimizing queries via search server ElasticSearch: a study applied to large ...
Optimizing queries via search server ElasticSearch: a study applied to large ...Optimizing queries via search server ElasticSearch: a study applied to large ...
Optimizing queries via search server ElasticSearch: a study applied to large ...Alex Camargo
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...confluent
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...Acquisition, Storage and Management of Research Data in Chemical Sciences: De...
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...LIBER Europe
 
[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesisChung-Il Kim
 
Bda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databasesBda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databasesInterpretOmics
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismpetermurrayrust
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014Richard West
 
Developing an Efficient Infrastruture, Standards and Data-Flow for Metabolomics
Developing an Efficient Infrastruture, Standards and Data-Flow for MetabolomicsDeveloping an Efficient Infrastruture, Standards and Data-Flow for Metabolomics
Developing an Efficient Infrastruture, Standards and Data-Flow for MetabolomicsChristoph Steinbeck
 

Semelhante a Chemical similarity using multi-terabyte graph databases: 68 billion nodes and counting (20)

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Chemistry Reserach as a Social Machine
 Chemistry Reserach as a Social Machine Chemistry Reserach as a Social Machine
Chemistry Reserach as a Social Machine
 
Optimizing queries via search server ElasticSearch: a study applied to large ...
Optimizing queries via search server ElasticSearch: a study applied to large ...Optimizing queries via search server ElasticSearch: a study applied to large ...
Optimizing queries via search server ElasticSearch: a study applied to large ...
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...Acquisition, Storage and Management of Research Data in Chemical Sciences: De...
Acquisition, Storage and Management of Research Data in Chemical Sciences: De...
 
[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis
 
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
 
Bda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databasesBda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databases
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014
 
Developing an Efficient Infrastruture, Standards and Data-Flow for Metabolomics
Developing an Efficient Infrastruture, Standards and Data-Flow for MetabolomicsDeveloping an Efficient Infrastruture, Standards and Data-Flow for Metabolomics
Developing an Efficient Infrastruture, Standards and Data-Flow for Metabolomics
 

Mais de NextMove Software

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Último

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Último (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Chemical similarity using multi-terabyte graph databases: 68 billion nodes and counting

  • 1. Chemical Similarity using multi-terabyte graph databases: 68 billion nodes and counting Roger Sayle, John Mayfield and Noel O’Boyle NextMove Software, Cambridge, UK SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 2. What can big data do for chemistry? 1. Text Mining and Reaction Analytics 2. Scalable (AI) Algorithms for Big Data 3. Graph Databases for 2D Chemical Similarity SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 3. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 Part 1: text mining and reaction analytics
  • 4. Automated chemical text mining SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 5. Extracting mps and reactions SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 6. Quantity has a quality of its own • “This analysis demonstrates that models developed using text-mined MP data from PATENTS provide an excellent prediction performance, similar or even significantly better than the results based on manually curated data used in previous studies”. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 7. Who let the dogs out? • Hartenfeller and Schneider 2011-2012 describe 58 unique reactions, 34 of which are ring forming. • “A Collection of Robust Organic Synthesis Reactions for In Silico Molecule Design”, J. Chem. Info. Model. 51(12), pp. 3093-3098, 2011/ • “DOGS: Reaction Driven de novo Design of Bioactive Compounds, PLOS Computational Biology, 8(2), February 2012 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 34% 17% 5% 2% 3% 6% 10% 1% 15% 2% 5% Heteroatom alkylation and arylation Acylation and related processes C-C bond formations Heterocycle formation Protections Deprotections Reductions Oxidations Functional group conversion Functional group addition Resolution
  • 8. Who let the dogs out? • Big Data can determine the utility and scope of reactions. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 9. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 Part 2: scalable artificial intelligence Algorithms for big data
  • 10. Motivation: compound acquisition • Given a existing screening collection of X compounds, and with Y vendor compounds available for purchase, how should I select the next Z diverse compounds to buy. • Typically, X is about 2M and Y is about 170M. • Previously ~O(N3), replaced with << O(N2). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 11. Rdkit’s maxminpicker • RDKit’s original MaxMinPicker is described in pair of blog posts by Greg Landrum: – Picking diverse compounds from large sets, 2014/08 – Optimizing Diversity Picking in the RDKit, 2014/08 • M. Ashton, J. Barnard, P. Willett et al., “Identification of Diverse Database Subsets using Property-based and Fragment-based Molecular Descriptors”, Quant. Struct.-Act. Relat., Vol. 21, pp. 598-604, 2002. • R. Kennard and L. Stone, “Computer aided design of experiments”, Technometrics, 11(1), pp. 137-148, 1969. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 12. Selection visualization Image Credits: Antoine Stevens, the ProspectR package on github SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 13. Conceptual algorithm • If no compounds have been picked so far, choose the first picked compound at random. • Repeatedly select the compound furthest from it’s nearest picked compound [hence the name maximum-minimum distance]. • Continue until the desired number of picked compounds has been selected (or the pool of available compounds has been exhausted). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 14. Artificial intelligence (MINI-MAX) Los Alamos Chess (6x6 board) White has 16 possible moves. The 10 that don’t check, lose. Five checks, lose the queen. MAX MIN 5 3 2 4 0 6 1 3 3 3 2 0 1 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 15. Artificial intelligence (MINI-MAX) Los Alamos Chess (6x6 board) White has 16 possible moves. The 10 that don’t check, lose. Five checks, lose the queen. MAX MIN 5 3 2 4 0 6 1 3 3 3 2 0 1 Alpha cut-offs allow us to prune the search tree. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 16. classic Max-min picking 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 Minimums Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 1 2 3 3 3 0 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 17. new Max-min picking 0 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 < Bounds Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 3 2 3 3 3 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 18. new Max-min picking 0 1 2 3 4 5 6 7 8 9 1 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 < Bounds Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 1 0 1 3 3 2 3 2 3 3 3 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 0 1 96 3 9 9 3 7 5
  • 19. Screening library enhancement #1 • Selecting 1K compounds for purchase from eMolecules (14M) to enhance ChEMBL 23 (1.7M). – Reading eMolecules: 4780s – Reading ChEMBL: 821s – Generating FPs: 1456s – MaxMinPicker: 42773s[80B FP cmps] • Selecting the first 18 compounds takes only 399s [715M FP cmps]. • Fazit: Large scale diversity selection can be run overnight on a single CPU core. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 20. Screening library enhancement #2 • Selecting 40 compounds for purchase from Enamine REAL 2017 (171M) to enhance ChEMBL 23 (1.7M). – Reading mols/FP gen: 77194s + 1204s – 1st compound: 181.32B FP cmps (1/82750). – 10th compound: 301.67B FP cmps. – 40th compound: 438.71B FP cmps. • A traditional distance matrix requires 60 petabytes of storage, and 1.5E16 FP comparisons (15 quadrillion). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 21. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 Part 3: large graph databases for 2d chemical similarity searching
  • 22. Fighting big data with bigger data • The real challenge of Big Data is scalability. • Traditional chemical similarity searching using binary fingerprints scales linearly, as O(N). – If a search of 1M compounds takes 1 second, then… – ChEMBL takes 2s, PubChem takes 90s, Enamine 171s. • Here we describe the use of a sublinear-scaling search method over a database that is approximately constant (perhaps 1K-1M) times larger. • As data set sizes increase, these approaches make traditional methods increasingly uncompetitive. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 23. Smallworld chemical space SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
  • 24. Counting molecular subgraphs Name Atoms MW Subgraphs Benzene 6 78 7 Cubane 8 104 64 Ferrocene 11 186 3,154 Aspirin 13 180 127 Dodecahedrane 20 260 440,473 Ranitidine 21 314 436 Clopidrogel 21 322 10,071 Morphine 21 285 176,541 Amlodipine 28 409 58,139 Lisinopril 29 405 24,619 Gefitinib 31 447 190,901 Atorvastatin 41 559 3,638,523 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 ≤ Bond Count %PubChem ≤ 20 bonds 14% ≤ 25 bonds 30% ≤ 30 bonds 55% ≤ 35 bonds 77% ≤ 40 bonds 89% ≤ 45 bonds 93% ≤ 50 bonds 95% ≤ 55 bonds 97% ≤ 60 bonds 98% ≤ 65 bonds 98% ≤ 70 bonds 99%
  • 25. Smallworld chemical space SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 Graph search (GED) of 68 billion subgraphs vs. 340 million molecules.
  • 26. graph Edit distance • Graph Edit Distance (GED) is the minimum number of edit operations required to transform one graph into another. – Alberto Sanfeliu and K.S. Fu, “A Distance Measure between Attributed Relational Graphs for Pattern Recognition”, IEEE Transactions of Systems, Man and Cybernetics (SMC), Vol. 13, No. 3, pp. 353-362, 1983. • Edit operations consist of insertions, deletions and substitutions of nodes and edges (atoms and bonds). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 27. Example edit operations Benzene Pyridine Chlorobenzene Fluorobenzene Benzoxazole Benzothiazole Benzothiazole SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 28. Example edit operations Benzene Cyclohexane Thiazole Tetrahydrofuran Histidine Histidine Zwitterion SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 29. Example edit operations Ticlodipine Clopidogrel Penicillin G Amoxicillin SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 30. Example edit operations Sildenafil (Viagra) Vardenafil (Levitra) Sumatriptan (Imitrex) Zolmitriptan (Zomig) SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 31. advantages over fingerprints • FP similarity based on “local” substructures. • FP saturation of features/Chemical Space. – Many peptides/proteins/nucleic acids have identical FPs. – For alkanes, C16 should be more similar to C18 than C20. – Identical FPs in Chemistry Toolkit Rosetta benchmark. – PubChem “similar compounds” uses 90% threshold. • FPs make no distinction atom type changes. – Chlorine to Bromine more conservative than HBD to HBA. – Tautomers/protonation states often have low similarity. – FPs are more sensitive to Normalization/Standardization. • Stereochemistry is poorly handled by FPs. – Either not represented or isomers have low similarity. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 32. Recent Rsc/GDB example • Bizzini et al., “Synthesis of trinorborane”, Chemical Communications, 26 September 2017. – “This new rigid structural type was found to be present in the computer generated GDB and has until now no real-world counterpart”. Tetracyclo[5.2.2.01,6.04,9]undecane 1-Azonia-tetracyclo[5.2.2.01,6.04,9]undecane PubChem CID90865661 (not in GDB!) Substructure of Dapniglucin A and B Org. Lett. 5(10):1733-1736, 2003. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 33. topological edit/edge types SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 34. Classic Ex ScienTia example SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017 MAJ 2×LUP LDN 2×TUP MAJ LUP Total Distance: 8 Total Topological Distance: 6 Besnard, Hopkins et al. Nature, 492:215-220, Dec 2012
  • 35. smallworld search SmallWorld lattice: Bold circles denote indexed molecules, thin circles represent virtual subgraphs. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 36. smallworld search The solid circle denotes a query structure which may be either an indexed molecule or a virtual subgraph. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 37. smallworld search The first iteration of the search adds the neigbors of the query to the “search wavefront”. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 38. smallworld search Each subsequent iteration propagates the wavefront by considering the unvisited neighbors of the wavefront. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 39. smallworld search At each iteration, “hits” are reported as the set of indexed molecules that are members of the wavefront. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 40. smallworld search The search terminates once sufficient indexed neighbors have been found (or a suitable iteration limit is reached). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 41. smallworld search SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 42. smallworld search SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 43. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 44. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 45. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 46. Current database statistics • As of October 2017, the SmallWorld index has • 68,921,678,269 nodes (~69B or ~236 nodes) • 258,787,077,793 edges (~76B or ~238 edges) – 128,762,041,180 ring edges. – 95,709,763,280 terminal edges – 34,315,273,333 linker edges. • Average degree (fan-out) of node: ~7.5 • 8.22B acyclic nodes, 7.12B have a single ring. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 47. Graph database fabrication • The “raw” source representation of SmallWorld is 28.7 TB of data, one ASCII line (of two SMILES) for each edge, i.e. 259 billion text lines. • Hypothetically, these 259B triples could be loaded into a database such as Oracle, Virtuoso or Neo4j. • Instead, we “compile” this graph database down to a 5TB form that is very efficiently searched at run-time. • This 5TB can be delivered to customers on a £150 external USB disk (like a subscription service). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 48. implementation features 2017 • High-performance graph canonicalization/enumeration. • “Umbrella” sampling of chemical space (<100 bonds). • Database partitioning by Bond/Ring count. • Integer node indices and User-Database mapping. • Adjacency matrix in Compressed Sparse Row (CSR) format. • Bloom filter hash join of mapped user-databases. • Custom multigram SMILES compression (Sayle2001). • Multiplicative Binary Search index lookup. • Key-length hashing for index lookup (fixed field length). SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 49. database partitioning • Instead of treating the database as a single monolithic entity, the nodes are partitioned by their atom, bond and ring counts. • This results in 2406 partitions, named BxRy where x is the number of bonds, y is the number of rings. • Each edge links vertices in neighboring partitions. – A tdn edge from BxRy leads to Bx-1Ry, tup to Bx+1Ry. – A rdn edge from BxRy leads to Bx-1Ry-1, rup to Bx+1Ry+1. – A ldn edge from BxRy leads to Bx-1Ry, lup to Bx+1Ry. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 50. Heatmap of smallworld universe Bonds → Rings→ SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 51. integer indices and DB mapping • Inside each partition, each node is assigned a unique sequential integer index. – c1ccccc1 → *1*****1 ↔ B6R1.13 – n1cnccc1 → *1*****1 ↔ B6R1.13 • Each edge is then represented by two integers. • SmallWorld is a “type domain index” over graphs. • User database are represented as “mappings”. – B13R1.834 CC(=O)Oc1ccccc1C(=O)O CHEMBL1697753 – B14R4.107563 C1C[N+]23CCC4C2CC1C3C4 CID90865661 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 52. SmallWorld Density heatmaps SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 53. conclusions • Larger data sets reveal patterns, trends and insights previously invisible to chemists from small samples. • As chemical data sets grow exponentially, they form pain points for tradition processing: Big Data. • Theoretical computer science and AI can provide new algorithms to process more data more efficiently. • The sub-linear behaviour of SmallWorld’s nearest neighbor similarity makes it faster than fingerprint- based methods on (sufficiently) large data sets. SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 54. acknowledgements • In memoriam Andy Grant, thank you for everything. • AstraZeneca R&D, Alderley Park, U.K. • GlaxoSmithKline, Stevenage, U.K. • Relay Therapeutics, Boston, U.S.A. • Eli Lilly, Indianapolis, U.S.A. • Hoffmann-La Roche, Basel, Switzerland. • Jose Batista, OpenEye Scientific Software, Germany. • Jameed Hussain, Chemical Computing Group, U.K. • Frey group, University of Southampton, U.K. • Thank you for your time, Any questions? SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017
  • 55. J. Andrew grant (1963-2012) Andy and I at OpenEye EuroCUP 2008 SCI, What can Big Data do for Chemistry?, London, UK, Wednesday 11th October 2017