SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Smallworld : Efficient maximum common
substructure searching of large databases
Roger Sayle, Jose Batista and Andrew Grant
NextMove Software Ltd, Cambridge, UK and AstraZeneca R&D, Alderley Park, UK
10. Bibliography
1. Alberto Sanfeliu and K.S. Fu, “A Distance Measure between Attributed Relational Graphs
for Pattern Recognition”, IEEE Transactions of Systems, Man and Cybernetics (SMC), Vol.
13, No. 3, pp. 353-363, 1983.
2. Horst Bunke, “On a Relation between Graph Edit Distance and Maximum Common
Subgraph”, Pattern Recognition Letters, Vol. 18, No. 8, pp. 689-694, 1997.
3. Horst Bunke and Kim Shearer, “A Graph Distance Metric Based on the Maximal Common
Subgraph”, Pattern Recognition Letters, Vol. 19, pp. 255-259, 1998.
4. Bruno T. Messmer and Horst Bunke, “Subgraph Isomorphism in Polynomial Time”,
Technical Report, University of Berlin, 1995.
5. Bruno T. Messmer and Horst Bunke, “Subgraph Isomorphism Detection in Polynomial
Time on Preprocessed Model Graphs”, In Proceedings of the Asian Conference on
Computer Vision (ACCV), pp. 151-155, 1995.
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
England CB4 0EY
www.nextmovesoftware.co.uk
www.nextmovesoftware.com
1. Overview
We report a novel chemical database search method based upon explicit
representation of chemical space. A pre-computed index allows the exact size of
the maximum common edge subgraph (MCES) between a query molecule and
molecules in the index to be calculated rapidly. In practice, this allows the 100
nearest neighbours having the largest MCES to a query molecule to be
determined in a few seconds, even for target databases containing millions of
molecules. This work builds upon the previous efforts of Wipke and Rogers in the
late 1980s and of Messmer and Bunke in the 1990s, but takes advantage of the
dramatic advances in parallel processing and storage technology now available to
researchers.
7. SmallWorld Database Index Size
Here we describe some features of a small SmallWorld index used as a proof-of-
concept and to prototype alternative database construction, incremental update,
maintenance and search implementations.
This “anonymous” (element and bond order suppressed) database was seeded
with the contents of the NCBI PubChem and Accelrys MDDR databases. The
resulting (incompletely enumerated) index contained 69 million nodes and 557
million edges, requiring around 40 Gbytes of disk space. Projections based on
these experiments indicate a 95% indexed pharmaceutical registry database of a
few million compounds should fit in a few terabytes of storage [costing a few
hundred dollars at current prices].
The average degree of each node is between 16 and 17 edges. Of the 69 million
nodes, 27 million represent acyclic graphs, and 21 million represent graphs with a
single ring. Of the 557 million edges, 205 million are ring deletion edges and 352
are terminal bond deletion edges.
A Briem & Lessel benchmark run, searching for the (up to) 10 nearest neighbours
of each of 380 drug-sized query molecules takes 8 minutes 42 seconds (less than
1.4 seconds per query) on a single thread on an Intel i7-2600 with 16 Gbytes RAM.
The median distance for each search to search was 8 edits, with the shortest
search finishing after just two edits and the furthest search reaching 24 edits.
Run times can be reduced to 7 minutes, by limiting searches to 10 edits. The
worst-case “wave-front” size was 6,624,624 nodes which occurred at distance 9.
9. Conclusions
• The sub-linear behaviour of SmallWorld’s nearest neighbour search makes it
faster than fingerprint-based similarity methods for sufficiently large data sets.
• This is thanks to the “blessing of dimensionality”
• With continual advances in computer hardware, SmallWorld-like approaches
are likely to become the basis of most chemical similarity calculations within a
decade.
7. SmallWorld Database Index Statistics
1. Abstract
8. Conclusions
10. Bibliography
2. 2D Chemical Similarity
Most 2D chemical similarity methods, including fingerprints, may be considered
“local” methods, where a molecular graph is reduced to a vector of features, and
similarity between these vectors treated as a surrogate of graph similarity. This
reduction to 1D similarity results in a number of artefacts; binary fingerprints such
as ECFP, MACCS keys and Daylight fingers saturate and fail to distinguish alkanes,
peptides or nucleic acids. “All the right notes, not necessarily the right order”.
These artefacts mean existing method don’t always match chemist’s intuition.
2. 2D Chemical Similarity
3. Analogy to Evolution of Bioinformatics
The mathematic concept of string edit distance as a similarity measure between
strings was introduced by Vladimir Levenshtein in 1965, as the minimum number
of insertions, deletions and substitutions to transform one string into another.
Thanks to the efficient dynamic programming algorithms of Needleman and
Wunsh, and Smith and Waterman, edit distance now underpins modern biological
sequence comparison. Prior to this, methods were restricted to “local” heuristic
methods, comparing vectors of k-tuples (short runs of characters).
3. Analogy to Evolution of Bioinformatics
6. SmallWorld Chemical Universe
Daylight Tanimoto: 0.44 Daylight Tanimoto: 0.39
4. Graph Edit Distance (GED) and Maximum Common Subgraph (MCS)
Graph Edit Distance (GED) as an analogous concept for similarity between graphs
was introduced by Sanfeliu and Ku in 1983 [1], and was shown to be related to the
Maximum Common Subgraph (MCS) by Bunke in 1997 [2,3]. Given the size of the
MCS between two molecules one can determine the edit distance, and likewise
given the edit distance it is possible to calculate the size of the MCS. Although
intuitive to chemists, the NP-Hard computational complexity of calculating the
MCS of two molecules has traditionally prohibited its practical use in chemical
database searching.
4. Graph Edit Distance (GED) and Maximum Common Subgraph (MCS)
5. SmallWorld
NextMove Software’s SmallWorld finesses these issues by taking advantage of
recent advances in computer science and technology, using pre-computation to
make MCS/GED searching faster than fingerprint based searches. Conceptually,
SmallWorld explicitly encodes the relationship between molecular graphs and all
of their possible subgraphs. Edit Distance and MCS queries become “Big Data”
graph neighbour problems much like those encountered in social network sites
(like LinkedIn or Facebook), or Graph500 supercomputer benchmarks. Advanced
algorithms and current storage technology allow management of databases of
many billions of subgraphs, orders of magnitude larger than current chemical
databases.
Unique subgraph counts (SGs) of drug-like molecules, supressing bond orders and
either preserving (elem) or suppressing (anon) atomic numbers [i.e. topologies].
5. SmallWorld
Name Atoms MW Anon SGs Elem SGs
Aspirin 13 180 127 332
Ranitidine 21 314 436 1,207
Clopidrogel 21 322 10,071 22,170
Amlodipine 28 409 58,139 147,128
Lisinopril 29 405 24,619 34,496
Gefitinib 31 447 190,901 337,174
Atorvastatin 41 559 3,638,523 6,019,427
7 bonds
6 bonds
5 bonds
4 bonds
3 bonds
2 bonds
1 bond

Mais conteúdo relacionado

Destaque

Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?NextMove Software
 
Presentación1 (1)
Presentación1 (1)Presentación1 (1)
Presentación1 (1)Diana Andrea
 
last Weekend
last Weekendlast Weekend
last Weekendlizsol2
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesNextMove Software
 
CINF 51: Analyzing success rates of supposedly 'easy' reactions
CINF 51: Analyzing success rates of supposedly 'easy' reactionsCINF 51: Analyzing success rates of supposedly 'easy' reactions
CINF 51: Analyzing success rates of supposedly 'easy' reactionsNextMove Software
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
 
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningNextMove Software
 
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...NextMove Software
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...NextMove Software
 
CINF 4: Naming algorithms for derivatives of peptide-like natural products
CINF 4: Naming algorithms for derivatives of peptide-like natural productsCINF 4: Naming algorithms for derivatives of peptide-like natural products
CINF 4: Naming algorithms for derivatives of peptide-like natural productsNextMove Software
 
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...NextMove Software
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsNextMove Software
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...NextMove Software
 
Extraction, analysis, atom mapping, classification and naming of reactions fr...
Extraction, analysis, atom mapping, classification and naming of reactions fr...Extraction, analysis, atom mapping, classification and naming of reactions fr...
Extraction, analysis, atom mapping, classification and naming of reactions fr...NextMove Software
 
Reactive Chemical Hazard Alerting in Pharmaceutical Notebooks
Reactive Chemical Hazard Alerting in Pharmaceutical NotebooksReactive Chemical Hazard Alerting in Pharmaceutical Notebooks
Reactive Chemical Hazard Alerting in Pharmaceutical NotebooksNextMove Software
 

Destaque (20)

Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
Presentación1 (1)
Presentación1 (1)Presentación1 (1)
Presentación1 (1)
 
last Weekend
last Weekendlast Weekend
last Weekend
 
Managing Your Reputation Online
Managing Your Reputation OnlineManaging Your Reputation Online
Managing Your Reputation Online
 
Tutorial para ingreso a plataforma educativa moodle
Tutorial para ingreso a plataforma educativa moodleTutorial para ingreso a plataforma educativa moodle
Tutorial para ingreso a plataforma educativa moodle
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
CINF 51: Analyzing success rates of supposedly 'easy' reactions
CINF 51: Analyzing success rates of supposedly 'easy' reactionsCINF 51: Analyzing success rates of supposedly 'easy' reactions
CINF 51: Analyzing success rates of supposedly 'easy' reactions
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
 
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...
 
CINF 4: Naming algorithms for derivatives of peptide-like natural products
CINF 4: Naming algorithms for derivatives of peptide-like natural productsCINF 4: Naming algorithms for derivatives of peptide-like natural products
CINF 4: Naming algorithms for derivatives of peptide-like natural products
 
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
Extraction, analysis, atom mapping, classification and naming of reactions fr...
Extraction, analysis, atom mapping, classification and naming of reactions fr...Extraction, analysis, atom mapping, classification and naming of reactions fr...
Extraction, analysis, atom mapping, classification and naming of reactions fr...
 
Reactive Chemical Hazard Alerting in Pharmaceutical Notebooks
Reactive Chemical Hazard Alerting in Pharmaceutical NotebooksReactive Chemical Hazard Alerting in Pharmaceutical Notebooks
Reactive Chemical Hazard Alerting in Pharmaceutical Notebooks
 

Semelhante a Smallworld : Efficient maximum common substructure searching of large databases

Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...csandit
 
Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013ijcsbi
 
Maps of sparse memory networks reveal overlapping communities in network flows
Maps of sparse memory networks reveal overlapping communities in network flowsMaps of sparse memory networks reveal overlapping communities in network flows
Maps of sparse memory networks reveal overlapping communities in network flowsUmeå University
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
Applying association rules and co location techniques on geospatial web services
Applying association rules and co location techniques on geospatial web servicesApplying association rules and co location techniques on geospatial web services
Applying association rules and co location techniques on geospatial web servicesAlexander Decker
 
Exploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningExploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningAdvanced-Concepts-Team
 
Chapter 5 applications of neural networks
Chapter 5           applications of neural networksChapter 5           applications of neural networks
Chapter 5 applications of neural networksPunit Saini
 
Extending network lifetime of wireless sensor
Extending network lifetime of wireless sensorExtending network lifetime of wireless sensor
Extending network lifetime of wireless sensorIJCNCJournal
 
SIMTECT-2013-Images-That-Change-KP
SIMTECT-2013-Images-That-Change-KPSIMTECT-2013-Images-That-Change-KP
SIMTECT-2013-Images-That-Change-KPKurt Pudniks
 
Energy Efficient Optimal Paths Using PDORP-LC
Energy Efficient Optimal Paths Using PDORP-LCEnergy Efficient Optimal Paths Using PDORP-LC
Energy Efficient Optimal Paths Using PDORP-LCpaperpublications3
 
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...AIRCC Publishing Corporation
 
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...ijcsit
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Austin Benson
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareEditor IJCATR
 

Semelhante a Smallworld : Efficient maximum common substructure searching of large databases (20)

1104.0355
1104.03551104.0355
1104.0355
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
 
E04423133
E04423133E04423133
E04423133
 
Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Maps of sparse memory networks reveal overlapping communities in network flows
Maps of sparse memory networks reveal overlapping communities in network flowsMaps of sparse memory networks reveal overlapping communities in network flows
Maps of sparse memory networks reveal overlapping communities in network flows
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Applying association rules and co location techniques on geospatial web services
Applying association rules and co location techniques on geospatial web servicesApplying association rules and co location techniques on geospatial web services
Applying association rules and co location techniques on geospatial web services
 
Exploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningExploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine Learning
 
Chapter 5 applications of neural networks
Chapter 5           applications of neural networksChapter 5           applications of neural networks
Chapter 5 applications of neural networks
 
Research Proposal
Research ProposalResearch Proposal
Research Proposal
 
Extending network lifetime of wireless sensor
Extending network lifetime of wireless sensorExtending network lifetime of wireless sensor
Extending network lifetime of wireless sensor
 
SIMTECT-2013-Images-That-Change-KP
SIMTECT-2013-Images-That-Change-KPSIMTECT-2013-Images-That-Change-KP
SIMTECT-2013-Images-That-Change-KP
 
Energy Efficient Optimal Paths Using PDORP-LC
Energy Efficient Optimal Paths Using PDORP-LCEnergy Efficient Optimal Paths Using PDORP-LC
Energy Efficient Optimal Paths Using PDORP-LC
 
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
 
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
 

Mais de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Último

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptxAneriPatwari
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 

Último (20)

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptx
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 

Smallworld : Efficient maximum common substructure searching of large databases

  • 1. Smallworld : Efficient maximum common substructure searching of large databases Roger Sayle, Jose Batista and Andrew Grant NextMove Software Ltd, Cambridge, UK and AstraZeneca R&D, Alderley Park, UK 10. Bibliography 1. Alberto Sanfeliu and K.S. Fu, “A Distance Measure between Attributed Relational Graphs for Pattern Recognition”, IEEE Transactions of Systems, Man and Cybernetics (SMC), Vol. 13, No. 3, pp. 353-363, 1983. 2. Horst Bunke, “On a Relation between Graph Edit Distance and Maximum Common Subgraph”, Pattern Recognition Letters, Vol. 18, No. 8, pp. 689-694, 1997. 3. Horst Bunke and Kim Shearer, “A Graph Distance Metric Based on the Maximal Common Subgraph”, Pattern Recognition Letters, Vol. 19, pp. 255-259, 1998. 4. Bruno T. Messmer and Horst Bunke, “Subgraph Isomorphism in Polynomial Time”, Technical Report, University of Berlin, 1995. 5. Bruno T. Messmer and Horst Bunke, “Subgraph Isomorphism Detection in Polynomial Time on Preprocessed Model Graphs”, In Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 151-155, 1995. NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge England CB4 0EY www.nextmovesoftware.co.uk www.nextmovesoftware.com 1. Overview We report a novel chemical database search method based upon explicit representation of chemical space. A pre-computed index allows the exact size of the maximum common edge subgraph (MCES) between a query molecule and molecules in the index to be calculated rapidly. In practice, this allows the 100 nearest neighbours having the largest MCES to a query molecule to be determined in a few seconds, even for target databases containing millions of molecules. This work builds upon the previous efforts of Wipke and Rogers in the late 1980s and of Messmer and Bunke in the 1990s, but takes advantage of the dramatic advances in parallel processing and storage technology now available to researchers. 7. SmallWorld Database Index Size Here we describe some features of a small SmallWorld index used as a proof-of- concept and to prototype alternative database construction, incremental update, maintenance and search implementations. This “anonymous” (element and bond order suppressed) database was seeded with the contents of the NCBI PubChem and Accelrys MDDR databases. The resulting (incompletely enumerated) index contained 69 million nodes and 557 million edges, requiring around 40 Gbytes of disk space. Projections based on these experiments indicate a 95% indexed pharmaceutical registry database of a few million compounds should fit in a few terabytes of storage [costing a few hundred dollars at current prices]. The average degree of each node is between 16 and 17 edges. Of the 69 million nodes, 27 million represent acyclic graphs, and 21 million represent graphs with a single ring. Of the 557 million edges, 205 million are ring deletion edges and 352 are terminal bond deletion edges. A Briem & Lessel benchmark run, searching for the (up to) 10 nearest neighbours of each of 380 drug-sized query molecules takes 8 minutes 42 seconds (less than 1.4 seconds per query) on a single thread on an Intel i7-2600 with 16 Gbytes RAM. The median distance for each search to search was 8 edits, with the shortest search finishing after just two edits and the furthest search reaching 24 edits. Run times can be reduced to 7 minutes, by limiting searches to 10 edits. The worst-case “wave-front” size was 6,624,624 nodes which occurred at distance 9. 9. Conclusions • The sub-linear behaviour of SmallWorld’s nearest neighbour search makes it faster than fingerprint-based similarity methods for sufficiently large data sets. • This is thanks to the “blessing of dimensionality” • With continual advances in computer hardware, SmallWorld-like approaches are likely to become the basis of most chemical similarity calculations within a decade. 7. SmallWorld Database Index Statistics 1. Abstract 8. Conclusions 10. Bibliography 2. 2D Chemical Similarity Most 2D chemical similarity methods, including fingerprints, may be considered “local” methods, where a molecular graph is reduced to a vector of features, and similarity between these vectors treated as a surrogate of graph similarity. This reduction to 1D similarity results in a number of artefacts; binary fingerprints such as ECFP, MACCS keys and Daylight fingers saturate and fail to distinguish alkanes, peptides or nucleic acids. “All the right notes, not necessarily the right order”. These artefacts mean existing method don’t always match chemist’s intuition. 2. 2D Chemical Similarity 3. Analogy to Evolution of Bioinformatics The mathematic concept of string edit distance as a similarity measure between strings was introduced by Vladimir Levenshtein in 1965, as the minimum number of insertions, deletions and substitutions to transform one string into another. Thanks to the efficient dynamic programming algorithms of Needleman and Wunsh, and Smith and Waterman, edit distance now underpins modern biological sequence comparison. Prior to this, methods were restricted to “local” heuristic methods, comparing vectors of k-tuples (short runs of characters). 3. Analogy to Evolution of Bioinformatics 6. SmallWorld Chemical Universe Daylight Tanimoto: 0.44 Daylight Tanimoto: 0.39 4. Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) Graph Edit Distance (GED) as an analogous concept for similarity between graphs was introduced by Sanfeliu and Ku in 1983 [1], and was shown to be related to the Maximum Common Subgraph (MCS) by Bunke in 1997 [2,3]. Given the size of the MCS between two molecules one can determine the edit distance, and likewise given the edit distance it is possible to calculate the size of the MCS. Although intuitive to chemists, the NP-Hard computational complexity of calculating the MCS of two molecules has traditionally prohibited its practical use in chemical database searching. 4. Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) 5. SmallWorld NextMove Software’s SmallWorld finesses these issues by taking advantage of recent advances in computer science and technology, using pre-computation to make MCS/GED searching faster than fingerprint based searches. Conceptually, SmallWorld explicitly encodes the relationship between molecular graphs and all of their possible subgraphs. Edit Distance and MCS queries become “Big Data” graph neighbour problems much like those encountered in social network sites (like LinkedIn or Facebook), or Graph500 supercomputer benchmarks. Advanced algorithms and current storage technology allow management of databases of many billions of subgraphs, orders of magnitude larger than current chemical databases. Unique subgraph counts (SGs) of drug-like molecules, supressing bond orders and either preserving (elem) or suppressing (anon) atomic numbers [i.e. topologies]. 5. SmallWorld Name Atoms MW Anon SGs Elem SGs Aspirin 13 180 127 332 Ranitidine 21 314 436 1,207 Clopidrogel 21 322 10,071 22,170 Amlodipine 28 409 58,139 147,128 Lisinopril 29 405 24,619 34,496 Gefitinib 31 447 190,901 337,174 Atorvastatin 41 559 3,638,523 6,019,427 7 bonds 6 bonds 5 bonds 4 bonds 3 bonds 2 bonds 1 bond