SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Evaluating the quality and performance
                                                                                                    of automatic atom mapping algorithms
                                                                                                                            Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK
                                                                                                                                                                                                                                                                     daniel@nextmovesoftware.co.uk

1. Introduction                                                                                                                                                               The mapping should be chemically reasonable. Heuristically this may be
                                                                                                                                                                              evaluated by comparing the number of C-C bonds broken. A complete
  Automatic atom mapping algorithms work on chemical reactions to produce                                                                                                     mapping with less C-C bond breakages is more likely to be correct.
  mappings between the atoms in the reactants and atoms in the product/s.                                                                                                                                                           1.2




                                                                                                                                                                                   Average number of C-C bonds broken per mapping
                                                                                                                                                                                                                                    1.0
                                                                                                                                                                                                                                                                                           Marvin 5.10
                                                                                                                                                                                                                                                                                           ChemDraw 12
                                                                                                                                                                                                                                    0.8                                                    Indigo 1.1
                                                                                                                                                                                                                                                                                           Indigo 1.1 (lenient)
                                                                                                                                                                                                                                                                                           ICMap 5.10
                                                                                                                                                                                                                                    0.6
                                                                                                                                                                                                                                                                                           PipelinePilot

                                                                                                        Mapping                                                                                                                                                                            Cheshire

                                                                                                        algorithm                                                                                                                   0.4



                                                                                                                                                                                                                                    0.2



                                                                                                                                                                                                                                    0.0
                                                                                                                                                                                                                                           PharmaELN   ChemReact68   SPRESI     USPTO

                                                                                                                                                                           4. Discussion
  In some cases, such as the above example, a reactant may appear multiple                                                                                                   ICMAP was found to produce the most chemically plausible atom mappings
  times in the product due to the input lacking the exact stoichiometry.                                                                                                     whilst Pipeline Pilot was able to successfully map the most reactions. An
                                                                                                                                                                             alternative measure of plausibility based on the energy of bonds broken was
  Chemically plausible mappings are potentially useful for:                                                                                                                  also tested and found to correlate strongly with the number of C-C bonds
  • Assigning roles to reagents, hence allowing determination of whether they                                                                                                broken.
    should be best placed above the reaction arrow
                                                                                                                                                                             Reuse of reactants was found to not be supported by Marvin, ChemDraw,
  • Reaction normalization for registration
                                                                                                                                                                             Pipeline Pilot or Cheshire and often not performed correctly by Indigo
  • Performing more precise database searches
                                                                                                                                                                             leading to clearly incorrect mappings:
  • Identifying suspect reactions e.g. those where a reactant is missing
  In this work we investigated the strengths and weaknesses of currently
  available solutions to this problem.

2. Methodology

  We evaluated the following algorithms on four sets of reactions:
                                                             Vendor:Program          Version                          Test set                       Reactions
                                               ChemAxon:Marvin[1]                      5.10.1                Pharmaceutical ELN subset                   18,244

                                                               GGA:Indigo[2]            1.1                    ChemReact68 database                      67,926                                                                           Example of bad mapping. All algorithms other than ICMAP
                                                             InfoChem:ICMAP[3]         5.10
                                                                                                               SPRESI database subset                    5,230                                                                                       placed atom maps on the pyridine
                                                                                                         Reactions extracted from 2008-2011            562,872
     PerkinElmer:ChemDraw Ultra[4]                                                     12.0
                   Accelrys:Pipeline Pilot[5]                                        8.5.0.200
                                                                                                             USPTO patent applications[7]
                                                                                                                                                                              A significant limitation found in ICMAP was the mapping of single atoms:
  Accelrys:Cheshire Advanced Edition[6]                                             5.0.0.0.1019



  Two configurations were used with Indigo; one for the default mapping
  settings and one with more lenient settings for matching valences, charges
  and bond orders. In both cases a 60 second timeout was explicitly specified.
  Marvin was configured to use its best quality mapping strategy.
  Input and output were reaction SMILES with the exceptions of ICMAP which                                                                                                    Example of incomplete mapping from ICMAP. Note that in this case
  required the conversion to and from RDF, and Cheshire and Pipeline Pilot                                                                                                    algorithms supporting single atom mapping incorrectly picked the methyl
  that required the conversion of their RDF output to SMILES.                                                                                                                 from the Et2Zn!
                                                                                                                                                                              Although not quantitatively evaluated the speed of the algorithms proved to
3. Results                                                                                                                                                                    be significantly different. The USPTO set could be processed in hours through
                                                                                                                                                                              ChemDraw or ICMAP, a day through Marvin but weeks through Indigo.
  Where reactions are valid an ideal algorithm will be able to find a mapping                                                                                              8. Conclusions
  for every product atom:
                                                                                                                                                                             ICMAP and Pipeline Pilot produced the best results with the trade-off
                                                             90
                                                                                                                                                                             between recall and precision determining which would be most appropriate
        Percent of reactions with all product atoms mapped




                                                             80                                                                                                              to a given task. Of the solutions tested only Indigo is freely available; when
                                                                                                                                              Marvin 5.10
                                                                                                                                                                             configured appropriately it produced adequate results.
                                                             70                                                                               ChemDraw 12
                                                                                                                                              Indigo 1.1                    9. Acknowledgements
                                                             60
                                                                                                                                              Indigo 1.1 (lenient)            We would like to thank InfoChem for providing an evaluation of ICMAP, and
                                                                                                                                              ICMap 5.10
                                                             50                                                                                                               Hans Kraut and Nick Tomkinson for assistance and comments on this work.
                                                                                                                                              PipelinePilot
                                                             40                                                                               Cheshire                      10. Bibliography
                                                             30                                                                                                               1.   http://www.chemaxon.com/products/marvin
                                                                                                                                                                              2.   http://ggasoftware.com/opensource/indigo
                                                             20                                                                                                               3.   http://infochem.de/products/software/icmap.shtml
                                                                                                                                                                              4.   http://www.cambridgesoft.com/software/chemdraw
                                                             10
                                                                                                                                                                              5.   http://accelrys.com/products/pipeline-pilot/
                                                              0                                                                                                               6.   http://accelrys.com/products/informatics/cheminformatics/accelrys-cheshire.html
                                                                      PharmaELN   ChemReact68       SPRESI            USPTO                                                   7.   Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. 243rd
                                                                                                                                                                                   ACS National Meeting & Exposition, San Diego, CA, March 27, 2012.

                                                                                                                                                                                                                                                                                  NextMove Software Limited
                                                                                                                                                                                                                                                                                  Innovation Centre (Unit 23)
                                                                                                                                                           www.nextmovesoftware.co.uk
                                                                                                                                                                                                                                                                                     Cambridge Science Park
                                                                                                                                                           www.nextmovesoftware.com
                                                                                                                                                                                                                                                                                     Milton Road, Cambridge
                                                                                                                                                                                                                                                                                            England CB4 0EY

Mais conteúdo relacionado

Mais de NextMove Software

Mais de NextMove Software (20)

Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 

Evaluating the quality and performance of automatic atom mapping algorithms

  • 1. Evaluating the quality and performance of automatic atom mapping algorithms Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK daniel@nextmovesoftware.co.uk 1. Introduction The mapping should be chemically reasonable. Heuristically this may be evaluated by comparing the number of C-C bonds broken. A complete Automatic atom mapping algorithms work on chemical reactions to produce mapping with less C-C bond breakages is more likely to be correct. mappings between the atoms in the reactants and atoms in the product/s. 1.2 Average number of C-C bonds broken per mapping 1.0 Marvin 5.10 ChemDraw 12 0.8 Indigo 1.1 Indigo 1.1 (lenient) ICMap 5.10 0.6 PipelinePilot Mapping Cheshire algorithm 0.4 0.2 0.0 PharmaELN ChemReact68 SPRESI USPTO 4. Discussion In some cases, such as the above example, a reactant may appear multiple ICMAP was found to produce the most chemically plausible atom mappings times in the product due to the input lacking the exact stoichiometry. whilst Pipeline Pilot was able to successfully map the most reactions. An alternative measure of plausibility based on the energy of bonds broken was Chemically plausible mappings are potentially useful for: also tested and found to correlate strongly with the number of C-C bonds • Assigning roles to reagents, hence allowing determination of whether they broken. should be best placed above the reaction arrow Reuse of reactants was found to not be supported by Marvin, ChemDraw, • Reaction normalization for registration Pipeline Pilot or Cheshire and often not performed correctly by Indigo • Performing more precise database searches leading to clearly incorrect mappings: • Identifying suspect reactions e.g. those where a reactant is missing In this work we investigated the strengths and weaknesses of currently available solutions to this problem. 2. Methodology We evaluated the following algorithms on four sets of reactions: Vendor:Program Version Test set Reactions ChemAxon:Marvin[1] 5.10.1 Pharmaceutical ELN subset 18,244 GGA:Indigo[2] 1.1 ChemReact68 database 67,926 Example of bad mapping. All algorithms other than ICMAP InfoChem:ICMAP[3] 5.10 SPRESI database subset 5,230 placed atom maps on the pyridine Reactions extracted from 2008-2011 562,872 PerkinElmer:ChemDraw Ultra[4] 12.0 Accelrys:Pipeline Pilot[5] 8.5.0.200 USPTO patent applications[7] A significant limitation found in ICMAP was the mapping of single atoms: Accelrys:Cheshire Advanced Edition[6] 5.0.0.0.1019 Two configurations were used with Indigo; one for the default mapping settings and one with more lenient settings for matching valences, charges and bond orders. In both cases a 60 second timeout was explicitly specified. Marvin was configured to use its best quality mapping strategy. Input and output were reaction SMILES with the exceptions of ICMAP which Example of incomplete mapping from ICMAP. Note that in this case required the conversion to and from RDF, and Cheshire and Pipeline Pilot algorithms supporting single atom mapping incorrectly picked the methyl that required the conversion of their RDF output to SMILES. from the Et2Zn! Although not quantitatively evaluated the speed of the algorithms proved to 3. Results be significantly different. The USPTO set could be processed in hours through ChemDraw or ICMAP, a day through Marvin but weeks through Indigo. Where reactions are valid an ideal algorithm will be able to find a mapping 8. Conclusions for every product atom: ICMAP and Pipeline Pilot produced the best results with the trade-off 90 between recall and precision determining which would be most appropriate Percent of reactions with all product atoms mapped 80 to a given task. Of the solutions tested only Indigo is freely available; when Marvin 5.10 configured appropriately it produced adequate results. 70 ChemDraw 12 Indigo 1.1 9. Acknowledgements 60 Indigo 1.1 (lenient) We would like to thank InfoChem for providing an evaluation of ICMAP, and ICMap 5.10 50 Hans Kraut and Nick Tomkinson for assistance and comments on this work. PipelinePilot 40 Cheshire 10. Bibliography 30 1. http://www.chemaxon.com/products/marvin 2. http://ggasoftware.com/opensource/indigo 20 3. http://infochem.de/products/software/icmap.shtml 4. http://www.cambridgesoft.com/software/chemdraw 10 5. http://accelrys.com/products/pipeline-pilot/ 0 6. http://accelrys.com/products/informatics/cheminformatics/accelrys-cheshire.html PharmaELN ChemReact68 SPRESI USPTO 7. Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. 243rd ACS National Meeting & Exposition, San Diego, CA, March 27, 2012. NextMove Software Limited Innovation Centre (Unit 23) www.nextmovesoftware.co.uk Cambridge Science Park www.nextmovesoftware.com Milton Road, Cambridge England CB4 0EY