Line notations for nucleic acids (both natural and therapeutic)
Evaluating the quality and performance of automatic atom mapping algorithms
1. Evaluating the quality and performance
of automatic atom mapping algorithms
Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK
daniel@nextmovesoftware.co.uk
1. Introduction The mapping should be chemically reasonable. Heuristically this may be
evaluated by comparing the number of C-C bonds broken. A complete
Automatic atom mapping algorithms work on chemical reactions to produce mapping with less C-C bond breakages is more likely to be correct.
mappings between the atoms in the reactants and atoms in the product/s. 1.2
Average number of C-C bonds broken per mapping
1.0
Marvin 5.10
ChemDraw 12
0.8 Indigo 1.1
Indigo 1.1 (lenient)
ICMap 5.10
0.6
PipelinePilot
Mapping Cheshire
algorithm 0.4
0.2
0.0
PharmaELN ChemReact68 SPRESI USPTO
4. Discussion
In some cases, such as the above example, a reactant may appear multiple ICMAP was found to produce the most chemically plausible atom mappings
times in the product due to the input lacking the exact stoichiometry. whilst Pipeline Pilot was able to successfully map the most reactions. An
alternative measure of plausibility based on the energy of bonds broken was
Chemically plausible mappings are potentially useful for: also tested and found to correlate strongly with the number of C-C bonds
• Assigning roles to reagents, hence allowing determination of whether they broken.
should be best placed above the reaction arrow
Reuse of reactants was found to not be supported by Marvin, ChemDraw,
• Reaction normalization for registration
Pipeline Pilot or Cheshire and often not performed correctly by Indigo
• Performing more precise database searches
leading to clearly incorrect mappings:
• Identifying suspect reactions e.g. those where a reactant is missing
In this work we investigated the strengths and weaknesses of currently
available solutions to this problem.
2. Methodology
We evaluated the following algorithms on four sets of reactions:
Vendor:Program Version Test set Reactions
ChemAxon:Marvin[1] 5.10.1 Pharmaceutical ELN subset 18,244
GGA:Indigo[2] 1.1 ChemReact68 database 67,926 Example of bad mapping. All algorithms other than ICMAP
InfoChem:ICMAP[3] 5.10
SPRESI database subset 5,230 placed atom maps on the pyridine
Reactions extracted from 2008-2011 562,872
PerkinElmer:ChemDraw Ultra[4] 12.0
Accelrys:Pipeline Pilot[5] 8.5.0.200
USPTO patent applications[7]
A significant limitation found in ICMAP was the mapping of single atoms:
Accelrys:Cheshire Advanced Edition[6] 5.0.0.0.1019
Two configurations were used with Indigo; one for the default mapping
settings and one with more lenient settings for matching valences, charges
and bond orders. In both cases a 60 second timeout was explicitly specified.
Marvin was configured to use its best quality mapping strategy.
Input and output were reaction SMILES with the exceptions of ICMAP which Example of incomplete mapping from ICMAP. Note that in this case
required the conversion to and from RDF, and Cheshire and Pipeline Pilot algorithms supporting single atom mapping incorrectly picked the methyl
that required the conversion of their RDF output to SMILES. from the Et2Zn!
Although not quantitatively evaluated the speed of the algorithms proved to
3. Results be significantly different. The USPTO set could be processed in hours through
ChemDraw or ICMAP, a day through Marvin but weeks through Indigo.
Where reactions are valid an ideal algorithm will be able to find a mapping 8. Conclusions
for every product atom:
ICMAP and Pipeline Pilot produced the best results with the trade-off
90
between recall and precision determining which would be most appropriate
Percent of reactions with all product atoms mapped
80 to a given task. Of the solutions tested only Indigo is freely available; when
Marvin 5.10
configured appropriately it produced adequate results.
70 ChemDraw 12
Indigo 1.1 9. Acknowledgements
60
Indigo 1.1 (lenient) We would like to thank InfoChem for providing an evaluation of ICMAP, and
ICMap 5.10
50 Hans Kraut and Nick Tomkinson for assistance and comments on this work.
PipelinePilot
40 Cheshire 10. Bibliography
30 1. http://www.chemaxon.com/products/marvin
2. http://ggasoftware.com/opensource/indigo
20 3. http://infochem.de/products/software/icmap.shtml
4. http://www.cambridgesoft.com/software/chemdraw
10
5. http://accelrys.com/products/pipeline-pilot/
0 6. http://accelrys.com/products/informatics/cheminformatics/accelrys-cheshire.html
PharmaELN ChemReact68 SPRESI USPTO 7. Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. 243rd
ACS National Meeting & Exposition, San Diego, CA, March 27, 2012.
NextMove Software Limited
Innovation Centre (Unit 23)
www.nextmovesoftware.co.uk
Cambridge Science Park
www.nextmovesoftware.com
Milton Road, Cambridge
England CB4 0EY