SlideShare uma empresa Scribd logo
1 de 32
Automated Extraction of Reactions from the
            Patent Literature




                        Daniel Lowe
     Unilever Centre for Molecular Science Informatics
                 University of Cambridge




                                                         1
Chemistry patent applications
• 100,000s applications each year
                                               400000


                                               350000
      Chemistry patent applications per year




                                               300000


                                               250000


                                               200000


                                               150000


                                               100000


                                                50000


                                                    0
                                                        2000   2001   2002   2003   2004   2005     2006    2007     2008    2009

                                                                                                  World Intellectual Property Indicators, 2011 edition

                                                                                                                                               2
3
The idea
   XML patents




     Reaction
    Extraction
      System




Extracted Reactions

                      4
Steps involved
•   Identifying experimental sections
•   Identifying chemical entities
•   Chemical name to structure conversion
•   Associating chemical entities with quantities
•   Assigning chemical roles
•   Atom-atom mapping


                                                    5
Building on existing projects




                                6
Archetypal experimental section
                           Section heading

                            Section target
                             compound
     Step identifier
                              Step target
                              compound
Paragraph number
                               Synthesis



                                Workup


                            Characterisation




                                               7
Jessop, D. M.; Adams, S. E.; Murray-Rust, P.
Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.




                                        8
ChemicalTagger
• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree
   – http://chemicaltagger.ch.cam.ac.uk/
   – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for
     semantic text-mining in chemistry. J Cheminf 2011, 3, 17.




                                                                                        9
Tagging
•   Regex tagger: tags keywords e.g. “yield”, “mL”
•   OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
    e.g. “2-methylpyridine”
•   OpenNLP: Tags parts of speech


Additional taggers:
• OPSIN tagger: Finds names OPSIN can parse
• Trivial chemical name tagger: Tags a few chemicals missed by
   the other taggers and cases that are partially matched by
   the regex tagger e.g. Dess-martin reagent


                                                            10
Sample ChemicalTagger Output
     <MOLECULE>
       <OSCARCM>
         <OSCAR-CM>methyl</OSCAR-CM>
         <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM>
       </OSCARCM>
       <QUANTITY>
         <_-LRB->(</_-LRB->
         <MASS>
           <CD>606</CD>
           <NN-MASS>mg</NN-MASS>
         </MASS>
         <COMMA>,</COMMA>
         <AMOUNT>
           <CD>2.1</CD>
           <NN-AMOUNT>mmol</NN-AMOUNT>
         </AMOUNT>
         <COMMA>,</COMMA>
         <EQUIVALENT>
           <CD>1</CD>
           <NN-EQ>eq</NN-EQ>
         </EQUIVALENT>
         <_-RRB->)</_-RRB->
       </QUANTITY>
     </MOLECULE>

                                                           11
Phrase Identification




                        12
Quantity Identification




                          13
Section/Step Parsing




                       14
Pyridine, pyridines and pyridine rings


                        The pyridine /       Pyridines /    Pyridine ring /
 Entity   Pyridine
                     Pyridine from step 1    A pyridine         Pyridyl

 Type      Exact      DefiniteReference     ChemicalClass     Fragment




                                                                      15
Section/Step Parsing




Workup phrase types : Concentrate, Degass,
 Dry, Extract, Filter, Partition, Precipitate,
 Purify, Recover, Remove, Wash, Quench




                                                 16
Atom-mapping




               17
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606
mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).

                                                         18
Graphical Output




                   19
CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
 <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
 <productList>
  <product role="product">                                                                     Reaction SMILES
   <molecule id="m0">
    <name dictRef="nameDict:unknown">title compound</name>
   </molecule>
   <amount units="unit:mmol">1.8</amount>
   <amount units="unit:mg">690</amount>                                           Quantities including yield are extracted
   <amount units="unit:percentYield">85.0</amount>
   <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
   <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
   <dl:entityType>definiteReference</dl:entityType>
   <dl:state>solid</dl:state>                                                       SMILES and InChIs for every structure
  </product>                                                                               resolvable reagent/product
 </productList>
 <reactantList>                                  Entity is classified as an exact compound,
  <reactant role="reactant" count="1">
   <molecule id="m1">
                                              definite reference, chemical class or polymer
    <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
   </molecule>
   <amount units="unit:mmol">2.1</amount>
   <amount units="unit:mg">606</amount>
   <amount units="unit:eq">1.0</amount>
   <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>



                                                                                                                                                  20
Evaluation
•   2008-2011 USPTO patent applications classified as containing
    organic chemistry  65,034 documents.

•   484,259 reactions atom mapped reactions extracted

•   Adding the additional requirements that all the identified
    product molecules were resolvable to structures and that all
    reagents were believed to describe exact compounds
     424,621 reactions.

•   100 of these were selected for manual evaluation of quality

                                                                  21
Reactions found
                                         100,000




                                          10,000
Patents with given number of reactions




                                           1,000




                                            100




                                             10




                                              1
                                                   0     200      400               600        800   1000
                                                               Number of extracted reactions




                                                                                                            22
Results
•   96% correctly identified the primary starting material and product
    whilst not misidentifying reagents that could be confused with the
    starting material

•   As compared to the 495 expected chemical entities there were 61 false
    positives and 16 false negatives

•   Only 4 of the 321 reagents (with quantities) did not have these
    quantities recognised and associated with the reagent

•   Association of quantities/yields with products was less successful, 48
    out of the 74 cases where such data was present were handled

                                                                             23
Use Cases
• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction




                                            24
Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)




     6 reactions found in 5 patents


                                               25
Name I20110224.tarUS20110046406A1-20110224.ZIP0066




Text from US 2011/0046406 A1




                                                        26
Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride
EDCI hydrochloride
1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride
N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride
                                                                             And 127 more!
N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl
N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride
N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl
                                                                             675 chemicals had over
1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride
1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride                10 lexical variants!
N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride
1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride



                                                                                                      27
Most common solvents




                       28
Known Limitations
•   The first workup reagent is often erroneously classified as a
    reactant

•   Atom mapping produces mappings that are not necessarily
    representative of reaction mechanism and occasionally
    involve clearly incorrect atoms

•   Conditions from analogous reactions are not resolved

•   Temperature/time for reactions to occur not captured



                                                                    29
Conclusions
• 424,621 exact atom-mapped reactions were
  extracted from 4 years of USPTO patent
  applications
• Evaluation indicates the reactions to be of
  generally good quality especially if the
  misidentification of workup reagents as
  reactants is not considered important
• All the code to extract reactions is open source:
  https://bitbucket.org/dan2097/patent-reaction-extraction

                                                        30
Acknowledgements
Unilever centre:                   Indigo toolkit:
Robert Glen                        Mikhail Rybalkin
Peter Murray-Rust                  Savelyev Alexander
Lezan Hawizy                       Dmitry Pavlov
David Jessop
Matthew Grayson
Boehringer Ingelheim for funding   SMARTS searching:
                                   Roger Sayle



                                                        31
Any Questions?




Email: daniel@nextmovesoftware.com


                                     32

Mais conteúdo relacionado

Mais procurados

Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
Towards Human-Centered Machine Learning
Towards Human-Centered Machine LearningTowards Human-Centered Machine Learning
Towards Human-Centered Machine LearningSri Ambati
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretabilityinovex GmbH
 
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...MLAI2
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...Jin-Hwa Kim
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Krishnaram Kenthapadi
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Krishnaram Kenthapadi
 
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...Universitat Politècnica de Catalunya
 
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paperI-Chao Shen
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
How good is your prediction a gentle introduction to conformal prediction.
How good is your prediction  a gentle introduction to conformal prediction.How good is your prediction  a gentle introduction to conformal prediction.
How good is your prediction a gentle introduction to conformal prediction.Deep Learning Italia
 

Mais procurados (20)

Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Towards Human-Centered Machine Learning
Towards Human-Centered Machine LearningTowards Human-Centered Machine Learning
Towards Human-Centered Machine Learning
 
ODSC APAC 2022 - Explainable AI
ODSC APAC 2022 - Explainable AIODSC APAC 2022 - Explainable AI
ODSC APAC 2022 - Explainable AI
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
Scalable and Order-robust Continual Learning with Additive Parameter Decompos...
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
 
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
 
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper
[SIGGRAPH ASIA 2011 Course]How to write a siggraph paper
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
How good is your prediction a gentle introduction to conformal prediction.
How good is your prediction  a gentle introduction to conformal prediction.How good is your prediction  a gentle introduction to conformal prediction.
How good is your prediction a gentle introduction to conformal prediction.
 

Semelhante a Automated Extraction of Reactions from the Patent Literature

Introduction to Chemoinformatics
Introduction to ChemoinformaticsIntroduction to Chemoinformatics
Introduction to ChemoinformaticsSSA KPI
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Hitesh Patel
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patentsdan2097
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06DanielSButler
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationMichel Dumontier
 
Novel materials for development of optical sensors
Novel materials for development of optical sensorsNovel materials for development of optical sensors
Novel materials for development of optical sensorsreganf
 
6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.pptAsifAli165576
 
Global content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessGlobal content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessCyndy Parr
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxjacksnathalie
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsPanagiotis Arapitsas
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveNextMove Software
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesMichel Dumontier
 
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i..."Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...Oscar Cabrices PhD
 

Semelhante a Automated Extraction of Reactions from the Patent Literature (20)

Introduction to Chemoinformatics
Introduction to ChemoinformaticsIntroduction to Chemoinformatics
Introduction to Chemoinformatics
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
 
Novel materials for development of optical sensors
Novel materials for development of optical sensorsNovel materials for development of optical sensors
Novel materials for development of optical sensors
 
6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt
 
Global content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessGlobal content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richness
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
CRE-!-Lec.pptx
CRE-!-Lec.pptxCRE-!-Lec.pptx
CRE-!-Lec.pptx
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomics
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologies
 
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i..."Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
 
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
 

Mais de dan2097

From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesFrom Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesdan2097
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclaturedan2097
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclaturedan2097
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithmsdan2097
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIdan2097
 

Mais de dan2097 (6)

From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesFrom Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
 

Último

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Automated Extraction of Reactions from the Patent Literature

  • 1. Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge 1
  • 2. Chemistry patent applications • 100,000s applications each year 400000 350000 Chemistry patent applications per year 300000 250000 200000 150000 100000 50000 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 World Intellectual Property Indicators, 2011 edition 2
  • 3. 3
  • 4. The idea XML patents Reaction Extraction System Extracted Reactions 4
  • 5. Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping 5
  • 6. Building on existing projects 6
  • 7. Archetypal experimental section Section heading Section target compound Step identifier Step target compound Paragraph number Synthesis Workup Characterisation 7
  • 8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40. 8
  • 9. ChemicalTagger • Tags words of text • Parses tags to identify phrases • Generate XML parse tree – http://chemicaltagger.ch.cam.ac.uk/ – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf 2011, 3, 17. 9
  • 10. Tagging • Regex tagger: tags keywords e.g. “yield”, “mL” • OSCAR4 tagger: Finds names OSCAR4 believes to be chemical e.g. “2-methylpyridine” • OpenNLP: Tags parts of speech Additional taggers: • OPSIN tagger: Finds names OPSIN can parse • Trivial chemical name tagger: Tags a few chemicals missed by the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent 10
  • 11. Sample ChemicalTagger Output <MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> 11
  • 15. Pyridine, pyridines and pyridine rings The pyridine / Pyridines / Pyridine ring / Entity Pyridine Pyridine from step 1 A pyridine Pyridyl Type Exact DefiniteReference ChemicalClass Fragment 15
  • 16. Section/Step Parsing Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recover, Remove, Wash, Quench 16
  • 18. Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%). 18
  • 20. CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> Reaction SMILES <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> Quantities including yield are extracted <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> SMILES and InChIs for every structure </product> resolvable reagent/product </productList> <reactantList> Entity is classified as an exact compound, <reactant role="reactant" count="1"> <molecule id="m1"> definite reference, chemical class or polymer <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> 20
  • 21. Evaluation • 2008-2011 USPTO patent applications classified as containing organic chemistry  65,034 documents. • 484,259 reactions atom mapped reactions extracted • Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds  424,621 reactions. • 100 of these were selected for manual evaluation of quality 21
  • 22. Reactions found 100,000 10,000 Patents with given number of reactions 1,000 100 10 1 0 200 400 600 800 1000 Number of extracted reactions 22
  • 23. Results • 96% correctly identified the primary starting material and product whilst not misidentifying reagents that could be confused with the starting material • As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives • Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent • Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled 23
  • 24. Use Cases • Reaction searching • Analysing trends in reactions over time • Reaction outcome prediction 24
  • 25. Example of reaction searching C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1) 6 reactions found in 5 patents 25
  • 27. Most lexical variants 1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride EDCI hydrochloride 1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride And 127 more! N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl 675 chemicals had over 1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride 1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants! N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride 1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride 27
  • 29. Known Limitations • The first workup reagent is often erroneously classified as a reactant • Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms • Conditions from analogous reactions are not resolved • Temperature/time for reactions to occur not captured 29
  • 30. Conclusions • 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications • Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important • All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction 30
  • 31. Acknowledgements Unilever centre: Indigo toolkit: Robert Glen Mikhail Rybalkin Peter Murray-Rust Savelyev Alexander Lezan Hawizy Dmitry Pavlov David Jessop Matthew Grayson Boehringer Ingelheim for funding SMARTS searching: Roger Sayle 31

Notas do Editor

  1. Manual abstraction of the precise details of reactions from this many documents would be expensive.
  2. How can one get access to patents? Google patents offers all USPTO patents from 2001 onwards as XML including images and ChemDraw files. Older patents are available with just the text back to 1976, back to 1920 with OCRed text and back to 1790 if one OCRs themselves
  3. This problem can be broken down into several sub problems
  4. Fortunately we don’t have to start from scratch, many open source toolkit exist to help with these tasks. OPSIN, name to structure, OSCAR4, chemical entity recognition, ChemicalTagger, tagging and parsing of experimental chemistry text
  5. This is what a typical experimental section from a patent looks like. We need to identify such sections, link the heading with the paragraphs and preferably distinguish synthesis reagents from workup reagents.
  6. Heading/paragraphs can be extracted directly from the XML. The classifier uses the probabilities of words being present in an experimental chemistry section versus a standard paragraph. The language in experimental sections is quite repetitive so this works well. In some cases a heading may not be annotated as such in the XML, this can be detected in many cases and processed as if the heading was a discrete element.
  7. This work relies heavily on ChemicalTagger and significant improvements have been made to ChemicalTagger as part of this porject to improve its performance and range of concepts recognised. Hence a description of the system would not be complete without also explaining what ChemicalTagger does
  8. For this project we also use the following taggers. These tags can then be parsed to yield….
  9. Quantities have been recognised and marked up and associated with a molecule. Where certain key words are identified phrases can be identfied….
  10. A few phrase types are identified directly by the grammar e.g. a chemical in a chemical is a dissolve phrase
  11. Will be associated with the identified compound. As you can see a compound doesn’t have to contain a chemical entity. (title compound as a white solid)
  12. Uses a combination of textual clues and OPSIN’s classification
  13. Phrases can be classified into workup by phrase type e.g. extraction, purification. As the yielded compound and characterisation are often conjoined rather than explicitly identifying the workup compounds commonly associated with characterisation are marked up as false positives by regexes. A single paragraph may have multiple blocks of synthesis and workup. Structure-aware role assignment involves things like heuristically assigning known solvents as solvent and catalysts e.g. using lists of known solvents/catalysts and their properties e.g. transition metal
  14. Perform sanity check on reaction e.g. has a product and at least 2 reagents. Attempt to find mapping where all product atoms can be accounted for
  15. Here is an example of an experimental section
  16. Occasionally the system identifies a compound as a reactant that was mentioned only in the context of the current reaction being performed in an analogous way to the reaction that produced it. False positives arise from workup reagents being classified as reactants and clear errors. Product information often not explicitly associated with product.
  17. Simmons–Smith reaction for conversion of a terminal allyl group to a cyclopropane group found 6 hits in 5 patents.
  18. It should be noted that nowhere in this text and indeed in the whole patent is the name of the reaction mentioned, this is quite common.
  19. 675 chemical entities had over 10 lexical variants
  20. Top 10
  21. This is due to the text typically just saying that the substance is added without further specification of its purpose