SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Automatic extraction of bioactivity
data from patents
Daniel Lowe*, Stefan Senger† and Roger Sayle*
*NextMove Software Cambridge, UK
†GlaxoSmithKline, Stevenage, UK
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Example Use cases
• “A patent has recently come out on a topic of
interest, can the key compounds be extracted
with their activity data?”
• “Which compounds have been found to be
active against this target?”
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
US Patent data freely available
patents.reedtech.com
(Or from the USPTO: bulkdata.uspto.gov)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
= text-mined
What are
these
compounds?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
SureChEMBL Google Patents
After text-mining for chemical entities:
Green = substituent
Purple = molecule
Source: US20170050925A9
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
SureChEMBL
Google PatentsPatent PDF
PatFetch
(NextMove Software)Source: US20010016661A1
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
5 columns
6 columns
• Columns merged such that header and body
have same number of columns
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Getting the compound
structures
• Chemical names
• Chemical sketches
• R-group tables
• Compound identifier associated with any of
the above
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical names
• OPSIN (Open Parser for systematic IUPAC
nomenclature)
• Dictionaries (ChEMBL/PubChem/NextMove)
• Chemical line formula parsing, especially
useful for peptide names and R-group
definitions
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical sketches
• Utilize the ChemDraw sketches provided by
the USPTO
• Detection and handling of repeat brackets and
positional variation
• Fixing obvious errors e.g. undervalent
nitrogen near to H atom with no bond
• Labels reinterpreted
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Formula Interpretation
Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
N
N
+
O
N
N
N
N
F
P
-
F
F
F
F
F
A
T U
C C
F
FF
F
F
F
F F
F
FF
F F
FF
F
F
F
O
N
P
O
O
O
OH
HH HO P
O
OH
OH
I
I
2
-
I
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
R-group tables
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
• Need to “name space” identifiers
– “Compound 1”, “Reference compound 1”,
“Example 1”
– But “Compound 1” = “cmpd 1” = “cpd. #1”
• Where a column is just called “#” is it a
compound number, example number or just a
table row number!
• Identifier may be defined multiple times e.g.
as a sketch and chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(text-mining)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Sketches)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Tables)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Excel table export
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
What is the
target?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Assay identification
• Naïve Bayes classifier trained from assay
descriptions identified by BindingDB curators
• 10-fold cross validation: 98.9% recall, 94.7%
precision
• Paragraph associated with next table or table
mentioned in paragraph
• Target/organism detected
• Care taken to avoid common irrelevant
organisms/proteins e.g. bovine serum albumin
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results From US Patent
applications (2001-Mar 2017)
Red = Bioactivity
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Activities with associated
structures per year
0
100,000
200,000
300,000
400,000
500,000
600,000
Activitty-structurerelationshipsextracted
Publication Year
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Activity data from ~1500 US patent grants (2013-
2016) manually extracted over the course of 3 years
• ~150,000 activities
• Comparison done on the subset that was made
available in ChEMBL 22_1 (98,898 activity values,
1012 patents)
• As some assay results are missed by the automatic
extraction, and some are considered out of scope by
BindingDB, difficult to distinguish differences in
coverage from genuine disagreements
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Values normalized into nM
– 1000s of instances of measurements in nanometers!
• Mid point of ranges taken
• Structures compared by StdInChI
• Target name normalized to ChEMBL target ID
(organism specific), using either:
– ChEMBL target synonyms
– Normalize to HGNC symbol and check if HGNC symbol is a
ChEMBL target synonym
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison
Expected
values
found
Expected
structures
found
Expected
value +
structure
found
Expected
value +
structure +
target
75% 65% 53% 18%
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Unclear structure assignment
? ?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Stereochemistry and salts
OH
O
O
N
H
CH3H3C
Br
H
H
Patent BindingDB This
work
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Long tail of difficult cases
What does this
superscript term
mean?
What are the
units?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Targets of patent data compared
to journal data
ChEMBL 22_1
(excluding BindingDB)
US Patent Applications
Common Target Classes
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
%peryear
Kinase
GPCR (Family A)
Protease
Nuclear receptor
Voltage-gated ion
channel
Electrochemical
transporter
Oxidoreductase
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Upcoming target classes
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Percentageofdocumentswithactivityvaluesagainst
targetclass
Epigenetic writer (Patents)
Epigenetic reader (Patents)
Epigenetic writer (ChEMBL ex
BindingDB)
Epigenetic reader (ChEMBL ex
BindingDB)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Future work
• Support for more complex R-group tables
• Improve recognition and resolution of protein
target names
• Support for activities specified in text e.g.
Example 1 has an IC50 of 12 nM measured at rat EP4
• Resolution of symbols for activity ranges e.g.
“A” indicates an IC50 value of less than 100 nM
• Improve assay metadata extraction
cf. BioAssay Express
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Disambiguation of Conflicting
structure descriptions
Image from
original filing
Redrawn by US
patent office in
ChemDraw
Intended
structure from
chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Conclusions
• Processing all US patents from 2001 to present
can be done in less than a day on a desktop PC
• Technique applicable to chemical properties
other than activity values
• Compound number <-> structure relationships
useful for key compound identification
• For the majority of patents, extracting
structure-activity relationships can be
significantly expedited
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Acknowledgements
• Noel O`Boyle
• John Mayfield
• Funding provided by:
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

Mais conteúdo relacionado

Mais procurados

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
Ken Karapetyan
 
Resolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experienceResolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experience
Chris Southan
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Mais procurados (20)

CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CAS
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
 
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...Standardization and Generation of Parents for Open PHACTS Chemical Registry S...
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...
 
Resolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experienceResolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experience
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
Data model
Data modelData model
Data model
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english
 

Semelhante a Automatic extraction of bioactivity data from patents

The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
Valery Tkachenko
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
NextMove Software
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
Kamel Mansouri
 

Semelhante a Automatic extraction of bioactivity data from patents (20)

Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 
Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical Substances
 
Kk m5re9v2e3
Kk m5re9v2e3Kk m5re9v2e3
Kk m5re9v2e3
 
CAS: Transforming Discovery
CAS: Transforming DiscoveryCAS: Transforming Discovery
CAS: Transforming Discovery
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 

Mais de NextMove Software

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 

Último

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 

Automatic extraction of bioactivity data from patents

  • 1. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Automatic extraction of bioactivity data from patents Daniel Lowe*, Stefan Senger† and Roger Sayle* *NextMove Software Cambridge, UK †GlaxoSmithKline, Stevenage, UK
  • 2. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Example Use cases • “A patent has recently come out on a topic of interest, can the key compounds be extracted with their activity data?” • “Which compounds have been found to be active against this target?”
  • 3. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 US Patent data freely available patents.reedtech.com (Or from the USPTO: bulkdata.uspto.gov)
  • 4. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 = text-mined What are these compounds?
  • 5. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics SureChEMBL Google Patents After text-mining for chemical entities: Green = substituent Purple = molecule Source: US20170050925A9
  • 6. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 SureChEMBL Google PatentsPatent PDF PatFetch (NextMove Software)Source: US20010016661A1
  • 7. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics 5 columns 6 columns • Columns merged such that header and body have same number of columns
  • 8. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Getting the compound structures • Chemical names • Chemical sketches • R-group tables • Compound identifier associated with any of the above
  • 9. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical names • OPSIN (Open Parser for systematic IUPAC nomenclature) • Dictionaries (ChEMBL/PubChem/NextMove) • Chemical line formula parsing, especially useful for peptide names and R-group definitions
  • 10. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical sketches • Utilize the ChemDraw sketches provided by the USPTO • Detection and handling of repeat brackets and positional variation • Fixing obvious errors e.g. undervalent nitrogen near to H atom with no bond • Labels reinterpreted
  • 11. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Formula Interpretation Input ChemDraw 15 This work HATU C4F9 H3PO4 CON(cHex)2 No result III-2 No result N N + O N N N N F P - F F F F F A T U C C F FF F F F F F F FF F F FF F F F O N P O O O OH HH HO P O OH OH I I 2 - I
  • 12. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 R-group tables
  • 13. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers • Need to “name space” identifiers – “Compound 1”, “Reference compound 1”, “Example 1” – But “Compound 1” = “cmpd 1” = “cpd. #1” • Where a column is just called “#” is it a compound number, example number or just a table row number! • Identifier may be defined multiple times e.g. as a sketch and chemical name
  • 14. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (text-mining)
  • 15. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Sketches)
  • 16. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Tables)
  • 17. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships
  • 18. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Excel table export
  • 19. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships What is the target?
  • 20. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Assay identification • Naïve Bayes classifier trained from assay descriptions identified by BindingDB curators • 10-fold cross validation: 98.9% recall, 94.7% precision • Paragraph associated with next table or table mentioned in paragraph • Target/organism detected • Care taken to avoid common irrelevant organisms/proteins e.g. bovine serum albumin
  • 21. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results
  • 22. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results From US Patent applications (2001-Mar 2017) Red = Bioactivity
  • 23. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Activities with associated structures per year 0 100,000 200,000 300,000 400,000 500,000 600,000 Activitty-structurerelationshipsextracted Publication Year
  • 24. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Activity data from ~1500 US patent grants (2013- 2016) manually extracted over the course of 3 years • ~150,000 activities • Comparison done on the subset that was made available in ChEMBL 22_1 (98,898 activity values, 1012 patents) • As some assay results are missed by the automatic extraction, and some are considered out of scope by BindingDB, difficult to distinguish differences in coverage from genuine disagreements
  • 25. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Values normalized into nM – 1000s of instances of measurements in nanometers! • Mid point of ranges taken • Structures compared by StdInChI • Target name normalized to ChEMBL target ID (organism specific), using either: – ChEMBL target synonyms – Normalize to HGNC symbol and check if HGNC symbol is a ChEMBL target synonym
  • 26. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison Expected values found Expected structures found Expected value + structure found Expected value + structure + target 75% 65% 53% 18%
  • 27. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Unclear structure assignment ? ?
  • 28. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Stereochemistry and salts OH O O N H CH3H3C Br H H Patent BindingDB This work
  • 29. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Long tail of difficult cases What does this superscript term mean? What are the units?
  • 30. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Targets of patent data compared to journal data ChEMBL 22_1 (excluding BindingDB) US Patent Applications Common Target Classes 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016 %peryear Kinase GPCR (Family A) Protease Nuclear receptor Voltage-gated ion channel Electrochemical transporter Oxidoreductase 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016
  • 31. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Upcoming target classes 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Percentageofdocumentswithactivityvaluesagainst targetclass Epigenetic writer (Patents) Epigenetic reader (Patents) Epigenetic writer (ChEMBL ex BindingDB) Epigenetic reader (ChEMBL ex BindingDB)
  • 32. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Future work • Support for more complex R-group tables • Improve recognition and resolution of protein target names • Support for activities specified in text e.g. Example 1 has an IC50 of 12 nM measured at rat EP4 • Resolution of symbols for activity ranges e.g. “A” indicates an IC50 value of less than 100 nM • Improve assay metadata extraction cf. BioAssay Express
  • 33. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Disambiguation of Conflicting structure descriptions Image from original filing Redrawn by US patent office in ChemDraw Intended structure from chemical name
  • 34. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Conclusions • Processing all US patents from 2001 to present can be done in less than a day on a desktop PC • Technique applicable to chemical properties other than activity values • Compound number <-> structure relationships useful for key compound identification • For the majority of patents, extracting structure-activity relationships can be significantly expedited
  • 35. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Acknowledgements • Noel O`Boyle • John Mayfield • Funding provided by:
  • 36. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com