TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Mining Drug Targets, Structures and Activity Data
1. Mining Drug Targets, Structures and
Activity Data Using Open Full-Text
Patent Sources and Web Tools
Christopher Southan
ChrisDS Consulting, Göteborg, Sweden,
Prepared for BioIT, Boston, April 2012,
Track 11, Open Source Solutions, Wednesday, 13:45
[1]
3. Key Relationships
Extractable from Patents and Papers
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK
Document Assay Result Compound Target
2011 PMID 21569515
2010 doi:10.1007/978-3-642-15120-0_9
Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)
[3]
4. The Good News: Patent Mining Utility
• Novel bioactive chemical structures related to drug discovery exceeding those in
journals by at least five-fold.
• Encompass academic, as well as commercial, global med. chem. output.
• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.
• ~ 70% of data initially patent-only, some never disclosed elswhere.
• Include synthetic descriptions and other useful enabling information.
• Precede journal or meeting reports by ~ 1.5 to 5 years.
• Can be complementary to papers (e.g. larger SAR matrix).
• Intersect with papers at chemistry, target, disease, author and citation levels
• IP exploitable for Neglected Tropical Disease research becoming ”open”.
[4]
5. The Bad News: Patent Mining Can be Tough
• High-specificity retrieval of relevant documents difficult
• Massive chaff-to-wheat ratio in 100s of pages
• Differences in layout, house style and data location
• Markush permutation
• Variability in IUPAC strings and image rendering
• Use of non-standard gene/protein names
• Obfuscation via;
– Qualitative or binned assay results
– Structure-to-data links non-obvious, patchy or absent
– Less than 50% of titles include target names
– The ”hiding the lead and core structures” game
– Blunderbuss disease and use exemplifications
– Tense ambiguity (i.e. ”could be” vs. ”was” done)
• Quality judgments dificult
• Patents cite papers and patents but few papers cite patents
• Document redundancy of Kind codes, patent families and equivalents
• Finding drug candidate first-filings is difficult
• The PDF hamburger problem and OCR noise
[5]
6. Reasons for Rolling-your-own Patent
Chemistry and Data Extraction
• Limited budget
• You are likely to be a tacit super-curator by profession
• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)
• Combine automated outputs with manual triage
• Develop a technical understanding and comparison of vendor offerings
• Commercial dbs cap the number of manually-extracted examples
• Need SAR analogues for a few targets rather than many (e.g. mechanistic
enzymology or systems chemical biology)
• Only require data sampling across specific disease areas
• Not overly concerned about false-negatives (i.e. don’t need
comprehensive prior-art check or scoping of claims)
• Open tools operate on any text or web source, not just patents
• You may already have commercial text mining capability
• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,
journals you subscribe to, PubMed and PMC)
• You can slice-and-dice PubChem patent chemistry in ways
complementary to commercial databases
[6]
7. Open Sources and Tools Overview
• Searching metadata, abstracts and text
– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore
– Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.
• Metadata, full-text and chemical structure search - SureChemOpen
• Bulk name-to-structure conversion - ChemAxon Chemicalize
• Individulal name-to-structure - OPSIN
• Conversion of images to structures - OSRA
• Sketcher inputs – many options
• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize
• EPO patent number searching in PubChem
• PDF24.org for cutting pages and OnlineOCR.net for sections or tables
• Utopia bioentity mark-up
(those below not included in this presentation but relevant)
• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator
• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.
• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,
SCRIPDB, Juristica group
(n.b. Google should give urls for all these source and tool names)
[7]
9. PubChem Patent-derived Content ~6 million
• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI
pharmaceutical patents plus some journal extractions
• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM
• ~ 3.5 million of these are Lipinski-ROF compliant
• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million
• ~ 70% of these are Lipinski-ROF compliant
• ~ 90% of these have assay data
• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs
• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)
[9]
15. Synonym Recall
• Title only BACE1 = 8
• Title + abstract BACE1 = 97
• Title + abstract BACE2 = 29
• Title + abstract BACE = 392
• Title + abstract ”Beta secretase” = 1056
• Title + abstract memapsin = 87
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin = 1383
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin AND inhibitors = 841
• Same query to PubMed (this interface) = 1031
[15]
19. IUPAC-to-structure: OPSIN
Instalable
application
Also chemical
dictionary
conversions
Result; Example 31 structure is 24 nM BACE1 inhibitor
[19]
20. Image-to-strucuture: OSRA
• Patchy results but fixable by editing and similarity iteration in PubChem
• Also an installable application
• Useful to cross-check between images and IUPACs
[20]
22. Structure Search in PubChem
SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)
Often see stero differences to the Derwent entry in PubChem
[22]
23. PubChem Similarity ”Walking”
• 2D and 3D different results
• Can do multiple steps
• Can ”read” CID history
• Possible to ”walk” between patents
• Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.
[23]
25. SureChemOpen: Patent Retrieval
• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal
• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not
bulk export)
[25]
26. SurChemOpen, WIPO, OPSIN and PubChem
Result 1nm (?) BACE2 inhibitor
with assay and synthesis details.
[26]
27. SureChemOpen: Structure > Patent
Direct answers to: ”which patents contain compounds simiar to my query”
and ”show me all the compounds in these patents”
[27]
30. Espacenet EP2391601 > ChemAxon Chemicalize.org
• Description URL from
Espacenet pasted into
Chemcalize.org
• Most of 74 examples
converted
• Example 60 had 4
analgues in PubChem
at 95% Tamimoto (e.g.
CID 46852300) but no
exact match
• Claims section was
Markush description
so no relevant
structures converted
[30]
31. EP2391601 > Chemicalize > PubChem
Chemicalize Similarity listing PubChem Tanimoto sub-cluster
• EP2391601 description text > Chemicalize SDF download > PubChem
Structure Search upload = 311 structures
• Of these 206 have PubChem exact matches
• Of these 176 have Thomson Pharma matches
• The example cluster (Thomson/Derwent extraction) cluster is ~15
• The example cluster from Chemicalize is ~ 90
• Ipso facto Chemicalize extracted at least 70 novel structures
• But only 10 examples were in the highest-potency bin
[31]
33. Tables and Recalcitrant IUPACs
PDF
Find tables
Snip image
Online OCR
Word Pad
Chemicalize
OPSIN
OSRA
• iterative fixing of OCR
errors (e.g. 1 vs l)
• cross-check Mw in the
document
[33]
34. Utopia Mark-up of Patent Introduction
Bioentity mark-up (green) via EMBL Reflect with rich call-out options
[34]
35. Tips for Joining Everything up
• SureChemOpen is continuing to back-fill and add features.
• Check the Chemicalize archive (~ 0.5 million) for unique content.
• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things
(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki
pages, blog posts and MeSH IUPACs).
• Check PubChem ”same connectivity” for tautomer forms in different CIDs.
• Check PubChem ”similar” compounds for analogues even if you cannot track
back to a patent number.
• Most PDB ligands published by companies have a patent analogue series.
• Espacenet text chemicalizes well but FreePantentsOnline can be better.
• Google Scholar tracks patent citations.
• Full-text is good but don’t forget to eyeball the original PDF
• You can ”walk” between patents by 2D/3D clusters, inventors or citations.
• Less-common author/inventor names may track a journal paper back to a patent.
• CiteExplore includes selectable ChEMBL structure links.
• Check ChEMBL structures for SureChem links via ChemSpider.
• On a good day you can paste OCR table data into Excel.
• You can set SciBitely patent keyword alerts and see posts on Twitter.
[35]
36. Conclusions
• Roll-your-own patent mining can take you a long way.
• Complementary to commerical databases.
• Target-centric recall and specificity is reasonable.
• Published patents are indexed and open text-extracted within weeks.
• You need perspicacity to dig out SAR details.
• Can cherry pick examples by potency or collate whole series
• Establishing intersects between journal articles and patents is valuable.
• Exemplified structures typically cover a broader range of analogue space
and SAR data than papers.
• You can ”walk” between patents via citation and chemistry clustering.
• PubChem already contains over 6 million patent-derived structures with
more depositions and links expected.
• The increased public surfacing of chemical structres and bioactivity data
from patents will expedite medicinal chemistry, tropical disease research
and chemical biology.
[36]