Mining Drug Targets, Structures and Activity Data

Mining Drug Targets, Structures and
Activity Data Using Open Full-Text
Patent Sources and Web Tools

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for BioIT, Boston, April 2012,
Track 11, Open Source Solutions, Wednesday, 13:45

[1]

Key Relationships
Extractable from Patents and Papers
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

Document Assay Result Compound Target

2011 PMID 21569515

2010 doi:10.1007/978-3-642-15120-0_9

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)
[3]

The Good News: Patent Mining Utility
• Novel bioactive chemical structures related to drug discovery exceeding those in
journals by at least five-fold.
• Encompass academic, as well as commercial, global med. chem. output.
• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.
• ~ 70% of data initially patent-only, some never disclosed elswhere.
• Include synthetic descriptions and other useful enabling information.
• Precede journal or meeting reports by ~ 1.5 to 5 years.
• Can be complementary to papers (e.g. larger SAR matrix).
• Intersect with papers at chemistry, target, disease, author and citation levels
• IP exploitable for Neglected Tropical Disease research becoming ”open”.

[4]

The Bad News: Patent Mining Can be Tough
• High-specificity retrieval of relevant documents difficult
• Massive chaff-to-wheat ratio in 100s of pages
• Differences in layout, house style and data location
• Markush permutation
• Variability in IUPAC strings and image rendering
• Use of non-standard gene/protein names
• Obfuscation via;
– Qualitative or binned assay results
– Structure-to-data links non-obvious, patchy or absent
– Less than 50% of titles include target names
– The ”hiding the lead and core structures” game
– Blunderbuss disease and use exemplifications
– Tense ambiguity (i.e. ”could be” vs. ”was” done)
• Quality judgments dificult
• Patents cite papers and patents but few papers cite patents
• Document redundancy of Kind codes, patent families and equivalents
• Finding drug candidate first-filings is difficult
• The PDF hamburger problem and OCR noise
[5]

Reasons for Rolling-your-own Patent
Chemistry and Data Extraction
• Limited budget
• You are likely to be a tacit super-curator by profession
• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)
• Combine automated outputs with manual triage
• Develop a technical understanding and comparison of vendor offerings
• Commercial dbs cap the number of manually-extracted examples
• Need SAR analogues for a few targets rather than many (e.g. mechanistic
enzymology or systems chemical biology)
• Only require data sampling across specific disease areas
• Not overly concerned about false-negatives (i.e. don’t need
comprehensive prior-art check or scoping of claims)
• Open tools operate on any text or web source, not just patents
• You may already have commercial text mining capability
• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,
journals you subscribe to, PubMed and PMC)
• You can slice-and-dice PubChem patent chemistry in ways
complementary to commercial databases

[6]

Open Sources and Tools Overview
• Searching metadata, abstracts and text
– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore
– Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.
• Metadata, full-text and chemical structure search - SureChemOpen
• Bulk name-to-structure conversion - ChemAxon Chemicalize
• Individulal name-to-structure - OPSIN
• Conversion of images to structures - OSRA
• Sketcher inputs – many options
• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize
• EPO patent number searching in PubChem
• PDF24.org for cutting pages and OnlineOCR.net for sections or tables
• Utopia bioentity mark-up
(those below not included in this presentation but relevant)
• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator
• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.
• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,
SCRIPDB, Juristica group
(n.b. Google should give urls for all these source and tool names)
[7]

So What’s in PubChem ?

[8]

PubChem Patent-derived Content ~6 million

• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI
pharmaceutical patents plus some journal extractions
• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM
• ~ 3.5 million of these are Lipinski-ROF compliant
• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million
• ~ 70% of these are Lipinski-ROF compliant
• ~ 90% of these have assay data
• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs
• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)
[9]

Chemistry > Patents in PubChem

[10]

You found a CID, what are the Patent and
Journal links?
PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB

[11]

Patent Links from SLING and IBM

[12]

PubChem > SureChem > Patent > Stucture >
Data > Target

[13]

Target-Centric Patent Searching

[14]

Synonym Recall

• Title only BACE1 = 8
• Title + abstract BACE1 = 97
• Title + abstract BACE2 = 29
• Title + abstract BACE = 392
• Title + abstract ”Beta secretase” = 1056
• Title + abstract memapsin = 87
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin = 1383
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin AND inhibitors = 841
• Same query to PubMed (this interface) = 1031
[15]

Target Query > Patent Retrieval from Espacenet

[16]

Linking Examples to Data in the Patent

[17]

Extracting Chemical Structrures

[18]

IUPAC-to-structure: OPSIN

Instalable
application

Also chemical
dictionary
conversions

Result; Example 31 structure is 24 nM BACE1 inhibitor
[19]

Image-to-strucuture: OSRA

• Patchy results but fixable by editing and similarity iteration in PubChem
• Also an installable application
• Useful to cross-check between images and IUPACs

[20]

Follow-up Searching

[21]

Structure Search in PubChem
SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)

Often see stero differences to the Derwent entry in PubChem
[22]

PubChem Similarity ”Walking”

• 2D and 3D different results
• Can do multiple steps
• Can ”read” CID history
• Possible to ”walk” between patents
• Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.
[23]

Direct Patent <> Chemistry

[24]

SureChemOpen: Patent Retrieval

• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal
• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not
bulk export)
[25]

SurChemOpen, WIPO, OPSIN and PubChem

Result 1nm (?) BACE2 inhibitor
with assay and synthesis details.
[26]

SureChemOpen: Structure > Patent

Direct answers to: ”which patents contain compounds simiar to my query”
and ”show me all the compounds in these patents”
[27]

Non-target Activity Data and Bulk
Chemistry Extraction

[28]

Malaria Query: CiteExplore > WIPO

Example 60, sub-200nM potency,
with solubilty and clearance data
[29]

Espacenet EP2391601 > ChemAxon Chemicalize.org
• Description URL from
Espacenet pasted into
Chemcalize.org

• Most of 74 examples
converted

• Example 60 had 4
analgues in PubChem
at 95% Tamimoto (e.g.
CID 46852300) but no
exact match

• Claims section was
Markush description
so no relevant
structures converted

[30]

EP2391601 > Chemicalize > PubChem

Chemicalize Similarity listing PubChem Tanimoto sub-cluster

• EP2391601 description text > Chemicalize SDF download > PubChem
Structure Search upload = 311 structures
• Of these 206 have PubChem exact matches
• Of these 176 have Thomson Pharma matches
• The example cluster (Thomson/Derwent extraction) cluster is ~15
• The example cluster from Chemicalize is ~ 90
• Ipso facto Chemicalize extracted at least 70 novel structures
• But only 10 examples were in the highest-potency bin
[31]

Tips and Tricks

[32]

Tables and Recalcitrant IUPACs
PDF

Find tables

Snip image

Online OCR

Word Pad

Chemicalize

OPSIN

OSRA

• iterative fixing of OCR
errors (e.g. 1 vs l)
• cross-check Mw in the
document

[33]

Utopia Mark-up of Patent Introduction

Bioentity mark-up (green) via EMBL Reflect with rich call-out options
[34]

Tips for Joining Everything up
• SureChemOpen is continuing to back-fill and add features.
• Check the Chemicalize archive (~ 0.5 million) for unique content.
• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things
(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki
pages, blog posts and MeSH IUPACs).
• Check PubChem ”same connectivity” for tautomer forms in different CIDs.
• Check PubChem ”similar” compounds for analogues even if you cannot track
back to a patent number.
• Most PDB ligands published by companies have a patent analogue series.
• Espacenet text chemicalizes well but FreePantentsOnline can be better.
• Google Scholar tracks patent citations.
• Full-text is good but don’t forget to eyeball the original PDF
• You can ”walk” between patents by 2D/3D clusters, inventors or citations.
• Less-common author/inventor names may track a journal paper back to a patent.
• CiteExplore includes selectable ChEMBL structure links.
• Check ChEMBL structures for SureChem links via ChemSpider.
• On a good day you can paste OCR table data into Excel.
• You can set SciBitely patent keyword alerts and see posts on Twitter.
[35]

Conclusions

• Roll-your-own patent mining can take you a long way.
• Complementary to commerical databases.
• Target-centric recall and specificity is reasonable.
• Published patents are indexed and open text-extracted within weeks.
• You need perspicacity to dig out SAR details.
• Can cherry pick examples by potency or collate whole series
• Establishing intersects between journal articles and patents is valuable.
• Exemplified structures typically cover a broader range of analogue space
and SAR data than papers.
• You can ”walk” between patents via citation and chemistry clustering.
• PubChem already contains over 6 million patent-derived structures with
more depositions and links expected.
• The increased public surfacing of chemical structres and bioactivity data
from patents will expedite medicinal chemistry, tropical disease research
and chemical biology.

[36]

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan – at - hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)
LinkedIN: http://www.linkedin.com/in/cdsouthan
Website: http://www.cdsouthan.info/CDS_prof.htm
Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year
Citations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan

[37]

Mining Drug Targets, Structures and Activity Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Mining Drug Targets, Structures and Activity Data

Semelhante a Mining Drug Targets, Structures and Activity Data (20)

Mais de Chris Southan

Mais de Chris Southan (20)

Último

Último (20)

Mining Drug Targets, Structures and Activity Data