SlideShare uma empresa Scribd logo
1 de 37
Mining Drug Targets, Structures and
 Activity Data Using Open Full-Text
   Patent Sources and Web Tools


                  Christopher Southan

          ChrisDS Consulting, Göteborg, Sweden,

            Prepared for BioIT, Boston, April 2012,
     Track 11, Open Source Solutions, Wednesday, 13:45


                                                         [1]
Introduction




               [2]
Key Relationships
          Extractable from Patents and Papers
                                                                MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
                                                                PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
                                                                YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
                                                                RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
                                                                NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
                                                                SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
                                                                GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
                                                                DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
                                                                ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
                                                                GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
                                                                TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
                                                                AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
                                                                FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK




 Document         Assay           Result           Compound               Target




                          2011 PMID 21569515




                    2010 doi:10.1007/978-3-642-15120-0_9

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)
                                                                                                 [3]
The Good News: Patent Mining Utility
•   Novel bioactive chemical structures related to drug discovery exceeding those in
    journals by at least five-fold.
•   Encompass academic, as well as commercial, global med. chem. output.
•   Targets, assays, mechanisms of action, disease descriptions and in-vivo data.
•   ~ 70% of data initially patent-only, some never disclosed elswhere.
•   Include synthetic descriptions and other useful enabling information.
•   Precede journal or meeting reports by ~ 1.5 to 5 years.
•   Can be complementary to papers (e.g. larger SAR matrix).
•   Intersect with papers at chemistry, target, disease, author and citation levels
•   IP exploitable for Neglected Tropical Disease research becoming ”open”.




                                                                                   [4]
The Bad News: Patent Mining Can be Tough
 •   High-specificity retrieval of relevant documents difficult
 •   Massive chaff-to-wheat ratio in 100s of pages
 •   Differences in layout, house style and data location
 •   Markush permutation
 •   Variability in IUPAC strings and image rendering
 •   Use of non-standard gene/protein names
 •   Obfuscation via;
      – Qualitative or binned assay results
      – Structure-to-data links non-obvious, patchy or absent
      – Less than 50% of titles include target names
      – The ”hiding the lead and core structures” game
      – Blunderbuss disease and use exemplifications
      – Tense ambiguity (i.e. ”could be” vs. ”was” done)
 •   Quality judgments dificult
 •   Patents cite papers and patents but few papers cite patents
 •   Document redundancy of Kind codes, patent families and equivalents
 •   Finding drug candidate first-filings is difficult
 •   The PDF hamburger problem and OCR noise
                                                                          [5]
Reasons for Rolling-your-own Patent
           Chemistry and Data Extraction
•   Limited budget
•   You are likely to be a tacit super-curator by profession
•   Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)
•   Combine automated outputs with manual triage
•   Develop a technical understanding and comparison of vendor offerings
•   Commercial dbs cap the number of manually-extracted examples
•   Need SAR analogues for a few targets rather than many (e.g. mechanistic
    enzymology or systems chemical biology)
•   Only require data sampling across specific disease areas
•   Not overly concerned about false-negatives (i.e. don’t need
    comprehensive prior-art check or scoping of claims)
•   Open tools operate on any text or web source, not just patents
•   You may already have commercial text mining capability
•   Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,
    journals you subscribe to, PubMed and PMC)
•   You can slice-and-dice PubChem patent chemistry in ways
    complementary to commercial databases

                                                                                 [6]
Open Sources and Tools Overview
•   Searching metadata, abstracts and text
     – Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore
     – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.
•   Metadata, full-text and chemical structure search - SureChemOpen
•   Bulk name-to-structure conversion - ChemAxon Chemicalize
•   Individulal name-to-structure - OPSIN
•   Conversion of images to structures - OSRA
•   Sketcher inputs – many options
•   Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize
•   EPO patent number searching in PubChem
•   PDF24.org for cutting pages and OnlineOCR.net for sections or tables
•   Utopia bioentity mark-up
               (those below not included in this presentation but relevant)
•   NCI/CADD Chemical Identifier Resolver and Online SMILES Translator
•   Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.
•   OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,
    SCRIPDB, Juristica group
           (n.b. Google should give urls for all these source and tool names)
                                                                                   [7]
So What’s in PubChem ?




                         [8]
PubChem Patent-derived Content ~6 million




•   ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI
    pharmaceutical patents plus some journal extractions
•   ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM
•   ~ 3.5 million of these are Lipinski-ROF compliant
•   ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million
•   ~ 70% of these are Lipinski-ROF compliant
•   ~ 90% of these have assay data
•   ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs
•   ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)
                                                                                    [9]
Chemistry > Patents in PubChem




                                 [10]
You found a CID, what are the Patent and
              Journal links?
PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB




                                                        [11]
Patent Links from SLING and IBM




                                  [12]
PubChem > SureChem > Patent > Stucture >
             Data > Target




                                           [13]
Target-Centric Patent Searching




                                  [14]
Synonym Recall




•   Title only BACE1 = 8
•   Title + abstract BACE1 = 97
•   Title + abstract BACE2 = 29
•   Title + abstract BACE = 392
•   Title + abstract ”Beta secretase” = 1056
•   Title + abstract memapsin = 87
•   Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
    Memapsin = 1383
•   Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
    Memapsin AND inhibitors = 841
•   Same query to PubMed (this interface) = 1031
                                                                     [15]
Target Query > Patent Retrieval from Espacenet




                                                 [16]
Linking Examples to Data in the Patent




                                         [17]
Extracting Chemical Structrures




                                  [18]
IUPAC-to-structure: OPSIN




                                                        Instalable
                                                        application

                                                        Also chemical
                                                        dictionary
                                                        conversions




Result; Example 31 structure is 24 nM BACE1 inhibitor
                                                                      [19]
Image-to-strucuture: OSRA




• Patchy results but fixable by editing and similarity iteration in PubChem
• Also an installable application
• Useful to cross-check between images and IUPACs

                                                                              [20]
Follow-up Searching




                      [21]
Structure Search in PubChem
    SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)




Often see stero differences to the Derwent entry in PubChem
                                                                  [22]
PubChem Similarity ”Walking”




•   2D and 3D different results
•   Can do multiple steps
•   Can ”read” CID history
•   Possible to ”walk” between patents
•   Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.
                                                                          [23]
Direct Patent <> Chemistry




                             [24]
SureChemOpen: Patent Retrieval




•   Patent searching, chemistry-to-patent and patent-to-chemistry in one portal
•   Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not
    bulk export)
                                                                                 [25]
SurChemOpen, WIPO, OPSIN and PubChem




                Result 1nm (?) BACE2 inhibitor
                with assay and synthesis details.
                                                    [26]
SureChemOpen: Structure > Patent




Direct answers to: ”which patents contain compounds simiar to my query”
and ”show me all the compounds in these patents”
                                                                          [27]
Non-target Activity Data and Bulk
      Chemistry Extraction




                                    [28]
Malaria Query: CiteExplore > WIPO




                   Example 60, sub-200nM potency,
                   with solubilty and clearance data
                                                       [29]
Espacenet EP2391601 > ChemAxon Chemicalize.org
                                • Description URL from
                                  Espacenet pasted into
                                  Chemcalize.org

                                • Most of 74 examples
                                  converted

                                • Example 60 had 4
                                  analgues in PubChem
                                  at 95% Tamimoto (e.g.
                                  CID 46852300) but no
                                  exact match

                                • Claims section was
                                  Markush description
                                  so no relevant
                                  structures converted




                                                         [30]
EP2391601 > Chemicalize > PubChem




       Chemicalize Similarity listing         PubChem Tanimoto sub-cluster


•   EP2391601 description text > Chemicalize SDF download > PubChem
    Structure Search upload = 311 structures
•   Of these 206 have PubChem exact matches
•   Of these 176 have Thomson Pharma matches
•   The example cluster (Thomson/Derwent extraction) cluster is ~15
•   The example cluster from Chemicalize is ~ 90
•   Ipso facto Chemicalize extracted at least 70 novel structures
•   But only 10 examples were in the highest-potency bin
                                                                         [31]
Tips and Tricks




                  [32]
Tables and Recalcitrant IUPACs
                                    PDF

                                Find tables

                                Snip image

                                Online OCR

                                 Word Pad

                                Chemicalize

                                  OPSIN

                                   OSRA

                         • iterative fixing of OCR
                           errors (e.g. 1 vs l)
                         • cross-check Mw in the
                           document

                                                [33]
Utopia Mark-up of Patent Introduction




Bioentity mark-up (green) via EMBL Reflect with rich call-out options
                                                                        [34]
Tips for Joining Everything up
•   SureChemOpen is continuing to back-fill and add features.
•   Check the Chemicalize archive (~ 0.5 million) for unique content.
•   Between Chemicalize, OSRA, OPSIN and sketching you can extract most things
    (e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki
    pages, blog posts and MeSH IUPACs).
•   Check PubChem ”same connectivity” for tautomer forms in different CIDs.
•   Check PubChem ”similar” compounds for analogues even if you cannot track
    back to a patent number.
•   Most PDB ligands published by companies have a patent analogue series.
•   Espacenet text chemicalizes well but FreePantentsOnline can be better.
•   Google Scholar tracks patent citations.
•   Full-text is good but don’t forget to eyeball the original PDF
•   You can ”walk” between patents by 2D/3D clusters, inventors or citations.
•   Less-common author/inventor names may track a journal paper back to a patent.
•   CiteExplore includes selectable ChEMBL structure links.
•   Check ChEMBL structures for SureChem links via ChemSpider.
•   On a good day you can paste OCR table data into Excel.
•   You can set SciBitely patent keyword alerts and see posts on Twitter.
                                                                                 [35]
Conclusions

•   Roll-your-own patent mining can take you a long way.
•   Complementary to commerical databases.
•   Target-centric recall and specificity is reasonable.
•   Published patents are indexed and open text-extracted within weeks.
•   You need perspicacity to dig out SAR details.
•   Can cherry pick examples by potency or collate whole series
•   Establishing intersects between journal articles and patents is valuable.
•   Exemplified structures typically cover a broader range of analogue space
    and SAR data than papers.
•   You can ”walk” between patents via citation and chemistry clustering.
•   PubChem already contains over 6 million patent-derived structures with
    more depositions and links expected.
•   The increased public surfacing of chemical structres and bioactivity data
    from patents will expedite medicinal chemistry, tropical disease research
    and chemical biology.


                                                                                [36]
Questions Welcome


ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan – at - hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)
LinkedIN: http://www.linkedin.com/in/cdsouthan
Website: http://www.cdsouthan.info/CDS_prof.htm
Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year
Citations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan




                                                                                [37]

Mais conteúdo relacionado

Mais procurados

The STRING database - Quality scores for heterogeneous interaction data
The STRING database - Quality scores for heterogeneous interaction dataThe STRING database - Quality scores for heterogeneous interaction data
The STRING database - Quality scores for heterogeneous interaction dataLars Juhl Jensen
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Protein association networks with STRING
Protein association networks with STRINGProtein association networks with STRING
Protein association networks with STRINGLars Juhl Jensen
 
Text mining and data integration
Text mining and data integrationText mining and data integration
Text mining and data integrationLars Juhl Jensen
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biologyrobertstevens65
 
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open DataGraph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open DataMaulik Kamdar
 
The Language of the Gene Ontology
The Language of the Gene OntologyThe Language of the Gene Ontology
The Language of the Gene Ontologyrobertstevens65
 
Protein association networks: Large-scale integration of data and text
Protein association networks: Large-scale integration of data and textProtein association networks: Large-scale integration of data and text
Protein association networks: Large-scale integration of data and textLars Juhl Jensen
 
Systems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systemsSystems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systemsLars Juhl Jensen
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textLars Juhl Jensen
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textLars Juhl Jensen
 
STRING: Protein association networks
STRING: Protein association networksSTRING: Protein association networks
STRING: Protein association networksLars Juhl Jensen
 
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networksLars Juhl Jensen
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textLars Juhl Jensen
 
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...Neo4j
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesGreg Landrum
 

Mais procurados (20)

The STRING database - Quality scores for heterogeneous interaction data
The STRING database - Quality scores for heterogeneous interaction dataThe STRING database - Quality scores for heterogeneous interaction data
The STRING database - Quality scores for heterogeneous interaction data
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Protein association networks with STRING
Protein association networks with STRINGProtein association networks with STRING
Protein association networks with STRING
 
Text mining and data integration
Text mining and data integrationText mining and data integration
Text mining and data integration
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open DataGraph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
 
The Language of the Gene Ontology
The Language of the Gene OntologyThe Language of the Gene Ontology
The Language of the Gene Ontology
 
Protein association networks: Large-scale integration of data and text
Protein association networks: Large-scale integration of data and textProtein association networks: Large-scale integration of data and text
Protein association networks: Large-scale integration of data and text
 
Systems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systemsSystems biology: Bioinformatics on complete biological systems
Systems biology: Bioinformatics on complete biological systems
 
BLAST
BLASTBLAST
BLAST
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
David
DavidDavid
David
 
The STRING database
The STRING databaseThe STRING database
The STRING database
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and text
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and text
 
STRING: Protein association networks
STRING: Protein association networksSTRING: Protein association networks
STRING: Protein association networks
 
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
 
Gene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and textGene association networks: Large-scale integration of data and text
Gene association networks: Large-scale integration of data and text
 
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...
From Advanced Queries to Algorithms and Graph-Based ML: Tackling Diabetes wit...
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 

Semelhante a Mining Drug Targets, Structures and Activity Data

Exploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemExploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemChris Southan
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSGeorge Papadatos
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Prof. Wim Van Criekinge
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekingeProf. Wim Van Criekinge
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekingeProf. Wim Van Criekinge
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLDr. Haxel Consult
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureChris Southan
 
Comparison of Compounds-to-targets between Databases
Comparison of Compounds-to-targets between DatabasesComparison of Compounds-to-targets between Databases
Comparison of Compounds-to-targets between DatabasesChris Southan
 
Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Chris Southan
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Prof. Wim Van Criekinge
 

Semelhante a Mining Drug Targets, Structures and Activity Data (20)

Exploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemExploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChem
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosure
 
Comparison of Compounds-to-targets between Databases
Comparison of Compounds-to-targets between DatabasesComparison of Compounds-to-targets between Databases
Comparison of Compounds-to-targets between Databases
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014
 

Mais de Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology Chris Southan
 

Mais de Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Mining Drug Targets, Structures and Activity Data

  • 1. Mining Drug Targets, Structures and Activity Data Using Open Full-Text Patent Sources and Web Tools Christopher Southan ChrisDS Consulting, Göteborg, Sweden, Prepared for BioIT, Boston, April 2012, Track 11, Open Source Solutions, Wednesday, 13:45 [1]
  • 3. Key Relationships Extractable from Patents and Papers MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK Document Assay Result Compound Target 2011 PMID 21569515 2010 doi:10.1007/978-3-642-15120-0_9 Important ”bag of targets” exceptions (eg bacterial/parasite whole cells) [3]
  • 4. The Good News: Patent Mining Utility • Novel bioactive chemical structures related to drug discovery exceeding those in journals by at least five-fold. • Encompass academic, as well as commercial, global med. chem. output. • Targets, assays, mechanisms of action, disease descriptions and in-vivo data. • ~ 70% of data initially patent-only, some never disclosed elswhere. • Include synthetic descriptions and other useful enabling information. • Precede journal or meeting reports by ~ 1.5 to 5 years. • Can be complementary to papers (e.g. larger SAR matrix). • Intersect with papers at chemistry, target, disease, author and citation levels • IP exploitable for Neglected Tropical Disease research becoming ”open”. [4]
  • 5. The Bad News: Patent Mining Can be Tough • High-specificity retrieval of relevant documents difficult • Massive chaff-to-wheat ratio in 100s of pages • Differences in layout, house style and data location • Markush permutation • Variability in IUPAC strings and image rendering • Use of non-standard gene/protein names • Obfuscation via; – Qualitative or binned assay results – Structure-to-data links non-obvious, patchy or absent – Less than 50% of titles include target names – The ”hiding the lead and core structures” game – Blunderbuss disease and use exemplifications – Tense ambiguity (i.e. ”could be” vs. ”was” done) • Quality judgments dificult • Patents cite papers and patents but few papers cite patents • Document redundancy of Kind codes, patent families and equivalents • Finding drug candidate first-filings is difficult • The PDF hamburger problem and OCR noise [5]
  • 6. Reasons for Rolling-your-own Patent Chemistry and Data Extraction • Limited budget • You are likely to be a tacit super-curator by profession • Best-of-both-worlds synergy with licensed sources (e.g. digging deeper) • Combine automated outputs with manual triage • Develop a technical understanding and comparison of vendor offerings • Commercial dbs cap the number of manually-extracted examples • Need SAR analogues for a few targets rather than many (e.g. mechanistic enzymology or systems chemical biology) • Only require data sampling across specific disease areas • Not overly concerned about false-negatives (i.e. don’t need comprehensive prior-art check or scoping of claims) • Open tools operate on any text or web source, not just patents • You may already have commercial text mining capability • Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL, journals you subscribe to, PubMed and PMC) • You can slice-and-dice PubChem patent chemistry in ways complementary to commercial databases [6]
  • 7. Open Sources and Tools Overview • Searching metadata, abstracts and text – Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al. • Metadata, full-text and chemical structure search - SureChemOpen • Bulk name-to-structure conversion - ChemAxon Chemicalize • Individulal name-to-structure - OPSIN • Conversion of images to structures - OSRA • Sketcher inputs – many options • Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize • EPO patent number searching in PubChem • PDF24.org for cutting pages and OnlineOCR.net for sections or tables • Utopia bioentity mark-up (those below not included in this presentation but relevant) • NCI/CADD Chemical Identifier Resolver and Online SMILES Translator • Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc. • OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al, SCRIPDB, Juristica group (n.b. Google should give urls for all these source and tool names) [7]
  • 8. So What’s in PubChem ? [8]
  • 9. PubChem Patent-derived Content ~6 million • ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI pharmaceutical patents plus some journal extractions • ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM • ~ 3.5 million of these are Lipinski-ROF compliant • ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million • ~ 70% of these are Lipinski-ROF compliant • ~ 90% of these have assay data • ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs • ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable) [9]
  • 10. Chemistry > Patents in PubChem [10]
  • 11. You found a CID, what are the Patent and Journal links? PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB [11]
  • 12. Patent Links from SLING and IBM [12]
  • 13. PubChem > SureChem > Patent > Stucture > Data > Target [13]
  • 15. Synonym Recall • Title only BACE1 = 8 • Title + abstract BACE1 = 97 • Title + abstract BACE2 = 29 • Title + abstract BACE = 392 • Title + abstract ”Beta secretase” = 1056 • Title + abstract memapsin = 87 • Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR Memapsin = 1383 • Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR Memapsin AND inhibitors = 841 • Same query to PubMed (this interface) = 1031 [15]
  • 16. Target Query > Patent Retrieval from Espacenet [16]
  • 17. Linking Examples to Data in the Patent [17]
  • 19. IUPAC-to-structure: OPSIN Instalable application Also chemical dictionary conversions Result; Example 31 structure is 24 nM BACE1 inhibitor [19]
  • 20. Image-to-strucuture: OSRA • Patchy results but fixable by editing and similarity iteration in PubChem • Also an installable application • Useful to cross-check between images and IUPACs [20]
  • 22. Structure Search in PubChem SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher) Often see stero differences to the Derwent entry in PubChem [22]
  • 23. PubChem Similarity ”Walking” • 2D and 3D different results • Can do multiple steps • Can ”read” CID history • Possible to ”walk” between patents • Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc. [23]
  • 24. Direct Patent <> Chemistry [24]
  • 25. SureChemOpen: Patent Retrieval • Patent searching, chemistry-to-patent and patent-to-chemistry in one portal • Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not bulk export) [25]
  • 26. SurChemOpen, WIPO, OPSIN and PubChem Result 1nm (?) BACE2 inhibitor with assay and synthesis details. [26]
  • 27. SureChemOpen: Structure > Patent Direct answers to: ”which patents contain compounds simiar to my query” and ”show me all the compounds in these patents” [27]
  • 28. Non-target Activity Data and Bulk Chemistry Extraction [28]
  • 29. Malaria Query: CiteExplore > WIPO Example 60, sub-200nM potency, with solubilty and clearance data [29]
  • 30. Espacenet EP2391601 > ChemAxon Chemicalize.org • Description URL from Espacenet pasted into Chemcalize.org • Most of 74 examples converted • Example 60 had 4 analgues in PubChem at 95% Tamimoto (e.g. CID 46852300) but no exact match • Claims section was Markush description so no relevant structures converted [30]
  • 31. EP2391601 > Chemicalize > PubChem Chemicalize Similarity listing PubChem Tanimoto sub-cluster • EP2391601 description text > Chemicalize SDF download > PubChem Structure Search upload = 311 structures • Of these 206 have PubChem exact matches • Of these 176 have Thomson Pharma matches • The example cluster (Thomson/Derwent extraction) cluster is ~15 • The example cluster from Chemicalize is ~ 90 • Ipso facto Chemicalize extracted at least 70 novel structures • But only 10 examples were in the highest-potency bin [31]
  • 33. Tables and Recalcitrant IUPACs PDF Find tables Snip image Online OCR Word Pad Chemicalize OPSIN OSRA • iterative fixing of OCR errors (e.g. 1 vs l) • cross-check Mw in the document [33]
  • 34. Utopia Mark-up of Patent Introduction Bioentity mark-up (green) via EMBL Reflect with rich call-out options [34]
  • 35. Tips for Joining Everything up • SureChemOpen is continuing to back-fill and add features. • Check the Chemicalize archive (~ 0.5 million) for unique content. • Between Chemicalize, OSRA, OPSIN and sketching you can extract most things (e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki pages, blog posts and MeSH IUPACs). • Check PubChem ”same connectivity” for tautomer forms in different CIDs. • Check PubChem ”similar” compounds for analogues even if you cannot track back to a patent number. • Most PDB ligands published by companies have a patent analogue series. • Espacenet text chemicalizes well but FreePantentsOnline can be better. • Google Scholar tracks patent citations. • Full-text is good but don’t forget to eyeball the original PDF • You can ”walk” between patents by 2D/3D clusters, inventors or citations. • Less-common author/inventor names may track a journal paper back to a patent. • CiteExplore includes selectable ChEMBL structure links. • Check ChEMBL structures for SureChem links via ChemSpider. • On a good day you can paste OCR table data into Excel. • You can set SciBitely patent keyword alerts and see posts on Twitter. [35]
  • 36. Conclusions • Roll-your-own patent mining can take you a long way. • Complementary to commerical databases. • Target-centric recall and specificity is reasonable. • Published patents are indexed and open text-extracted within weeks. • You need perspicacity to dig out SAR details. • Can cherry pick examples by potency or collate whole series • Establishing intersects between journal articles and patents is valuable. • Exemplified structures typically cover a broader range of analogue space and SAR data than papers. • You can ”walk” between patents via citation and chemistry clustering. • PubChem already contains over 6 million patent-derived structures with more depositions and links expected. • The increased public surfacing of chemical structres and bioactivity data from patents will expedite medicinal chemistry, tropical disease research and chemical biology. [36]
  • 37. Questions Welcome ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: cdsouthan – at - hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ (includes postings on patent themes) LinkedIN: http://www.linkedin.com/in/cdsouthan Website: http://www.cdsouthan.info/CDS_prof.htm Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year Citations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en Presentations: http://www.slideshare.net/cdsouthan [37]