Getting the Big Picture by Joining up the SAR dots
This document discusses challenges in integrating structure and bioactivity data at large scales due to the volume and complexity of unstructured data from various sources. It describes efforts to extract chemical entities from text using natural language processing and to standardize structures. The Chemistry Connect knowledge base aims to enable searching across internal and external datasets by developing a chemical dictionary and common representation of concepts.
Scanning the Internet for External Cloud Exposures via SSL Certs
Getting the Big Picture by Joining up the SAR dots
1. Getting the Big Picture by Joining
up the SAR dots
Large-scale integration of structure
and bioactivity data
The 9th Annual Pharmaceutical IT Congress 2011
Sorel Muresan
AstraZeneca R&D Mölndal
DECS Computational Sciences
2. WO patents with the classification code C07D
Query performed using the European Patent Office search interface
DECS | CompSci
3. Driver – explosion in SAR data
• Chemical information landscape changing fast
• Databases, journal articles, patents, internal docs
2006 2008
DECS | CompSci
Southan, C.; Varkonyi, P.; Muresan, S., J. Cheminfo. 2009, 1:10
4. The Challenge – Information deluge
• Volume
• Complexity
• Unstructured content
DECS | CompSci
5. Since 2006 >1M chemistry publications per year
Number of articles (diamonds) and patents (open boxes) abstracted
annually by Chemical Abstracts
Bachrach J.Cheminformatics 2009 1:2
DECS | CompSci
7. SAR key entities and relationships
Unstructured Data Structured Entries in
from Documents Relational Databases
Expert Extraction
or
Text Mining
DECS | CompSci
Southan, C.; Boppana, K.; Jagarlapudi, S.; Muresan, S .J. Cheminfo. 2011, 3:14
8. Manually extracted SAR data (commercial)
• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a
comprehensive database that captures explicit relationships between the three
entities of publications, compounds and sequences.
• It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR
points extracted from 43,000 patents and 67,000 articles from 125 journals
DECS | CompSci
9. SAR data (public)
• PubChem
• the NCBI public informatics backbone for the NIH Molecular Libraries
Initiative focused on small molecules as systems biology probes and
potential therapeutic agents. The statistics are 30.5 million
compounds with 85.6 million links. Of the compounds, 1654K have
been tested in 504K assays.
• ChEMBL
• includes drugs, small molecules from the medicinal chemistry or
biochemical literature and their targets. It contains 1,060,258 distinct
compounds extracted by expert manual curation from 42,516
publications with 5,479,146 activities, including SAR and ADMET
values. This data is mapped to 8,603 targets.
DECS | CompSci
10. Extracting chemical entities from text
Collaboration with IBM Research Almaden to apply
text analytics technology to analyze intellectual
property and scientific literature
- 10 million full text patents
- 11 million structures
- 12% out of 46M parent structures in Chemistry Connect
DECS | CompSci
11. Chemical Named Entity Recognition (NER)
7-CHLORO-1,3-DIHYDRO-1-METHYL-5-
PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
Name-to-Structure
software
CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl
DECS | CompSci
12. Extracting chemical entities from text
The biggest cause of missing compounds when extracting
chemical entities from text is the presence of typographical
errors: human errors, OCR failures, hyphenation and
multiple line issues, etc.
• Automated spelling correction with CaffeineFix from
NextMove Software
• CaffeineFix significantly improves extraction rates (22%
increase from D=0 to D=1)
• name2structure software are complementary (40% of the
structures come from single n2s contributions)
DECS | CompSci
13. Structure standardisation
“The big merge” requires:
• A common set of chemistry and biology rules
applied carefully & consistently across databases
DECS | CompSci
Muresan, S.; Sitzmann, M.; Southan, C., Biocomputing and Drug Discovery, 2011
25. Different Questions, Common Language
Question Concepts
• What compounds have been described in Target Pathway
document D?
Institute People Disease
• What compounds bind target X with an affinity
Compound Bioprocess
greater than A? Target
MoA Pathway Disease
• What targets does compound C bind with an
affinity greater than A?
Compound Test Target
• What compounds have AZ patented on target X?
• What is the structure for this development Disease Study Drug MoA
compound?
Species
• How can I quickly get the SAR data from this Compound BMO (AE)
patent? Study
BMO (AE) Compound
DECS | CompSci
26. Take-home messages
• Chemistry Connect is enabling AZ to intensify its exploitation of
synergies between internal and external SAR estate and to shorten
the time between hypothesis generation during DMTA cycles
• Our Chemical Dictionary of 120 million chemical terms has become a
crucial cross-mapping resource between chemistry and the scientific
literature
• We cannot wave a magic wand over data qality, provenance issues,
drug name space, and the inherent challenges of chemistry
representation but Chemistry Connect gives us a unique overview and
amelioration options for each source
DECS | CompSci
27. A Democracy of Ideas (Acknowledgements)
• Plamen Petrov • Niklas Blomberg
• Chris Southan • Kay Brickmann
• Paul Xie • Ola Engkvist
• Peter Varkonyi • Yidong Yang
• Thierry Kogej • Hongming Chen
• Christian Tyrchan • and many others…
• Magnus Kjellberg
• Håkan Nilsson
• Mats Ericsson
• Jonas Ekengren
• Marcus Gelderman
• Ithipol Suriyawongkul
DECS | CompSci