SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
2. EMBL-EBI Resources Genes, genomes & variation
ArrayExpress Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide
Archive
1000 Genomes
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels
Enzyme Portal
BioSamples
Ensembl
Ensembl Genomes
European Genome-phenome Archive
Metagenomics portal
4. Why looking at patent documents?
• Patent filing and searching
• Legal, financial and commercial incentives & interests
• Prior art, novelty, freedom to operate searches
• Competitive intelligence
• Unprecedented wealth of knowledge
• Most of knowledge will never be disclosed anywhere else
• Average lag of 2-3 years between patent document and journal
publication disclosure for chemistry
5. From SureChem to SureChEMBL
• Digital Science/Macmillan donated SureChem to EMBL-
EBI
• SureChem: commercial patent chemistry mining product
• Wellcome Trust funds further development
• EMBL-EBI provides an on-going, live service
• Full functionality freely available to everyone
• Query, view and export chemistry from patents
• Complemented with biological annotations
6. SureChEMBL data processing
WO
EP
Applications
& Granted
US
Applications
& granted
JP
Abstracts
Patent
Offices
Chemistry
Database
SureChEMBL System
Patent
PDFs
(service)
Application
Users
API
Database
Entity
Recognition
SureChem IP
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-
1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to
Structure
(one method)
Name to
Structure
(five methods)
OCR
Processed
patents
(service)
7. SureChEMBL data processing
WO
EP
Applications
& Granted
US
Applications
& granted
JP
Abstracts
Patent
Offices
Chemistry
Database
SureChEMBL System
Patent
PDFs
(service)
Application
Users
API
Database
Entity
Recognition
SureChem IP
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-
1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to
Structure
(one method)
Name to
Structure
(five methods)
OCR
Processed
patents
(service)
8. Homepage
Help
Search by keyword and
meta-data
Search by
chemical
structure
(sketch
compound)
Search by
SMILES, MOL,
SMARTS, name
Search by patent number
Filter by authority
(US, EP, WO and JP)
Filter by document section
(title, claims, abstract,
description and images)
Chemical
search type
(substructure,
similarity,
identical) Filter
by date
Filter by MW
www.surechembl.org
9.
10.
11.
12. Data growth
• ~80K novel compounds every month
• ~800K novel compounds since EBI took over
• 2–7 days for a published patent to be chemically annotated and
searchable in SureChEMBL
Cumulative growth of SureChEMBL compounds
Compoundcount
Time
13. EMBL-EBI chemistry resources
RDF and REST API interfaces
REST API Interface - https://www.ebi.ac.uk/unichem/
Atlas
Ligand
induced
transcript
response
750
PDBe
Ligand
structures
from
protein
complexes
15K
ChEBI
Nomenclature
of primary and
secondary
metabolites.
Chemical
Ontology
24K
SureChEMBL
Chemical
structures
from patent
literature
16M
ChEMBL
Bioactivity
data from
literature
and
depositions
1.5M
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >90M
3rd Party Data
ZINC, PubChem,
ThomsonPharma
DOTF, IUPHAR,
DrugBank, KEGG,
NIH NCC,
eMolecules, FDA
SRS, PharmGKB,
Selleck, ….
~65M
14. Data access & exports
• Full compound repository
• FTP download, SDF and CSV format
• Updates quarterly
• Full compound-patent map
• FTP download, flat file
• Updates quarterly
• Data feed client
• Creates a local replica database of SureChEMBL
• Updates daily
15. Compound-patent map
• Flat file with
• Compound, global frequency, document, section, section
frequency, publication date
• Back file
• 187,958,584 unique patent-compound pairs
• 14,076,090 unique compound IDs
• 3,585,233 EP, JP, WO and US patent docs
• 1960-2014
• Quarterly incremental updates
• Q1 2015 is also now available on the FTP
http://chembl.blogspot.co.uk/2015/03/the-surechembl-map-file-is-out.html
17. Use cases with SureChEMBL
• Chemoinformatics
• Chemistry landscape for a particular biological target/disease
• Novel chemistry & scaffolds
• MDS, MCS and R-group analysis for a particular patent family claimed
chemistry
• (Negative) novelty checking with UniChem
• Competitive intelligence
• Reporting
• Patent alerts
• Per target/disease/company
22. Future steps
• OpenPHACTS ENSO
• Biological tagging of targets, genes, indications and diseases
• Development of integrated use-cases
• Combine chemistry & biology from patents, literature, pathways, etc.
• OpenPHACTS API
• Accessible via KNIME nodes
• Further improvements/added value
• Data quality and accuracy
• Target and compound relevance score
23. Acknowledgements
ChEMBL team:
• John Overington
• Anne Hersey
• Anna Gaulton
• Mark Davies
• Nathan Dedman
• Michal Nowotka
Collaborators:
• James Siddle
• Richard Koks
• Lee Harland
• Kevin Clark
Support:
surechembl-help@ebi.ac.uk
Webinar:
http://www.ebi.ac.uk/training/online/course/surechembl-accessing-chemical-patent-data-webinar
32. Common sources of errors
• Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR correction
step, but cannot fix all errors
-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3-
vDbenzamide’
• Reliability better for US patents due to inclusion of mol files