In recent years, in parallel with the general broad trend of information proliferation, many tens of public chemical databases have been created and made available using internet technologies. In many cases fluent data exchange has occurred between these various databases as they source information from one another. While this has the advantages of linking together multiple data sources the results also include the proliferation of errors across the various databases. The lack of a public authority to resolve such errors significantly affects the quality of freely accessible chemical information. While ChemSpider has previously allowed a crowdsourcing approach to curation efforts have now migrated to addressing this problem using a "federated resolver" approach. This presentation will report on our work in this area.
4. It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?
Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?
5. Open PHACTS Project
Develop a set of robust standards…
Implement the standards in a semantic integration hub
Deliver services to support drug discovery programs in
pharma and public domain
22 partners, 8 pharmaceutical companies, 3 biotechs
36 months project
Guiding principle is open access, open usage, open source
- Key to standards adoption -
8. MeSH
A lipid cofactor that is required for normal blood
clotting.
Several forms of vitamin K have been identified:
VITAMIN K 1 (phytomenadione) derived from
plants,
VITAMIN K 2 (menaquinone) from bacteria, and
synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione).
12. Create an Online “Resolver” as a
path to chemistry
Search all forms of structure IDs
Systematic name(s)
Trivial Name(s)
SMILES
InChI Strings
InChIKeys
Database IDs
Registry Number
18. Resolving Names for QUALITY
Searching chemical identifiers should resolve to
the correct chemical as much as possible
19. Validated Name-Structure Dictionaries
Chemical name dictionaries are used for:
Text-mining (publications, patents)
Used to index PubMed and link to Google Patents
Linking to other databases – think Biology!
When structures are not available drug names link
Searching the web
Names link to structures link to InChIs
24. Top 200 Drugs on Wikipedia
http://en.wikipedia.org/wiki/List_of_bestselling_drugs
25. The Project Challenge PART ONE
Agree on the set of chemical names to work with
Independently create an SDF file in each “lab”
Compare differences and agree on final structures
Issue “Gold Standard” SDF file to team
28. The Project Challenge PART TWO
Use Gold Standard SDF File to investigate data
quality on these compounds in Internet Databases
Two checks
Search chemical name – does it return the
correct compound. If not correct, how is it
different?
Search “structure” – SMILES, Molfile,
InChIString or InChIKey
36. One dictionary look up is never enough…
ChemSpider does not contain all chemistry
We are not the only ones curating data
New chemistry expands daily and goes online
37. One dictionary look up is never enough…
Federation is key….
Check ChemSpider first, if not found then
Check PubChem
Check NCI resolver
Check ChEBI
Check ….the “network” of open interfaces
Each resolver will have its own “quantitative
confidence”.
38. Chemical Identifier Resolver (CIR)
Converts a given
structure identifier into
another representation
or structure identifier.
Resolve names,
identifiers etc
http://cactus.nci.nih.gov/chemical/structure
40. We are building….
A central federated resolver utilizing available
services
Dictionary lookups, systematic name conversions
(multiple tools – ACD/Labs, Lexichem, OPSIN)
“Consensus” decisions and guidance BUT
Chemicals have timelines!!!