1. CVSP
New Data Model
Standardization and Parents
2. Why there is a need for a new data model?
• Details of the proposed new data model
• Concept of Substance and Compound
• Standardization Workflow
• The benefit of having parents (we all know this )
3. Deposited Record – “Substance”
Unique Substance Identifier (SID) assigned for each
deposited record (a record identified by combination of Data
Source and depositor’s internal database registry identifier)
Benefits of having separate independent layers of deposited
record data (“Substances”) and standardized record data
(“Compounds”) - archive model - are:
• Depositors’ records – Substance - gets preserved as they
are with no alteration
• Depositors records get versioned when changes occur.
Only last most-to-date version is used in links and
calculations
• Same chemical may be deposited by several depositors –
each of them will have different substance ID, but all of
them will be linked to same standardized compound
• Any records can be accepted – even those not producing
InChI (e.g. plant extracts, blood samples, polymers, etc.)
• Substance identifier (SID) guaranteed not to change
5. What happens when standardization rules adjust?
Would that affect Substance-Compound relationships?
Would SID-CSID change?
Yep, it is possible!!
• After occasional total ChemSpider re-standardization we can’t
guarantee that same standardized compound (CSID) will be
linked to Substance – the mapping may change.
This, however, will not in any way affect depositors’ SIDs.
• It should be encouraged that depositors use their substance
identifiers (SIDs) when referring to ChemSpider
• Need to develop a compound permalinks (URL) that depositors
can always use to get to their up-to-date CSIDs via SIDs. In this
case, our re-standardization wouldn’t affect external references.
6. What happens when depositor revokes Substance (SID)?
• Revoking is still versioning the substance record. A new version
of record will be created with “not alive” flag.
• Revoked substances are no longer indexed
• If there are no more Substances point at Compound then the
Compound gets deleted. Otherwise, the data from revoked
substances is pulled off the compound
• If revoke substance gets re-deposited a new version is created
with “live” flag
7. FDA Structure Registration System
Version 5c, 2007
• This guide is used to standardize the entry of
substances into the Food and Drug Administration
(FDA) Substance Registration System (SRS)
• The primary purpose of this guide is to prevent
duplicate entries of a single substance
• Conventions for drawing structures and for
organizing the characteristics of substances are
included
• The lack of standardization system at FDA gave
birth to SRS SOP that served as guidelines for
curators to draw chemicals the same way to
avoid duplication in database
8. Standardization – is it possible to please all
interested parties?
Depending on the area of specialization:
• Some folks may insist on neutralizing charges
while others may feel differently
• Some may think that canonical tautomer should
always be in specific form
• We believe that combining “mild”
standardization supplemented with parents may
be the right choice to please as many interested
parties as possible
9. Standardization – Step I - Organometallics
Always disconnect N, O, and F from metals:
Example: (Ph3)Sn+… HO-
Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
Ionize free metal with carboxylic acid
Metals: Group I and II
Whenever covalent bonds with metals are disconnected - charges are adjusted.
10. Standardization – Step II (CVSP only)
Tautomer Canonicalization
In CVSP tautomer canonicalization is a part of standardization
In OpenPhacts model tautomer canonicalization is not part of standardization.
Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being
generated. Why OpenPHACTS approach is different?
Having different tautomers of the same family to be mapped to different
standardized compounds would give better tautomer-specific annotation
mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)
Standardized compounds representing same tautomeric family will have
same tautomeric parent – canonicalized tautomer
11. Standardization – Step III
Some of basic InChI normalization experience/rules were used (~30)
[*;H+:1]>>[*;H:1]
[O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
[N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Etc
FDA SRS rules added (~30)
[n:1]=[O:2]>>[n+:1][O-:2]
[*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
[N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
etc
13. For each Compound (CSID) parent generation is
attempted
“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)
Parent Description
Fragment-Unsensitive Largest fragment is identified and set as fragment
parent. Parent set to the biggest organic
fragment.
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-Unsensitive Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
15. Tricky cases of generating charge-unsensitive parents
DrugBank ID: DB00152
DrugBank ID: DB00209
Currently not dealt with
16. What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixing mobile hydrogens
• Pays attention to chiral flag in mol file (relative/absolute stereo)
17. Preliminary Data Flow
SDF Split to Parallel Processing
file chunks
Standardize
Moving forward to HADOOP-based
processing
Generate Parents
Upload to DB
(optional)