Data model

CVSP

New Data Model

Standardization and Parents

Why there is a need for a new data model?

• Details of the proposed new data model

• Concept of Substance and Compound

• Standardization Workflow

• The benefit of having parents (we all know this )

Deposited Record – “Substance”

Unique Substance Identifier (SID) assigned for each
deposited record (a record identified by combination of Data
Source and depositor’s internal database registry identifier)

Benefits of having separate independent layers of deposited
record data (“Substances”) and standardized record data
(“Compounds”) - archive model - are:
• Depositors’ records – Substance - gets preserved as they
are with no alteration
• Depositors records get versioned when changes occur.
Only last most-to-date version is used in links and
calculations
• Same chemical may be deposited by several depositors –
each of them will have different substance ID, but all of
them will be linked to same standardized compound
• Any records can be accepted – even those not producing
InChI (e.g. plant extracts, blood samples, polymers, etc.)
• Substance identifier (SID) guaranteed not to change

Deposited Compounds Parents Compounds
Substances
Fragment Parent (CSID2)
SID 1
CSID 1 CSID 2
SDF1
Stereo Parent (CSID5)
DataSource1
Synonym1
Synonym2
XRef1 Isotope Parent (CSID4)
Standardized Standardized
MOL MOL
SID 2 Tautomer Parent (CSID6)
DataSource1 DataSource3
DataSource2 DataSource4
SDF2
Synonym1 Charge Parent (CSID3) Synonym4
DataSource2
Synonym2 Synonym5
Synonym1
Synonym3 Synonym6
Synonym3 Super Parent (CSID7)
XRef1 XRef3
XRef2
XRef2 XRef4

What happens when standardization rules adjust?
Would that affect Substance-Compound relationships?
Would SID-CSID change?

Yep, it is possible!!
• After occasional total ChemSpider re-standardization we can’t
guarantee that same standardized compound (CSID) will be
linked to Substance – the mapping may change.
This, however, will not in any way affect depositors’ SIDs.

• It should be encouraged that depositors use their substance
identifiers (SIDs) when referring to ChemSpider

• Need to develop a compound permalinks (URL) that depositors
can always use to get to their up-to-date CSIDs via SIDs. In this
case, our re-standardization wouldn’t affect external references.

What happens when depositor revokes Substance (SID)?

• Revoking is still versioning the substance record. A new version
of record will be created with “not alive” flag.
• Revoked substances are no longer indexed
• If there are no more Substances point at Compound then the
Compound gets deleted. Otherwise, the data from revoked
substances is pulled off the compound
• If revoke substance gets re-deposited a new version is created
with “live” flag

FDA Structure Registration System

Version 5c, 2007

• This guide is used to standardize the entry of
substances into the Food and Drug Administration
(FDA) Substance Registration System (SRS)

• The primary purpose of this guide is to prevent
duplicate entries of a single substance

• Conventions for drawing structures and for
organizing the characteristics of substances are
included

• The lack of standardization system at FDA gave
birth to SRS SOP that served as guidelines for
curators to draw chemicals the same way to
avoid duplication in database

Standardization – is it possible to please all
interested parties?

Depending on the area of specialization:
• Some folks may insist on neutralizing charges
while others may feel differently
• Some may think that canonical tautomer should
always be in specific form

• We believe that combining “mild”
standardization supplemented with parents may
be the right choice to please as many interested
parties as possible

Standardization – Step I - Organometallics
Always disconnect N, O, and F from metals:
 Example: (Ph3)Sn+… HO-

Disconnect nonmetals (except N,O,F) with transition metals (except Hg)

Ionize free metal with carboxylic acid
 Metals: Group I and II

Whenever covalent bonds with metals are disconnected - charges are adjusted.

Standardization – Step II (CVSP only)
Tautomer Canonicalization

 In CVSP tautomer canonicalization is a part of standardization

 In OpenPhacts model tautomer canonicalization is not part of standardization.
Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being
generated. Why OpenPHACTS approach is different?

 Having different tautomers of the same family to be mapped to different
standardized compounds would give better tautomer-specific annotation
mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)
 Standardized compounds representing same tautomeric family will have
same tautomeric parent – canonicalized tautomer

Standardization – Step III
Some of basic InChI normalization experience/rules were used (~30)
 [*;H+:1]>>[*;H:1]
 [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
 [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
 Etc

FDA SRS rules added (~30)
 [n:1]=[O:2]>>[n+:1][O-:2]
 [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
 [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
 Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
 etc

CVSP standardization vs OpenPHACTS

CVSP OpenPHACTS
Standardization Standardization

1. Disconnecting Metals 1. Disconnecting Metals
2. Canonicalizing tautomer 2. Omitted
3. Applying SMIRKES rules 3. Applying SMIRKS
(InChI + FDA) rules (InChI + FDA)

PARENTS

1. Tautomer-unsensitive
2. Charge-unsensitive
3. Isotope-unsensitive
4. Stereo-unsensitive
5. Super-unsensitive

For each Compound (CSID) parent generation is
attempted

“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)

Parent Description
Fragment-Unsensitive Largest fragment is identified and set as fragment
parent. Parent set to the biggest organic
fragment.
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight

Stereo-Unsensitive Stereo is stripped

Tautomer-Unsensitive Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above

standardization

standardization
standardization

Tricky cases of generating charge-unsensitive parents

DrugBank ID: DB00152

DrugBank ID: DB00209

Currently not dealt with

What do we use as chemical identity of the standardized records
(primary compound key)?

• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)

Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo

Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixing mobile hydrogens
• Pays attention to chiral flag in mol file (relative/absolute stereo)

Preliminary Data Flow

SDF Split to Parallel Processing
file chunks

Standardize

Moving forward to HADOOP-based
processing
Generate Parents

Upload to DB
(optional)

Thanks

We would appreciate any comments.

For comments or questions email
karapetyank@rsc.org

Data model

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Data model

Semelhante a Data model (20)

Mais de Ken Karapetyan

Mais de Ken Karapetyan (12)

Data model