SlideShare uma empresa Scribd logo
1 de 18
CVSP

   New Data Model

       Standardization and Parents
Why there is a need for a new data model?


• Details of the proposed new data model

• Concept of Substance and Compound

• Standardization Workflow

• The benefit of having parents (we all know this )
Deposited Record – “Substance”

Unique Substance Identifier (SID) assigned for each
deposited record (a record identified by combination of Data
Source and depositor’s internal database registry identifier)

Benefits of having separate independent layers of deposited
record data (“Substances”) and standardized record data
(“Compounds”) - archive model - are:
• Depositors’ records – Substance - gets preserved as they
   are with no alteration
• Depositors records get versioned when changes occur.
   Only last most-to-date version is used in links and
   calculations
• Same chemical may be deposited by several depositors –
   each of them will have different substance ID, but all of
   them will be linked to same standardized compound
• Any records can be accepted – even those not producing
   InChI (e.g. plant extracts, blood samples, polymers, etc.)
• Substance identifier (SID) guaranteed not to change
Deposited      Compounds       Parents                      Compounds
Substances
                                  Fragment Parent (CSID2)
 SID 1
                CSID 1                                       CSID 2
 SDF1
                                  Stereo Parent (CSID5)
 DataSource1
 Synonym1
 Synonym2
 XRef1                            Isotope Parent (CSID4)
                Standardized                                 Standardized
                MOL                                          MOL
 SID 2                            Tautomer Parent (CSID6)
                DataSource1                                  DataSource3
                DataSource2                                  DataSource4
 SDF2
                Synonym1          Charge Parent (CSID3)      Synonym4
 DataSource2
                Synonym2                                     Synonym5
 Synonym1
                Synonym3                                     Synonym6
 Synonym3                         Super Parent (CSID7)
                XRef1                                        XRef3
 XRef2
                XRef2                                        XRef4
What happens when standardization rules adjust?
Would that affect Substance-Compound relationships?
Would SID-CSID change?


Yep, it is possible!!
• After occasional total ChemSpider re-standardization we can’t
  guarantee that same standardized compound (CSID) will be
  linked to Substance – the mapping may change.
  This, however, will not in any way affect depositors’ SIDs.

• It should be encouraged that depositors use their substance
  identifiers (SIDs) when referring to ChemSpider

• Need to develop a compound permalinks (URL) that depositors
  can always use to get to their up-to-date CSIDs via SIDs. In this
  case, our re-standardization wouldn’t affect external references.
What happens when depositor revokes Substance (SID)?


• Revoking is still versioning the substance record. A new version
  of record will be created with “not alive” flag.
• Revoked substances are no longer indexed
• If there are no more Substances point at Compound then the
  Compound gets deleted. Otherwise, the data from revoked
  substances is pulled off the compound
• If revoke substance gets re-deposited a new version is created
  with “live” flag
FDA Structure Registration System

Version 5c, 2007

• This guide is used to standardize the entry of
  substances into the Food and Drug Administration
  (FDA) Substance Registration System (SRS)

• The primary purpose of this guide is to prevent
  duplicate entries of a single substance

• Conventions for drawing structures and for
  organizing the characteristics of substances are
  included

• The lack of standardization system at FDA gave
  birth to SRS SOP that served as guidelines for
  curators to draw chemicals the same way to
  avoid duplication in database
Standardization – is it possible to please all
interested parties?

Depending on the area of specialization:
• Some folks may insist on neutralizing charges
  while others may feel differently
• Some may think that canonical tautomer should
  always be in specific form


• We believe that combining “mild”
  standardization supplemented with parents may
  be the right choice to please as many interested
  parties as possible
Standardization – Step I - Organometallics
Always disconnect N, O, and F from metals:
 Example: (Ph3)Sn+… HO-


Disconnect nonmetals (except N,O,F) with transition metals (except Hg)


Ionize free metal with carboxylic acid
 Metals: Group I and II

Whenever covalent bonds with metals are disconnected - charges are adjusted.
Standardization – Step II (CVSP only)
Tautomer Canonicalization

   In CVSP tautomer canonicalization is a part of standardization


   In OpenPhacts model tautomer canonicalization is not part of standardization.
    Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being
    generated. Why OpenPHACTS approach is different?

      Having different tautomers of the same family to be mapped to different
       standardized compounds would give better tautomer-specific annotation
       mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)
      Standardized compounds representing same tautomeric family will have
       same tautomeric parent – canonicalized tautomer
Standardization – Step III
Some of basic InChI normalization experience/rules were used (~30)
 [*;H+:1]>>[*;H:1]
 [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
    >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
 [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
 Etc

 FDA SRS rules added (~30)
 [n:1]=[O:2]>>[n+:1][O-:2]
 [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
 [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
 Thiopurine
   [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
   H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
   1=[S:2]
 etc
CVSP standardization vs OpenPHACTS


         CVSP                           OpenPHACTS
      Standardization                          Standardization

1.   Disconnecting Metals                 1.   Disconnecting Metals
2.   Canonicalizing tautomer              2.   Omitted
3.   Applying SMIRKES rules               3.   Applying SMIRKS
     (InChI + FDA)                             rules (InChI + FDA)



                                               PARENTS

                                 1. Tautomer-unsensitive
                                 2. Charge-unsensitive
                                 3. Isotope-unsensitive
                                 4. Stereo-unsensitive
                                 5. Super-unsensitive
For each Compound (CSID) parent generation is
attempted

“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)


      Parent                            Description
Fragment-Unsensitive   Largest fragment is identified and set as fragment
                       parent. Parent set to the biggest organic
                       fragment.
Charge-Unsensitive     An attempt is made to neutralize ionized acids
                       and bases. Envisioned to be an ongoing
                       improvement while new cases appear.
Isotope-Unsensitive    Isotopes replaced by common weight

Stereo-Unsensitive     Stereo is stripped

Tautomer-Unsensitive   Tautomer canonicalization is attempting to
                       generate a “reasonable” tautomer
Super-Unsensitive      This parent is all of the above
standardization




standardization
                  standardization
Tricky cases of generating charge-unsensitive parents



                               DrugBank ID: DB00152




                               DrugBank ID: DB00209



   Currently not dealt with
What do we use as chemical identity of the standardized records
                      (primary compound key)?

•    Standard InChI/InChIKey (currently used ChemSpider)
•    Absolute smiles (isomeric canonical)

Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
    • does not distinguish between undefined and unknown stereo
    • by default standard InChI does some basic tautomer canonicalization
      (not needed in new model)
    • By default assumes absolute stereo

Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixing mobile hydrogens
• Pays attention to chiral flag in mol file (relative/absolute stereo)
Preliminary Data Flow



SDF                Split to      Parallel Processing
file               chunks

                                  Standardize


Moving forward to HADOOP-based
processing
                                 Generate Parents


                                 Upload to DB
                                 (optional)
Thanks


We would appreciate any comments.

For comments or questions email
karapetyank@rsc.org

Mais conteúdo relacionado

Semelhante a Data model

Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Ken Karapetyan
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems PharmacologyPhilip Bourne
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...Valerie Wood
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Lee Larcombe
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'Cresset
 
P. Joshi SBDD and docking.ppt
P. Joshi SBDD and docking.pptP. Joshi SBDD and docking.ppt
P. Joshi SBDD and docking.pptpranalpatilPranal
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Data101 pmcb retreat_09-20-13_final
Data101 pmcb retreat_09-20-13_finalData101 pmcb retreat_09-20-13_final
Data101 pmcb retreat_09-20-13_finalJackie Wirz, PhD
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...Kamel Mansouri
 
Benchmark Tutorial -- III - Report
Benchmark Tutorial -- III - ReportBenchmark Tutorial -- III - Report
Benchmark Tutorial -- III - Reportjdbess
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand dockingbaoilleach
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Chris Southan
 

Semelhante a Data model (20)

Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...
 
Computer aided Drug designing (CADD)
Computer aided Drug designing (CADD)Computer aided Drug designing (CADD)
Computer aided Drug designing (CADD)
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
David Evans, Eli-Lilly, 'Field-Aligned Matched Pairs'
 
P. Joshi SBDD and docking.ppt
P. Joshi SBDD and docking.pptP. Joshi SBDD and docking.ppt
P. Joshi SBDD and docking.ppt
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Data101 pmcb retreat_09-20-13_final
Data101 pmcb retreat_09-20-13_finalData101 pmcb retreat_09-20-13_final
Data101 pmcb retreat_09-20-13_final
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Benchmark Tutorial -- III - Report
Benchmark Tutorial -- III - ReportBenchmark Tutorial -- III - Report
Benchmark Tutorial -- III - Report
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 

Mais de Ken Karapetyan

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...Ken Karapetyan
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...Ken Karapetyan
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archiveKen Karapetyan
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryKen Karapetyan
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveKen Karapetyan
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Ken Karapetyan
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectKen Karapetyan
 

Mais de Ken Karapetyan (12)

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
 

Data model

  • 1. CVSP New Data Model Standardization and Parents
  • 2. Why there is a need for a new data model? • Details of the proposed new data model • Concept of Substance and Compound • Standardization Workflow • The benefit of having parents (we all know this )
  • 3. Deposited Record – “Substance” Unique Substance Identifier (SID) assigned for each deposited record (a record identified by combination of Data Source and depositor’s internal database registry identifier) Benefits of having separate independent layers of deposited record data (“Substances”) and standardized record data (“Compounds”) - archive model - are: • Depositors’ records – Substance - gets preserved as they are with no alteration • Depositors records get versioned when changes occur. Only last most-to-date version is used in links and calculations • Same chemical may be deposited by several depositors – each of them will have different substance ID, but all of them will be linked to same standardized compound • Any records can be accepted – even those not producing InChI (e.g. plant extracts, blood samples, polymers, etc.) • Substance identifier (SID) guaranteed not to change
  • 4. Deposited Compounds Parents Compounds Substances Fragment Parent (CSID2) SID 1 CSID 1 CSID 2 SDF1 Stereo Parent (CSID5) DataSource1 Synonym1 Synonym2 XRef1 Isotope Parent (CSID4) Standardized Standardized MOL MOL SID 2 Tautomer Parent (CSID6) DataSource1 DataSource3 DataSource2 DataSource4 SDF2 Synonym1 Charge Parent (CSID3) Synonym4 DataSource2 Synonym2 Synonym5 Synonym1 Synonym3 Synonym6 Synonym3 Super Parent (CSID7) XRef1 XRef3 XRef2 XRef2 XRef4
  • 5. What happens when standardization rules adjust? Would that affect Substance-Compound relationships? Would SID-CSID change? Yep, it is possible!! • After occasional total ChemSpider re-standardization we can’t guarantee that same standardized compound (CSID) will be linked to Substance – the mapping may change. This, however, will not in any way affect depositors’ SIDs. • It should be encouraged that depositors use their substance identifiers (SIDs) when referring to ChemSpider • Need to develop a compound permalinks (URL) that depositors can always use to get to their up-to-date CSIDs via SIDs. In this case, our re-standardization wouldn’t affect external references.
  • 6. What happens when depositor revokes Substance (SID)? • Revoking is still versioning the substance record. A new version of record will be created with “not alive” flag. • Revoked substances are no longer indexed • If there are no more Substances point at Compound then the Compound gets deleted. Otherwise, the data from revoked substances is pulled off the compound • If revoke substance gets re-deposited a new version is created with “live” flag
  • 7. FDA Structure Registration System Version 5c, 2007 • This guide is used to standardize the entry of substances into the Food and Drug Administration (FDA) Substance Registration System (SRS) • The primary purpose of this guide is to prevent duplicate entries of a single substance • Conventions for drawing structures and for organizing the characteristics of substances are included • The lack of standardization system at FDA gave birth to SRS SOP that served as guidelines for curators to draw chemicals the same way to avoid duplication in database
  • 8. Standardization – is it possible to please all interested parties? Depending on the area of specialization: • Some folks may insist on neutralizing charges while others may feel differently • Some may think that canonical tautomer should always be in specific form • We believe that combining “mild” standardization supplemented with parents may be the right choice to please as many interested parties as possible
  • 9. Standardization – Step I - Organometallics Always disconnect N, O, and F from metals:  Example: (Ph3)Sn+… HO- Disconnect nonmetals (except N,O,F) with transition metals (except Hg) Ionize free metal with carboxylic acid  Metals: Group I and II Whenever covalent bonds with metals are disconnected - charges are adjusted.
  • 10. Standardization – Step II (CVSP only) Tautomer Canonicalization  In CVSP tautomer canonicalization is a part of standardization  In OpenPhacts model tautomer canonicalization is not part of standardization. Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being generated. Why OpenPHACTS approach is different?  Having different tautomers of the same family to be mapped to different standardized compounds would give better tautomer-specific annotation mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)  Standardized compounds representing same tautomeric family will have same tautomeric parent – canonicalized tautomer
  • 11. Standardization – Step III Some of basic InChI normalization experience/rules were used (~30)  [*;H+:1]>>[*;H:1]  [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]  [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]  Etc FDA SRS rules added (~30)  [n:1]=[O:2]>>[n+:1][O-:2]  [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]  [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]  Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2]  etc
  • 12. CVSP standardization vs OpenPHACTS CVSP OpenPHACTS Standardization Standardization 1. Disconnecting Metals 1. Disconnecting Metals 2. Canonicalizing tautomer 2. Omitted 3. Applying SMIRKES rules 3. Applying SMIRKS (InChI + FDA) rules (InChI + FDA) PARENTS 1. Tautomer-unsensitive 2. Charge-unsensitive 3. Isotope-unsensitive 4. Stereo-unsensitive 5. Super-unsensitive
  • 13. For each Compound (CSID) parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description Fragment-Unsensitive Largest fragment is identified and set as fragment parent. Parent set to the biggest organic fragment. Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. Isotope-Unsensitive Isotopes replaced by common weight Stereo-Unsensitive Stereo is stripped Tautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer Super-Unsensitive This parent is all of the above
  • 15. Tricky cases of generating charge-unsensitive parents DrugBank ID: DB00152 DrugBank ID: DB00209 Currently not dealt with
  • 16. What do we use as chemical identity of the standardized records (primary compound key)? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES – can be too long; no accepted standard; needs to be hashed • Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereo Proposed Solution Non-standard InChI with options: SUU SLUUD FixedH SUCF • much more sensitive to stereo description • Fixing mobile hydrogens • Pays attention to chiral flag in mol file (relative/absolute stereo)
  • 17. Preliminary Data Flow SDF Split to Parallel Processing file chunks Standardize Moving forward to HADOOP-based processing Generate Parents Upload to DB (optional)
  • 18. Thanks We would appreciate any comments. For comments or questions email karapetyank@rsc.org