SlideShare a Scribd company logo
1 of 1
Chemicalize.org: A uniquely user-selected PubChem source of structures extracted from text
Christopher Southan, TW2Informatics, Göteborg, Sweden, Andras Stracz, Daniel Bonniot de Ruisselet and Ferenc Csizmadia,
ChemAxon Kft, 1031 Budapest, Hungary. cdsouthan@hotmail.com astracz@chemaxon.com

The chemicalize.org open web application recognizes different types of chemical names in any text source and
converts them into structures (the example of a new Wikipedia antimalarial lELQ-300 is shown on the right). It can
thus both disinter and connect between millions of structures from their “tombs” of patents, papers, abstracts or
web pages. The service has generated a searchable database of ~300,000 unique structures from user results since
2009. The Webpage Viewer and Document Viewer save visited URLs along with the extracted structures. Because
the philosophy of chemicalize.org is to make chemistry more accessible this archive is deposited into PubChem
and updated. It is unique in being the only source derived from user-selected content (i.e. documents and URLs
are actively chosen, typically via the protein target and/or disease indication). The accumulated structures are thus
”collectively crowd-sourced” and link back to chemicalize.org data pages. These include predicted properties such
as pKa, logP/D, all of which can be downloaded along with SDF, SMILES, IUPAC names, and InChI. The figures
below give an introduction to the functionality and exploitation synergies with PubChem.




Figure 1. The left hand picture shows the chemicalize.org result statistics on the ChemAxon
side. On the right you can see a breakdown of these user-selected sources by first-access to
structures. Notably this is ~ 70% patent full-text sites but also 8% from Wikipedia pages.
                                                                                                                                                    Figure 2 shows the connectivity utility via a
The overall statistics associated with the PubChem content are as follows. After                                                                    unique CID from PubChem (top left) and part
                                                                                                                                                    of the linked chemicalize.org record (top
filtration and submission processing, 300,987 chemicalize.org structure
                                                                                                                                                    right). In turn, this links directly to the open
records were merged into 297,083 CIDs, with a ranking of 27th by source size.                                                                       patent URL selected by the original user
Of these CIDs 80% are independently confirmed by at least one other                                                                                 (left). Using just one click any user can re-
                                                                                                                                                    run the complete document extraction in
submitting source in PubChem suggesting a high quality of extraction by
                                                                                                                                                    chemicalize.org and download over 1,900
chemicalize.org. The remaining 62,032 are unique CIDs. While 8,964 of these                                                                         structures.
are mixtures, the components of which may be pre-existing CIDs, it is clear that
chemicalize.org users are finding novel structures in their selected sources.


                                                                                                    Figure 3 is a comparative analysis of the chemicalize.org compounds in PubChem. The
                                                                                                    Lipinski rule-of-five content at 71% is high (c.f. 59% for ChEMBL as a comparison to active
                                                                                                    compounds from papers). This indicates the user-focus on drug-like space. Using a more
                                                                                                    lead-like filter by restricting to a Mw range of 250-800, drops this to 38% (c.f. 58% for
                                                                                                    ChEMBL). Unsurprisingly, since patents are the major user-selected source by document
                                                                                                    origin (as shown in fig.1), the structure overlap is 69% for patent sources (SureChem,
                                                                                                    IBM,SCRIPDB and Thomson Pharma) with 52% for SureChem being the largest patent-
                                                                                                    only source. Note that coverage of DrugBank is high (60%) but for ChEMBL is low(3%).
                                                                                                    The likely reason for the latter is that, unlike patents, there are few medicinal chemistry
                                                                                                    journals with open text URLs that users can run for extractions. Using in the PubChem
                                                                                                    query interface the chemicalize.org CIDs can in intersected with any sources or property
                                                                                                    filters. Note that even if the CID was pre-existing the URL and or document links (as shown
                                                                                                    in fig. 2) may link, via chemicalize.org, to information or data that is not in the other
                                                                                                    PubChem source links




  Figure 4 shows the utility both as an open sharing option and an indirect submission route to PubChem. Compounds tested in an Open Source Drug Discovery project (OSDD), for
  antimalarial activity in this case, have been initially extracted from the team’s open web pages that are linked to assay results. Of the 15 structures, 6 in PubChem, including a low-nM lead
  from the project (CID 57515644 = OSM-S-38) originate uniquely from chemicalize.org, thus becoming globally searchable within the PubChem CID space many months in advance of
  publication. Note also that two additional links are from personal websites as another global surfacing option. The InChIKey generated by chemicalize.org is effective highly-specific for
  Google searching and connecting to databases and websites of all types. Quote “In NIH’s view, all data should be considered for data sharing.”



 References and additional information
 http://www.chemicalize.org/
 http://www.chemaxon.com/blog/chemicalize-orgs-crowdsourced-database
 http://cdsouthan.blogspot.se/2013/01/chemicalizeorg-from-chemaxon-in-pubchem.html
 http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=57515644 the antimalarial lead from fig.4
 Southan C. & Stracz A., 2013, Extracting and Connecting Chemical Structures from text Sources using Chemicalize.org. J. Chem. Inform., in press (open access)
 Southan C. 2013 InChI in the wild: an assessment of InChIKey searching in Google. J Chem. Inform. Feb;5(1):10. doi: 10.1186/1758-2946-5-10 (open access)
 Swain M. 2012, chemicalize.org, J. Chem. Inf. Model. 52, No. 2.

More Related Content

What's hot

Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsChris Southan
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogensChris Southan
 
Opening up pharmacological space, the OPEN PHACTs api
Opening up pharmacological space, the OPEN PHACTs apiOpening up pharmacological space, the OPEN PHACTs api
Opening up pharmacological space, the OPEN PHACTs apiChris Evelo
 
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesThe IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesGuide to PHARMACOLOGY
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyChris Evelo
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways: Chris Evelo
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS Foundation
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS FoundationPistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS Foundation
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS FoundationPistoia Alliance
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigData_Europe
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
 
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...open_phacts
 
2013 Open PHACTS Exemplars Poster
2013 Open PHACTS Exemplars Poster2013 Open PHACTS Exemplars Poster
2013 Open PHACTS Exemplars Posteropen_phacts
 

What's hot (17)

Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogens
 
Opening up pharmacological space, the OPEN PHACTs api
Opening up pharmacological space, the OPEN PHACTs apiOpening up pharmacological space, the OPEN PHACTs api
Opening up pharmacological space, the OPEN PHACTs api
 
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesThe IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biology
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways:
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
 
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpiderIdentification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS Foundation
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS FoundationPistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS Foundation
Pistoia Alliance European Conference 2015 - Nick Lynch / Open PHACTS Foundation
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & Health
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
 
US-EPA CompTox Chemicals Dashboard as a web-based data resource to help ident...
US-EPA CompTox Chemicals Dashboard as a web-based data resource to help ident...US-EPA CompTox Chemicals Dashboard as a web-based data resource to help ident...
US-EPA CompTox Chemicals Dashboard as a web-based data resource to help ident...
 
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
 
2013 Open PHACTS Exemplars Poster
2013 Open PHACTS Exemplars Poster2013 Open PHACTS Exemplars Poster
2013 Open PHACTS Exemplars Poster
 

Similar to Chemicalize.org: User-Selected PubChem Source of Structures from Text

The Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemThe Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemChris Southan
 
Protein database ..... of NCBI
Protein database ..... of NCBI Protein database ..... of NCBI
Protein database ..... of NCBI Alagppa University
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingSunghwan Kim
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science TeachersMary Markland
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
11-Big Data Application in Biomedical Research and Health Care.pptx
11-Big Data Application in Biomedical Research and Health Care.pptx11-Big Data Application in Biomedical Research and Health Care.pptx
11-Big Data Application in Biomedical Research and Health Care.pptxshikhamittal42
 
Towards semantic systems chemical biology
Towards semantic systems chemical biology Towards semantic systems chemical biology
Towards semantic systems chemical biology Bin Chen
 
Chemistry Reserach as a Social Machine
 Chemistry Reserach as a Social Machine Chemistry Reserach as a Social Machine
Chemistry Reserach as a Social MachineJeremy Frey
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Chris Southan
 

Similar to Chemicalize.org: User-Selected PubChem Source of Structures from Text (20)

The Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemThe Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In Pubchem
 
Protein database ..... of NCBI
Protein database ..... of NCBI Protein database ..... of NCBI
Protein database ..... of NCBI
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science Teachers
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
ChemSpider as a hub for online chemical information resources
ChemSpider as a hub for online chemical information resources   ChemSpider as a hub for online chemical information resources
ChemSpider as a hub for online chemical information resources
 
Why open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and modelsWhy open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and models
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
11-Big Data Application in Biomedical Research and Health Care.pptx
11-Big Data Application in Biomedical Research and Health Care.pptx11-Big Data Application in Biomedical Research and Health Care.pptx
11-Big Data Application in Biomedical Research and Health Care.pptx
 
Towards semantic systems chemical biology
Towards semantic systems chemical biology Towards semantic systems chemical biology
Towards semantic systems chemical biology
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
 
Pallavi gupta
Pallavi guptaPallavi gupta
Pallavi gupta
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptxProof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
 
Chemistry Reserach as a Social Machine
 Chemistry Reserach as a Social Machine Chemistry Reserach as a Social Machine
Chemistry Reserach as a Social Machine
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 

More from Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 

More from Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Chemicalize.org: User-Selected PubChem Source of Structures from Text

  • 1. Chemicalize.org: A uniquely user-selected PubChem source of structures extracted from text Christopher Southan, TW2Informatics, Göteborg, Sweden, Andras Stracz, Daniel Bonniot de Ruisselet and Ferenc Csizmadia, ChemAxon Kft, 1031 Budapest, Hungary. cdsouthan@hotmail.com astracz@chemaxon.com The chemicalize.org open web application recognizes different types of chemical names in any text source and converts them into structures (the example of a new Wikipedia antimalarial lELQ-300 is shown on the right). It can thus both disinter and connect between millions of structures from their “tombs” of patents, papers, abstracts or web pages. The service has generated a searchable database of ~300,000 unique structures from user results since 2009. The Webpage Viewer and Document Viewer save visited URLs along with the extracted structures. Because the philosophy of chemicalize.org is to make chemistry more accessible this archive is deposited into PubChem and updated. It is unique in being the only source derived from user-selected content (i.e. documents and URLs are actively chosen, typically via the protein target and/or disease indication). The accumulated structures are thus ”collectively crowd-sourced” and link back to chemicalize.org data pages. These include predicted properties such as pKa, logP/D, all of which can be downloaded along with SDF, SMILES, IUPAC names, and InChI. The figures below give an introduction to the functionality and exploitation synergies with PubChem. Figure 1. The left hand picture shows the chemicalize.org result statistics on the ChemAxon side. On the right you can see a breakdown of these user-selected sources by first-access to structures. Notably this is ~ 70% patent full-text sites but also 8% from Wikipedia pages. Figure 2 shows the connectivity utility via a The overall statistics associated with the PubChem content are as follows. After unique CID from PubChem (top left) and part of the linked chemicalize.org record (top filtration and submission processing, 300,987 chemicalize.org structure right). In turn, this links directly to the open records were merged into 297,083 CIDs, with a ranking of 27th by source size. patent URL selected by the original user Of these CIDs 80% are independently confirmed by at least one other (left). Using just one click any user can re- run the complete document extraction in submitting source in PubChem suggesting a high quality of extraction by chemicalize.org and download over 1,900 chemicalize.org. The remaining 62,032 are unique CIDs. While 8,964 of these structures. are mixtures, the components of which may be pre-existing CIDs, it is clear that chemicalize.org users are finding novel structures in their selected sources. Figure 3 is a comparative analysis of the chemicalize.org compounds in PubChem. The Lipinski rule-of-five content at 71% is high (c.f. 59% for ChEMBL as a comparison to active compounds from papers). This indicates the user-focus on drug-like space. Using a more lead-like filter by restricting to a Mw range of 250-800, drops this to 38% (c.f. 58% for ChEMBL). Unsurprisingly, since patents are the major user-selected source by document origin (as shown in fig.1), the structure overlap is 69% for patent sources (SureChem, IBM,SCRIPDB and Thomson Pharma) with 52% for SureChem being the largest patent- only source. Note that coverage of DrugBank is high (60%) but for ChEMBL is low(3%). The likely reason for the latter is that, unlike patents, there are few medicinal chemistry journals with open text URLs that users can run for extractions. Using in the PubChem query interface the chemicalize.org CIDs can in intersected with any sources or property filters. Note that even if the CID was pre-existing the URL and or document links (as shown in fig. 2) may link, via chemicalize.org, to information or data that is not in the other PubChem source links Figure 4 shows the utility both as an open sharing option and an indirect submission route to PubChem. Compounds tested in an Open Source Drug Discovery project (OSDD), for antimalarial activity in this case, have been initially extracted from the team’s open web pages that are linked to assay results. Of the 15 structures, 6 in PubChem, including a low-nM lead from the project (CID 57515644 = OSM-S-38) originate uniquely from chemicalize.org, thus becoming globally searchable within the PubChem CID space many months in advance of publication. Note also that two additional links are from personal websites as another global surfacing option. The InChIKey generated by chemicalize.org is effective highly-specific for Google searching and connecting to databases and websites of all types. Quote “In NIH’s view, all data should be considered for data sharing.” References and additional information http://www.chemicalize.org/ http://www.chemaxon.com/blog/chemicalize-orgs-crowdsourced-database http://cdsouthan.blogspot.se/2013/01/chemicalizeorg-from-chemaxon-in-pubchem.html http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=57515644 the antimalarial lead from fig.4 Southan C. & Stracz A., 2013, Extracting and Connecting Chemical Structures from text Sources using Chemicalize.org. J. Chem. Inform., in press (open access) Southan C. 2013 InChI in the wild: an assessment of InChIKey searching in Google. J Chem. Inform. Feb;5(1):10. doi: 10.1186/1758-2946-5-10 (open access) Swain M. 2012, chemicalize.org, J. Chem. Inf. Model. 52, No. 2.