O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Wikiconference 2016 talk Burgstaller

My talk on chemical compounds in Wikipedia and Wikidata

  • Seja o primeiro a comentar

Wikiconference 2016 talk Burgstaller

  1. 1. Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg Sebastian Burgstaller-Muehlbacher, PhD User:Sebotic Twitter: @sebotic
  2. 2. Contents ● The Problem ● Introduction to Wikidata – Data model – References – Values/data types ● Gene Wiki Info Boxes - An Example solution ● Chemistry Data in Wikidata – Issues with the data – Community cleanup – Migration of Info Boxes to Wikidata
  3. 3. The Problem (with chemistry data) ● Wikipedia has ~300 different languages projects ● Currently, chemistry data resides as info box parameter – Data are not reusable between language projects – Data are not machine readable – Data are hard to update automatically – Data cannot be reused for other purposes, e.g. science.
  4. 4. The solution
  5. 5. Wikidata items ● Two types of entities – Properties (Pxxxx): ● Describe the nature of a data value ● Different data types ● 2,900 different properties in Wikidata – Data items (Qxxxx): ● A set of claims or statements ● Consist of property value pairs ● 20 million items in Wikidata
  6. 6. A Wikidata Statement
  7. 7. Wikidata Data types ● The current Wikidata data types: – String – WDItemID – External ID – MonolingualText – Property – Quantity – Time – Url – GlobeCoordinate – CommonsMedia – Mathematical formula
  8. 8. Unique Features of Wikidata ● Completely free, even for commercial usage (CC0). ● Granular: Single values with references. ● Anybody can contribute. ● Extensive item history. ● A repository for data on all domains of knowledge. ● Full integration with the semantic web. ● Essentially: A giant graph of knowledge.
  9. 9. Burgstaller-Muehlbacher, et al, Database, 2016
  10. 10. Data use case: Gene Wiki infoboxes
  11. 11. Issues with chemical data in the Wiki space ● Incorrect identifiers in info boxes or on Wikidata items ● Incorrect chemical properties ● Incorrect labels, aliases ● Incorrect isomeric forms of the compound ● Mixture of different isomeric forms
  12. 12. https://commons.wikimedia.org/wiki/File:Isomerism.svg
  13. 13. How to solve Isomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about.
  14. 14. What are InChIs ● IUPAC InChI (International Chemical Identifier). ● Describes the structure of a chemical compound or substance. ● Freely usable. ● Can be computed from e.g SMILES, or MOL format. ● Do not need to be assigned by an organization.
  15. 15. What are InChI keys ● The SHA-256 hashed version of an InChI ● Makes chemicals searchable on the Web ● Makes chemicals easily comparable ● Short, unique UEJJHQNACJXSKW-UHFFFAOYSA-N First block (14 letter) encodes skeleton (connectivtiy) Second block (8 letter) encodes stereochemistry and radioisotopes Last letter, number of protons (charge)
  16. 16. How to solve Isomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about. – Minimum requirement: Correct, unique InChI key on item. – Best case: Make sure all structural identifiers are correct (isomeric SMILES, canonical SMILES, InChI or InCh key). – A minimum of a correct InChI key allows for the rest of the chemical compound item to be populated by (our) bots.
  17. 17. What has been accomplished so far? ● Discussion on Wikiproject chemistry: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki – General consensus that info boxes should use Wikidata – Wikidata needs to improve on data quality ● Of the 17,000 original chemical compound Wikidata items, 16,000 have been validated around an InChI key. ● More chemical data has been imported, so they are readily available for new Wikipedia articles or correction of existing ones.
  18. 18. Things that need your attention ● I generated a list of items at Wikidata project chemistry which need human intervention. https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota Please have a look at those and unify the sterechemistry and identifiers around one unique InChI key!
  19. 19. Data maintenance in Wikidata ● Our bots are written in Python (2.7 and 3.x compatible). ● Python bots keep Wikidata in sync with authoritative data source. (PubChem, ChemSpider, ChEBI, ChEMBL) ● Bots are run according to data release cycles of authoritative data sources. ● Mechanisms in place for detection of inconsistencies. ● Contributions of other Wikidata users are being accounted for, based on references.
  20. 20. Wikidata API and query endpoints ● Three ways to access data: – Wikidata API allows read, write and full text search. (www.wikidata.org/w/api.php) – REST endpoint for fast, direct data access. (queryr.wmflabs.org/) – Wikidata query service (WDQS) as a SPARQL endpoint for complex queries. (query.wikidata.org/)
  21. 21. Acknowledgments Andrew Su Benjamin Good Tim Putman Julia Turner Gregg Stupp (TSRI) Gang Fu Evan Bolton (NIH, PubChem) Andra Waagmeester (Micelio.be) Elvira Mitraka Lynn Schriml (Disease Ontology, U Baltimore)