The document discusses using automated workflow tools like KNIME and Semantic Web technologies to improve the accuracy of chemical and pharmaceutical data on Wikipedia. It describes generating a report of all chemicals and drugs on Wikipedia by extracting data from the Chembox and Drugbox templates, and then using SPARQL queries on Wikidata to compile the information more easily in a standard format. The goal is to allow subject matter experts to review and correct the data to link it unambiguously to the FDA's Substance Registration System.
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Improving the chemistry content of Wikipedia using workflow tools
1. Using Automated
Workflow Tools to
Improve Wikipedia
MITCH MILLER
SCIENTIFIC THINKING
VERMONT CODE CAMP 2016
SEPTEMBER 17, 2016
2. Disclaimer
This talk represents my opinion and personal experience using software
systems developed by third parties
The software systems shown are very complex and have hundreds of
components. I have only worked with a small number.
Every task shown today can be accomplished in multiple ways. I’m only
showing some of those ways.
3. Overview
Introduction: how are we improving Wikipedia? Why are we doing this?
The list of information we need to compile
First method of generating the list
The second method of generating the list
The third method of generating the list
4. What chemistry does Wikipedia
contain?
9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total)
[source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]
Chembox? Drug box?
Templates of selected content within Wikipedia articles
Contents of Chembox:
Molecular structure image
Name (systematically assigned name + synonyms)
Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG,
PubChem, SMILES, UNII…
Key properties
5. Chemical identifiers
Different specific databases
Individual IDs have strengths and weakness
The UNII is a non- proprietary, free, unique, unambiguous, non semantic,
alphanumeric identifier based on a substance’s composition and/or
descriptive information.
http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-
UniqueIngredientIdentifierUNII/
UNIIs contain 9 randomly generated alphanumeric characters with a tenth
check alphanumeric character
When two samples have the same UNII, “they represent the same molecular
entity or elements upon which the definition is based.”
6. SRS group goal
Manages Substance Registration System (SRS)
Assure uniformity of UNII assignments across internet resources that
reference UNIIs
7. The assignment
Generate a report of all chemicals and drugs in Wikipedia
Name, UNII (when present), CAS (when present),Wikipedia URL
Idea: subject matter experts will review list and correct assignments, add
new UNIIs to Wikipedia as needed
Result: more accurate Wikipedia that links to the FDA’s Substance
Registration System unambiguously
https://fdasis.nlm.nih.gov/srs/srs.jsp
8. Development tool: KNIME
Graphic, component based programming environment
Drag functional components from palette onto canvas to create program
Configure most components by setting parameters
Connect components to route data from one to another
Run and observe data traveling down the lines
KNIME stands for KoNstanz Information MinEr
Pronounced “Nighm”
Originally a production of the University of Konstanz, Germany 2004
Currently produced by KNIME.com AG, a company in Zurich, Switzerland
Free version available for download
Windows, Linux, Mac
9. First method of report generation
Read list of pages with each infobox
E.g.,
https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch
embox&limit=50000&from=16225610&back=0
Retrieve each individual page mentioned in the list
Parse HTML
Use Xpath to get Name, CAS, UNII
The Infobox templates lead to pages with defined structure – straightforward to
parse
Format data for output
Write to a file
10. First method: pluses/minuses
Plus: it works
Minus: had to run in batches to get all records
Minus: XPath parsing was more cumbersome than expected
Minus: misses some data
11. The Semantic Web
A connected set of data resources that can be understood by machines
Data encoded in a standard way that allows unattended processors to
traverse links from one entity to another across organizational and
geographic boundaries
[Standard WWW is a web of documents meant to be understood by
humans]
Tim Berners-Lee has a great Ted talk on the semantic web
https://www.youtube.com/watch?v=OM6XIICm_qo
12. Understand Semantic Web in
comparison to WWW
Compare pages on same subject:
Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol
Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
13. Technological foundations of Semantic
Web
RDF – Resource Definition Framework – organizing facts as
Subject – Predicate – Object
Conceptual example:
[Ethanol] [has a boiling point] [173 degrees Fahrenheit]
Coded example:
Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .
Represented in Turtle - Terse RDF Triple Language
14. SPARQL
Query language for RDF data
SPARQL Protocol and RDF Query Language
Similar to SQL
Syntax based on the RDF triple
15. Wikidata
Conceptually: semantic web version of Wikipedia
Add grain of salt
“Free and open knowledge base that can be read and edited by both
humans and machines. “
Designed as ‘central storage’ for Wikipedia and other Wikimedia projects
Approximately: programmatic interface to Wikipedia
See https://query.wikidata.org/
Run the example queries
16. Second method
Search Wikidata programmatically for chemical information
Wikidata SPARQL interface
Format list
Write file
17. SPARQL for chemical and
pharmaceutical compounds
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>
#All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata
SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE {
?compound wdt:P31 wd:Q11173 .
OPTIONAL { ?compound wdt:P231 ?cas . }
OPTIONAL { ?compound wdt:P274 ?formula . }
OPTIONAL { ?compound wdt:P652 ?unii . }
OPTIONAL { ?compound wdt:P662 ?pubchem . }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
18. Second method: pluses/minus
Fast and easy!
Data arrives in a format we can use – no parsing!
Minus:
*some* Wikidata data does not match up with Wikipedia!
19. Third method
Hybrid approach
Use Wikidata SPARQL query to get list of chemicals
Query Wikipedia for individual items to compare values
20. Conclusion
Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with
the required data
Subject matter experts are in the process of updating Wikipedia
Semantic web technology made the job easier!
Thank you!
21. References
Scholarly article on KNIME and Pipeline Pilot
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/
KNIME
www.knime.org
Wikipedia
https://en.wikipedia.org/wiki/Template:Chembox
https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox
Wikidata: https://query.wikidata.org
22. Who is your speaker?
Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience
Independent consultant: Scientific Thinking, LLC
mitch.miller@thinkscience.us
Some recent projects
Ongoing custodian of one chemical database implementation for ChemIDplus
project within the National Library of Medicine
Reporting systems
Web service to link collaborative object management system to reporting
system
Import wizard for chemical array designer
Merged a set of chemical databases and harmonized data