SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Using Automated
Workflow Tools to
Improve Wikipedia
MITCH MILLER
SCIENTIFIC THINKING
VERMONT CODE CAMP 2016
SEPTEMBER 17, 2016
Disclaimer
 This talk represents my opinion and personal experience using software
systems developed by third parties
 The software systems shown are very complex and have hundreds of
components. I have only worked with a small number.
 Every task shown today can be accomplished in multiple ways. I’m only
showing some of those ways.
Overview
 Introduction: how are we improving Wikipedia? Why are we doing this?
 The list of information we need to compile
 First method of generating the list
 The second method of generating the list
 The third method of generating the list
What chemistry does Wikipedia
contain?
 9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total)
[source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]
 Chembox? Drug box?
 Templates of selected content within Wikipedia articles
 Contents of Chembox:
 Molecular structure image
 Name (systematically assigned name + synonyms)
 Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG,
PubChem, SMILES, UNII…
 Key properties
Chemical identifiers
 Different specific databases
 Individual IDs have strengths and weakness
 The UNII is a non- proprietary, free, unique, unambiguous, non semantic,
alphanumeric identifier based on a substance’s composition and/or
descriptive information.
 http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-
UniqueIngredientIdentifierUNII/
 UNIIs contain 9 randomly generated alphanumeric characters with a tenth
check alphanumeric character
 When two samples have the same UNII, “they represent the same molecular
entity or elements upon which the definition is based.”
SRS group goal
 Manages Substance Registration System (SRS)
 Assure uniformity of UNII assignments across internet resources that
reference UNIIs
The assignment
 Generate a report of all chemicals and drugs in Wikipedia
 Name, UNII (when present), CAS (when present),Wikipedia URL
 Idea: subject matter experts will review list and correct assignments, add
new UNIIs to Wikipedia as needed
 Result: more accurate Wikipedia that links to the FDA’s Substance
Registration System unambiguously
 https://fdasis.nlm.nih.gov/srs/srs.jsp
Development tool: KNIME
 Graphic, component based programming environment
 Drag functional components from palette onto canvas to create program
 Configure most components by setting parameters
 Connect components to route data from one to another
 Run and observe data traveling down the lines
 KNIME stands for KoNstanz Information MinEr
 Pronounced “Nighm”
 Originally a production of the University of Konstanz, Germany 2004
 Currently produced by KNIME.com AG, a company in Zurich, Switzerland
 Free version available for download
 Windows, Linux, Mac
First method of report generation
 Read list of pages with each infobox
 E.g.,
https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch
embox&limit=50000&from=16225610&back=0
 Retrieve each individual page mentioned in the list
 Parse HTML
 Use Xpath to get Name, CAS, UNII
 The Infobox templates lead to pages with defined structure – straightforward to
parse
 Format data for output
 Write to a file
First method: pluses/minuses
 Plus: it works
 Minus: had to run in batches to get all records
 Minus: XPath parsing was more cumbersome than expected
 Minus: misses some data
The Semantic Web
 A connected set of data resources that can be understood by machines
 Data encoded in a standard way that allows unattended processors to
traverse links from one entity to another across organizational and
geographic boundaries
 [Standard WWW is a web of documents meant to be understood by
humans]
 Tim Berners-Lee has a great Ted talk on the semantic web
 https://www.youtube.com/watch?v=OM6XIICm_qo
Understand Semantic Web in
comparison to WWW
 Compare pages on same subject:
 Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol
 Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
Technological foundations of Semantic
Web
 RDF – Resource Definition Framework – organizing facts as
 Subject – Predicate – Object
 Conceptual example:
 [Ethanol] [has a boiling point] [173 degrees Fahrenheit]
 Coded example:
 Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .
 Represented in Turtle - Terse RDF Triple Language
SPARQL
 Query language for RDF data
 SPARQL Protocol and RDF Query Language
 Similar to SQL
 Syntax based on the RDF triple
Wikidata
 Conceptually: semantic web version of Wikipedia
 Add grain of salt
 “Free and open knowledge base that can be read and edited by both
humans and machines. “
 Designed as ‘central storage’ for Wikipedia and other Wikimedia projects
 Approximately: programmatic interface to Wikipedia
 See https://query.wikidata.org/
 Run the example queries
Second method
 Search Wikidata programmatically for chemical information
 Wikidata SPARQL interface
 Format list
 Write file
SPARQL for chemical and
pharmaceutical compounds
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>
#All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata
SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE {
?compound wdt:P31 wd:Q11173 .
OPTIONAL { ?compound wdt:P231 ?cas . }
OPTIONAL { ?compound wdt:P274 ?formula . }
OPTIONAL { ?compound wdt:P652 ?unii . }
OPTIONAL { ?compound wdt:P662 ?pubchem . }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Second method: pluses/minus
 Fast and easy!
 Data arrives in a format we can use – no parsing!
 Minus:
 *some* Wikidata data does not match up with Wikipedia!
Third method
 Hybrid approach
 Use Wikidata SPARQL query to get list of chemicals
 Query Wikipedia for individual items to compare values
Conclusion
 Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with
the required data
 Subject matter experts are in the process of updating Wikipedia
 Semantic web technology made the job easier!
 Thank you!
References
 Scholarly article on KNIME and Pipeline Pilot
 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/
 KNIME
 www.knime.org
 Wikipedia
 https://en.wikipedia.org/wiki/Template:Chembox
 https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox
 Wikidata: https://query.wikidata.org
Who is your speaker?
 Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience
 Independent consultant: Scientific Thinking, LLC
 mitch.miller@thinkscience.us
 Some recent projects
 Ongoing custodian of one chemical database implementation for ChemIDplus
project within the National Library of Medicine
 Reporting systems
 Web service to link collaborative object management system to reporting
system
 Import wizard for chemical array designer
 Merged a set of chemical databases and harmonized data

Mais conteúdo relacionado

Mais procurados

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)Dag Endresen
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingAnita de Waard
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
 

Mais procurados (12)

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly Publishing
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 

Semelhante a Improving the chemistry content of Wikipedia using workflow tools

Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Martin Walker
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...Dr. Haxel Consult
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BigData_Europe
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 

Semelhante a Improving the chemistry content of Wikipedia using workflow tools (20)

Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
 
ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
Building an integrated system for chemistry markup and online publishing inte...
Building an integrated system for chemistry markup and online publishing inte...Building an integrated system for chemistry markup and online publishing inte...
Building an integrated system for chemistry markup and online publishing inte...
 
RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
 
The Benefits to Chemical Vendors of Putting their data on ChemSpider
The Benefits to Chemical Vendors of Putting their data on ChemSpiderThe Benefits to Chemical Vendors of Putting their data on ChemSpider
The Benefits to Chemical Vendors of Putting their data on ChemSpider
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 
Bringing it all together: A Web-based Database for Chemical and Biological Da...
Bringing it all together: A Web-based Database for Chemical and Biological Da...Bringing it all together: A Web-based Database for Chemical and Biological Da...
Bringing it all together: A Web-based Database for Chemical and Biological Da...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 

Último

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 

Último (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 

Improving the chemistry content of Wikipedia using workflow tools

  • 1. Using Automated Workflow Tools to Improve Wikipedia MITCH MILLER SCIENTIFIC THINKING VERMONT CODE CAMP 2016 SEPTEMBER 17, 2016
  • 2. Disclaimer  This talk represents my opinion and personal experience using software systems developed by third parties  The software systems shown are very complex and have hundreds of components. I have only worked with a small number.  Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.
  • 3. Overview  Introduction: how are we improving Wikipedia? Why are we doing this?  The list of information we need to compile  First method of generating the list  The second method of generating the list  The third method of generating the list
  • 4. What chemistry does Wikipedia contain?  9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]  Chembox? Drug box?  Templates of selected content within Wikipedia articles  Contents of Chembox:  Molecular structure image  Name (systematically assigned name + synonyms)  Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG, PubChem, SMILES, UNII…  Key properties
  • 5. Chemical identifiers  Different specific databases  Individual IDs have strengths and weakness  The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information.  http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem- UniqueIngredientIdentifierUNII/  UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character  When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”
  • 6. SRS group goal  Manages Substance Registration System (SRS)  Assure uniformity of UNII assignments across internet resources that reference UNIIs
  • 7. The assignment  Generate a report of all chemicals and drugs in Wikipedia  Name, UNII (when present), CAS (when present),Wikipedia URL  Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed  Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously  https://fdasis.nlm.nih.gov/srs/srs.jsp
  • 8. Development tool: KNIME  Graphic, component based programming environment  Drag functional components from palette onto canvas to create program  Configure most components by setting parameters  Connect components to route data from one to another  Run and observe data traveling down the lines  KNIME stands for KoNstanz Information MinEr  Pronounced “Nighm”  Originally a production of the University of Konstanz, Germany 2004  Currently produced by KNIME.com AG, a company in Zurich, Switzerland  Free version available for download  Windows, Linux, Mac
  • 9. First method of report generation  Read list of pages with each infobox  E.g., https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch embox&limit=50000&from=16225610&back=0  Retrieve each individual page mentioned in the list  Parse HTML  Use Xpath to get Name, CAS, UNII  The Infobox templates lead to pages with defined structure – straightforward to parse  Format data for output  Write to a file
  • 10. First method: pluses/minuses  Plus: it works  Minus: had to run in batches to get all records  Minus: XPath parsing was more cumbersome than expected  Minus: misses some data
  • 11. The Semantic Web  A connected set of data resources that can be understood by machines  Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries  [Standard WWW is a web of documents meant to be understood by humans]  Tim Berners-Lee has a great Ted talk on the semantic web  https://www.youtube.com/watch?v=OM6XIICm_qo
  • 12. Understand Semantic Web in comparison to WWW  Compare pages on same subject:  Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol  Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
  • 13. Technological foundations of Semantic Web  RDF – Resource Definition Framework – organizing facts as  Subject – Predicate – Object  Conceptual example:  [Ethanol] [has a boiling point] [173 degrees Fahrenheit]  Coded example:  Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .  Represented in Turtle - Terse RDF Triple Language
  • 14. SPARQL  Query language for RDF data  SPARQL Protocol and RDF Query Language  Similar to SQL  Syntax based on the RDF triple
  • 15. Wikidata  Conceptually: semantic web version of Wikipedia  Add grain of salt  “Free and open knowledge base that can be read and edited by both humans and machines. “  Designed as ‘central storage’ for Wikipedia and other Wikimedia projects  Approximately: programmatic interface to Wikipedia  See https://query.wikidata.org/  Run the example queries
  • 16. Second method  Search Wikidata programmatically for chemical information  Wikidata SPARQL interface  Format list  Write file
  • 17. SPARQL for chemical and pharmaceutical compounds PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX bd: <http://www.bigdata.com/rdf#> #All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
  • 18. Second method: pluses/minus  Fast and easy!  Data arrives in a format we can use – no parsing!  Minus:  *some* Wikidata data does not match up with Wikipedia!
  • 19. Third method  Hybrid approach  Use Wikidata SPARQL query to get list of chemicals  Query Wikipedia for individual items to compare values
  • 20. Conclusion  Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data  Subject matter experts are in the process of updating Wikipedia  Semantic web technology made the job easier!  Thank you!
  • 21. References  Scholarly article on KNIME and Pipeline Pilot  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/  KNIME  www.knime.org  Wikipedia  https://en.wikipedia.org/wiki/Template:Chembox  https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox  Wikidata: https://query.wikidata.org
  • 22. Who is your speaker?  Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience  Independent consultant: Scientific Thinking, LLC  mitch.miller@thinkscience.us  Some recent projects  Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine  Reporting systems  Web service to link collaborative object management system to reporting system  Import wizard for chemical array designer  Merged a set of chemical databases and harmonized data