Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Transparency in the Data Supply Chain
1. Transparency in the Data Supply Chain
Paul Groth (@pgroth)
Web & Media Group
Department of Computer Science
VU University Amsterdam
http://www.few.vu.nl/~pgroth
2.
3. Outline
• Data integration for analysis
– i.e. remixing data
• The need for transparency
• Two solutions
• The future
5. Public Domain Drug Discovery Data:
Pharma are accessing, processing, storing & re-processing
Literature Genbank
Patents PubChem
Data Integration
Databases
Data Analysis
Downloads
x
Repeat @
each
company
Firewalled Databases
Why?
6. Prioritised Research Questions
Number
sum
Nr of 1
Question
15
12
9
All oxido,reductase inhibitors active <100nM in both human and mouse
18
14
8
Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety
concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor,
KOL) for findings associated with a compound?
24
13
8
Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine
ADMET profile of actives.
32
13
8
For a given interaction profile, give me compounds similar to it.
37
13
8
The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine
protease assays for molecules that contain substructure X.
38
13
8
41
13
8
44
13
8
46
13
8
59
14
8
Retrieve all experimental and clinical data for a given list of compounds defined by their chemical
structure (with options to match stereochemistry or not).
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to
modulate the target directly? What are the compounds that may modulate the target directly? i.e. return
all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both
from structured assay databases and the literature.
Give me all active compounds on a given target with the relevant assay data
Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)
Identify all known protein-protein interaction inhibitors
www.openphacts.org
7. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
From Mabel Loza - USC team
11. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
ChEMBL:
Search target Oxidoreductase: 481 targets from different species
Selection of all the oxidoreductases and filtering bioactivities with the
criteria IC50 < 100 (no units could be selected): 11497 data obtained
Table exported to a excel spreadsheet and manually filtered
From Mabel Loza - USC team
14. Using the Power of Open PHACTS, London, 22-23 April 2013
Core Platform
Applications
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
Linked Data API (RDF/XML, TTL, JSON)
P12374
EC2.43.4
CS4532
Semantic Workflow Engine
Chemistry
Registration
Normalisation
& Q/C
Data Cache
(Virtuoso Triple Store)
VoID
VoID
VoID
Nanopub
Public
Ontologies
Db
Domain
Specific
Services
Db
Public Content
VoID
Nanopub
Db
VoID
Nanopub
Db
Commercial
User
Annotations
index
17. Credits: Curt Tilmes, Peter Fox
Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.;
Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate
Assessment in the Global Change Information System," Geoscience and Remote
Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013
29. Goal
• the capability to trace back, for each query
result, the complete list of sources and how they
were combined to deliver a result.
30. Implement In a Graph Database at
Scale
Marcin Wylot
Philippe Cudré-Mauroux
Exascale Lab
University of Fribourg
http://diuf.unifr.ch/main/xi/diplodocus
33. Test on large messy data
• Billion Triple Challenge
– Crawled from the linked open data cloud
• Web Data Commons
– RDFa, Microdata extracted from common crawl
• 115 million triples (25 GB)
• 8 Queries defined for BTC
– T. Neumann and G. Weikum. Scalable join processing
on very large rdf graphs. In Proceedings of the 2009
ACM SIGMOD International Conference on
Management of data, pages 627–640. ACM, 2009.
34. External + Internal Provenance
• Unified queries over external and database
provenance
• Adapting query results based on provenance
• Performance improvements
37. Big Data is often lots of small data
http://www.data2semantics.org/prov-reconstruction-challenge/
38. Questions?
• More info:
–
–
–
–
openphacts.org
data2semantics.org
provbook.org
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, MarchApril, 2013
– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent
Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013
– Marcin Wylot, Philippe Cudré-Mauroux and Paul Groth.
TripleProv: Efficient Processing of Lineage Queries over a Native
RDF Store. WWW 2014