SlideShare uma empresa Scribd logo
1 de 22
The Data Today
Alasdair Gray
Heriot-Watt University, Edinburgh, UK
A.J.G.Gray@hw.ac.uk
@gray_alasdair
@gray_alasdair Big Data Integration 2
Dataset Downloaded Version Licence Triples
Bio Assay Ontology CC-By 10,360
CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552
ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056
ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880
ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760
DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136
Disease Ontology 2015-05-21 CC-By 188,062
DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767
ENZYME 2015_11 CC-By-ND 61,467
FDA Adverse Events 9 Jul 2012 CC0 13,557,070
Total: ~3 Billion triples
Dataset Downloaded Version Licence Triples
Gene Ontology 4 Mar 2015 CC-By 1,366,494
Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347
NCATS OPDDR Nov 2015 Oct 2015 2,643
neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108
OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722
HMDB 3.6 HMDB
MeSH 2015 MeSH
PDB Ligands 2 PDB
OPS Metadata CC-By-SA 2,053
UniProt 2015_11 CC-By-ND 1,131,186,434
WikiPathways 20151118 CC-By 11,781,627
Total: ~3 Billion triples
John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
Disease
Tissue
Target
Compound
Pathway
STANDARD_TYPE UNIT_COUNT
---------------- -------
AC50 7
Activity 421
EC50 39
IC50 46
ID50 42
Ki 23
Log IC50 4
Log Ki 7
Potency 11
log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*)
------------------ ------------------ --------
IC50 nM 829448
IC50 ug.mL-1 41000
IC50 38521
IC50 ug/ml 2038
IC50 ug ml-1 509
IC50 mg kg-1 295
IC50 molar ratio 178
IC50 ug 117
IC50 % 113
IC50 uM well-1 52
~ 100 units
>5000 types
Implemented using the Quantities, Units, Dimension, Types
Ontology (http://www.qudt.org/)
Quantitative Data Challenges
Quality Assurance
ops:OPS437281
✔
ops:OPS380297 ops:OPS380292
is_stereoisomer_of
[ci:CHEMINF_000461]
has_stereoundefined_parent
[ci:CHEMINF_000456] Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
Chemical Registration Service Data
Mappings: Raw
Mappings (Raw)
25,087,328
Mappings: Computed
Mappings (Comp)
200,000,000+
P12047
X31045
GB:29384
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
DrugbankChemSpider PubChem
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
I need to perform an analysis, give me
details of the active compound in Gleevec.
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Which targets are known to interact
with Gleevec?
A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
Scientific Lens
Lenses
34 in total
7 Public
25 Chemistry
2 Gene
Data Governance
Contribution must not be underestimated!!!
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33/
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts

Mais conteúdo relacionado

Semelhante a Open PHACTS: The Data Today

Computational tools for drug discovery
Computational tools for drug discoveryComputational tools for drug discovery
Computational tools for drug discovery
Eszter Szabó
 
Cambridge Bioscience_ ACEA User Group Meeting2014
Cambridge Bioscience_ ACEA User Group Meeting2014Cambridge Bioscience_ ACEA User Group Meeting2014
Cambridge Bioscience_ ACEA User Group Meeting2014
Jay Champaneri
 
Roy E Morgan-Bio Presentation 2-16e
Roy E Morgan-Bio Presentation 2-16eRoy E Morgan-Bio Presentation 2-16e
Roy E Morgan-Bio Presentation 2-16e
Roy Morgan
 

Semelhante a Open PHACTS: The Data Today (20)

Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
Elsevier Medical Graph – mit Machine Learning zu Precision MedicineElsevier Medical Graph – mit Machine Learning zu Precision Medicine
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
 
Computational tools for drug discovery
Computational tools for drug discoveryComputational tools for drug discovery
Computational tools for drug discovery
 
IRJET - Survey on Chronic Kidney Disease Prediction System with Feature Selec...
IRJET - Survey on Chronic Kidney Disease Prediction System with Feature Selec...IRJET - Survey on Chronic Kidney Disease Prediction System with Feature Selec...
IRJET - Survey on Chronic Kidney Disease Prediction System with Feature Selec...
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
Health, Data Analytics and Decision Support
Health, Data Analytics and Decision SupportHealth, Data Analytics and Decision Support
Health, Data Analytics and Decision Support
 
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
 
Detection of Kidney Stone using Neural Network Classifier
Detection of Kidney Stone using Neural Network ClassifierDetection of Kidney Stone using Neural Network Classifier
Detection of Kidney Stone using Neural Network Classifier
 
Esfast services presentation
Esfast services presentationEsfast services presentation
Esfast services presentation
 
IRJET- Biochips Technology
IRJET-  	  Biochips TechnologyIRJET-  	  Biochips Technology
IRJET- Biochips Technology
 
Recent Advances in Immune Monitoring Presentation Slides
Recent Advances in Immune Monitoring Presentation Slides Recent Advances in Immune Monitoring Presentation Slides
Recent Advances in Immune Monitoring Presentation Slides
 
Cambridge Bioscience_ ACEA User Group Meeting2014
Cambridge Bioscience_ ACEA User Group Meeting2014Cambridge Bioscience_ ACEA User Group Meeting2014
Cambridge Bioscience_ ACEA User Group Meeting2014
 
German hospital network, AVS. Birgitta Schweicker (Germany)
German hospital network, AVS. Birgitta Schweicker (Germany)German hospital network, AVS. Birgitta Schweicker (Germany)
German hospital network, AVS. Birgitta Schweicker (Germany)
 
Analysis of c-diNMPthesis
Analysis of c-diNMPthesisAnalysis of c-diNMPthesis
Analysis of c-diNMPthesis
 
IRJET - A Smartphone ALS based Syringe System for Colorimetric Detection of C...
IRJET - A Smartphone ALS based Syringe System for Colorimetric Detection of C...IRJET - A Smartphone ALS based Syringe System for Colorimetric Detection of C...
IRJET - A Smartphone ALS based Syringe System for Colorimetric Detection of C...
 
7 sins in the analysis of high-throughput sequencing data
7 sins in the analysis of high-throughput sequencing data7 sins in the analysis of high-throughput sequencing data
7 sins in the analysis of high-throughput sequencing data
 
BILS 2015 Jesse Mc Cool Cytovance
BILS 2015 Jesse Mc Cool CytovanceBILS 2015 Jesse Mc Cool Cytovance
BILS 2015 Jesse Mc Cool Cytovance
 
Roy E Morgan-Bio Presentation 2-16e
Roy E Morgan-Bio Presentation 2-16eRoy E Morgan-Bio Presentation 2-16e
Roy E Morgan-Bio Presentation 2-16e
 
SCL Healthcare overview(220408).pptx
SCL Healthcare overview(220408).pptxSCL Healthcare overview(220408).pptx
SCL Healthcare overview(220408).pptx
 
Seventh Wave Bioanalytical Lab
Seventh Wave Bioanalytical LabSeventh Wave Bioanalytical Lab
Seventh Wave Bioanalytical Lab
 
Seventh Wave BA Lab Intro
Seventh Wave BA Lab IntroSeventh Wave BA Lab Intro
Seventh Wave BA Lab Intro
 

Mais de Alasdair Gray

Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Alasdair Gray
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
Alasdair Gray
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Alasdair Gray
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
Alasdair Gray
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
Alasdair Gray
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
Alasdair Gray
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Alasdair Gray
 

Mais de Alasdair Gray (20)

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
 
Project X
Project XProject X
Project X
 
Data Linkage
Data LinkageData Linkage
Data Linkage
 
Scientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataScientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry data
 
Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
SensorBench
SensorBenchSensorBench
SensorBench
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked Data
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
Computing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery DatasetsComputing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery Datasets
 
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
 
Including Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL QueryIncluding Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL Query
 
2013 01-14 ops-dataset_descriptions
2013 01-14 ops-dataset_descriptions2013 01-14 ops-dataset_descriptions
2013 01-14 ops-dataset_descriptions
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Open PHACTS: The Data Today

Notas do Editor

  1. Data provided by many publishers: some cover other sets, e.g. ChemSpider Originally in many formats: relational, SD files and RDF Worked closely with publishers getting them to publish Raw RDF Metadata descriptions of their data Links between their data and others
  2. ~3billion triples 42GB gzip nquads 400GB uncompressed
  3. Getting this informaiton is still hard and manual! ~3billion triples 42GB gzip nquads 400GB uncompressed
  4. ~3billion triples 42GB gzip nquads 400GB uncompressed
  5. API: Complex data interactions/relationships Interactions needed to satisfy use cases Gradually added additional types of data and interactions
  6. Quantitative Data Challenges No standard units Even in curated sources! Feedback issues to data providers
  7. Quality Assurance Validation & Standardization Platform Developed by Royal Society of Chemistry http://bit.ly/NZF5VB
  8. CRS Dataset Generation Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types
  9. 230MB gzipped nquads 2 GB uncompressed 238 Mapping sets 43 data sources 11 predicates
  10. Identity Mapping
  11. Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases  Different results Chemistry is complicated, often simplified for convenience Data is messy! Are these records the same? It depends on what you are doing with the data! Each captures a subtly different view of the world
  12. Structure Lens Interested in physiochemical properties of Gleevec
  13. Name Lens Interested in biomedical and pharmacological properties sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  14. Lens enables certain relationships and disables others Alters links between the data
  15. Builds on OPS document: Checklist and guidance notes! Covers a wider range of use cases Large community buy in – Including EBI
  16. Builds on OPS document: Checklist and guidance notes! Covers a wider range of use cases Large community buy in – Including EBI
  17. Verifying data Verifying linkages Investigating unexpected answers Not to be