Anúncio
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Dataset Catalogs as a Foundation for FAIR* Data(20)

Anúncio

Dataset Catalogs as a Foundation for FAIR* Data

  1. Dataset Catalogs as a Foundation for FAIR* Data Tom Plasterer, PhD Research & Development Information (RDI); US Cross-Science Director 16 May 2017 * Findable, Accessible, Interoperable and Reusable
  2. 3 AstraZeneca FAIR Data Enablers AZ-Insight and Nanopublications Integrative Informatics Data Catalog Differential Privacy URI Policy clinical trials, competitive intelligence, translational Science
  3. 4 FAIR Data: Data Stewardship Survey Data Stewardship Survey 13 Questions, now managed by Cambridge Healthtech Institute
  4. 5 What controlled vocabularies and/or ontologies do you use for structuring and annotating your data and models? 31.6 21.1 26.3 26.3 36.8 68.4 42.1 21.1 15.8 52.6 78.9 63.2 15.8 15.8 15.8 10.5 52.6 15.8 31.6 0 0 10 20 30 40 50 60 70 80 90 BAO -BioAssayOntology BioPax-BiologicalPathw aysOntology CL-CellType ontology CHEBI-Chem icalEntitiesofBiological… DO/HDO -Hum an Disease Ontology GO -Gene Ontology HPO -Hum an Phenotype Ontology EFO -Experim entalFactorsOntology FM A -FoundationalM odelofAnatom y ICD9/10 –InternationalClassification… M edDRA – M edicalDictionaryfor… M ESH -M edicalSubject OBI-OntologyforBiom edical… ORDO -OrphanetRare Disease… RxNorm SIO -Sem anticScience SnoM ed -System atized Nom enclature… UBERON -UberAnatom yOntology OtherAllOthers
  5. Data Stewardship is important for the business Data Stewardship is challenging Metadata may no longer be considered proprietary Best Practices for use (Governance) are not well understood There is a consensus emerging around vocabulary standards 6 Survey Insights
  6. What do MedI Researchers want the ability to do? 7 • Gain a greater understanding of the biology of the molecular mechanisms of diseases • Use the human as a model organism to a greater degree • Discover how the microbiome is involved with human pathogenesis • Understanding molecular mechanisms of drug failures • Use patient-level clinical data to identify subphenotypes of diseases Integrative Informatics: A hybrid approach to integrating data for Drug Discovery @Mathew Woodwark; Pharma 2020: March 28, 2018
  7. Can MedImmune researchers do these things today? 8 • Currently, data exists in file shares, on laptops, eLN, in silos of managed systems and unknown places • The level of data integration is immature and fragmented • Using systems biology approaches requires considerable time and effort • Bioinformatics groups become a bottleneck to analyzing data • Research scientists not empowered to use information and knowledge to answer complex questions Integrative Informatics: A hybrid approach to integrating data for Drug Discovery @Mathew Woodwark; Pharma 2020: March 28, 2018
  8. 9 Dublin-Core-Type (DCT): Dataset • A dataset is information encoded in a defined structure (for example, lists, tables, and databases), intended to be useful for direct machine processing Data-Catalog (DCAT): Dataset • A collection of data, published or curated by a single source, and available for access or download in one or more formats Vocabulary of Interlinked Dataset (VoID): Dataset • A set of RDF triples that are published, maintained or aggregated by a single provider. Data and Datasets… dct:Dataset dcat:Dataset void:Dataset rdfs:subClassOf rdfs:subClassOf
  9. 10 Dataset Catalog is a collection of Dataset Records • Catalogs are needed to supporting FAIR (Findable) data • Catalogs can and should support Enterprise MDM strategies • Consumers can be internal or external Dataset Catalogs are needed so data consumers can find Datasets • Dataset records need sufficient metadata to support discoverability • Dataset terms are NOT the data instance Dataset Catalogs surface dataset provenance and enable data access Dataset Catalogs can provide datasets for multiple consumption patters • Analytics readiness and fit • ‘Walking’ across information models Dataset Catalogs: Findability Starts Here
  10. 11 Dataset Catalogs: Find me Datasets about: Projects Study Indication/ Disease Technology Targets Cohort DatesAgent Therapeutic Area Drugs
  11. 12 Best Practices: Data on the Web, Vocabulary of Interlinked Datasets Dataset Descriptions for the Open Pharmacological Space http://www.openphacts.org/specs/2012/WD-datadesc-20121019/ Data on the Web Best Practices https://www.w3.org/TR/dwbp/
  12. 13 Findable: Metadata, documentation, identifiers FAIRness Metrics (early draft) • Addresses some but not all sub-principles. • Nothing about how you can actually find the resource. • Understanding the content is not specified in this principle. Data FAIRNESS metrics @MichelDumontier; Linked Data CoP: May 19, 2017
  13. 14 The Backbone: A DCAT conformant Data Catalog https://www.w3.org/TR/hcls-dataset/ https://www.w3.org/TR/vocab-dcat/#vocabulary-overview Semantic tagging of datasets with concepts from taxonomies: • provides context • multi-dimensional & flexible • effective for discoverability • light-weight semantics skos:Concept dcat:Catalog skos:ConceptScheme dctypes:Dataset (summary) dct:title dct:publisher <foaf:Agent> foaf:page void:sparqlEndpoint dct:accrualPeriodicity dcat:keyword dcat:dataset dcat:theme dctypes:Dataset (version) dcat:Distribution (dctypes:Dataset) void:vocabulary dct:conformsTo void:exampleResource …other void properties dcat:distribution dcat:themeTaxonomy dct:isVersionOf pav:previousVersion dct:hasPart pav:hasCurrentVersion dct:hasPart dct:title dct:publisher <foaf:Agent> pav:version dct:creator <foaf:Agent> dct:created dct:source dct:creator <foaf:Agent> dct:license dct:format pav:retrievedFrom dct:created pav:createdWith dcat:accessURL dcat:downloadURL void:Dataset dct:title dctDescription dct:publisher <foaf:Agent>
  14. 15 Metadata Model Stack for the AZ Data Catalog DCAT VoID DCTerms RDF/S, OWL, SKOS/SKOS-XL AZ TaxonomiesPAV AZ DataCatalog ontology and instances for catalogs, datasets, distributions (could be further modularized later) DCMI bdm-tech core bdm internal external uniprot umls sio chembl… W3C and Metadata Standards Reference Master Data
  15. 16 Creating a Dataset Record
  16. 17 Flexible Vocabularies and Mapping Services Public (Extended): • Indication/Disease • Drugs • Targets • Technology Internal, Organizational: • Therapeutic Area (business unit) • Project • Cohort Mapping Services & APIs: • Clinical Study • Agent Other: • Dates
  17. 18 Validation and SHACL Az:ANYDataset az:BDMDataset dct:Dataset rdfs:subClassOf az:BDMDataset (Node Shape) sh:targetClass az:BDMDataset sh:and dctypes:dataset (Node Shape) sh:targetClass dctypes:Dataset sh:property az:BDMDatasetExtension (Shape) sh:property dct:title (Propery Shape) sh:path dct:title sh:datatype xsd:string sh:minCount: 1 sh:maxCount: 1 … az:theme (Shape) sh:path dcat:theme sh:class bdm:Technology sh:minCount: 1 … For BDM dataset: at least one technology MUST be specified Use of SHACL for Data Catalogs & Dataset Types @Heiner Oberkampf; <internal talk> April 18, 2018
  18. Data Discoverability: Multi-phase Filtering Data Catalog Filter Phase 1 Experiment Metadata Filter Phase 2 Ad hoc Analyses Filtering Phase 3 Outbound to Data Analytics Data Science Tools Statistical Filtering e.g., clinical trial with > 50 participants Dataset Catalog Descriptions
  19. 20 Example: Graph Model azds:cp1071 dctypes:Dataset CHEMBL1743039 bdm:Project core:Project rdf:type rdf:type dcat:theme “CP1071 RDF Dataset” dcterms:title core:hasDrug core:hasProject = catalog = BDM = inferred Named Graphs dcat:theme P15509 dcat:theme core:hasTarget Project Instance owl:NamedIndividual rdf:type core:hasTherapeuticArea RIA dcat:theme bdm:createdBy? pav:hasCurrentVersion v2 pav:createdBy kqsp092
  20. 21 DisQover Example
  21. R&D | RDI DCTERMS, DCAT, VoID are nearly sufficient • Extend for local needs Public Domain Ontologies should be reused • Consensus is emerging around best practices and cross-mapping Use Multi-Phase Filtering for Shallow & Deep Questions • Balance to what belongs in a catalog record vs. instance data Lots of Activity to Learn and Shape Best Practices • Didn’t reinvent a wheel Dataset Catalogs: Take-aways
  22. R&D | RDI Thanks Key Influencers David Wood Tim Berners-Lee Lee Harland Jane Lomax James Malone Dean Allemang Barend Mons Carole Goble Bernadette Hyland Bob Stanley Eric Little Juan Sequeda Michel Dumontier John Wilbanks Hans Constandt Filip Pattyn Dan Crowther Tim Hoctor Ian Harrow AZ/MedImmune Linked Data Community David Fenstermacher Mathew Woodwark Rajan Desai Nic Sinibaldi Chia-Chien Chiang Kerstin Forsberg Ola Engkvist Ian Dix Ted Slater Martin Romacker Eric Neumann Jeff Saltzman Kathy Reinold Nirmal Keshava Bryan Takasaki
Anúncio