Anúncio
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowledge Graphs(20)

Anúncio

Mais de Rothamsted Research, UK(20)

Anúncio

Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowledge Graphs

  1. Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowledge Graphs Harpenden, 3/6/2018
 Marco Brandizi <marco.brandizi@rothamsted.ac.uk> Find these slides on SlideShare KnetMiner-inspired Artwork
 by Hugo Dalton (hugodalton.com)
  2. Behind the scenes of KnetMiner
  3. Putting it on a Bigger Picture
  4. Putting it on a Bigger Picture
  5. <concept> <id>1</id> <pid>Q75WV3</pid> <description/> <elementOf> <idRef>UNIPROTKB-SwissProt</idRef> </elementOf> <ofType> <idRef>Protein</idRef> </ofType> <evidences> <evidence> <idRef>IMPD</idRef> </evidence> </evidences> <conames> <concept_name> <name>Probable trehalose-phosphate phosphatase 1</name> <isPreferred>true</isPreferred> </concept_name> … <cc> <id>Protein</id> <fullname>Protein</fullname> <description> A protein is comprised of one or more Polypeptides and potentially other molecules. </description> <specialisationOf> <idRef>MolCmplx</idRef> </specialisationOf> </cc> <relation> <fromConcept>1</fromConcept> <toConcept>3</toConcept> <ofType> <idRef>participates_in</idRef> </ofType> <evidences> <evidence> <idRef>ECO:0000316</idRef> </evidence> </evidences> <relgds/> </relation> <concept> <id>3</id> <pid>GO:0009651</pid> <description>response to salt stress</description> <ofType><idRef>BioProc</idRef></ofType> <coaccessions> <concept_accession> <accession>GO:0009651</accession> <elementOf><idRef>GO</idRef></elementOf> <ambiguous>false</ambiguous> </concept_accession> </coaccessions> </concept> Is XML/OXL Enough?
  6. A Brief History of Data Models/Formats
  7. The Semantic Web Approach: RDF
  8. The Semantic Web Approach: RDF
  9. URI Resolution @prefix bkr: <http://www.ondex.org/bioknet/resources/> . @prefix bk: <http://www.ondex.org/bioknet/terms/> . @prefix bka: <http://www.ondex.org/bioknet/terms/attributes/> . bkr:TOB1 a bk:Protein ; bk:participates_in <http://www.wikipathways.org/id1> ; bk:prefName "TOB1"; bk:published_in bkr:23236473.
 The Turtle Syntax: https://www.w3.org/TR/turtle/
  10. Schema/Ontologies
  11. Schema/Ontologies Data store Schema store
  12. Schema/Ontologies Data store Schema store
  13. Sharing Identifiers via URIs Data store Schema store Wikipathways
  14. Mapping Data for Interoperability
  15. Our Data Model: The BioKNO Ontology
  16. wp:id1 a bk:Path ; # a subclass of bk:Concept bk:evidence bkev:IMPD ; # Imported from database, a predefined resource type. bk:prefName "Bone Morphogenic Protein (BMP) Signalling and Regulation". bkr:TOB1 a bk:Protein ; dc:identifier bkr:TOB1_acc ; bk:prefName "TOB1 HUMAN";
 # A simplified link, hiding the BioPax chain: # pathwayComponent -> BioChemicalReaction|Complex -> Protein bk:participates_in wp:id1; 
 bk:is_annotated_by obo:GO_0030014. # Same URI as the OBO Gene Ontology Term. # Structured accession, allow for linking of identifier and context. bkr:TOB1_acc a bk:Accession ; dcterms:identifier "TOB1"; # instance of bk:DataSource. Another predefined entity. bk:dataSource bkds:UNIPROTKB. BioKNO: Biological Entities
  17. # For practical reasons, we always expect that the straight # triple is always asserted, with the # reified version optionally added to it. bkr:TOB1 bk:published_in bkr:20068231. bkr:citation_TOB1_15489334 a bk:Relation ; # the same properties that are used for regular relations bk:relTypeRef bk:published_in; bk:relFrom bkr:TOB1 ; bk:relTo bkr:15489334 ; # An attribute bka:score 0.95 ;
 # Both attributes and object properties can be linked to a # reified relation. bk:evidence bkev:TextMining. Attributes in Reified Relations
  18. Talking to the Rest of The World BioKNO External Ontologies Mapping Type bk:Concept skos:Concept Subclass bk:Relation bk:relFrom bk:relTypeRef bk:relTo rdf:Statement
 rdf:subject rdf:predicate rdf:object Subclass Subproperties (ie, mapping to RDF reified statements) bk:Path, bk:Participant, bk:Interaction, bk:Transport, bk:Protein, bk:Gene Classes with same names in BioPAX and SIO Equivalent Class bk:participates_in bk:has_participant Relation Ontology (RO) properties with same names
 biopax:participant (as sub-property) Equivalent property bk:produces bk:produced_by bk:consumes bk:consumed_by biopax:product (as sub-property) RO properties with same names Equivalent property bk:regulates bk:positively_regulates bk:negatively_regulates RO properties with same names Equivalent property bk:is_a bk:part_of, bk:has_part bk:occurs_in, bk:co_occurs_with skos:broader Basic Formal Ontology (BFO)/RO properties with same names Equivalent property bk:Publication schema:CreativeWork Subclass bka:abstract bka:title (also known as AbstractHeader) bka:authors dcterms:description dcterms:title dc:creator Sub-property
  19. How to Serve and Query RDF?
  20. Typical RDF (and Data) Architecture
  21. How to Use it, Concretely? Playground: SPARQL Browsers
  22. How to Use it, Concretely? Playground: SPARQL Browsers
  23. How to Use it, Concretely? Playground: SPARQL Browsers
  24. How to Use it, Concretely? Programmatically: RDF Frameworks (Jena in this case)
  25. How to Use it, Concretely? Programmatically: RDF Frameworks (Jena in this case)
  26. How to Use it, Concretely? Programmatically: RDF Frameworks (Jena in this case) String service = "http://localhost:3030/ds/query"; String sparql = "PREFIX bk: <http://www.ondex.org/bioknet/terms/>n" + 
 … "n" + "n" + "SELECT DISTINCT ?pmid ?title ?year ?pub n" + "{n" + " ?prot a bk:Protein;n" + " bk:prefName 'TOB1'.n" + " n" + " ?pubRel a bk:Relation;n" + " bk:relFrom ?prot;n" + " bk:relTo ?pub;n" + " bka:Score ?score.n" + " n" + " FILTER ( ?score > 0.90 )n" + " n" + " ?pub n" + " bka:PMID ?pmid ;n" + " bka:YEAR ?dyear;n" + " bka:abstractHeader ?titlen" + "n" + " BIND ( xsd:int ( ?dyear ) AS ?year )n" + "}n" + "LIMIT 1000";
  27. How to Use it, Concretely? Programmatically: RDF Frameworks (Jena in this case) String service = "http://localhost:3030/ds/query"; String sparql = "PREFIX bk: <http://www.ondex.org/bioknet/terms/>n" + 
 … "n" + "n" + "SELECT DISTINCT ?pmid ?title ?year ?pub n" + "{n" + " ?prot a bk:Protein;n" + " bk:prefName 'TOB1'.n" + " n" + " ?pubRel a bk:Relation;n" + " bk:relFrom ?prot;n" + " bk:relTo ?pub;n" + " bka:Score ?score.n" + " n" + " FILTER ( ?score > 0.90 )n" + " n" + " ?pub n" + " bka:PMID ?pmid ;n" + " bka:YEAR ?dyear;n" + " bka:abstractHeader ?titlen" + "n" + " BIND ( xsd:int ( ?dyear ) AS ?year )n" + "}n" + "LIMIT 1000"; Query query = QueryFactory.create ( sparql ); QueryEngineHTTP qexec = QueryExecutionFactory.createServiceRequest( service, query ); ResultSet results = qexec.execSelect() ; results.forEachRemaining ( (QuerySolution soln ) -> { Resource pubNode = soln.getResource ( "pub" ); String uri = pubNode.getURI (); Literal titleNode = soln.getLiteral ( "title" ); String title = titleNode.getString (); String titleLang = titleNode.getLanguage (); Literal yearNode = soln.getLiteral ( "year" ); int year = yearNode.getInt (); System.out.format ( "Publication ID: <%s>, title: %s (in %s), year: %dn", uri, title, titleLang, year ); });
  28. CONSTRUCT { ?path a bk:Path; bk:prefName ?pathName; bk:evidence bkev:IMPD. ?bkProt a bk:Protein; dc:identifier ?bkProtAccUri; bk:prefName ?protName; bk:participates_in ?path. ?bkProtAccUri a bk:Accession; dcterms:identifier ?protName; bk:dataSource bkds:UNIPROTKB. } SPARQL for Extraction, Loading, Transformation (The Simpler-than-Ondex Way) WHERE { ?path a bp:Pathway; bp:displayName ?pathName; bp:pathwayComponent ?comp. { ?comp a bp:BiochemicalReaction; bp:left|bp:right ?protein. } UNION { ?react a bp:Complex; bp:component ?protein. } ?protein a bp:Protein; bp:displayName ?protName. BIND ( IRI ( CONCAT ( STR ( bkr: ), STR ( ?protName ) ) ) AS ?bkProt ) BIND ( IRI ( CONCAT ( STR ( ?bkProt ), "_acc" ) ) AS ?bkProtAccUri ) }
  29. CONSTRUCT { ?path a bk:Path; bk:prefName ?pathName; bk:evidence bkev:IMPD. ?bkProt a bk:Protein; dc:identifier ?bkProtAccUri; bk:prefName ?protName; bk:participates_in ?path. ?bkProtAccUri a bk:Accession; dcterms:identifier ?protName; bk:dataSource bkds:UNIPROTKB. } SPARQL for Extraction, Loading, Transformation (The Simpler-than-Ondex Way) WHERE { ?path a bp:Pathway; bp:displayName ?pathName; bp:pathwayComponent ?comp. { ?comp a bp:BiochemicalReaction; bp:left|bp:right ?protein. } UNION { ?react a bp:Complex; bp:component ?protein. } ?protein a bp:Protein; bp:displayName ?protName. BIND ( IRI ( CONCAT ( STR ( bkr: ), STR ( ?protName ) ) ) AS ?bkProt ) BIND ( IRI ( CONCAT ( STR ( ?bkProt ), "_acc" ) ) AS ?bkProtAccUri ) }
  30. SPARQL/RDF for ELT • TARQL: Using SPARQL to RDF-Convert Tabular CSV Files • RDF/XML can be transformed via XSL • We have done it for bio-specific ontology definitions in Ondex • Programmatic conversions • Using RDF frameworks, eg, Jena, RDF4J (former Sesame), rdflib for Python • See also java2rdf (https://github.com/EBIBioSamples/java2rdf) • We have used it for the Ondex->RDF converter
  31. SPARQL/RDF for ELT • TARQL: Using SPARQL to RDF-Convert Tabular CSV Files • RDF/XML can be transformed via XSL • We have done it for bio-specific ontology definitions in Ondex • Programmatic conversions • Using RDF frameworks, eg, Jena, RDF4J (former Sesame), rdflib for Python • See also java2rdf (https://github.com/EBIBioSamples/java2rdf) • We have used it for the Ondex->RDF converter
  32. The Bigger Picture
  33. The Bigger Picture https://www.economist.com/node/21521548
  34. The Bigger Picture https://goo.gl/n4m5xL Artificial Intelligence (AI) 8 https://www.economist.com/node/21521548
  35. The Bigger Picture https://goo.gl/n4m5xL Artificial Intelligence (AI) 8 https://www.economist.com/node/21521548
  36. The Bigger Picture: Linked Open Data Artificial Intelligence (AI) 8 https://lod-cloud.net/
  37. In the Life Sciences
  38. Another Graph Database World
  39. Another Graph Database World
  40. The Cypher Query/DML Language Proteins->Reactions->Pathways:
 // chain of paths, node selection via property (exploits indices)
 MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ })
 // further conditions, not always so performant
 WHERE prot.name =~ ‘(?i)^DNA.+’
 // Usual projection and post-selection operators
 RETURN prot.name, pway
 // Relations can have properties
 ORDER BY csby.pvalue
 LIMIT 1000 Proteins->Reactions->Pathways: // Single-path (or same-direction branching) easy to write
 MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) 
 - [:part_of*1..3] -> (pway:Path)
 RETURN ID(prot), ID(pway) LIMIT 1000
 // Very compact forms available, depending on the data
 MATCH (prot:Protein) - (pway:Path) RETURN pway
  41. Cypher as Semantic Motif Language
  42. Cypher as Semantic Motif Language
  43. The rdf2neo Tool
  44. The rdf2neo Tool
  45. The rdf2neo Tool
  46. The rdf2neo Tool SELECT ?iri { ?label rdfs:subClassOf* bk:Concept. ?iri a ?label. } SELECT ?label { { ?iri a ?label. ?label rdfs:subClassOf* bk:Concept. } UNION { # it's always instance of concept BIND ( bk:Concept AS ?label ) BIND ( ?iri AS ?iri ) } } SELECT ?name ?value { { ?iri ?name ?value. VALUES ( ?name ) { (dcterms:identifier) (dcterms:description) (rdfs:comment) (bk:prefName) (bk:altName) } } UNION { ?iri ?name ?value. ?name rdfs:subPropertyOf* bk:attribute. } }
  47. The rdf2neo Tool https://github.com/Rothamsted/rdf2neo
  48. How to Use it, Concretely? Playground: The Neo4j Browser
  49. How to Use it, Concretely? Programmatically: The Neo4j Drivers (for Java in this case)
  50. How to Use it, Concretely? Programmatically: The Neo4j Drivers (for Java in this case) AuthToken auth = AuthTokens.basic ( "neo4j", "test" ); try ( Driver neodb = GraphDatabase.driver ( "bolt://127.0.0.1:7687", auth ); Session session = neodb.session (); ) { String cypher = "MATCH (prot:Protein{ prefName:'TOB1' }) - [r:published_in] -> (pub)n" + "WHERE toFloat ( r.Score ) > 0.9n" + "RETURN pub.PMID, pub.AbstractHeader, pub.YEARn" + "ORDER BY pub.YEAR DESCn" + "LIMIT 30"; Statement stmt = new Statement ( cypher ); StatementResult rs = session.run ( stmt ); rs.forEachRemaining ( rec -> { String pmid = rec.get ( "pub.PMID" ).asString (); String title = rec.get ( "pub.AbstractHeader" ).asString (); String year = rec.get ( "pub.YEAR" ).asString (); System.out.format ( "PMID: %s, Title: "%s", year: %sn", pmid, title, year ); }); }
  51. Triple Stores vs Prop Graphs Neo4j, Cypher DBs, Graph DBs Semantic Web/Triple Stores Data xchg format - No official one, just Cypher, 
 Support for GraphML, RDF
 +/- Focus on backing applications + Focus on data sharing standards Data model + Relations with properties - Metadata/schemas/ontologies management - Relations cannot have properties (reification required) + Metadata/schemas/ontologies as first citizen and standardised OWL Performance + complex graph traversals + Comparable in most cases Query Language + Cypher is easier (eg, compact, implicit elems)?
 - Expressivity issues (unions) - No standard QL (but efforts in progress, eg, OpenCypher) - SPARQL is Harder? (URIs, namespaces, verbosity)
 + SPARQL More expressive Standardisation, openness +/- (TinkerPop is open, Neo4j isn’t) + Commercial support + More alive and up-to date (e.g., support for Hadoop, nice Neo4j browser, easy installation) + Natively open, many open implementations - Instability and many short-lived prototypes - Advancements seems to be slowing down + Some nice open and commercial browser (LODEStar, Scalability,
 big data +/- Commercial support to clustering/clouds for Neo4j
 + Open support in TinkerPop + Load Balancing/Cluster solutions, Commercial Cloud support (eg GraphDB)
 + SPARQL Over TinkerPop (via SAIL inteface)
  52. Supporting Web APIs via JSON { "type": "Protein", "id": "TOB1", "prefName": "TOB1 Human", "participates_in": { "type": "Pathway", "id": "id1", "evidence": "IMPD", "prefName": "Bone Morphogenic Protein (BMP) Signalling and Regulation" }, "is_annotated_by": "GO_0030014" } • Designed to be compatible with browser, i.e., Javascript • Language of choice for web APIs, web browser consuming, dynamic web interfaces (i.e., AJAX) • Conceptually similar to XML (trees, nested structures) • Often used in a lightweight way, without much schema constraints
  53. Supporting Web APIs via JSON { "type": "Protein", "id": "TOB1", "prefName": "TOB1 Human", "participates_in": { "type": "Pathway", "id": "id1", "evidence": "IMPD", "prefName": "Bone Morphogenic Protein (BMP) Signalling and Regulation" }, "is_annotated_by": "GO_0030014" } • Designed to be compatible with browser, i.e., Javascript • Language of choice for web APIs, web browser consuming, dynamic web interfaces (i.e., AJAX) • Conceptually similar to XML (trees, nested structures) • Often used in a lightweight way, without much schema constraints
  54. Bridging to RDF: JSON-LD … "@id": "bkr:TOB1", "@type": "bk:Protein", "prefName": "TOB1 Human", "dcterms:identifier": "TOB1", "is_annotated_by": "obo:GO_0030014", "participates_in": { "@id": "http://www.wikipathways.org/id1", "@type": "bk:Pathway", "evidence": "bkev:IMPD", "prefName":
 “Bone Morphogenic Protein (BMP) Signalling and Regulation" } } { "@context": { "bk": "http://www.ondex.org/bioknet/terms/", "bka": "http://www.ondex.org/bioknet/terms/attributes/", "bkds": "http://www.ondex.org/bioknet/terms/dataSources/", "bkev": "http://www.ondex.org/bioknet/terms/evidences/", "bkr": "http://www.ondex.org/bioknet/resources/", "dcterms": "http://purl.org/dc/terms/", "obo": "http://purl.obolibrary.org/obo/", "xsd": "http://www.w3.org/2001/XMLSchema#", "@vocab": "http://www.ondex.org/bioknet/terms/", "dcterms:identifier": { "@type": "xsd:string" }, "evidence": { "@type": “@id" } }, …
  55. JSON Schemas Babylon (and Our Focus)
  56. JSON Schemas Babylon (and Our Focus)
  57. JSON Schemas Babylon (and Our Focus)
  58. JSON Schemas Babylon (and Our Focus)
  59. JSON Schemas Babylon (and Our Focus)
  60. Take-Home Messages • From small data integration farm to sharing with the rest of the world => FAIR Principles • Semantic Web has pros and cons • Still useful for data model and schema governance, identifiers, complex models (namely, ontologies) • Alternative data sharing approaches, PG in particular • More alive area, can be simpler (blends into existing industrial software better) • LOD/FAIR principles not addressed much • Integrating the two is useful • APIs are a useful alternative/complementary approach • LOD/FAIR principles to be addressed as well • In our radar: • complete the work, publishing SPARQL, Neo4j access, APIs • Integrating similar projects in the agrifood field (e.g. BrAPI, DFW) • Contribute to standardisation efforts like Bioschemas
Anúncio