Mais conteúdo relacionado

Apresentações para você(20)

Similar a Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case(20)

Mais de Rothamsted Research, UK(20)

Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case

  1. Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case Marco Brandizi marco.brandizi@rothamsted.ac.uk Find these slides at: https://www.slideshare.net/mbrandizi
  2. A short story about Gene Knowledge www.knetminer.org
  3. <concept> <id>1</id> <pid>Q75WV3</pid> <description/> <elementOf> <idRef>UNIPROTKB-SwissProt</idRef> </elementOf> <ofType> <idRef>Protein</idRef> </ofType> <evidences> <evidence> <idRef>IMPD</idRef> </evidence> </evidences> <conames> <concept_name> <name>Probable trehalose-phosphate phosphatase 1</name> <isPreferred>true</isPreferred> </concept_name> … <cc> <id>Protein</id> <fullname>Protein</fullname> <description> A protein is comprised of one or more Polypeptides and potentially other molecules. </description> <specialisationOf> <idRef>MolCmplx</idRef> </specialisationOf> </cc> <relation> <fromConcept>1</fromConcept> <toConcept>3</toConcept> <ofType> <idRef>participates_in</idRef> </ofType> <evidences> <evidence> <idRef>ECO:0000316</idRef> </evidence> </evidences> <relgds/> </relation> <concept> <id>3</id> <pid>GO:0009651</pid> <description>response to salt stress</description> <ofType><idRef>BioProc</idRef></ofType> <coaccessions> <concept_accession> <accession>GO:0009651</accession> <elementOf><idRef>GO</idRef></elementOf> <ambiguous>false</ambiguous> </concept_accession> </coaccessions> </concept> A short story about Gene Knowledge www.ondex.org
  4. A short story about Gene KnowledgeCan we improve? Graph DBs? Query Languages? Open Data? FAIR? Sure! RDF! OWL! Triple Store! SPARQL! Uhm, we’ve tried that, but… I can feel what you mean, but, it’s not so difficult, let me… Look! I’ve seen this Neo4j! It has relations with properties! Uhm… well… yeah, but no data format, bad with ontologies, No URIs/merging… And look how cool a browser! Oh, yes, that’s cool, but maybe not the most important thing…And Cypher is a breeze! Uhm… let me try. Oh, cool, but UNION sucks, and… And has graph algorithms! And devs got the APIs in minutes! Uhm… Are Jena/RDF4J that harder? … … Source: https://digiday.com/uk/weve-created-monster-publishers-vent-ad-tech-frustration
  5. Why not Taking the Best of Both Worlds?
  6. Actually, Like This…
  7. Details
  8. The Bridge: rdf2neo
  9. The Bridge: rdf2neo
  10. The rdf2neo in Principle
  11. The rdf2neo in Principle
  12. The rdf2neo in Principle
  13. It works!
  14. Comparing Functionality • Data ELT and Integration • See our example: https://github.com/Rothamsted/bioknet- onto/tree/master/examples/bmp_reg_human • Semantic Web is focused on standardised data sharing • Neo4j doesn’t have a data format, focused on backing applications • URI-based merging in RDF • CONSTRUCT-based data transformations in Sem Web (including tools like TARQL) • MATCH/CREATE in Cypher, but not the same • Query languages • Cypher considered compact and simple to learn • SPARQL better at complex graph patterns with branches • Cypher very good at chain patterns
  15. Query Performance Details at: https://github.com/Rothamsted/graphdb-benchmark
  16. Query Performance: Graph Traversal
  17. Query Performance: Branch Union
  18. Query Performance: Branch Union
  19. Conclusions • Hybrid architectures might be good at getting the best of both • They’re feasible, performance are acceptable with both technologies • rdf2neo can help you with keeping everything aligned to a conceptual data model • Helps with Linked Data and FAIR Principles • Please checkout GitHub, get in touch (especially if you’re on agriculture/plant biology) • It comes with some overhead. You might need just one half • Whatever you do, follow LOD/FAIR
  20. Acknowledgements Ajit Singh Software Engineer Monika Mistry Master Student, Data Curator Keywan Hassani-Pak KnetMiner Team Leader Chris Rawlings Head of Computational & Analytical Sciences William Brown IT Admin
  21. And You All! Marco Brandizi marco.brandizi@rothamsted.ac.ukFind these slides at:
  22. EXTRA
  23. rdf2neo Architecture
  24. Cypher vs SPARQL Proteins->Reactions->Pathways: // chain of paths, node selection via property (exploits indices) MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ }) // further conditions, not always so performant WHERE prot.name =~ ‘(?i)^DNA.+’ // Usual projection and post-selection operators RETURN prot.name, pway // Relations can have properties ORDER BY csby.pvalue LIMIT 1000 Proteins->Reactions->Pathways: // Single-path (or same-direction branching) easy to write MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) - [:part_of*1..3] -> (pway:Path) RETURN ID(prot), ID(pway) LIMIT 1000 // Very compact forms available, depending on the data MATCH (prot:Protein) - (pway:Path) RETURN pway
  25. select distinct ?prot ?pway { where { # Branch 1 ?prot kb:pd_by|kb:cs_by ?react. ?prot a kb:Protein. ?react a kb:Reaction. ?react kb:part_of ?pway. ?pway a kb:Path. } union { # Branch 2 ?prot ^kb:ac_by|kb:is_a ?enz. ?prot a kb:Protein. ?enz a kb:Enzyme. { # Branch 2.1 ?enz kb:ac_by|kb:in_by ?comp. ?comp a kb:Compound. ?comp kb:cs_by|kb:pd_by ?trns ?trns a kb:Transport } union { # Branch 2.2 ?enz ^kb:ca_by ?trns. ?comp a kb:Compound. ?trns a kb:Transport } ?trns kb:part_of ?pway. ?pway a kb:Path. } } LIMIT 1000 Cypher vs SPARQL
  26. Loading Performance Details at: https://github.com/Rothamsted/graphdb-benchmarks
  27. Conclusions Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores Data xchg format - No official one, just Cypher, Support for GraphML, RDF +/- Focus on backing applications + Focus on data sharing standards Data model + Relations with properties - Metadata/schemas/ontologies management - Relations cannot have properties (reification required) + Metadata/schemas/ontologies as first citizen and standardised OWL Performance + complex graph traversals + Comparable in most cases Query Language + Cypher is easier (eg, compact, implicit elems)? - Expressivity issues (unions) - No standard QL (but efforts in progress, eg, OpenCypher) - SPARQL is Harder? (URIs, namespaces, verbosity) + SPARQL More expressive Standardisation, openness +/- (TinkerPop is open, Neo4J isn’t) + Commercial support + More alive and up-to date (e.g., support for Hadoop, nice Neo4j browser, easy installation) + Natively open, many open implementations - Instability and many short-lived prototypes - Advancements seems to be slowing down + Some nice open and commercial browser (LODEStar, Scalability, big data +/- Commercial support to clustering/clouds for Neo4J + Open support in TinkerPop + Load Balancing/Cluster solutions, Commercial Cloud support (eg GraphDB) + SPARQL Over TinkerPop (via SAIL inteface)

Notas do Editor

  1. Let me start from a little story…
  2. Eventually, it’s not really a tug of war, it’s that we realised they’re complementary.
  3. So, why not taking the best of the two worlds Hopefully less controversial than that…
  4. We have done some formal modelling of the mapping procedure Main results are: 1) it works as expected 2) Computational complexity is no worse than SPARQL, which is known to be constrainable into LOGSPACE.
  5. Loading is scalable and fairly OK, skipped here In querying, Both have comparable performance. Single use cases, they’re complementary again.
  6. Virtuoso is better with queries involving distant subgraphs (traversal not possible) and with complex branching, based on (nested) UNIONs. In the latter case, SPARQL looks easier to write (especially with nested patterns), though OpenCypher promised subqueries
  7. Virtuoso is better with queries involving distant subgraphs (traversal not possible) and with complex branching, based on (nested) UNIONs. In the latter case, SPARQL looks easier to write (especially with nested patterns), though OpenCypher promised subqueries