O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Knetminer Backend Training, Nov 2018


Confira estes a seguir

1 de 32 Anúncio

Mais Conteúdo rRelacionado

Semelhante a Knetminer Backend Training, Nov 2018 (20)

Mais de Rothamsted Research, UK (20)


Mais recentes (20)

Knetminer Backend Training, Nov 2018

  1. 1. Behind the scenes of KnetMiner Marco Brandizi marco.brandizi@rothamsted.ac.uk Bioinformatics Group Training, 27/11/2018 Find these slides at: https://www.slideshare.net/mbrandizi
  2. 2. Behind the scenes of KnetMiner
  3. 3. Behind the scenes of KnetMiner
  4. 4. <concept> <id>1</id> <pid>Q75WV3</pid> <description/> <elementOf> <idRef>UNIPROTKB-SwissProt</idRef> </elementOf> <ofType> <idRef>Protein</idRef> </ofType> <evidences> <evidence> <idRef>IMPD</idRef> </evidence> </evidences> <conames> <concept_name> <name>Probable trehalose-phosphate phosphatase 1</name> <isPreferred>true</isPreferred> </concept_name> … <cc> <id>Protein</id> <fullname>Protein</fullname> <description> A protein is comprised of one or more Polypeptides and potentially other molecules. </description> <specialisationOf> <idRef>MolCmplx</idRef> </specialisationOf> </cc> <relation> <fromConcept>1</fromConcept> <toConcept>3</toConcept> <ofType> <idRef>participates_in</idRef> </ofType> <evidences> <evidence> <idRef>ECO:0000316</idRef> </evidence> </evidences> <relgds/> </relation> <concept> <id>3</id> <pid>GO:0009651</pid> <description>response to salt stress</description> <ofType><idRef>BioProc</idRef></ofType> <coaccessions> <concept_accession> <accession>GO:0009651</accession> <elementOf><idRef>GO</idRef></elementOf> <ambiguous>false</ambiguous> </concept_accession> </coaccessions> </concept> The OXL format
  5. 5. The Ondex Integrator
  6. 6. And the Command Line Version
  7. 7. But it Needs some Pre-Processing Too
  8. 8. Why Changing? https://funnyjunk.com/Reinvent+the+wheel/funny- pictures/5665443/
  9. 9. Why Changing? • Graph databases have emerged • having expressive query Languages (eg, SPARQL, Cypher) • Having low memory footprint (and possibly scalability over clusters/clouds) • More stable APIs and implementations • Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc • Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL CONSTRUCT, scripting with JSON) • Useful in Output: applications based on APIs/micro-services, query languages, machine readable & standardised data. • New apps can be either ours or 3rd parties • Ondex issues • Getting old (and older with Java >8) • All data must be in memory • Not exactly high quality code
  10. 10. Property Graphs
  11. 11. The Cypher Query/DML Language Proteins->Reactions->Pathways: // chain of paths, node selection via property (exploits indices) MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ }) // further conditions, not always so performant WHERE prot.name =~ ‘(?i)^DNA.+’ // Usual projection and post-selection operators RETURN prot.name, pway // Relations can have properties ORDER BY csby.pvalue LIMIT 1000 Proteins->Reactions->Pathways: // Single-path (or same-direction branching) easy to write MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) - [:part_of*1..3] -> (pway:Path) RETURN ID(prot), ID(pway) LIMIT 1000 // Very compact forms available, depending on the data MATCH (prot:Protein) - (pway:Path) RETURN pway
  12. 12. Cypher as Semantic Motif Language
  13. 13. Cypher as Semantic Motif Language
  14. 14. Exercise 1: Try Cypher • Go to http://babvs48.rothamsted.ac.uk:7476/browser • Use neo4j/test as credentials • Try the query: • MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction) - [react2path:part_of] -> (pway:Path) WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism' RETURN * LIMIT 10 • And explore the graphical result • What do you think you’ve found? • What do you have in () and in []? • What’s the meaning of the ‘|’ operator? • cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’ • What’s the difference between -[]-> and -[]-> ? • More help about Cypher at: https://neo4j.com/developer/cypher-query-language
  15. 15. Exercise 1: Solution • You should see something like the figure • Which shows the ACP pathway at the centre, a member reaction and proteins consumed/produced by the latter • (name:Label) matches nodes (label is synonym of type), [name:Type] matches relations • [r:R1|R2] matches relations of either type R1 or R2 • (src:Label1)-[r:R]->(dst:Label2) matches relations of type R going from nodes of type Label1 to nodes of type Label2 • (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2 and n2->r2->n1
  16. 16. Exercise 2: Write Your Own Cypher • Using the same browser, find: • genes, • which are encoded by proteins, • which are mentioned by articles that contain ‘ZmPEAMT1’ in the title • Hints • Use the node labels: Gene, Protein, Publication • Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’) • Use the attribute AbstractHeader (meaning ‘publication title’) • Use the filter operator CONTAINS, as in the previous exercise • More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the following slides
  17. 17. Exercise 2: Solution • MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication) WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1' RETURN * LIMIT 10 • Your solution might be a variant of this
  18. 18. But how to Encode Data? The Semantic Web Way
  19. 19. But how to Encode Data? The Semantic Web Way @prefix bkr: <http://www.ondex.org/bioknet/resources/> . @prefix bk: <http://www.ondex.org/bioknet/terms/> . @prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>. bkr:TOB1 a bk:Protein ; bk:participates_in <http://www.wikipathways.org/id1> ; bk:prefName “TOB1"; bk:published_in bkr:23236473.
  20. 20. But how to Encode Data? The Semantic Web Way
  21. 21. But how to Encode Data? The Semantic Web Way
  22. 22. select distinct ?prot ?comp { where { ?prot a kb:Protein; rdfs:label ?protLabel. filter ( contains ( ?protLabel, ‘TOB1’ ). ?enz kb:activated_by ?prot. ?enz kb:activated_by ?comp. ?comp rdfs:label ?compLabel. } LIMIT 1000 Querying KnetMiner with SPARQL
  23. 23. select distinct ?prot ?pway { where { # Branch 1 ?prot kb:pd_by|kb:cs_by ?react. ?prot a kb:Protein. ?react a kb:Reaction. ?react kb:part_of ?pway. ?pway a kb:Path. } union { # Branch 2 ?prot ^kb:ac_by|kb:is_a ?enz. ?prot a kb:Protein. ?enz a kb:Enzyme. { # Branch 2.1 ?enz kb:ac_by|kb:in_by ?comp. ?comp a kb:Compound. ?comp kb:cs_by|kb:pd_by ?trns ?trns a kb:Transport } union { # Branch 2.2 ?enz ^kb:ca_by ?trns. ?comp a kb:Compound. ?trns a kb:Transport } ?trns kb:part_of ?pway. ?pway a kb:Path. } } LIMIT 1000 Querying KnetMiner with SPARQL
  24. 24. So, Why Both?
  25. 25. And more Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores Data xchg format - No official one, just Cypher, Support for GraphML, RDF +/- Focus on backing applications + Focus on data sharing standards Data model + Relations with properties - Metadata/schemas/ontologies management - Relations cannot have properties (reification required) + Metadata/schemas/ontologies as first citizen and standardised OWL Performance + complex graph traversals + Comparable in most cases Query Language + Cypher is easier (eg, compact, implicit elems)? - Expressivity issues (unions) - No standard QL (but efforts in progress, eg, OpenCypher) - SPARQL is Harder? (URIs, namespaces, verbosity) + SPARQL More expressive Standardisation, openness +/- (TinkerPop is open, Neo4J isn’t) + Commercial support + More alive and up-to date (e.g., support for Hadoop, nice Neo4j browser, easy installation) + Natively open, many open implementations - Instability and many short-lived prototypes - Advancements seems to be slowing down + Some nice open and commercial browser (LODEStar, Scalability, big data +/- Commercial support to clustering/clouds for Neo4J + Open support in TinkerPop + Load Balancing/Cluster solutions, Commercial Cloud support (eg GraphDB) + SPARQL Over TinkerPop (via SAIL inteface)
  26. 26. So, the New Architecture
  27. 27. Why Should I Bother? • As data consumer • Querying data via Cypher (or SPARQL) • In particular, define new semantic motifs to find gene-related entities • Knowing our BioKNO ontology/schema (TODO) • In future, querying data via API/Cypher, getting back JSON/BioKNO • As data producer (for KnetMiner) • Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data sets) • Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner data with other RDF/SPARQL sources
  28. 28. Exercise 3: Playing with RDF • Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto • What is the meaning of ‘a’? What are the classes (ie, types) used in example 1? • Which property types (ie, relations) link proteins, pathways and protein accessions? • According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • Hint, use bk:is_annotated_by • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • Hint, use the attribute bka:EVIDENCE and bka:Score • Possibly use further documentation: • A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf- turtle.pdf • BioKNO Ontology Reference: • http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core) • http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)
  29. 29. Exercise 3: Solution • ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class • So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession • is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • The question aims at highlighting a feature of graph data, that is: automatic reasoning • ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to) • So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’ • However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z • This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424 • and logically infer that obo:GO_0030014 part_of obo:GO_0044424 • This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not explicitly declared in the original data • The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a intracellular part (as per is_a) • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement • Compare this with the Neo4j equivalent • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • bkr:TOB1 bk:is_annotated_by obo:GO_0003714. • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • You need to add: bkr:citation_TOB1_15489334 a bk:Relation ; bk:relTypeRef bk:is_annotated_by; bk:relFrom bkr:TOB1; bk:relTo obo:GO_0003714 ; bka:Score 0.95 ; bka:EVIDENCE “text mining tool”. • bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).
  30. 30. Exercise 4: Data Integration based on RDF • Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human, which build a KnetMiner network in RDF format (and following the BioKNO ontology) • using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked- data/0/steps/16104) to perform RDF-to-RDF transformations • and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data into RDF • Look at the transformation https://github.com/Rothamsted/bioknet- onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our BioKNO • What is happening? Look at it before the next question • Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the CONSTRUCT block. Is the new graph smaller or bigger? • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql • Se example queries at: https://github.com/Rothamsted/bioknet- onto/tree/master/examples/bmp_reg_human/queries
  31. 31. Exercise 4: Solution • The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds chains of protein/pathway in BioKNO format. • So, it maps a format to another (an alternative would be to do so in data queries, see queries/pw_commons_fed.sparql) • and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at serving don’t need certain details) • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • In the CONSTRUCT block you’d have: ?comp bk:participates_in ?path.
  32. 32. Thanks! • Even more material: • On graph databases, standards, KnetMiner new backend: • https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs • https://doi.org/10.1515/jib-2018-0023 • On Semantic Web, Linked Data, RDF, SPARQL, etc: • https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary • https://goo.gl/bfF1hu • https://www.nature.com/articles/nbt1139 • https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age • http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf