O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Knetminer Backend Training, Nov 2018

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 32 Anúncio

Mais Conteúdo rRelacionado

Semelhante a Knetminer Backend Training, Nov 2018 (20)

Mais de Rothamsted Research, UK (20)

Anúncio

Mais recentes (20)

Knetminer Backend Training, Nov 2018

  1. 1. Behind the scenes of KnetMiner Marco Brandizi marco.brandizi@rothamsted.ac.uk Bioinformatics Group Training, 27/11/2018 Find these slides at: https://www.slideshare.net/mbrandizi
  2. 2. Behind the scenes of KnetMiner
  3. 3. Behind the scenes of KnetMiner
  4. 4. <concept> <id>1</id> <pid>Q75WV3</pid> <description/> <elementOf> <idRef>UNIPROTKB-SwissProt</idRef> </elementOf> <ofType> <idRef>Protein</idRef> </ofType> <evidences> <evidence> <idRef>IMPD</idRef> </evidence> </evidences> <conames> <concept_name> <name>Probable trehalose-phosphate phosphatase 1</name> <isPreferred>true</isPreferred> </concept_name> … <cc> <id>Protein</id> <fullname>Protein</fullname> <description> A protein is comprised of one or more Polypeptides and potentially other molecules. </description> <specialisationOf> <idRef>MolCmplx</idRef> </specialisationOf> </cc> <relation> <fromConcept>1</fromConcept> <toConcept>3</toConcept> <ofType> <idRef>participates_in</idRef> </ofType> <evidences> <evidence> <idRef>ECO:0000316</idRef> </evidence> </evidences> <relgds/> </relation> <concept> <id>3</id> <pid>GO:0009651</pid> <description>response to salt stress</description> <ofType><idRef>BioProc</idRef></ofType> <coaccessions> <concept_accession> <accession>GO:0009651</accession> <elementOf><idRef>GO</idRef></elementOf> <ambiguous>false</ambiguous> </concept_accession> </coaccessions> </concept> The OXL format
  5. 5. The Ondex Integrator
  6. 6. And the Command Line Version
  7. 7. But it Needs some Pre-Processing Too
  8. 8. Why Changing? https://funnyjunk.com/Reinvent+the+wheel/funny- pictures/5665443/
  9. 9. Why Changing? • Graph databases have emerged • having expressive query Languages (eg, SPARQL, Cypher) • Having low memory footprint (and possibly scalability over clusters/clouds) • More stable APIs and implementations • Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc • Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL CONSTRUCT, scripting with JSON) • Useful in Output: applications based on APIs/micro-services, query languages, machine readable & standardised data. • New apps can be either ours or 3rd parties • Ondex issues • Getting old (and older with Java >8) • All data must be in memory • Not exactly high quality code
  10. 10. Property Graphs
  11. 11. The Cypher Query/DML Language Proteins->Reactions->Pathways: // chain of paths, node selection via property (exploits indices) MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ }) // further conditions, not always so performant WHERE prot.name =~ ‘(?i)^DNA.+’ // Usual projection and post-selection operators RETURN prot.name, pway // Relations can have properties ORDER BY csby.pvalue LIMIT 1000 Proteins->Reactions->Pathways: // Single-path (or same-direction branching) easy to write MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) - [:part_of*1..3] -> (pway:Path) RETURN ID(prot), ID(pway) LIMIT 1000 // Very compact forms available, depending on the data MATCH (prot:Protein) - (pway:Path) RETURN pway
  12. 12. Cypher as Semantic Motif Language
  13. 13. Cypher as Semantic Motif Language
  14. 14. Exercise 1: Try Cypher • Go to http://babvs48.rothamsted.ac.uk:7476/browser • Use neo4j/test as credentials • Try the query: • MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction) - [react2path:part_of] -> (pway:Path) WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism' RETURN * LIMIT 10 • And explore the graphical result • What do you think you’ve found? • What do you have in () and in []? • What’s the meaning of the ‘|’ operator? • cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’ • What’s the difference between -[]-> and -[]-> ? • More help about Cypher at: https://neo4j.com/developer/cypher-query-language
  15. 15. Exercise 1: Solution • You should see something like the figure • Which shows the ACP pathway at the centre, a member reaction and proteins consumed/produced by the latter • (name:Label) matches nodes (label is synonym of type), [name:Type] matches relations • [r:R1|R2] matches relations of either type R1 or R2 • (src:Label1)-[r:R]->(dst:Label2) matches relations of type R going from nodes of type Label1 to nodes of type Label2 • (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2 and n2->r2->n1
  16. 16. Exercise 2: Write Your Own Cypher • Using the same browser, find: • genes, • which are encoded by proteins, • which are mentioned by articles that contain ‘ZmPEAMT1’ in the title • Hints • Use the node labels: Gene, Protein, Publication • Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’) • Use the attribute AbstractHeader (meaning ‘publication title’) • Use the filter operator CONTAINS, as in the previous exercise • More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the following slides
  17. 17. Exercise 2: Solution • MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication) WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1' RETURN * LIMIT 10 • Your solution might be a variant of this
  18. 18. But how to Encode Data? The Semantic Web Way
  19. 19. But how to Encode Data? The Semantic Web Way @prefix bkr: <http://www.ondex.org/bioknet/resources/> . @prefix bk: <http://www.ondex.org/bioknet/terms/> . @prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>. bkr:TOB1 a bk:Protein ; bk:participates_in <http://www.wikipathways.org/id1> ; bk:prefName “TOB1"; bk:published_in bkr:23236473.
  20. 20. But how to Encode Data? The Semantic Web Way
  21. 21. But how to Encode Data? The Semantic Web Way
  22. 22. select distinct ?prot ?comp { where { ?prot a kb:Protein; rdfs:label ?protLabel. filter ( contains ( ?protLabel, ‘TOB1’ ). ?enz kb:activated_by ?prot. ?enz kb:activated_by ?comp. ?comp rdfs:label ?compLabel. } LIMIT 1000 Querying KnetMiner with SPARQL
  23. 23. select distinct ?prot ?pway { where { # Branch 1 ?prot kb:pd_by|kb:cs_by ?react. ?prot a kb:Protein. ?react a kb:Reaction. ?react kb:part_of ?pway. ?pway a kb:Path. } union { # Branch 2 ?prot ^kb:ac_by|kb:is_a ?enz. ?prot a kb:Protein. ?enz a kb:Enzyme. { # Branch 2.1 ?enz kb:ac_by|kb:in_by ?comp. ?comp a kb:Compound. ?comp kb:cs_by|kb:pd_by ?trns ?trns a kb:Transport } union { # Branch 2.2 ?enz ^kb:ca_by ?trns. ?comp a kb:Compound. ?trns a kb:Transport } ?trns kb:part_of ?pway. ?pway a kb:Path. } } LIMIT 1000 Querying KnetMiner with SPARQL
  24. 24. So, Why Both?
  25. 25. And more Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores Data xchg format - No official one, just Cypher, Support for GraphML, RDF +/- Focus on backing applications + Focus on data sharing standards Data model + Relations with properties - Metadata/schemas/ontologies management - Relations cannot have properties (reification required) + Metadata/schemas/ontologies as first citizen and standardised OWL Performance + complex graph traversals + Comparable in most cases Query Language + Cypher is easier (eg, compact, implicit elems)? - Expressivity issues (unions) - No standard QL (but efforts in progress, eg, OpenCypher) - SPARQL is Harder? (URIs, namespaces, verbosity) + SPARQL More expressive Standardisation, openness +/- (TinkerPop is open, Neo4J isn’t) + Commercial support + More alive and up-to date (e.g., support for Hadoop, nice Neo4j browser, easy installation) + Natively open, many open implementations - Instability and many short-lived prototypes - Advancements seems to be slowing down + Some nice open and commercial browser (LODEStar, Scalability, big data +/- Commercial support to clustering/clouds for Neo4J + Open support in TinkerPop + Load Balancing/Cluster solutions, Commercial Cloud support (eg GraphDB) + SPARQL Over TinkerPop (via SAIL inteface)
  26. 26. So, the New Architecture
  27. 27. Why Should I Bother? • As data consumer • Querying data via Cypher (or SPARQL) • In particular, define new semantic motifs to find gene-related entities • Knowing our BioKNO ontology/schema (TODO) • In future, querying data via API/Cypher, getting back JSON/BioKNO • As data producer (for KnetMiner) • Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data sets) • Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner data with other RDF/SPARQL sources
  28. 28. Exercise 3: Playing with RDF • Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto • What is the meaning of ‘a’? What are the classes (ie, types) used in example 1? • Which property types (ie, relations) link proteins, pathways and protein accessions? • According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • Hint, use bk:is_annotated_by • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • Hint, use the attribute bka:EVIDENCE and bka:Score • Possibly use further documentation: • A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf- turtle.pdf • BioKNO Ontology Reference: • http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core) • http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)
  29. 29. Exercise 3: Solution • ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class • So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession • is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • The question aims at highlighting a feature of graph data, that is: automatic reasoning • ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to) • So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’ • However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z • This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424 • and logically infer that obo:GO_0030014 part_of obo:GO_0044424 • This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not explicitly declared in the original data • The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a intracellular part (as per is_a) • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement • Compare this with the Neo4j equivalent • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • bkr:TOB1 bk:is_annotated_by obo:GO_0003714. • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • You need to add: bkr:citation_TOB1_15489334 a bk:Relation ; bk:relTypeRef bk:is_annotated_by; bk:relFrom bkr:TOB1; bk:relTo obo:GO_0003714 ; bka:Score 0.95 ; bka:EVIDENCE “text mining tool”. • bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).
  30. 30. Exercise 4: Data Integration based on RDF • Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human, which build a KnetMiner network in RDF format (and following the BioKNO ontology) • using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked- data/0/steps/16104) to perform RDF-to-RDF transformations • and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data into RDF • Look at the transformation https://github.com/Rothamsted/bioknet- onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our BioKNO • What is happening? Look at it before the next question • Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the CONSTRUCT block. Is the new graph smaller or bigger? • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql • Se example queries at: https://github.com/Rothamsted/bioknet- onto/tree/master/examples/bmp_reg_human/queries
  31. 31. Exercise 4: Solution • The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds chains of protein/pathway in BioKNO format. • So, it maps a format to another (an alternative would be to do so in data queries, see queries/pw_commons_fed.sparql) • and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at serving don’t need certain details) • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • In the CONSTRUCT block you’d have: ?comp bk:participates_in ?path.
  32. 32. Thanks! • Even more material: • On graph databases, standards, KnetMiner new backend: • https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs • https://doi.org/10.1515/jib-2018-0023 • On Semantic Web, Linked Data, RDF, SPARQL, etc: • https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary • https://goo.gl/bfF1hu • https://www.nature.com/articles/nbt1139 • https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age • http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf

×