Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowledge Graphs
15 de Jun de 2018•0 gostou
1 gostaram
Seja o primeiro a gostar disto
mostrar mais
•344 visualizações
visualizações
Vistos totais
0
No Slideshare
0
De incorporações
0
Número de incorporações
0
Baixar para ler offline
Denunciar
Dados e análise
Workshop within the Integrative Bioinformatics Conference (IB2018, Harpenden, 2018).
We describe how to use Semantic Web Technologies and graph databases like Neo4j to serve life science data and address the FAIR data principles.
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowledge Graphs
Behind the Scenes of KnetMiner:
Towards Standardised and Interoperable
Knowledge Graphs
Harpenden, 3/6/2018
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Find these slides on SlideShare
KnetMiner-inspired Artwork
by Hugo Dalton (hugodalton.com)
wp:id1
a bk:Path ; # a subclass of bk:Concept
bk:evidence bkev:IMPD ; # Imported from database, a predefined resource type.
bk:prefName "Bone Morphogenic Protein (BMP) Signalling and Regulation".
bkr:TOB1 a bk:Protein ;
dc:identifier bkr:TOB1_acc ;
bk:prefName "TOB1 HUMAN";
# A simplified link, hiding the BioPax chain:
# pathwayComponent -> BioChemicalReaction|Complex -> Protein
bk:participates_in wp:id1;
bk:is_annotated_by obo:GO_0030014. # Same URI as the OBO Gene Ontology Term.
# Structured accession, allow for linking of identifier and context.
bkr:TOB1_acc a bk:Accession ;
dcterms:identifier "TOB1";
# instance of bk:DataSource. Another predefined entity.
bk:dataSource bkds:UNIPROTKB.
BioKNO: Biological Entities
# For practical reasons, we always expect that the straight
# triple is always asserted, with the
# reified version optionally added to it.
bkr:TOB1 bk:published_in bkr:20068231.
bkr:citation_TOB1_15489334 a bk:Relation ;
# the same properties that are used for regular relations
bk:relTypeRef bk:published_in;
bk:relFrom bkr:TOB1 ;
bk:relTo bkr:15489334 ;
# An attribute
bka:score 0.95 ;
# Both attributes and object properties can be linked to a
# reified relation.
bk:evidence bkev:TextMining.
Attributes in Reified Relations
Talking to the Rest of The World
BioKNO External Ontologies Mapping Type
bk:Concept skos:Concept Subclass
bk:Relation
bk:relFrom
bk:relTypeRef
bk:relTo
rdf:Statement
rdf:subject
rdf:predicate
rdf:object
Subclass
Subproperties
(ie, mapping to RDF reified
statements)
bk:Path, bk:Participant, bk:Interaction, bk:Transport,
bk:Protein, bk:Gene
Classes with same names in BioPAX and SIO Equivalent Class
bk:participates_in
bk:has_participant
Relation Ontology (RO) properties with same names
biopax:participant (as sub-property)
Equivalent property
bk:produces
bk:produced_by
bk:consumes
bk:consumed_by
biopax:product (as sub-property)
RO properties with same names
Equivalent property
bk:regulates
bk:positively_regulates
bk:negatively_regulates
RO properties with same names Equivalent property
bk:is_a
bk:part_of, bk:has_part
bk:occurs_in, bk:co_occurs_with
skos:broader
Basic Formal Ontology (BFO)/RO properties with same
names
Equivalent property
bk:Publication schema:CreativeWork Subclass
bka:abstract
bka:title (also known as AbstractHeader)
bka:authors
dcterms:description
dcterms:title
dc:creator
Sub-property
CONSTRUCT {
?path a bk:Path;
bk:prefName ?pathName;
bk:evidence bkev:IMPD.
?bkProt a bk:Protein;
dc:identifier ?bkProtAccUri;
bk:prefName ?protName;
bk:participates_in ?path.
?bkProtAccUri a bk:Accession;
dcterms:identifier ?protName;
bk:dataSource bkds:UNIPROTKB.
}
SPARQL for Extraction, Loading, Transformation
(The Simpler-than-Ondex Way)
WHERE
{
?path a bp:Pathway;
bp:displayName ?pathName;
bp:pathwayComponent ?comp.
{
?comp a bp:BiochemicalReaction;
bp:left|bp:right ?protein.
}
UNION {
?react a bp:Complex;
bp:component ?protein.
}
?protein a bp:Protein;
bp:displayName ?protName.
BIND ( IRI ( CONCAT ( STR ( bkr: ), STR ( ?protName ) ) ) AS ?bkProt )
BIND ( IRI ( CONCAT ( STR ( ?bkProt ), "_acc" ) ) AS ?bkProtAccUri )
}
CONSTRUCT {
?path a bk:Path;
bk:prefName ?pathName;
bk:evidence bkev:IMPD.
?bkProt a bk:Protein;
dc:identifier ?bkProtAccUri;
bk:prefName ?protName;
bk:participates_in ?path.
?bkProtAccUri a bk:Accession;
dcterms:identifier ?protName;
bk:dataSource bkds:UNIPROTKB.
}
SPARQL for Extraction, Loading, Transformation
(The Simpler-than-Ondex Way)
WHERE
{
?path a bp:Pathway;
bp:displayName ?pathName;
bp:pathwayComponent ?comp.
{
?comp a bp:BiochemicalReaction;
bp:left|bp:right ?protein.
}
UNION {
?react a bp:Complex;
bp:component ?protein.
}
?protein a bp:Protein;
bp:displayName ?protName.
BIND ( IRI ( CONCAT ( STR ( bkr: ), STR ( ?protName ) ) ) AS ?bkProt )
BIND ( IRI ( CONCAT ( STR ( ?bkProt ), "_acc" ) ) AS ?bkProtAccUri )
}
SPARQL/RDF for ELT
• TARQL: Using SPARQL to RDF-Convert Tabular CSV Files
• RDF/XML can be transformed via XSL
• We have done it for bio-specific ontology definitions in Ondex
• Programmatic conversions
• Using RDF frameworks, eg, Jena, RDF4J (former Sesame), rdflib for
Python
• See also java2rdf (https://github.com/EBIBioSamples/java2rdf)
• We have used it for the Ondex->RDF converter
SPARQL/RDF for ELT
• TARQL: Using SPARQL to RDF-Convert Tabular CSV Files
• RDF/XML can be transformed via XSL
• We have done it for bio-specific ontology definitions in Ondex
• Programmatic conversions
• Using RDF frameworks, eg, Jena, RDF4J (former Sesame), rdflib for
Python
• See also java2rdf (https://github.com/EBIBioSamples/java2rdf)
• We have used it for the Ondex->RDF converter
The Cypher Query/DML Language
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) -
[:part_of] -> (pway:Path{ title: ‘apoptosis’ })
// further conditions, not always so performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Proteins->Reactions->Pathways:
// Single-path (or same-direction branching) easy to write
MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
- [:part_of*1..3] -> (pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway
Triple Stores vs Prop Graphs
Neo4j, Cypher DBs, Graph DBs Semantic Web/Triple Stores
Data xchg format
- No official one, just Cypher,
Support for GraphML, RDF
+/- Focus on backing applications
+ Focus on data sharing standards
Data model
+ Relations with properties
- Metadata/schemas/ontologies management
- Relations cannot have properties (reification
required)
+ Metadata/schemas/ontologies as first citizen
and standardised OWL
Performance + complex graph traversals + Comparable in most cases
Query Language
+ Cypher is easier (eg, compact, implicit elems)?
- Expressivity issues (unions)
- No standard QL (but efforts in progress, eg,
OpenCypher)
- SPARQL is Harder? (URIs, namespaces,
verbosity)
+ SPARQL More expressive
Standardisation,
openness
+/- (TinkerPop is open, Neo4j isn’t)
+ Commercial support
+ More alive and up-to date (e.g., support for
Hadoop, nice Neo4j browser, easy installation)
+ Natively open, many open implementations
- Instability and many short-lived prototypes
- Advancements seems to be slowing down
+ Some nice open and commercial browser
(LODEStar,
Scalability,
big data
+/- Commercial support to clustering/clouds for
Neo4j
+ Open support in TinkerPop
+ Load Balancing/Cluster solutions, Commercial
Cloud support (eg GraphDB)
+ SPARQL Over TinkerPop (via SAIL inteface)
Supporting Web APIs via JSON
{
"type": "Protein",
"id": "TOB1",
"prefName": "TOB1 Human",
"participates_in":
{
"type": "Pathway",
"id": "id1",
"evidence": "IMPD",
"prefName": "Bone Morphogenic Protein (BMP) Signalling and Regulation"
},
"is_annotated_by": "GO_0030014"
}
• Designed to be compatible with browser, i.e., Javascript
• Language of choice for web APIs, web browser consuming, dynamic
web interfaces (i.e., AJAX)
• Conceptually similar to XML (trees, nested structures)
• Often used in a lightweight way, without much schema constraints
Supporting Web APIs via JSON
{
"type": "Protein",
"id": "TOB1",
"prefName": "TOB1 Human",
"participates_in":
{
"type": "Pathway",
"id": "id1",
"evidence": "IMPD",
"prefName": "Bone Morphogenic Protein (BMP) Signalling and Regulation"
},
"is_annotated_by": "GO_0030014"
}
• Designed to be compatible with browser, i.e., Javascript
• Language of choice for web APIs, web browser consuming, dynamic
web interfaces (i.e., AJAX)
• Conceptually similar to XML (trees, nested structures)
• Often used in a lightweight way, without much schema constraints
Take-Home Messages
• From small data integration farm to sharing with the rest of the world => FAIR Principles
• Semantic Web has pros and cons
• Still useful for data model and schema governance, identifiers, complex models (namely,
ontologies)
• Alternative data sharing approaches, PG in particular
• More alive area, can be simpler (blends into existing industrial software better)
• LOD/FAIR principles not addressed much
• Integrating the two is useful
• APIs are a useful alternative/complementary approach
• LOD/FAIR principles to be addressed as well
• In our radar:
• complete the work, publishing SPARQL, Neo4j access, APIs
• Integrating similar projects in the agrifood field (e.g. BrAPI, DFW)
• Contribute to standardisation efforts like Bioschemas