5. The Challenge
Variety of users / diversity of scientific questions
Scientists
Medicalâš
Doctors
Dataâš
Scientists
Graphdatabase
6. Biological question:
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
âš
The actual question (from a data-point-of-view):
âš
âš
Is there a connection between A and R?
=> 3s to look into the Excel sheet
Why graph? Easy scientific question
7. âš
The actual question (from a data-point-of-view):
âš
âš
Is there a connection between A and R?
=> 3s to look into the graph
A
B
C
E
D
F
G
K
Q
R
S
W
Z
U
Why graph? Easy scientific question
8. Back to the question
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
Genomics
Human diabetic data
Genes
SNPs
Proteins
Enzymes
Pathways
Metabolites
Metabolomics
Pre diabetic pig
Metabolites
List of SNPs
List of Genes of
(species 1)
List of Proteins of
(species 1)
List of loci
List of Enzymes of
(species 1)
List of Pathways of
(species 1)
List of Metabolites
of (species 1)
List of Metabolites
of (species 2)
graph
9. Why graph? -> why not relational
âą biomedical data / healthcare data is highly connected
âą => variety of data
âš
=> unstructured
âš
=> heterogeneous
âš
=> not connected
âš
=> unFAIR
âą easy to model
âą extremely flexible / easy adoptable (âre-shaping the graphâ) vs. static SQL model
âą scalable (Billion of nodes+relationships on a single machine
âą easy to query (cyclic dependencies)
âą GraphDataScience library + graph embeddings
12. DZDconnect: stats
âą PROD-Server: 323m nodes, 1.1bn relationships => 480GB
âą DEV-Server: 1.1bn nodes, 4.8bn relationships
âą Singleserver (60 CPUs, 256GB memory, only SSDs)
âą 4 developers
âš
âą Neo4j enterprise (live backup, GDS)
âą UI: flask web server, SemSpect, Neo4j browser
âą Visualization for interactive browsing (SemSpect by derive GmbH)
âą Bloom (semi-natural-language queries)
Strata Data
âš
Award finalist 2019
bytes4diabetes Award
2020
Graphie Award 2018
We have
âš
DB role model
13. DZDconnect:
data integration + ML
Gene RNA Protein
CODES CODES
CODES*
âą Python
âą Py2Neo, GraphIO
âą Docker Pipeline for orchestration (open-source by DZD)
âą Based on integrated data => annotate / enrich
âą textmatching + Natural Language Processing
âą âshortcutsâ for queries (reduce #hops)
âą inferring knowledge
16. The Challenge
User with a specific input => specific output
Scientist
multi-omicsâš
experimentâš
output
Flask app
17. The Challenge
User âstart somewhere -> explore freely knowledgeâ
SemSpect
interactive
browsing
Start from any node
Scientistâš
orâš
Medicalâš
Doctor
18. The Challenge
User with data analysis skills / computer scientist
Scientist
Start from any node
Cypher query language
Graph Data
Science
19. Use case 1
Handle mapping identifiers of molecular entities
Knowledge Graph
20. Query âfriends of a friendâ on a gene level
âš
Example: diabetes relevant gene âTCF7L2â
match path=(g:Gene{sid:'TCF7L2'})-[:MAPS|SYNONYM*0..2]-(g1:Gene) return path
21. Use case 2
Find information that is NOW connected
Knowledge Graph
22. Query for SNPs (mutations) associated to diabetes
âš
Output: relevant protein and its function (ontology terms)
match (tr:Trait)
where tr.name contains âdiabetes mellitusâ
with tr as disease
match path=(disease)<-[:ASSOCIATED_WITH_TRAIT]-(asso:Association)<-[:SNP_HAS_ASSOCIATION]-(snp:SNP)-
[:SNP_HAS_GENE]-(gene:Gene)-[:MAPS]-(g1:Gene)-[x:CODES]->(transcript:Transcript)-[:CODES]->
(prot:Protein)-[:ASSOCIATION]->(term:Term)â(o:Ontology)
return path
23. Use case 3
Using graph algorithms to infer new insights
Natural Language
Processing
âš
Ontologies
Knowledge Graph
24. Googleâs page rank algorithm - find the most relevant gene
âš
finding ACE2 - the receptor the SARS-Cov2 virus uses to enter the cell
âą 140â000 abstracts from
Covid19 related publications
âą NamedEntityRecognition
âš
of gene names
âą Page Rank identified
âš
âACE2â as the most relevant
âš
gene
31. k-nearest neighbour clustering with k=5
representing the 5 diabetes subtypes
patient 01 patient 02
patient 03
Graphâš
algorithms
patient 04
patient 05
patient 02
p
a
t
i
e
n
t
0
4
patient 03
patient 05
patient 01
subphenotyping of diabetic patients
32. DZDconnect
connect patient data with knowledge graph
Transcript
Gene
Synonyms
Abstract
PubMed
âš
Article
Keyword
âš
MeSH-term
Ontology term
Hello role-model :-)
33. Take home message
âą Knowledge graph
âą as single point of truth
âą connect in-house data
âą scalability
âą infer new insights
âš
âą Use cases:
âą simple and advanced (Cypher) queries
âą Graph Data Science library (page rank, kNN)
âą Node embeddings for complex data
âą NLP
âą Visualization of graph
âą different users
âą flask app, browser, SemSpect,âŠ