Presented by: John Barry, Senior Platform Developer, Apervita
Experience level: Deep dive
Graph databases are natural representation for many forms of data - from social network relationships, to music genres, to medical vocabulary data. Apervita’s challenge is importing and housing vast medical ontologies, and allowing efficient traversal of the graph retrieving both parent and child nodes across a variety of relationships.
In this session, we will discuss the characteristics of graph databases, an implementation strategy for MongoDB, and an approach for implementing transitive closure for efficient traversal of the graph with limited database trips.
e.g.
distance between two cities (edge)
population of city (node)
Steffi held the #1 rank for 377 weeks total, longer than any tennis player male or female
She’s won 22 Grand Slam singles titles, which is the most held by any player male or female since the Open Era started in 1968
She’s won a Golden Slam, winning all 4 grand slam tournaments AND the olympic gold medal in a single year, and is the only player in the history of tennis to do so
And her name is a homonym for graph, which is how I ended up learning all the previous facts
All tournaments per year should get their own node
Participants, winners, scores
Tracking tennis players and their results?
Tournaments may be independent nodes, one per tournament
Property of Years Won can reside on the edge (as this is per player
Tournaments can actually have the property of years won
All of these have one major issue: walking the graph is computing intensive and results in many queries to traverse the graph
One path list per relationship/unique path
Indexing the path array leads to speedy results
This will allow (with some post-processing):
Transitive Closure
Reachability becomes a trivial query
query example
As graphs get larger, performance becomes an issue
RXNORM 204,000
SNOMED 311,000
ICD10 PCS 76000
Sharding on edge id + graph id
Collection per graph id
Extremely wide child models can break the document size
Implementing pseudo subnodes can alleviate this pressure
Apervita allows medical authors to code to common datasets
Hospitals aren’t forced to conform their data, but can take advantage of our mapping tools to execute algorithms on their own data
Medical Vocabularies are by their nature graphs
SNOMED, ICD-9, ICD-10, RxNorm
The ability to code an algorithm to the “generic” and subsume for the specific is extremely powerful
e.g. Coding to Diabetes Mellitus (SNOMED ID 73211009) can capture all subnodes such as gestational diabetes, diabetes type II, diabetes type I
Think about your data model
What are your nodes?
What are your edges?
What should be a property vs a node?
Use MongoDb’s indexing power to have quick access to your paths for graph calculations
Think about your overall expected graph size
Does the width imply faux nodes?
Should you shard?
If multiple subgraphs, should you separate into individual collections?