SlideShare a Scribd company logo
Graph Databases
Karol Grzegorczyk
June 10, 2014
2/25
Graph Theory
Seven Bridges of Königsberg problem
defined by Leonhard Euler in 1735
How to find a walk through the city that would
cross each bridge once and only once?
[© Google]
Euler proved that it is impossible to solve
this problem!
G = (V, E)
E {V × V}⊆
3/25
Storing Connected Data in a Relational Database
● Relationships do exist in the relational databases, but only as a means of joins and joining tables
● Logically, join crates a Cartesian product of tables
● Operations of relational databases are index-intensive. Retrieval based on an index is fast, but not
with a constant time (most often O(log 2 n))
● Traversal queries require hierarchical joins, which are costly. Deep traversal queries are
infeasible. Execution time increases exponentially with a depth of a join.
● For a given SQL query, RDBMS creates an in-memory graph data structure.
● Often relational database are normalized in order to efficiently organize data in a database.
● Normalization increases number of joins needed to query the database. Denormalization can be a
partial solution.
4/25
Database normalization
● Database normalization is the process of organizing the fields and tables of a relational database to
minimize redundancy.
– Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining
relationships between them.
● Normal forms
– The first normal form (each attribute contains only atomic values)
– The second normal form (each non primary key attribute is dependent on the whole primary key)
– The third normal form (each non primary key attribute is dependent on nothing but the primary key)
● A relational database table is often described as "normalized" if it is in the 3NF
● When a database is intended for OLAP rather than OLTP, it is topically denormalized.
● Denormalization is the process of attempting to optimize the read performance of a database by
adding redundant data or by grouping data
● Examples of denormalization techniques:
– Materialised views
– Star schemas
– OLAP cubes
5/25
Graph Database Highlights
● Graph data stores provide index-free adjacency resulting in a much better performance, if
compared to traditional RDBMS
● Designed predominantly for traversal performance and executing graph algorithms
● Graph database is more natural, direct representation of a domain than RDBMS (no need for
junction tables)
● There is no need for joining tables because the data structure is already “joined” by the edges
that are defined.
● In graph databases denormalization is not needed!
● The interesting thing about graph diagrams is that they tend to contain specific instances of
nodes and relationships, rather than classes or archetypes.
● The main purpose of Graph Databases is analysis and visualization of graphical data.
6/25
Graph Database Models
● The Property Graph Model
– Model is built of nodes and relationships
– Nodes contain key-value properties. Sometimes relationships as well.
– Relationships are named and directed, and always have a start and end node
● Hypergraphs
– Generalization of a graph model.
– A relationship can have any number of nodes at either end of a relationship (many-to-
many relationships)
● Triple stores
– A triple expresses a relationship between two resources.
– The triple is a subject-predicate-object data structure, e.g. Fred likes ice cream
7/25
Triple stores
● The Resource Description Framework (RDF) is a framework for expressing
information about resources.
● Resources can be anything, including documents, people, physical objects, and
abstract concepts.
● RDF is intended for situations in which information on the Web needs to be processed
by applications, rather than being only displayed to people.
● RDF is a building block of the Semantic Web movement.
● RDF is a set of W3C specifications
– SPARQL - SPARQL Protocol and RDF Query Language
● Disadvantages
– Lack of index-free adjacencies. Data is stored in form of triplets which are independent
artifacts. In order to traverse the graph one need to join multiple triplets.
8/25
RDF example
[G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014]
In RDF, resources are
described by IRI - International
Resource Identifier
RDF define logical
relationships. A number of
different serialization formats
exist for writing down RDF
graphs:
● Turtle
● JSON-LD
● RDFa
● RDF/XML
Popular RDF datasets:
● Wikidata
● Dbpedia
● WordNet
● Europeana
● VIAF
9/25
Hypergraphs
[I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
HyperGraphDB
http://www.hypergraphdb.org
Using hypergraphs we lose the ability to add
properties to the individual relationships.
10/25
The Property Graph Model
● The most popular variant of graph model
● Only one-to-one relationships
● The Property Graph Model databases are typically schema-less. There is
no notion of database schema.
● Querying is often done in specification by example way, i.e. by finding
data (nodes and relationships) matching the specified pattern.
● Optimization for traversal
● Popular solutions:
– Neo4j (pure graph DBMS)
– OrientDB (hybrid document and graph DBMS)
11/25
Neo4j
● Written in Java but uses some high-performance features of JVM
● Concepts:
– Nodes (can have zero or more properties)
– Relationships (always have direction and a type; can have zero or more properties)
– Labels for grouping nodes together (a node can have zero or more labels; labels have colors assigned)
● Neo4j is a schema-optional graph database (since 2.0 version). There are two schema elements:
– Indexes - you can create index on a set of properties of nodes with a specific label (Apache Lucene)
– Constraints - constraint (currently only unique) on a property of nodes of a given label (index will be added automatically)
● Two versions/modes:
– Web server with pure RESTful API and rich web GUI
– Embedded Java library
● RESTful API was designed with discoverability in mind. Just start with a GET on the service root (e.g.
http://localhost:7474/db/data) and you will a list of hyperlinks to available resources.
12/25
Cypher Query Language basics
● Cypher is declarative query language based on pattern matching
● Basic SQL syntax structure:
SELECT columns FROM table WHERE conditions
● Basic Cypher syntax structure:
MATCH pattern WHERE conditions RETURN nodes
● Patterns are defined in ASCII art graphs, e.g.:
MATCH x-->y RETURN x
● It is possible to crate data with Cypher as well:
CREATE ({key:"value"})
13/25
Cypher basic examples
●
Create a simple node
create ({name:"Anna"})
● Retrieve all the nodes
match x return x
● Create a labeled node with some properties
create (x:Person {name:"Jan", from: "Poland"})
● Retrieve all the nodes labeled as Person having parameter from: “Poland”
match (y:Person) where y.from = "Poland" return y
● Create a relationship
match x where x.name="Anna"
match (y:Person)
create x-[:knows]->y
14/25
Traversal queries
● Find Jan's friends. Return him and his friends.
MATCH (x:Person)-[:knows]-(friends)
WHERE x.name = "Jan"
RETURN x, friends
● Find friends of Jan's friends who likes surfing
MATCH (x:Person)-[:knows]-()-[:knows]-(surfer)
WHERE x.name = "Jan"
AND surfer.hobby = "surfing"
RETURN DISTINCT surfer
15/25
Starting points
● Patterns often have starting points, i.e. nodes or relationships that are
explicitly given.
● It is possible to specify the starting point using WHERE clause (as in the
previous slide), but it can be inefficient (when there are no indices).
● More proper way of specifying the starting point (node or relationship) is by
using the START keyword.
● These starting points are obtained via index lookups or, more rarely,
accessed directly based on node or relationship IDs
– START n=node:index-name(key = "value")
– START n=node(id)
16/25
START clause example
Find the mutual friends of user named “Michael”
[I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
START a=node:user(name='Michael')
MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a)
RETURN b, c
17/25
D3.js based graph visualization of the example data set
18/25
Transaction management
● Neo4j provide full ACID support
● All relationships must have a valid start node and end node. In
effect this means that trying to delete a node that still has
relationships attached to, it will throw an exception upon commit.
● When updating or inserting massive amounts of data then periodic
commit query hint (USING PERIODIC COMMIT) can be helpful.
● Currently only one isolation level (READ_COMMITTED) is supported.
● In order to execute a query inside a transaction, POST the query to
http://localhost:7474/db/data/transaction/{id}
19/25
Native Graph Storage
There are separate stores for nodes, relationships and properties. In order to be able to compute a
record’s location at cost O(1), all stores are fixed-size record stores.
Nodes (9 bytes)
Relationships are stored in doubly linked lists, so firstPrevRelId, firstNextRelId, secondPrevRelId and
secondNextRelId are pointers for the next and previous relationship records for the start and end nodes
[I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
20/25
Scalability
●
On a single server, Neo4j is capable of managing 34*109
nodes
●
Currently, only full DB replication for read-only purposes, is available
– Master-slave architecture to support fault-tolerancy
– Horizontally scaling for read-mostly purposes
● Open transactions are not shared among members of an HA cluster. Therefore, if you use this
endpoint in an HA cluster, you must ensure that all requests for a given transaction are sent to the
same Neo4j instance.
● As was stated, in the graph database data are already “joined”, so it is hard to partition (to shard) a
graph into multiple machine.
● Neo4j team is working on this, but it is not ready yet. It would be desired to keep nodes tightly
connected (or belonging to a common domain) together on the same machine and loosely
connected (or belonging to different domains) on separate machines.
● The problem is that the connection that is currently loose, can one day in the future, become tight,
and vice-versa.
21/25
Graph algorithms
● Both graph theory and graph algorithms are mature and well-understood fields of
computing science and both can can be used to mine sophisticated information
from graph databases.
● Neo4j supports both depth- and breadth-first search
– Search type can be specified using BranchSelector and BranchOrderingPolicy
● Graph Algorithms available in neo4j
– all paths (find all paths between two nodes)
– all simple paths (find paths with no repeated nodes)
– shortest paths (find paths with the fewest relationship)
● Can find all shortest paths (if there are more than one) or just the first one.
– Dijkstra (find paths with the lowest cost)
– A* (improved version of Dijkstra algorithm)
22/25
Example of finding the shortest path using REST API
Example request
POST http://localhost:7474/db/data/node/35/path
Accept: application/json; charset=UTF-8
Content-Type: application/json
{
"to" : "http://localhost:7474/db/data/node/30",
"max_depth" : 3,
"relationships" : {
"type" : "to",
"direction" : "out"
},
"algorithm" : "shortestPath"
}
Example response
200: OK
Content-Type: application/json; charset=UTF-8
{
"start" : "http://localhost:7474/db/data/node/35",
"nodes" : [ "http://localhost:7474/db/data/node/35",
"http://localhost:7474/db/data/node/31","http://localhost:7474/db/data/node/30" ],
"length" : 2,
"relationships" : [ "http://localhost:7474/db/data/relationship/26", "http://localhost:7474/db/data/relationship/32" ],
"end" : "http://localhost:7474/db/data/node/30"
}
23/25
Spring Data Neo4J
Spring Data is an umbrella project that makes it easy to use new data access technologies,
such as non-relational databases, map-reduce frameworks, and cloud based data services.
Spring Data Neo4j is an integration library for Neo4j and it was the first Spring Data project
@NodeEntity
public class Movie {
@GraphId Long id;
@Indexed(type = FULLTEXT, indexName = "search")
String title;
Person director;
@RelatedTo(type="ACTS_IN", direction = INCOMING)
Set<Person> actors;
@Query("start movie=node({self})
match movie-->genre<--similar
return similar")
Iterable<Movie> similarMovies;
}
24/25
Bibliography
● I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013
● R. Angles, C. Gutierrez, Survey of graph database models, ACM Computing
Surveys (CSUR), 2008
● M. A. Rodriguez, P. Neubauer, The Graph Traversal Pattern, Graph Data
Management: Techniques and Applications, 2011
● Jonas Partner, Aleksa Vukotic, and Nicki Watt, Neo4j in Action, Manning,
2014
● Eric Redmond. Jim R. Wilson, Seven Databases in Seven Weeks, The
Pragmatic Bookshelf, 2012
● G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014
25/25
Thank you!

More Related Content

What's hot

Column oriented database
Column oriented databaseColumn oriented database
Column oriented database
Kanike Krishna
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
InfiniteGraph
 

What's hot (20)

Graphdatabases
GraphdatabasesGraphdatabases
Graphdatabases
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Column oriented database
Column oriented databaseColumn oriented database
Column oriented database
 
JanusGraph DataBase Concepts
JanusGraph DataBase ConceptsJanusGraph DataBase Concepts
JanusGraph DataBase Concepts
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Document Database
Document DatabaseDocument Database
Document Database
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 

Similar to Graph databases

Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
Daniel Camarda
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
Rathachai Chawuthai
 
Neo4j - Graph Database
Neo4j - Graph DatabaseNeo4j - Graph Database
Neo4j - Graph Database
Mubashar Iqbal
 
NOSQL Databases for the .NET Developer
NOSQL Databases for the .NET DeveloperNOSQL Databases for the .NET Developer
NOSQL Databases for the .NET Developer
Jesus Rodriguez
 

Similar to Graph databases (20)

Neo4j graph database
Neo4j graph databaseNeo4j graph database
Neo4j graph database
 
Neo4j: Graph-like power
Neo4j: Graph-like powerNeo4j: Graph-like power
Neo4j: Graph-like power
 
Change RelationalDB to GraphDB with OrientDB
Change RelationalDB to GraphDB with OrientDBChange RelationalDB to GraphDB with OrientDB
Change RelationalDB to GraphDB with OrientDB
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 
Database
DatabaseDatabase
Database
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Graph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraGraph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandra
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Nosql
NosqlNosql
Nosql
 
Neo4jrb
Neo4jrbNeo4jrb
Neo4jrb
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Neo4j - Graph Database
Neo4j - Graph DatabaseNeo4j - Graph Database
Neo4j - Graph Database
 
NOSQL Databases for the .NET Developer
NOSQL Databases for the .NET DeveloperNOSQL Databases for the .NET Developer
NOSQL Databases for the .NET Developer
 
Mongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorial
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Spark
SparkSpark
Spark
 

Recently uploaded

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 

Recently uploaded (20)

Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 

Graph databases

  • 2. 2/25 Graph Theory Seven Bridges of Königsberg problem defined by Leonhard Euler in 1735 How to find a walk through the city that would cross each bridge once and only once? [© Google] Euler proved that it is impossible to solve this problem! G = (V, E) E {V × V}⊆
  • 3. 3/25 Storing Connected Data in a Relational Database ● Relationships do exist in the relational databases, but only as a means of joins and joining tables ● Logically, join crates a Cartesian product of tables ● Operations of relational databases are index-intensive. Retrieval based on an index is fast, but not with a constant time (most often O(log 2 n)) ● Traversal queries require hierarchical joins, which are costly. Deep traversal queries are infeasible. Execution time increases exponentially with a depth of a join. ● For a given SQL query, RDBMS creates an in-memory graph data structure. ● Often relational database are normalized in order to efficiently organize data in a database. ● Normalization increases number of joins needed to query the database. Denormalization can be a partial solution.
  • 4. 4/25 Database normalization ● Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. – Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. ● Normal forms – The first normal form (each attribute contains only atomic values) – The second normal form (each non primary key attribute is dependent on the whole primary key) – The third normal form (each non primary key attribute is dependent on nothing but the primary key) ● A relational database table is often described as "normalized" if it is in the 3NF ● When a database is intended for OLAP rather than OLTP, it is topically denormalized. ● Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data ● Examples of denormalization techniques: – Materialised views – Star schemas – OLAP cubes
  • 5. 5/25 Graph Database Highlights ● Graph data stores provide index-free adjacency resulting in a much better performance, if compared to traditional RDBMS ● Designed predominantly for traversal performance and executing graph algorithms ● Graph database is more natural, direct representation of a domain than RDBMS (no need for junction tables) ● There is no need for joining tables because the data structure is already “joined” by the edges that are defined. ● In graph databases denormalization is not needed! ● The interesting thing about graph diagrams is that they tend to contain specific instances of nodes and relationships, rather than classes or archetypes. ● The main purpose of Graph Databases is analysis and visualization of graphical data.
  • 6. 6/25 Graph Database Models ● The Property Graph Model – Model is built of nodes and relationships – Nodes contain key-value properties. Sometimes relationships as well. – Relationships are named and directed, and always have a start and end node ● Hypergraphs – Generalization of a graph model. – A relationship can have any number of nodes at either end of a relationship (many-to- many relationships) ● Triple stores – A triple expresses a relationship between two resources. – The triple is a subject-predicate-object data structure, e.g. Fred likes ice cream
  • 7. 7/25 Triple stores ● The Resource Description Framework (RDF) is a framework for expressing information about resources. ● Resources can be anything, including documents, people, physical objects, and abstract concepts. ● RDF is intended for situations in which information on the Web needs to be processed by applications, rather than being only displayed to people. ● RDF is a building block of the Semantic Web movement. ● RDF is a set of W3C specifications – SPARQL - SPARQL Protocol and RDF Query Language ● Disadvantages – Lack of index-free adjacencies. Data is stored in form of triplets which are independent artifacts. In order to traverse the graph one need to join multiple triplets.
  • 8. 8/25 RDF example [G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014] In RDF, resources are described by IRI - International Resource Identifier RDF define logical relationships. A number of different serialization formats exist for writing down RDF graphs: ● Turtle ● JSON-LD ● RDFa ● RDF/XML Popular RDF datasets: ● Wikidata ● Dbpedia ● WordNet ● Europeana ● VIAF
  • 9. 9/25 Hypergraphs [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013] HyperGraphDB http://www.hypergraphdb.org Using hypergraphs we lose the ability to add properties to the individual relationships.
  • 10. 10/25 The Property Graph Model ● The most popular variant of graph model ● Only one-to-one relationships ● The Property Graph Model databases are typically schema-less. There is no notion of database schema. ● Querying is often done in specification by example way, i.e. by finding data (nodes and relationships) matching the specified pattern. ● Optimization for traversal ● Popular solutions: – Neo4j (pure graph DBMS) – OrientDB (hybrid document and graph DBMS)
  • 11. 11/25 Neo4j ● Written in Java but uses some high-performance features of JVM ● Concepts: – Nodes (can have zero or more properties) – Relationships (always have direction and a type; can have zero or more properties) – Labels for grouping nodes together (a node can have zero or more labels; labels have colors assigned) ● Neo4j is a schema-optional graph database (since 2.0 version). There are two schema elements: – Indexes - you can create index on a set of properties of nodes with a specific label (Apache Lucene) – Constraints - constraint (currently only unique) on a property of nodes of a given label (index will be added automatically) ● Two versions/modes: – Web server with pure RESTful API and rich web GUI – Embedded Java library ● RESTful API was designed with discoverability in mind. Just start with a GET on the service root (e.g. http://localhost:7474/db/data) and you will a list of hyperlinks to available resources.
  • 12. 12/25 Cypher Query Language basics ● Cypher is declarative query language based on pattern matching ● Basic SQL syntax structure: SELECT columns FROM table WHERE conditions ● Basic Cypher syntax structure: MATCH pattern WHERE conditions RETURN nodes ● Patterns are defined in ASCII art graphs, e.g.: MATCH x-->y RETURN x ● It is possible to crate data with Cypher as well: CREATE ({key:"value"})
  • 13. 13/25 Cypher basic examples ● Create a simple node create ({name:"Anna"}) ● Retrieve all the nodes match x return x ● Create a labeled node with some properties create (x:Person {name:"Jan", from: "Poland"}) ● Retrieve all the nodes labeled as Person having parameter from: “Poland” match (y:Person) where y.from = "Poland" return y ● Create a relationship match x where x.name="Anna" match (y:Person) create x-[:knows]->y
  • 14. 14/25 Traversal queries ● Find Jan's friends. Return him and his friends. MATCH (x:Person)-[:knows]-(friends) WHERE x.name = "Jan" RETURN x, friends ● Find friends of Jan's friends who likes surfing MATCH (x:Person)-[:knows]-()-[:knows]-(surfer) WHERE x.name = "Jan" AND surfer.hobby = "surfing" RETURN DISTINCT surfer
  • 15. 15/25 Starting points ● Patterns often have starting points, i.e. nodes or relationships that are explicitly given. ● It is possible to specify the starting point using WHERE clause (as in the previous slide), but it can be inefficient (when there are no indices). ● More proper way of specifying the starting point (node or relationship) is by using the START keyword. ● These starting points are obtained via index lookups or, more rarely, accessed directly based on node or relationship IDs – START n=node:index-name(key = "value") – START n=node(id)
  • 16. 16/25 START clause example Find the mutual friends of user named “Michael” [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013] START a=node:user(name='Michael') MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a) RETURN b, c
  • 17. 17/25 D3.js based graph visualization of the example data set
  • 18. 18/25 Transaction management ● Neo4j provide full ACID support ● All relationships must have a valid start node and end node. In effect this means that trying to delete a node that still has relationships attached to, it will throw an exception upon commit. ● When updating or inserting massive amounts of data then periodic commit query hint (USING PERIODIC COMMIT) can be helpful. ● Currently only one isolation level (READ_COMMITTED) is supported. ● In order to execute a query inside a transaction, POST the query to http://localhost:7474/db/data/transaction/{id}
  • 19. 19/25 Native Graph Storage There are separate stores for nodes, relationships and properties. In order to be able to compute a record’s location at cost O(1), all stores are fixed-size record stores. Nodes (9 bytes) Relationships are stored in doubly linked lists, so firstPrevRelId, firstNextRelId, secondPrevRelId and secondNextRelId are pointers for the next and previous relationship records for the start and end nodes [I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
  • 20. 20/25 Scalability ● On a single server, Neo4j is capable of managing 34*109 nodes ● Currently, only full DB replication for read-only purposes, is available – Master-slave architecture to support fault-tolerancy – Horizontally scaling for read-mostly purposes ● Open transactions are not shared among members of an HA cluster. Therefore, if you use this endpoint in an HA cluster, you must ensure that all requests for a given transaction are sent to the same Neo4j instance. ● As was stated, in the graph database data are already “joined”, so it is hard to partition (to shard) a graph into multiple machine. ● Neo4j team is working on this, but it is not ready yet. It would be desired to keep nodes tightly connected (or belonging to a common domain) together on the same machine and loosely connected (or belonging to different domains) on separate machines. ● The problem is that the connection that is currently loose, can one day in the future, become tight, and vice-versa.
  • 21. 21/25 Graph algorithms ● Both graph theory and graph algorithms are mature and well-understood fields of computing science and both can can be used to mine sophisticated information from graph databases. ● Neo4j supports both depth- and breadth-first search – Search type can be specified using BranchSelector and BranchOrderingPolicy ● Graph Algorithms available in neo4j – all paths (find all paths between two nodes) – all simple paths (find paths with no repeated nodes) – shortest paths (find paths with the fewest relationship) ● Can find all shortest paths (if there are more than one) or just the first one. – Dijkstra (find paths with the lowest cost) – A* (improved version of Dijkstra algorithm)
  • 22. 22/25 Example of finding the shortest path using REST API Example request POST http://localhost:7474/db/data/node/35/path Accept: application/json; charset=UTF-8 Content-Type: application/json { "to" : "http://localhost:7474/db/data/node/30", "max_depth" : 3, "relationships" : { "type" : "to", "direction" : "out" }, "algorithm" : "shortestPath" } Example response 200: OK Content-Type: application/json; charset=UTF-8 { "start" : "http://localhost:7474/db/data/node/35", "nodes" : [ "http://localhost:7474/db/data/node/35", "http://localhost:7474/db/data/node/31","http://localhost:7474/db/data/node/30" ], "length" : 2, "relationships" : [ "http://localhost:7474/db/data/relationship/26", "http://localhost:7474/db/data/relationship/32" ], "end" : "http://localhost:7474/db/data/node/30" }
  • 23. 23/25 Spring Data Neo4J Spring Data is an umbrella project that makes it easy to use new data access technologies, such as non-relational databases, map-reduce frameworks, and cloud based data services. Spring Data Neo4j is an integration library for Neo4j and it was the first Spring Data project @NodeEntity public class Movie { @GraphId Long id; @Indexed(type = FULLTEXT, indexName = "search") String title; Person director; @RelatedTo(type="ACTS_IN", direction = INCOMING) Set<Person> actors; @Query("start movie=node({self}) match movie-->genre<--similar return similar") Iterable<Movie> similarMovies; }
  • 24. 24/25 Bibliography ● I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013 ● R. Angles, C. Gutierrez, Survey of graph database models, ACM Computing Surveys (CSUR), 2008 ● M. A. Rodriguez, P. Neubauer, The Graph Traversal Pattern, Graph Data Management: Techniques and Applications, 2011 ● Jonas Partner, Aleksa Vukotic, and Nicki Watt, Neo4j in Action, Manning, 2014 ● Eric Redmond. Jim R. Wilson, Seven Databases in Seven Weeks, The Pragmatic Bookshelf, 2012 ● G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014

Editor's Notes

  1. G – graph V – vertice E – edge
  2. D3 - Data-Driven Documents