Graph databases

Graph Databases
Karol Grzegorczyk
June 10, 2014

2/25
Graph Theory
Seven Bridges of Königsberg problem
defined by Leonhard Euler in 1735
How to find a walk through the city that would
cross each bridge once and only once?
[© Google]
Euler proved that it is impossible to solve
this problem!
G = (V, E)
E {V × V}⊆

3/25
Storing Connected Data in a Relational Database
● Relationships do exist in the relational databases, but only as a means of joins and joining tables
● Logically, join crates a Cartesian product of tables
● Operations of relational databases are index-intensive. Retrieval based on an index is fast, but not
with a constant time (most often O(log 2 n))
● Traversal queries require hierarchical joins, which are costly. Deep traversal queries are
infeasible. Execution time increases exponentially with a depth of a join.
● For a given SQL query, RDBMS creates an in-memory graph data structure.
● Often relational database are normalized in order to efficiently organize data in a database.
● Normalization increases number of joins needed to query the database. Denormalization can be a
partial solution.

4/25
Database normalization
● Database normalization is the process of organizing the fields and tables of a relational database to
minimize redundancy.
– Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining
relationships between them.
● Normal forms
– The first normal form (each attribute contains only atomic values)
– The second normal form (each non primary key attribute is dependent on the whole primary key)
– The third normal form (each non primary key attribute is dependent on nothing but the primary key)
● A relational database table is often described as "normalized" if it is in the 3NF
● When a database is intended for OLAP rather than OLTP, it is topically denormalized.
● Denormalization is the process of attempting to optimize the read performance of a database by
adding redundant data or by grouping data
● Examples of denormalization techniques:
– Materialised views
– Star schemas
– OLAP cubes

5/25
Graph Database Highlights
● Graph data stores provide index-free adjacency resulting in a much better performance, if
compared to traditional RDBMS
● Designed predominantly for traversal performance and executing graph algorithms
● Graph database is more natural, direct representation of a domain than RDBMS (no need for
junction tables)
● There is no need for joining tables because the data structure is already “joined” by the edges
that are defined.
● In graph databases denormalization is not needed!
● The interesting thing about graph diagrams is that they tend to contain specific instances of
nodes and relationships, rather than classes or archetypes.
● The main purpose of Graph Databases is analysis and visualization of graphical data.

6/25
Graph Database Models
● The Property Graph Model
– Model is built of nodes and relationships
– Nodes contain key-value properties. Sometimes relationships as well.
– Relationships are named and directed, and always have a start and end node
● Hypergraphs
– Generalization of a graph model.
– A relationship can have any number of nodes at either end of a relationship (many-to-
many relationships)
● Triple stores
– A triple expresses a relationship between two resources.
– The triple is a subject-predicate-object data structure, e.g. Fred likes ice cream

7/25
Triple stores
● The Resource Description Framework (RDF) is a framework for expressing
information about resources.
● Resources can be anything, including documents, people, physical objects, and
abstract concepts.
● RDF is intended for situations in which information on the Web needs to be processed
by applications, rather than being only displayed to people.
● RDF is a building block of the Semantic Web movement.
● RDF is a set of W3C specifications
– SPARQL - SPARQL Protocol and RDF Query Language
● Disadvantages
– Lack of index-free adjacencies. Data is stored in form of triplets which are independent
artifacts. In order to traverse the graph one need to join multiple triplets.

8/25
RDF example
[G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014]
In RDF, resources are
described by IRI - International
Resource Identifier
RDF define logical
relationships. A number of
different serialization formats
exist for writing down RDF
graphs:
● Turtle
● JSON-LD
● RDFa
● RDF/XML
Popular RDF datasets:
● Wikidata
● Dbpedia
● WordNet
● Europeana
● VIAF

9/25
Hypergraphs
[I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013]
HyperGraphDB
http://www.hypergraphdb.org
Using hypergraphs we lose the ability to add
properties to the individual relationships.

10/25
The Property Graph Model
● The most popular variant of graph model
● Only one-to-one relationships
● The Property Graph Model databases are typically schema-less. There is
no notion of database schema.
● Querying is often done in specification by example way, i.e. by finding
data (nodes and relationships) matching the specified pattern.
● Optimization for traversal
● Popular solutions:
– Neo4j (pure graph DBMS)
– OrientDB (hybrid document and graph DBMS)

11/25
Neo4j
● Written in Java but uses some high-performance features of JVM
● Concepts:
– Nodes (can have zero or more properties)
– Relationships (always have direction and a type; can have zero or more properties)
– Labels for grouping nodes together (a node can have zero or more labels; labels have colors assigned)
● Neo4j is a schema-optional graph database (since 2.0 version). There are two schema elements:
– Indexes - you can create index on a set of properties of nodes with a specific label (Apache Lucene)
– Constraints - constraint (currently only unique) on a property of nodes of a given label (index will be added automatically)
● Two versions/modes:
– Web server with pure RESTful API and rich web GUI
– Embedded Java library
● RESTful API was designed with discoverability in mind. Just start with a GET on the service root (e.g.
http://localhost:7474/db/data) and you will a list of hyperlinks to available resources.

12/25
Cypher Query Language basics
● Cypher is declarative query language based on pattern matching
● Basic SQL syntax structure:
SELECT columns FROM table WHERE conditions
● Basic Cypher syntax structure:
MATCH pattern WHERE conditions RETURN nodes
● Patterns are defined in ASCII art graphs, e.g.:
MATCH x-->y RETURN x
● It is possible to crate data with Cypher as well:
CREATE ({key:"value"})

13/25
Cypher basic examples
●
Create a simple node
create ({name:"Anna"})
● Retrieve all the nodes
match x return x
● Create a labeled node with some properties
create (x:Person {name:"Jan", from: "Poland"})
● Retrieve all the nodes labeled as Person having parameter from: “Poland”
match (y:Person) where y.from = "Poland" return y
● Create a relationship
match x where x.name="Anna"
match (y:Person)
create x-[:knows]->y

14/25
Traversal queries
● Find Jan's friends. Return him and his friends.
MATCH (x:Person)-[:knows]-(friends)
WHERE x.name = "Jan"
RETURN x, friends
● Find friends of Jan's friends who likes surfing
MATCH (x:Person)-[:knows]-()-[:knows]-(surfer)
WHERE x.name = "Jan"
AND surfer.hobby = "surfing"
RETURN DISTINCT surfer

15/25
Starting points
● Patterns often have starting points, i.e. nodes or relationships that are
explicitly given.
● It is possible to specify the starting point using WHERE clause (as in the
previous slide), but it can be inefficient (when there are no indices).
● More proper way of specifying the starting point (node or relationship) is by
using the START keyword.
● These starting points are obtained via index lookups or, more rarely,
accessed directly based on node or relationship IDs
– START n=node:index-name(key = "value")
– START n=node(id)

16/25
START clause example
Find the mutual friends of user named “Michael”
START a=node:user(name='Michael')
MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]->(a)
RETURN b, c

17/25
D3.js based graph visualization of the example data set

18/25
Transaction management
● Neo4j provide full ACID support
● All relationships must have a valid start node and end node. In
effect this means that trying to delete a node that still has
relationships attached to, it will throw an exception upon commit.
● When updating or inserting massive amounts of data then periodic
commit query hint (USING PERIODIC COMMIT) can be helpful.
● Currently only one isolation level (READ_COMMITTED) is supported.
● In order to execute a query inside a transaction, POST the query to
http://localhost:7474/db/data/transaction/{id}

19/25
Native Graph Storage
There are separate stores for nodes, relationships and properties. In order to be able to compute a
record’s location at cost O(1), all stores are fixed-size record stores.
Nodes (9 bytes)
Relationships are stored in doubly linked lists, so firstPrevRelId, firstNextRelId, secondPrevRelId and
secondNextRelId are pointers for the next and previous relationship records for the start and end nodes

20/25
Scalability
●
On a single server, Neo4j is capable of managing 34*109
nodes
●
Currently, only full DB replication for read-only purposes, is available
– Master-slave architecture to support fault-tolerancy
– Horizontally scaling for read-mostly purposes
● Open transactions are not shared among members of an HA cluster. Therefore, if you use this
endpoint in an HA cluster, you must ensure that all requests for a given transaction are sent to the
same Neo4j instance.
● As was stated, in the graph database data are already “joined”, so it is hard to partition (to shard) a
graph into multiple machine.
● Neo4j team is working on this, but it is not ready yet. It would be desired to keep nodes tightly
connected (or belonging to a common domain) together on the same machine and loosely
connected (or belonging to different domains) on separate machines.
● The problem is that the connection that is currently loose, can one day in the future, become tight,
and vice-versa.

21/25
Graph algorithms
● Both graph theory and graph algorithms are mature and well-understood fields of
computing science and both can can be used to mine sophisticated information
from graph databases.
● Neo4j supports both depth- and breadth-first search
– Search type can be specified using BranchSelector and BranchOrderingPolicy
● Graph Algorithms available in neo4j
– all paths (find all paths between two nodes)
– all simple paths (find paths with no repeated nodes)
– shortest paths (find paths with the fewest relationship)
● Can find all shortest paths (if there are more than one) or just the first one.
– Dijkstra (find paths with the lowest cost)
– A* (improved version of Dijkstra algorithm)

22/25
Example of finding the shortest path using REST API
Example request
POST http://localhost:7474/db/data/node/35/path
Accept: application/json; charset=UTF-8
Content-Type: application/json
{
"to" : "http://localhost:7474/db/data/node/30",
"max_depth" : 3,
"relationships" : {
"type" : "to",
"direction" : "out"
},
"algorithm" : "shortestPath"
}
Example response
200: OK
Content-Type: application/json; charset=UTF-8
{
"start" : "http://localhost:7474/db/data/node/35",
"nodes" : [ "http://localhost:7474/db/data/node/35",
"http://localhost:7474/db/data/node/31","http://localhost:7474/db/data/node/30" ],
"length" : 2,
"relationships" : [ "http://localhost:7474/db/data/relationship/26", "http://localhost:7474/db/data/relationship/32" ],
"end" : "http://localhost:7474/db/data/node/30"
}

23/25
Spring Data Neo4J
Spring Data is an umbrella project that makes it easy to use new data access technologies,
such as non-relational databases, map-reduce frameworks, and cloud based data services.
Spring Data Neo4j is an integration library for Neo4j and it was the first Spring Data project
@NodeEntity
public class Movie {
@GraphId Long id;
@Indexed(type = FULLTEXT, indexName = "search")
String title;
Person director;
@RelatedTo(type="ACTS_IN", direction = INCOMING)
Set<Person> actors;
@Query("start movie=node({self})
match movie-->genre<--similar
return similar")
Iterable<Movie> similarMovies;
}

24/25
Bibliography
● I. Robinson, J. Webber, E. Eifrem, Graph Databases, O’Reilly Media, 2013
● R. Angles, C. Gutierrez, Survey of graph database models, ACM Computing
Surveys (CSUR), 2008
● M. A. Rodriguez, P. Neubauer, The Graph Traversal Pattern, Graph Data
Management: Techniques and Applications, 2011
● Jonas Partner, Aleksa Vukotic, and Nicki Watt, Neo4j in Action, Manning,
2014
● Eric Redmond. Jim R. Wilson, Seven Databases in Seven Weeks, The
Pragmatic Bookshelf, 2012
● G. Schreiber, Y. Raimond, RDF 1.1 Primer, W3C, 2014

Graph databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph databases

Similar to Graph databases (20)

Recently uploaded

Recently uploaded (20)

Graph databases

Editor's Notes