I give a talk through my Graph Database and Python learning journey at PyCon Australia 2015. It should be up on PyVideo soon enough.
Note: A great question was asked regarding why I didn't cover Postgres on the "what should I use" slide. That was a great question. Definitely consider Postgres, especially if you've got existing expertise in it. Rhys Elsemores talk (Just Use Postgres) at the same conference is excellent.
Streamlining Python Development: A Guide to a Modern Project Setup
A quick review of Python and Graph Databases
1. A quick review of
Python and Graph
Databases
NIC CROUCH
@FPHHOTCHIPS
2. Who am I?
◦ Consultant at Deloitte Melbourne
in Enterprise Information Management
◦ Recent graduate of Flinders University in Adelaide
◦ Casual/Enthusiast reviewer of Graph Databases
3. What is a graph?
“A set of objects connected by links” – Wikipedia
Objects: Vertices, nodes, points
Links: Edges, arcs, lines, relationships
4. Prior Work on Graphs in Python
Graph Database Patterns in Python – Elizabeth Ramirez, PyCon US 2015
Practical Graph/Network Analysis Made Simple – Eric Ma, PyCon US 2015
Graphs, Networks and Python: The Power of Interconnection – Lachlan Blackhall, PyCon AU
2014
An introduction to Python and graph databases with Neo4j - Holger Spill, PyCon NZ 2014
Mogwai: Graph Databases in your App – Cody Lee, PyTexas 2014
5. Today: Pythonic Graphs
An exploration of graph storage in Python:
◦ API must be Pythonic
◦ execute(“<Not Python>”) doesn’t count.
◦ As little configuration as possible
Caveats:
◦ No configuration means no tuning
◦ Can’t compare distributed performance on a single node
◦ Limited to rough comparisons of performance – not a lab environment!
6. The Simple
1) Set up a dictionary of nodes
2) Each node keeps a list of relationships (or two, if you want a directed graph)
3) Set up add and get convenience methods
Pros:
• Sometimes the simplest ways are the best
• Very quick
Cons:
• Not consistent
• Probably going to need to be
maintained
• Not persistent
7. The (slightly less) Simple
1) Set up a Shelf of nodes
2) Each node keeps a list of relationships (or two, if you want a directed graph)
3) Set up add and get convenience methods
Pros:
• Still reasonably quick
Cons:
• Not consistent
• Probably going to need to be
maintained
8. Off-topic: NetworkX
All the advantages of using a dictionary with none of the custom code.
◦ Comes with graph generators
◦ BSD Licenced
◦ Loads of standard analysis algorithms
◦ 90% test coverage
◦ … no persistence (except Pickle).
10. The Incumbent: Neo4j
Released in 2007
Written in Java
GPLv3/AGPLv3 or a commercial license
Runs as a server that exposes a REST Interface
Natively uses Cypher – an in-house developed graph query language
Best established, most popular graph-database
Easy to install – unzip and run a script
High Availability, but a little difficult to scale
11. Neo4j from Python
Py2Neo:
◦ Built by Nigel Small from Neo4j
◦ Actively maintained
Neo4j-rest-client
◦ Javier de la Rosa from University of Western Ontario
◦ Maintained through 9 months ago
neo4jdb-python
◦ Jacob Hansson of Neo4j
◦ Maintained through 8 months ago
◦ Mostly just wrappers around Cypher
Bulbflow:
◦ Built by James Thornton of Pipem/Espeed
◦ Maintained to 8 months ago
◦ Connects to multiple backends
12. Py2Neo: Syntax
Set up a connection:
◦ graph=Graph("http://neo4j:password@localhost:7474/db/data/")
Create a node:
◦ graph.create(Node("node_label", name=node_name))
◦ Node labels are like classes
Find a node:
◦ graph.find_one("node_label", property_key="name",property_value=node_name)
Create a relationship:
◦ graph.create(Relationship(node1, relationship, node2))
Find a relationship:
o graph.match_one(node1, relationship, node2, bidirectional=False)
13. Py2Neo: Good and Bad
The good:
Simple API
Well documented
Easy to connect and get started.
Cool (if preliminary) spatial support
Not so much:
◦ Skinny API
◦ No transaction support for Pythonic calls
◦ Performance struggles on large inputs
◦ No ORM (kinda)
14. neo4j-rest-client Syntax
Set up a connection:
◦ graph=GraphDatabase("http://localhost:7474/db/data/", username="username",
password="password")
Create a node:
◦ node=graph.nodes.create(name=node_name)
◦ Node labels are like classes
Find a node:
◦ graph.nodes.filter(Q("name", iexact=node)).elements[0]
Create a relationship:
◦ relationship=node1.is_related_to(node2)
15. neo4j-rest-client:
Good and Bad
Transaction support with a context manager*
Strong filtering syntax
Very strong labelling syntax – searchable tags for nodes
Lazy evaluation of queries
Still REST based – still difficult to make it perform
*Seemingly. Somewhat difficult to make it work.
16. Py2Neo vs Neo4j-Rest-Client:
Performance
100 nodes with 20% connection:
Loading:
Py2Neo: ~8 seconds
Neo4j-rest-client: ~5 seconds
Postgres: 4s
Retrieving:
Py2Neo: ~6 seconds
Neo4j-rest-client: ~5 seconds
Postgres: 4s
1000 nodes with 20% connection:
Loading:
Py2Neo: ~7 minutes
Neo4j-rest-client: ~50 minutes
Postgres: 6 minutes
Retrieving:
Py2Neo: ~7 minutes
Neo4j-rest-client: ~50 minutes
Postgres: 6 minutes
Machine:
AWS Memory Optimised
xLarge node (30GB RAM)
on Ubuntu Server using
iPython2 3.0.0
Important note
Completely unoptimised! No indexes, no attempt to chunk, only
a couple OS optimisations.
17. OrientDB
PyOrient:
◦ Official OrientDB Driver for Python
◦ Binary Driver
◦ Not Pythonic
Released in 2011
More NoSQL than Neo and Titan (Documents as well as graphs)
Scalable across multiple servers
Supports SQL
18. Titan
First released in 2012
Written in Java
Licenced under Apache Licence
Many storage backends, including Cassandra, HBase and BerkeleyDB
Hadoop integration
Large amount of search back-ends
Built for scalability
Commercially supported by DataStax (formerly Aurelius)
19. Titan and Python
Mogwai:
◦ Written by Cody Lee of wellaware
◦ Binary Driver for RexPro Server
◦ Very pythonic!
Bulbflow:
◦ Built by James Thornton of Pipem/Espeed
◦ REST-based interface
◦ Maintained to 8 months ago
◦ Connects to multiple backends
20. RexPro and the
Tinkerpop Stack
Apache Incubator Open Source Graph Framework
◦ Built around Gremlin
◦ Written in Java
◦ Extensively documented
22. So, what should I use?*
Neo4j:
◦ Good, relatively quick
bindings
◦ Well supported
◦ Could be expensive
◦ May not scale
*The full title of this slide is “What should I research further to ensure it meets my specific needs and then
consider using?” In any case, the answer is still “It depends”
It depends.
Titan:
◦ Good bindings
◦ Support in doubt
◦ Should be cheaper
◦ Proven scalability
Orient:
◦ Poor bindings
◦ Well supported
◦ Open pricing structure
◦ Should scale well
23. What about Python Graph Databases?
Not just Python bindings –pure(ish) Python.
GrapheekDB: https://bitbucket.org/nidusfr/grapheekdb
◦ Uses local memory, Kyoto Cabinet or Symas LMDB as backend
◦ Under active development
◦ Exposes client/server interface
◦ Code is Beta quality at best
◦ Documentation is very spotty
Ajgu: https://bitbucket.org/amirouche/ajgu-graphdb/
◦ Uses Berkeley Database backend
◦ Under active development
◦ “This program is alpha becarful”
◦ Python 3 only
24. Ajgu
Set up a connection:
◦ graph = GraphDatabase(Storage('./BSDDB/graph'))
Create a node:
◦ transaction = self.graph.transaction(sync=True)
◦ node = transaction.vertex.create(node)
Find a node:
◦ transaction.vertex.label(start)
Create a relationship:
◦ relationship=transaction.edge.create(node1,node2)
25. Take-aways
Graphs match plenty of data sets
The big three Graph Databases are Neo4j, Titan and Orient
All three have upsides and downsides – depending on the usecase.
If you want to have a bit more fun, try Ajgu or Grapheek!
27. Py2Neo: Performance and
Transactional Support
Large imports should be done in one transaction to decrease overhead:
Graph.create(long_list_of_nodes_and_relationships)
This kills the client (essentially hangs in string processing).
So:
for chunk in izip_longest(*[iter(iterator)]*size, fillvalue=''):
try:
chunk = chunk[0:chunk.index('')]
except ValueError:
pass
try:
self.graph.create(*chunk)
except Exception as ex:
pass #chunk dividing goes here
We lose ACID at this point.
What if this fails? Have to chunk it up again to find what
failed.