I've been looking into Graph Databases lately, mainly because they just seem interesting to me. My SQL skills are very basic. I've designed some rudimentary tables that have some basic contraints, and the foreign key primary key relations between tables has usually only been 3-4 columns at most. I can do your basic queries, including simple (inner) joins between 2 or 3 tables. And I can do basic inserts and updates. But if you start asking me how transactions work, what the best connection pool policy is, or anything more advanced than that, I will just give you a blank stare. Because I use SQL so infrequently, even simple things like updating a record in a table(s) requires me to look up the command.
My team lead at work has said that people he's trained either get relational DB or they don't. Apparently, I "get" it, but it still seems a bit awkward to me. Especially the syntax of SQL. I mean a non-case sensitive language (yeah yeah, LISP is case in-sensitive)? Also, SQL isn't Turing complete. You don't have branches or conditionals for example. I also had trouble modeling a tree type structure in a classic SQL database.
So out of curiousity, I started looking at NoSQL. NoSQL is a family of non-RDBMS style database technologies, including key-value stores (Redis), document databases (CouchDB and MongoDB), column family (Hadoop, BigTable), and graph databases (Neo4j, OrientDB) among others. The graph databases seemed to perk my interest for some reason, so I decided to investigate them.
As a consequence, I've been looking at Neo4j and OrientDB. Essentially, graph databases are, well....graphs :) You have Nodes (Vertexes), Edges (Relationships), and properties. As I understand it, these graphs are all directed, but the direction can go either way. What interested me about this style of modeling data is that the relationships are inherent in the graph itself. In other words, the traversal of the graph to get from one Node to another Node is your relationship. I immediately started seeing the advantages of this.
For example, imagine installing software. This is a graph structure, and in fact, it is a directed acyclic graph (DAG). Let's say you need to install a python script that requires sqlalchemy. What would this look like?
The edges are semantic notation, but notice how they are a relation? By traversing the graph, we can see how to install script.py. If the relation is either REQUIRES or INSTALLS_BY, then these are dependencies. I could have set a property inside either the edge or the node as to an installation rule. For example, in the pymysql node, I could have a property like:
install_rule: 'pip install pymsyql'
And for sqlalchemy, it would be the same:
install_rule: 'pip install sqlalchemy'
But what about python itself? or Pip? Either the graph could be extended to contain its own INSTALLS_BY edge, or the logic of the traversal could say that "if this is an end Node, look up the property 'install_rule', and use that for installation.
Well, that was just a thought experiment, but it looks way easier to do that than implement the equivalent in a SQL Table (or two). Not to mention that a graph data is many orders of magnitude faster (from what I've read, it's on the order of 1000x to 1500x faster).
Neo4j seems to have a cleaner architecture and is better documented, but it is GPLv3 only. OrientDB seems to have many more features (it is both a document db and a graph db, and it also has a subset extension of SQL to look up relations), but it does use a more permissive Apache license. Another advantage I noticed in Neo4j is that it is thread-safe...multiple threads can access the same graphdb object (probably because neo4j enforces that all mutation must occur inside a transaction).
Sadly, all the different Open Sources licenses really has me confused, and I'm not sure if my project uses some parts that are under EPL, some parts that are under BSD, and some parts under GPLv3, if that is even possible.
No comments:
Post a Comment