Graph databases explained

Graph Databases with Prototypes and Types

An overview of graph database theory. A description of a kind of graph database perfectly suited to enterprise systems.

Lance Gutteridge

Published in

codeburst

8 min readJun 7, 2019

This article is a description of a graph database that is useful for implementing enterprise systems. It is basic graph database theory with some common software patterns added. It is meant as added information to my article:

The Contextual Graph Database of Forms

Forms and the everyday business identifiers (e.g., employee number, project number, etc.) create a natural graph…

codeburst.io

In that article I give a formalization in 21st-century database terms of the paper database that was so effectively used before computers.

This article defines the particular style of graph database used in that formalization.

Graph Databases

A graph database is a set of nodes (sometimes they are called a vertex with the plural vertices). The graph is formed by connections between nodes called edges. This forms a mathematical graph.

A graph has no restrictions on the edges between nodes. This is unlike a specialized graph like a tree where the descendants of a node are never connected to a descendant of a sibling node. So a graph is the most general case of this kind of structure.

In our discussion of graphs we mean by a graph a general graph rather than a simple graph. This means that self-loops (where an edge connects a node to itself) and parallel edges (where two edges connect the same two nodes) are allowed.

A directed graph is when the edges have a direction, i.e., they can be represented by an arrow. There is a source (starting) node and a target (ending) node. We will be talking exclusively about directed graphs.

Nodes can have properties which are stored values that are associated with the node.

The most common form of a graph used for graph databases is a labelled property graph, i.e., a graph where all the node properties have names. This is the type of graph database we will be considering in this article. We can define

Labelled Property Graph: A directed graph where all the properties are name value pairs.

All our discussion in this article will be confined to labelled property graphs.

Sub-graph: A sub-graph is a subset of the nodes with a subset of the set of all edges that connect two nodes in that subset.

Path: A path from one node to another is a chain of edges that lead from the first node to the second following the direction of the edges that does not travel through any node more than once.

In the diagram above there is a path from Node1 to Node5 which goes through Node4.

Every node defines a sub-tree of all the nodes that are connected to it via a path.

A well known example of a labelled property graph is a JavaScript object model. When a JavaScript program is executed there exists at any instant in time a collections of objects. This collection forms a labelled property graph. Certainly every JavaScript object contains member variables which are labelled properties.

In JavaScript there is the concept of a reference. So one object can have a reference to another object. This is done by having a member variable as a var whose value is the other object. When that member variable is assigned to an object its value is a reference to the object. This is basically a pointer. Looking at this as a graph means that there is an edge from one object node to another. So in JavaScript the edge relationship is one of reference.

Prototypes

Notice that with a labelled property graph the nodes are essentially objects, in that they have named members. They can be restricted to something similar to a JavaScript prototype. This means that the properties of a node are forced to be a particular set of names and types dictated by the prototype. There seems to be no current formal term for this kind of graph database so we shall call it a Prototyped Graph Database. Just as in JavaScript, a prototype is a node that has properties. The properties in a prototype describe the properties in the nodes with that prototype. The prototype describes the label of these properties and their type. There are also parameters such as maximum field length, or the number of decimal points. To inherit from a prototype is to have the labelled properties described by the prototype.

Prototyped Graph: A labelled property graph where each node contains an edge that points to its prototype and every node has properties that are identical in name and type to the properties in the prototype. The prototype nodes are special in that they have no prototype themselves.

This is essentially exactly the same inheritance structure as JavaScript. In fact the object model of a JavaScript program as it executes is a Prototyped Graph . In JavaScript the reference to the prototype is not explicit as other vars that the programmer has specified are. Rather the prototype reference can be accessed via Object.getPrototypeOf() function. (The old way of using __prototype__ has been deprecated.)

I’m not trying to suggest here that all such databases have to be implemented in JavaScript. In fact, the implementation I have built of a forms database that models itself on paper systems is built in Java rather than JavaScript. I am using JavaScript here as an example of a prototyped graph as it is familiar to the community reading this article.

Types

Most databases have data types. In the case of object databases these types are extendable. However what is meant here by types in a graph database is that every property has one of a fixed set of data types. The data types themselves are objects that support serialization. This means that a type T supports the following method (using Java as the example language).

public byte[] serialize()

and have a constructor

public <T> (byte[] serialized_data)

Given that each node has a prototype then it is possible to serialize and de-serialize any node from a method in the prototype. The prototype has the information as to what properties are in the node, what their names are, and what type each property has. This is efficient as each node can be serialized as a concatenation of byte buffers as there is no need to carry any type information as that is stored in the prototype. All that is needed is to carry an identity of the prototype.

Prototyped Graph Database

Databases have two functions persistence and access. Persistence is the ability to preserve the data even if the power fails. This is a minor feature when combined with an in-memory database. This graph structure of data allows for an easy implementation of the database pattern I described in the following article:

Doing Without Databases in the 21st Century

I have received a lot of responses for my article “What I’m Telling Business People About Why Relational Databases Are…

codeburst.io

This is clearly not the only way persistence can be handled but it is particularly suitable for persisting a prototyped graph. As I point out in the above article, the size of the fast memory used in computers today allows most enterprise data to be stored directly in memory.

Data Access

The obvious memory model for a graph database is the nodes as a set of objects that are connected by unique identifiers. These are either memory pointers or something that can be looked up and resolved to a memory pointer.

Graph Databases vs. Relational Databases

Graph databases are different from relational databases in that the relationships in a graph database are explicit. In a relational database there are no specified relationships, rather the relationships are expressed via a separate language by joining tables via the values in table cells. This is a highly formalized and artificial approach to storing and retrieving data that was first proposed by theoreticians who had no real business experience. Hence they were not aware of the history of data processing, and unthinkingly threw away the combined experience of countless organizations from the past.

In a graph database information is organized into nodes that are connected by edges. Edges are known paths from one node to another. Because the paths are already determined, gathering related nodes is far, far more efficient than operations of a relational database such as inner or outer joins.

Graph databases are not as rigidly structured as relational databases are. A relational database used for business is always constrained to be in a third normal form (if not a normal form of higher order). This means that the data has been flattened into tables and all independent data has been extracted onto separate tables. This makes the data structures of relational databases quite similar between implementations. The data access language SQL has been standardized (with some dialectal differences). Graphical database are not so constrained. The concepts of node, edge, and properties are much more abstract and can be implemented in a variety of ways. Given this flexibility of implementations, graphical databases often have their own specialized access language.

Graph Databases vs. Semantic Data

There is another model for storing data which is semantic data. A semantic data model regards data as a collection of triples. These triples express a relationship between two objects. The triple is the first object, the name of the relationship, and the second object that is related to the first. This is very similar in concept to a graph database. In fact a graph database can be described by a set of triples where the objects are nodes and the relationship is the edge between the node.

In developing a system for representing enterprise systems I have chosen the graph database representation rather than a semantic data description. The model that drives this description is paper forms and their connections via the IDs that enterprises use for their various entities. This seems to be better represented by a graphical database that preserves a more natural-looking visual representation rather than the more formal semantic data description.

Graph Database Access Languages

There are some standard languages to access data in a graphical database such as Cypher used with Neo4j. Amazon’s Neptune supports Gremlin and the W3C standard SPARQL. These are interesting and useful implementations but they do not capture the essence of paper forms and the natural processing that humans apply to them.

All of these data access languages are designed for use by IT specialists. The difficult problem is designing a language that is usable by business people who are not trained in programming or other complicated IT technologies.

I have built Formever, a software environment that allows the quick building of enterprise systems by non-programmers. This software is based on a graph database of forms.

Summary

In this article I describe the concepts of a graph database. This forms the basis for my article on the structure of the underlying database technology that has powered organizations for centuries.

If you are interested in software that implements these theories and further information please go to www.formever.org.

Also see my book “Avoiding IT Disasters: Fallacies about enterprise systems and how you can rise above them.” (It reveals the real reasons why enterprise systems are failing — the ones that no one wants to talk about.)