Graph data is very common in modern applications. The property of this kind of
data is the high levels of connectivity between the entities. Although social
networks are the most common example for demonstrating graph data and the
importance of graph databases, graphs are not limited to social networks; many
other applications can be modeled much more clearly and intuitively with
graphs. Once we model our data as a graph, we usually want to explore certain
parts of the graph to find relations or interesting patterns. That includes
simple use cases such as finding a person's friends or friendoffriends in a
social network, or complex patterns revealing frauds in a financial graph. All
these applications require fast traversal of graph nodes. Although we can
store a graph in a tabular relational database or a NoSQL database such as
keyvalue/document stores, they typically don't provide fast traversal which
is critical for graph applications. In this post, we look at graph databases
that are databases designed specifically for dealing with graph data.
Graph solutions are used in both online and offline processing. In this post,
however, we only focus on graph databases for online/nearrealtime
transactional graph processing (OLTP). There are also different graph data
models. The two main ones are Labeled Property Graphs (LPGs) and Resource
Description Frameworks (RDFs). Here, we don't discuss different graph models.
Instead, we only focus on the fast traversal property of the graph databases
and see how they achieve it.
Listen to the Audio Blog
Storing Graph Data in NonGraph Databases
Before talking about graph databases, let's see why can't we just use our
familiar relational or NoSQL databases to store graph data. Consider a social
network where users can be friends with each other. We can model the data of
this social network as a graph where nodes/vertices represent users and the
edges represent friendship relations between users. We can assume friendship
is mutual, i.e., when A is friends with B, B is also friends with A. Here,
however, we assume friendship is not mutual. Thus, when person A considers
person B a friend, we have a directed edge from the node representing A to the
node representing B.

Sketch 1. Friendship graph in a social network. Nodes represent
users. Edges represent onedirectional friendships.

We can use a relational schema shown in Sketch 2 to store the friendship data
in a relational database. It is a nice schema, because we can easily add or
remove friendships, and data is normalized, so we don't have any redundancy.
Using foreign keys constraints, we can make sure our database prevents us from
adding invalid edges. It also prevents us from deleting a node when we have an
edge to it thereby preventing us from creating
dangling edges in our graph. Thus, the database helps us to keep our graph
"consistent" (the "C" in
ACID).

Sketch 2. A Schema for capturing friendship relations in a social network
using a relational database

Now, suppose we want to get the list of A's friends. To answer this query, we
have to search the friendship table to find rows having A's ID in column P_1.
Typical relational databases store data using a sorted data structure. The
most common data structures for this purpose are variants
of BTrees. By storing data in a BTree, we have a sorted index on
the primary key of the table. Having this sorted index, we can find the rows
that we are interested in with O(log E) time where E is the number of rows in
the table. Right here, you can see one problem with this design: the amount of
time we need to wait to find a user's friends grows as the total number of
friendship relations grows in our system even if the actual number of the
user's friends does not increase. Thus, our query time is a function of the overall size of the graph. That is a serious issue in a system with a huge number of users and
friendship relations.
To find A's friends' names, we need to pay O(log E) to find the IDs of the
friends, and then for each of them, we need to pay a O(log N) to find the
corresponding rows in the persons table, where N is the number of rows in the
persons table (nodes). Thus, finding A's friends' names is of O(log E +
F ☓ log N) where F is the average number of a person's friends. Now, suppose we
want to find the A's seconddegree friends' (i.e., friends of A's friends).
The list of friendsoffriends is especially interesting for recommending
people who are not already friends with A, but are good candidates to be A's
new friends. To get A's seconddegree friends' names, we have to first find
the list of A's friends which require O(log E), then for each of them, we need
to find friends again which require O(log E). Finally, for each of their
friends, we need to look up persons table to find their names which requires
O(log N). Thus, the total time complexity of finding A's seconddegree
friends' name is of O(log E + F ☓ (log E + F ☓ log N). In general, finding a user's mthdegree friends' names is of O(Fm1 ☓ log E + Fm☓ log N). As you can see, the negative effect of the overall size of the
graph just gets worse as we query deeper relations. Answering simple questions
like finding neighbors or neighbors of neighbors in a graph using a typical
relational database is inefficient, let alone finding complicated relations
that require more explorations.
If relational databases are inefficient in capturing relations, why are
they called "relational"?!
Relational databases got their name from the
Relational Model defined by
Edgar Codd. This model is based on firstorder predicate logic. In the relational
model, the database is considered as a set of propositions each states a
relation between a set of variables from certain domains. We insert our data
as propositions to the database, and then we can query it via relational
algebra/calculus formulas. Formally, a relation is a subset of the Cartesian
product of several sets. Tabular databases are an implementation of this model
where relations are captured by tables. To avoid data redundancy, however, we
have to normalize data by dividing our relations into multiple tables. To
reconstruct the relations and have the full picture, we have to join tables,
but to join we have to match rows, and this is where the performance issues
show up. Thus, in such a relational database, a relation is divided into two
parts; part of it is captured inside a table, and the other part is captured
in other tables. Relational databases are efficient for dealing with the first
part, but inefficient in dealing with the second part. For applications with
nongraph data that occasionally need to join tables that is fine, but for
applications dealing with highly connected data, relational databases just
don't work. The point of a graph database, as we will see, is to capture all
parts of relations equally and efficiently.
We saw that relational databases are inefficient for storing graph data, but
what about NoSQL data stores? Unfortunately, NoSQL data stores, such as
keyvalue/document/widecolumn stores, are also inefficient for dealing with
graph data. In fact, they are even worse than relational databases in
capturing connected data, because while in relational databases we can capture
connections in the database using foreign keys, in nongraph NoSQL databases
capturing relations must be handled 100% by the applications, as these stores
provide no special facility for capturing and tracking relations.
For example, let's see how we could use a document store to hold the data for
our social network application explained above. In a document store, we can
store a document (usually in JSON format) for each user. We can capture the
list of the user's friends as an attribute in the JSON object. Now, we can
quickly find the IDs of the friends in O(1). However, to access the documents
storing the friends' information, e.g., to get friends' names, we still need
to do an index lookup in O(log N). Thus, finding A's friends' names is
still of O(F ☓ log N). In general, finding mthdegree friends' names is of O(Fm☓ log N). Unlike relational databases that provide referential integrity, in
a NoSQL database with this schema, the application is responsible for keeping
the data consistent. For example, when removing a person, we must make sure we
remove it from other users' list of friends. Without this active housekeeping
by the application, the database will soon be full of inconsistent
information.

Sketch 3. A Schema for Capturing Friendship Relations in a
Social Network using a Document Store

Graph Databases
Graph databases are database management systems that are designed specifically
for dealing with graph data. The goal of a graph database is to provide fast
graph traversal. The key to fast graph traversal is the fast adjacency, i.e.,
to find neighbors of a node as fast as possible. Ideally, we want our query
time to be independent of the overall size of the graph.
The Main Idea
In the previous section, we saw why typical tabular relational databases or
nongraph NoSQL databases perform poorly for dealing with graph data. As we
said, the main issue is index lookups of O(log N) time complexity. The main
idea of a graph database is to physically capture the relations between nodes
in the write time, so we don't need to spend time figuring out the
relations in the read time as we do in a relational database using join
operations. By capturing the relations in the write time, graph databases can
quickly get the list of direct pointers to edges of a node in the read time.
The edges also directly point to the involved nodes. Thus, having these lists
of edges, we can find the neighbors of a node in O(K) where K is the number of
neighbors of the node. Below, we will see two designs implementing this idea
using different approaches. Of course, in a real system, we can have
optimizations to improve the performance of these approaches. Here, however,
we only focus on the overall ideas.
Fixedsize Record Storage
One way of implementing direct pointers on a disk is via fixedsized record store files, i.e., files that consist of fixedsize records. If our files contain only
fixedsized records, then by knowing the offset of a record, we can jump right
to its location on a disk without any index lookup. Thus, the offsets are used
as the ID of the records. With this approach, a pointer is basically a
tuple <filename, offset>. In addition to the
fast lookup, fixedsized record stores let us reuse the storage without having
too much fragmentation that causes storage waste. Specifically, when a record
is not needed anymore, we can easily mark it as deleted, and we can reuse it
for a future record. Since records are of the same size, we don't need to
worry whether the new record fits in space of the deleted record. Noe4j is a
graph database that uses this approach [1]. Table 1 shows the record sizes in
Neo4j.
Store File  Record size  Contents

neostore.nodestore.db  15 B  Nodes
neostore.relationshipstore.db  34 B  Relationships
neostore.propertystore.db  41 B  Properties for nodes and relationships
neostore.propertystore.db.strings  128 B  Values of string properties
neostore.propertystore.db.arrays  128 B  Values of array properties
Table 1. Record sizes in Neo4j [1]
In Neo4j, each node has the ID of the first edge in its list of edges. The
edges are stored as fixedsize records in the edges (relation) files. By
multiplying the ID of the edge and the size of the edge records, we know the
exact location of the edge. Each edge points to the next and previous
edges in the list of edges, thereby creating a doublylinked list.
Sketch 4 shows an example. The pink arrows move forward in the list, and the
green arrows move backward. The edge record also includes the ID of the start
and end nodes. By multiplying a node ID by the size of node records, we
get the location of the node. Despite all these indirections, the time
complexity is still O(1) as we find out the disk locations in O(1). Compare
this to the case where we use a Btree index, and we need O(log N) to find
these locations.
Note that, to avoid having two records per edge (one for the linked list of
the start node and one for the linked list of the end node), we can store the
linked list information for both nodes on the same record having four pointers
in total: two pointers for the moving forward and backward in the list of
edges for the start node, and two pointers for moving forward and backward in
the list of edges for the end node.

Sketch 4. Fixedsize record storage design

Now, let's calculate the time complexity of finding A's friends' names
with this design. Since the pointer to the beginning of the linked list of A's
friends is stored with A, we have the list of friends in O(1). And since these
are direct pointers, we don't need any index lookup. Thus, we can get the name
of each friend in O(1). Thus, the time complexity of finding A's friends'
names is O(F). In general, finding A's mthdegree friends' names is of O(Fm). The nice thing about this complexity is that there is no term N in it.
Thus, the latency of our query is independent of the overall size of our
graph; we can grow to billions of users and the query time does not
change (except for the initial index lookup to find A itself).
Logstructured Storage
In this section, we review the overall ideas of the design proposed in [2]. In
this design, instead of fixedsized record stores, we store data in an appendonly log. As its name suggests, to add new data, we simply append it to the end of
the log file and we never delete it. [2] provides an RFDlike data model where
nodes are simply strings. Thus, to add a new node, we append a lengthprefixed
string to the end of the log file. There won't be two log entries for the
same node. Thus, when the database is asked to add a new node, it first checks
the log (via a hash index) to see if it already has the node. If the node is
already added, it avoids adding it again. To add an edge, we also append a new
entry to the log. Unlike nodes, edges can be removed from the graph. The
deletion of an edge is also recorded on the log as a new entry. Thus, a log
entry is either a node, the addition of an edge, or the deletion of an edge.
Sketch 5 shows the graph shown in Sketch 1 in this design. Note that the size
of records are provided just as an example. Each log entry is identified by
its offset in the file. In this example, entries with ID (offset) 0, 8, 36,
and 84 are nodes. And entries with ID 16, 44, 64, 92, and 112 are edges. Note
that the edge from A to C which is added in offset 44 is later deleted in
offset 112.
An appendonly file is great for writing the data (in fact, it is
theoretically the optimal persistent storage from the write path perspective;
what could be faster than just appending to the end of a file?!), but very
inefficient for reading the data, because to answer a query, e.g., figuring
out what are the neighbors of a given node, we have to read the whole history,
every time! If we had only one node in the graph, that wasn't an issue,
because all log entries would be related to our query. The issue is that on
the log we have lots of entries that are unrelated to our target node. This is
where the indexes come into the picture. Using different indexes, we
can filter out our log and quickly read its entries related
to our query. For example, we can have an index that maps the ID of a node and
a certain edge type to log offsets that record addition/deletions of the edges
of the given type to the given node. To store the log offsets, [2] uses the
VList data structure.
To avoid O(log E) time complexity in finding the related log entries, [2] uses
hash indexes. Using a hash index we can find the related log offsets in O(1)
and traverse them linearly in O(H) where H is the average size of the related
history (i.e., number of related log entries). To answer a query, e.g., to
find the neighbors of a node, we first find related log locations using the
index, and then we check them one by one and figure out the list of edges. The
process of figuring out the edges by reading the history is
called materialization in [2]. With this design, finding A's
mthdegree friends' names is of O(Hm).
We can create various indexes to filter out the log from different
perspectives that we are interested in. For example, if we are interested in
traversing the relations backward, we can create an index that filters out the
log based on <destination + edge type> instead of <source + edge
type>

Sketch 5. Log structured design

But isn't this O(H) an issue? For answering each query, we have to read the
related history including all edge additions and deletions.
For a highly dynamic graph where relations change frequently, yes; that would
be a problem, because we will have long histories. However, for many
applications such as social networks with low changes in relations, that is
not an issue, especially when our data is in memory.
Discussion: what can be called a graph database?
We saw why we need a special type of database management system for dealing
with graph data and reviewed two designs, but what is the definition of a
graph database? There is a debate in the community of graph database
developers on what can be called a "graph database". Of course, we don't have
such a thing as a wrong or right definition; a definition is just a
definition. Thus, the debate should be on what is
a useful definition of a graph database. Some people believe
as long as a database provides the graph API, it can be considered a graph
database, no matter how that API is implemented. However, such definition is
not useful, and in fact can be harmful, as application developers using such a
database may store their connected data wrongly assuming that the database is
designed for efficient handling of the connected data while the database and
its underlying storage might not be designed for that purpose which leads
to poor performance.
The majority of developers believe the essence of graph data is fast
adjacency/traversal. But now the question is what "fast"? Some vendors
restrict fast adjacency with implementation details. They use the term
"Indexfree Adjacency" as the essence of graph database, and refer to graph
databases with indexfree adjacency as "native graph" databases. However, at
least I was not able to find any formal and clear definition of indexfree
adjacency. Does it mean that database does not use an index at all? On the
other hand, just using an index does not necessarily mean slow adjacency.
Sorted indexes like BTree indexes needing O(log N) lookup might cause slow
traversal, but what about hash indexes as used in [1] and other databases?
Shouldn't they be considered a graph database because they use indexes?
Overall, defining requirements through implementation details does not seem to
be a good approach.
Some may define "fast" traversal with time complexity, as we did above. For
example, we can define a graph database as a database that provides O(K) time
complexity for finding neighbors of a node where K is the size of neighbors.
We may eagerly say "clearly O(K) is better than O(log N) where N is the
overall size of the graph". However, we know that for sublinear time
complexities constants are critical. What if a graph database theoretically
provides O(K) time complexity, but for most typical use cases its O(K) time
complexity is slower than a relational database with O(log N)?
As you can see, providing a formal definition for graph databases is not
straightforward, and that is the source of debate in the community. From the
practical point of view, fast traversal is what we want, but as always there
is no onesizefitsall technology. Thus, for each use case, we have to
carefully see if the product we choose is capable of handling our graph data
efficiently.
Summary
In this post, we talked about graph databases, a kind of database management
system that are designed for dealing with highly connected data. The key
design point of all graph databases is the ability of fast traversal that is
usually possible by physically tracking relations in the write time instead of
figuring out them in the read time. We reviewed two different approaches for
implementing this idea using fixedsize record stores and appendonly logs.
Both approaches make it possible to explore a subgraph in a time complexity
independent of the overall size of the graph.
References
[1] Understanding Neo4j’s data on disk. https://neo4j.com/developer/kb/understandingdataondisk/ [August 8, 2021].
[2] Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. "Nanosecond
indexing of graph data with hash maps and VLists." In Proceedings of the 2019
International Conference on Management of Data, pp. 623635. 2019.
Comments
Post a Comment