Graph Databases 101

You may have heard the term ‘graph database’ and how they provide the magic behind facebook, Google, Twitter and seemingly everything else on the web today. What exactly is a graph database, and why should you be interested in them? Let me explain.

An Introduction

To begin, a quick introduction to basic database design. Most databases are a collection of tables that contain specific information and fields describing the relationships to other tables. The most common relationship is called a one-to-many relationship. For example, an author has written many books and each book has one author. Can you see an immediate issue with relational databases?  – some books have several authors.
We could add additional author fields into a book record – author1, author2, author3 and so on, but these are hardwired and require lots of code to be created to maintain the links. What happens if we have no author? Or if an author also writes under a nom de plumes?

Adding flexibility

With a graph database setup you still have the same kinds of data, but the relationships between items is more flexible. Each connection is described on the connection itself, so a link between a book and a person can be described as an author, alias, contributor, illustrator, editor or any other type of connection that may occur. Because these connections are dynamic, you don’t need to know all of the potential ways your data may connect when you are setting it up – you can define new connections between items if and when required.

Nodes and edges

In graph database terminology, the people and books are called nodes and the connections between them are called edges. Each node has specific properties (metadata) on it – for example a date of birth for a person, and each edge has properties about the join between items – for example the author or illustrator of our book. There is no limit to the number of edges (joins) you can make between nodes, or any limitations on what kind of connections they are.
With a standard database, each connection is essentially hard coded when the database is created and items that fall outside of the norm tend to be poorly managed and non-discoverable. You cannot just add new connections to accommodate these items on the fly – you need to edit the database structure and the underlying code to make changes.
In a graph database the connections can be made up as you go along, so your data expands your system as it grows.

A data challenge

Here’s an example of how something fairly simple to say is nearly impossible to do in a standard database: Music.
Mapping songs to albums is easy enough, even with greatest hits and live recordings being added to the mix. Add in the singer and songwriter on each song for each album and we’re getting richer metadata, and it’s still possible. We now have several tables covering the bands, albums, songs and people with a variety of connections between them and we’re all good. However, to go to the next stage is where it falls apart.

Scenario time:

How many different and yet distinct connections can you think of in this situation?
In 1992, after the death of Freddy Mercury, Queen is playing a tribute concert to over 70,000 people. The remaining members of the band, along with other performers including Robert Plant, Elton John, David Bowie, George Michael and Annie Lennox, perform a variety of Queen songs. Guns N’ Roses, Def Leppard and Metallica also get on stage and play their tributes to Freddy Mercury.
Data wise, we have

  • (Most of) a band (Queen) playing a tribute to another band member who isn’t there.
  • Queen (thankfully) play their own songs – but some other musicians contribute the vocals
  • Musicians from other bands may join Queen on stage – playing various instruments
  • in some cases a completely different band plays one of Queen’s songs.

In summary, we have a venue where a band (or most of one) played a gig where a number of their own songs were played, and most songs had contributing artists or bands on stage with them, and the occasional song played at the gig didn’t involve the original band at all. Get a pencil and work out the table structures you need to describe that.
It gets worse though… Both Guns N’ Roses and Metallica have changed their band members since, so while they performed that day it was the 1992 line-up, not the current line-up.  That’s where databases get hard!

Graph to the rescue!

With this quantity of data and the large number of connections between items it is almost impossible to map out using non-graph methods. It’s also very unlikely you would have pre-configured all of those connections before you started entering your data.
Thankfully, with a graph database you don’t have to! Remember – In a graph database the connections can be made up as you go along, so your data expands your system as it grows. A graph database gives you the freedom and flexibility to build these rich connections on the fly, adapts quickly to changes in your data, and ensures all of your data is discoverable.

Watch the Video

I delivered a talk on graph databases at the National Digital Forum (NDF) in 2012 which gives a little more insight into graph databases and why they are so useful for managing data. You can watch a version of the presentation here: