Turning Your Relational Database into a Graph Database

Author:Murphy  |  View: 20404  |  Time: 2025-03-22 22:08:22
Image generated by the author

Relational databases have been the backbone of data management for decades, handling structured data with their tabular format and strict schemas.

But what if your data's true potential lies in the relationships between data points? That's where graph databases come into play.

Unlike traditional relational databases, graph databases excel in scenarios where each data point can be connected to a multitude of other entities, forming relationships that can be leveraged to discover new insights.

If you need to navigate deep hierarchies or uncover hidden connections, it might be relevant to use a graph database.

They are particularly useful when building recommendation systems, CRMs, or tools to analyze customer behavior for example.


In this tutorial, I'm going to guide you through transforming your relational database into a dynamic graph database in python. We will use the Amazon Products Dataset as an example, and extract entities from the products' title to enrich the dataset and turn it into a graph.

To do this, we'll leverage OpenAI's GPT-3.5-turbo model, and we will then use cypher queries to load the data into a Neo4j graph database.

Preparing the dataset

The first thing we are going to do is filter out a lot of the dataset – it is very large, and running the model on every line takes a long time.

The ideal thing would be to manually curate your data, but just for the sake of this tutorial, here is what we are going to do:

  • Filter out irrelevant categories
  • Filter out products rated below 3.8
  • Filter out non-bestsellers

We can use pandas to load our dataset into a dataframe and perform these operations:

After doing this, we are left with ~4k products. We can then extract entities in the title of these products to generate connections in the graph.

Extracting entities

At the heart of transforming our dataset into a graph database is the process of extracting entities. But what does this actually mean?

Entity: a thing with distinct and independent existence

Well, we want to create relationships between items in the database and things that they could have in common with other items.

These things, or entities, could be anything – in the case of products they could be a color, a size, or a characteristic.

So the first step is actually deciding which types of entities we want to look for, and what they mean.

This is arbitrary and completely up to you to decide what you want to look for; just bear in mind that it should be deductible from a value in the existing dataset – in our case, the product title.

I settled on these entity types:

  • description: item detailed description, for example ‘high waist pants', ‘outdoor plant pot', ‘chef kitchen knife';
  • type: Item type, for example ‘women clothing', ‘plant pot', ‘kitchen knife';
  • characteristic: if present, item characteristics, for example ‘waterproof', ‘adhesive', ‘easy to use';
  • measurement: if present, dimensions of the item;
  • brand: if present, brand of the item;
  • color: if present, specific color of the item;
  • color_group: if the color is present, this is the broader color group. For example, ‘navy blue' is part of the color group ‘blue', ‘burgundy' is part of ‘red', or ‘lilac' is part of purple;
  • age_group: target age group for the product, one of ‘babies', ‘children', ‘teenagers', ‘adults'. If it is not clear whether the product is aimed at a specific age group, it should be for ‘adults';
  • gender_group: target gender for the product, one of ‘women', ‘men', ‘all'. If it is not clear whether the product is aimed at a specific gender, it should be for ‘all'

Now that we know what we are looking for, we are going to craft a prompt for a Llm to extract those entities from a given text. As mentioned, we will use OpenAI's GPT-3.5-turbo model for this.

This is a critical step because the quality of the prompt will directly influence the effectiveness of the entity extraction.

You can experiment with different prompts but this example is clear, concise and specifies exactly the task at hand:

You'll notice that I'm specifying that I'm looking for an array of values for each entity type. Indeed, there can be several colors, subcategories or characteristics for a given product.

That's the beauty of a graph database: one product could have multiple relationships with multiple entities of the same type – which would be harder to represent using a relational database.

An important thing to add for a task such as this one is examples: it is the best way to ensure the model will output something that looks like what we are looking for, and it also helps describe the "thought process" we want it to use.

We can add those examples directly into the prompt:

Another key aspect is to force the output to be in a JSON format – we can now do this with the OpenAI Chat Completions API and it will make sure the format of the output is a valid JSON.

Creating the final dataset

Now that we have our prompt ready, we can use the Chat Completions API to run the entity extraction step on every line in our dataset, and add the result to an array of json objects.

We will start by defining a function to run the entity extraction:

You can try this out on a single title to check that you have the output expected:

This should return something like this:

We can now run this on every line of the dataframe that contains our prepared dataset.

Be aware that this can take a while, so I would do it by chunks or implement a retry mechanism.

Loading data in the database

Our dataset is now ready: we have added new dimensions to it by creating relationships with different entities for each item. This json dataset, which is not very readable, can now be turned into a graph database!

For this step, we will use a Neo4j database.

You can read the guidelines on how to set up a local database on the Neo4j website.

Once you have your database credentials, you can connect to it like this:

We will define an entity map to specify which names to use for entity nodes in the database as well as the name of the relationships we want to create:

We can now use this code to generate cypher queries that will load each json object into the database:

If we take a closer look at the queries generated, here's what we are doing:

  1. We are defining a Product node p with values from the original dataset (title, url, etc)
  2. We are adding a Category node with the product category that is also in the original dataset, and creating a relationship between p and this category node
  3. For each entity type, we are iterating over the array of values if present with UNWIND and then creating a new node for each entity with the defined node name and the corresponding value
  4. We are then creating a new relationship between the product and the new entity node created

Et voilà, our data is loaded and ready to go!

What we've learned

In this tutorial, we've walked through transforming a relational dataset into a knowledge graph, using AI-driven entity extraction.

The transformation from a relational database to a graph database unlocks a new dimension of data analysis and management.

By leveraging OpenAI's GPT-3.5-turbo model, we can efficiently extract entities from product descriptions, enriching our dataset with multi-faceted relationships.

This enriched dataset, when loaded into a graph database like Neo4j, becomes a powerful tool for uncovering insights that were previously obscured in traditional tabular data structures.


We've used a dataset of products as an example, but of course this is applicable to any use case where using a graph database is relevant.

Feel free to tweak this and adapt it to your specific needs!

Tags: Hands On Tutorials Llm Neo4j OpenAI Recommendation System

Comment