Cypher Generation: The Good, The Bad and The Messy

Introduction
Cypher is Neo4j's graph query language. It was inspired and bears similarities with SQL, enabling data retrieval from knowledge graphs. Given the rise of generative AI and the widespread availability of large language models (LLMs), it is natural to ask which LLMs are capable of generating Cypher queries or how we can finetune our own model to generate Cypher from the text.
The issue presents considerable challenges, primarily due to the scarcity of fine-tuning datasets and, in my opinion, because such a dataset would significantly rely on the specific graph schema.
In this blog post, I will discuss several approaches for creating a fine-tuning dataset aimed at generating Cypher queries from text. The initial approach is grounded in Large Language Models (LLMs) and utilizes a predefined graph schema. The second strategy, rooted entirely in Python, offers a versatile means to produce a vast array of questions and Cypher queries, adaptable to any graph schema. For experimentation I created a Knowledge Graph that is based on a subset of the ArXiv dataset.
As I was finalizing this blogpost, Tomaz Bratanic launched an initiative project aimed at developing a comprehensive fine-tuning dataset that encompasses various graph schemas and integrates a human-in-the-loop approach to generate and validate Cypher statements. I hope that the insights discussed here will also be advantageous to the project.
Knowledge Graph Model
I like working with the ArXiv dataset of scientific articles because of its clean, easy-to-integrate format for a knowledge graph. Utilizing techniques from my recent Medium blogpost, I enhanced this dataset with additional keywords and clusters. Since my primary focus is on building a fine-tuning dataset, I'll omit the specifics of constructing this graph. For those interested, details can be found in this Github repository.
The graph is of a reasonable size, featuring over 38K nodes and almost 96K relationships, with 9 node labels and 8 relationship types. Its schema is illustrated in the following image:

While this knowledge graph isn't fully optimized and could be improved, it serves the purposes of this blogpost quite effectively. If you prefer to just test queries without building the graph, I uploaded the dump file in this Github repository.
Generating Training Pairs with LLM
The first approach I implemented was inspired by Tomaz Bratanic's blogposts on building a knowledge graph chatbot and finetuning a LLM with H2O Studio. Initially, a selection of sample queries was provided in the prompt. However, some of the recent models have enhanced capability to generate Cypher queries directly from the graph schema. Therefore, in addition to GPT-4 or GPT-4-turbo, there are now accessible open source alternatives such as Mixtral-8x7B I anticipate could effectively generate decent quality training data.
In this project, I experimented with two models. For the sake of convenience, I decided to use GPT-4-turbo in conjunction with ChatGPT, see this Colab Notebook. However, in this notebook I performed a few tests with Mixtral-7x2B-GPTQ, a quantized model that is small enough to run on Google Colab, and which delivers satisfactory results.
To maintain data diversity and effectively monitor the generated questions, Cypher statements pairs, I have adopted a two steps approach:
- Step 1: provide the full schema to the LLM and request it to generate 10–15 different categories of potential questions related to the graph, along with their descriptions,
- Step 2: provide schema information and instruct the LLM to create a specific number N of training pairs for each identified category.
Extract the categories of samples:
For this step I used ChatGPT Pro version, although I did iterate through the prompt several times, combined and enhanced the outputs.
- Extract a schema of the graph as a string (more about this in the next section).
- Build a prompt to generate the categories:
chatgpt_categories_prompt = f"""
You are an experienced and useful Python and Neo4j/Cypher developer.
I have a knowledge graph for which I would like to generate
interesting questions which span 12 categories (or types) about the graph.
They should cover single nodes questions,
two or three more nodes, relationships and paths. Please suggest 12
categories together with their short descriptions.
Here is the graph schema:
{schema}
"""
- Ask the LLM to generate the categories.
- Review, make corrections and enhance the categories as needed. Here is a sample:
'''Authorship and Collaboration: Questions about co-authorship and collaboration patterns.
For example, "Which authors have co-authored articles the most?"''',
'''Article-Author Connections: Questions about the relationships between articles and authors,
such as finding articles written by a specific author or authors of a particular article.
For example, "Find all the authors of the article with tile 'Explorations of manifolds'"''',
'''Pathfinding and Connectivity: Questions that involve paths between multiple nodes,
such as tracing the relationship path from an article to a topic through keywords,
or from an author to a journal through their articles.
For example, "How is the author 'John Doe' connected to the journal 'Nature'?"'''