How the LLM Got Lost in the Network and Discovered Graph Reasoning

Author:Murphy | View: 21807 | Time: 2025-03-23 11:26:52

|GRAPH|LLM|REASONING|GRAPH REASONING|

In a long story format, you have to set a graph for your role. – Sunil Grover

Large Language Models (LLMs) have shown incredible capabilities, and these capabilities have recently been extended beyond the text. On the one hand, we have witnessed multimodal models (e.g., vision-language models); on the other hand, we have witnessed an extension of model capabilities to skills that require reasoning. For example, we now have models dedicated to solving math problems or writing code.

Recently, however, another type of data has captured the attention of researchers. In fact, a great deal of data in the real world can be represented in the form of graphs. For example, social networks are data that are structured as graphs precisely because it is important to represent the relationship between various entities. This is not the only example: in biomedical sciences it is common to represent molecules, and interactions between proteins, as graphs. However, the interaction between LLMs and graphs is recent history. A recent line of research has shown how knowledge graphs (or potentially other graphs) can be used in the Retrieval Augmented Generation (RAG) framework where entities and relationships are found and used as input to an LLM.

The Convergence of Graph and Vector RAGs: A New Era in Information Retrieval

GraphRAG: Combining Retrieval and Summarization

While graphs are increasingly important, research on how LLMs comprehend data in graph form has lagged behind. There has been more focus on the intersection of LLMs and knowledge graphs (KGs) than on LLM understanding of graph data.

Previous studies have shown that LLMs do not do well with structural understanding, so much so that they perform poorly when they encounter tables. However, graphs add an additional dimension of complexity.

How do LLMs fare with graphs? Are they capable of understanding structural information?

For example, this study [1] states that LLMs perform poorly on basic graph tasks (especially when LLM has to identify whether cycles exist or an edge exists). LLM performs worse than the baseline they had chosen. One reason is that different graph encoding functions have a significant impact on LLM reasoning. This is because LLMs do not natively take graphs as input. So encoding the graph in the prompt as an adjacency matrix favors the model's reasoning for some tasks but undermines its capabilities for other tasks. In fact, each different encoding allows the model to access different structural information impacting its reasoning ability.

Graph ML: A Gentle Introduction to Graphs

On the other hand, different prompt engineering techniques can improve the LLM's ability to solve some graph tasks. Thus techniques such as chain-of-thought or few-shot prompting can help improve performance. One can then design specific prompts for graph tasks for further improvement [1–2]

These prompting techniques still work well with simple problems, but their benefit is significantly reduced for complex ones. Therefore, several authors have tried fine-tuning models on graph data [7–8]. Although these approaches are promising, the results can still be significantly improved.

Why do LLMs struggle with structural problems?

We don't really know. One hypothesis is that LLM struggles with spatial concepts. For animals and humans, it is important to build mental maps to interact with the physical world. Humans use these cognitive maps to plan routes, find shortcuts, or decide how to interact with the outside world. In addition, these maps represent abstract knowledge and reasoning. An LLM does not interact with the physical world, but according to one theory, humans learn these maps simply from a sequence of observations [3–5]. In this study [3] they studied the spatial understanding capabilities of LLMs, designing navigation tasks that require accurately representing the underlying spatial relations (square, hexagonal, and triangular, rings and trees topologies). LLMs demonstrate some implicit understanding of spatial maps but struggle with complex layouts. In fact, the model sometimes does not understand relative positions (how to interpret "left" or "right"). Second, LLMs are trained on large amounts of text where spatial awareness is less emphasized.

This lack of spatial understanding directly impacts their ability to comprehend graphs, especially for tasks where comprehending node arrangement or distance is crucial. In turn, it limits their ability to comprehend complex graph structures, and thus underperform in tasks where graph topology or spatial positioning is essential for accurate analysis

The question also remains open. One of the problems is that we have no benchmarks for graph reasoning and LLM. To have a good benchmark dataset we need two main factors: a variety of different topological structures and a variety of different tasks. In fact, we want to test our models not only for solving tasks but also for their understanding of graph topology.

Recently some benchmarks have been developed that can be used to evaluate graph reasoning of LLMs. In this work [6] they proposed a new dataset, in which they tried to diversify both the topology and the number of possible tasks. The authors then used different methods to generate the graphs in the dataset (random networks, small world networks, scale-free networks). They also varied different properties of the graphs such as direction (indirect, direct), scale (small, medium, and wide), and different descriptions of the graphs (edge list, adjacency table, and adjacency table in natural language)

Countless graph reasoning tasks are possible. For example, some tasks can be defined at node level (neighbors, node importance, clustering coefficients, and so on) but also at edge level and graph level, for a total of 21 tasks. In addition, reasoning intermediates were generated to help a model with CoT prompts.

So the authors decide to conduct fine-tuning of an LLM on this dataset. interestingly, they decided to divide the dataset into In-domain tasks and Out-of-domain tasks. In short, they decide to train the model on almost every task in the dataset except four (Out-of-domain tasks). The four tasks chosen are challenging and require the model to have graph comprehension and reasoning abilities to solve them. In addition, the authors chose four tasks that are different and cover both node, edge, and graph level aspects. So the model is trained on a set of tasks but is then also tested on tasks that it has not seen and can solve only if it has acquired during the graph understanding training. They compared the fine-tuned model with other models of the same size or closed source.

The experiments show some interesting results:

Smaller LLMs (about 7B) perform poorly in the benchmark dataset. This implies that lack of capacity for graph data.
After fine-tuning, the model has substantial improvements, with performance much better than the smaller models and superior to larger models.
GPT4 has good performance on some tasks but unsatisfactory on others, thus showing some understanding of graph data but also severe difficulties.

The authors also study the generalization capabilities of LLMs versus graph data. During training, the model saw only small graphs (few nodes and little complex topology). As the model encounters more complex networks, performance decreases linearly as a graph size function. More complex graphs pose more difficulty in reasoning. A model exposed to graph data during fine-tuning performs better than an unexposed model.

Despite these encouraging results, the model fails to generalize in out-of-domain tasks. so the model is unable to generalize beyond the data it has seen, thus showing serious reasoning limitations.

According to the authors, therefore, providing graph data allows the model to gain some graph understanding. The model has been trained so far only on the graph and the final answer. In this final experiment, they add the reasoning intermediates for each question and ask whether this would improve the model's understanding ability. They also add a mask to make the information they want the model to learn from the intermediate steps more prominent. The addition of these intermediates shows that the model has sensible improvements in tasks on which it previously had difficulty.

In addition, when the model is trained with the intermediate steps can produce correct reasoning (not only the right answer but also correct intermediate steps). According to the authors, when these reasoning steps are not provided, the model acquires only a shallow understanding of graph data but is not able to produce the correct reasoning or explanation of the process.

Graphs are everywhere, from biology to finance, from the path of cars to social networks. What's more, today graphs and LLMs have an increasingly close relationship. Knowledge graphs are increasingly used as a source of context for LLMs. Despite this, we know little about how much LLMs understand about graphs.

Recent studies show that LLMs have little understanding of graphs, and do not shine on graph reasoning. We can highlight two main reasons for these limitations. The first reason is that models are trained in an autoregressive manner on a large amount of text. However, it is difficult to learn spatial relationships of large corpora of text. Humans learn to navigate abstract concepts such as graphs from interaction with the world around them. This allows them to create and internalize mental maps that will later be used beyond the physical world. The second reason is that there are few graph data in training datasets. Providing training data graphs in the training datasets allows models to improve their capabilities for graph understanding. Providing them with reasoning enables LLMs to significantly improve their abilities in solving graph reasoning tasks.

The fact that LLMs fail in out-of-distribution tasks means that there are still aspects that are not clear. Second, we still do not know how to solve this limitation to their ability to generalize. As this synergy between the knowledge graph and LLM is getting closer and closer. More proportion of graph data in training datasets should be added, thus fostering better graph reasoning capability. At the same time, it would be important to deepen the graph understanding of LLMs.

What are your thoughts on this? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

AI Won't Steal Your Job – But Get Ready for the World's Most Annoying Coworker

DeepMind's AlphaProteo: Revolutionizing Protein Design with Machine Learning

Sometimes Noise is Music: How Beneficial Noise Can Improve Your RAG

Forever Learning: Why AI Struggles with Adapting to New Challenges

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

Fatemi, 2024, Talk like a Graph: Encoding Graphs for Large Language Models, link
Guo, 2023, GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, link
Yamada, 2023, Evaluating Spatial Understanding of Large Language Models, link
Whittington, 2022, How to build a cognitive map, link
Garvert, 20217, A map of abstract relational knowledge in the human hippocampal–entorhinal cortex, link
Luo, 2024, GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability, link
Chai, 2023, GraphLLM: Boosting Graph Reasoning Ability of Large Language Model, link
Tang, 2024, GraphGPT: Graph Instruction Tuning for Large Language Models, link