Graph Neural Networks Part 1. Graph Convolutional Networks Explained

Author:Murphy  |  View: 20750  |  Time: 2025-03-22 20:15:24

Data doesn't always fit neat into rows and columns. Instead, it's often the case that data follows a graph structure, take for example social networks, protein structures, recommendation or transportation systems. Leaving the information about the graph topology out of a machine learning model can decrease the performance drastically. Luckily, there is a way to include this information.

Graph Neural Networks (GNNs) are designed to learn from data represented as nodes and edges. GNNs have evolved over the years, and in this post you will learn about Graph Convolutional Networks (GCNs). My next post will cover Graph Attention Networks (GATs). GCNs and GATs are two fundamental architectures on which current state of the art models are based upon, so if you want to learn about GNNs, this is a good start. Let's dive in!

New to graphs? The first part of this post (Graph Basics) explains the basics of graphs. Also, you should be familiar with neural networks (a short recap is provided in this article, in the part Datasets and Prerequisites).

Undirected graph with nodes (or vertices) and edges (or links). The graph doesn't contain self loops. The adjacency matrix shows the connections (edges) between nodes. Image by author.

Why Graph Neural Networks?

A GNN is a type of neural network specifically designed to process and analyze data structured as graphs. Unlike traditional neural networks that operate on regular grid-like data (like sequences for natural language processing or grids for image data), GNNs are built to work with graph data. Each node in a graph is associated with a feature vector (or embedding) that captures its attributes or characteristics. The GNN updates these embeddings by aggregating information from neighboring nodes and edges. We will dive deeper into this in the following paragraphs of the post.

Why do we need GNNs? In essence, GNNs allow us to leverage the rich relational data present in graphs to make predictions, detect patterns, and gain insights from complex systems that are not easily represented by traditional data formats. GNNs excel at leveraging the connections between entities in a graph, capturing both local and global relationships. As we will see in this post, GNNs can outperform traditional models in tasks where relational data is crucial, offering more accurate predictions with less data.

There are three common types of prediction tasks in graphs:

  • You can predict on graph level. The input of the model is many different graphs, and every graph gets one classification. For example the class a molecule belongs to: every molecule is represented by one graph, and every molecule needs a prediction. Another example is image classification. Yes, images can also be represented as graphs (as you can see in the image below)!
  • Another way to use GNNs is by predicting on node level. The input of the GNN is one graph, and every node needs a prediction. This prediction is a characteristic of the node. The demo in this post will show node classification. Node regression is of course possible as well! Compared to classification, you only need to change the output layer activation function, the loss function, evaluation metric, and obviously the target.
  • Finally, we can predict on edge level. The value of an edge is predicted, or the likelihood of an edge that will appear soon. An example is recommended friends on social media (a.k.a. link prediction).

All these prediction tasks can be solved with GNNs!

From image to graph. Neighbor pixels are connected with an edge. Image by author.

Dataset and Prerequisites

If you want to run the code yourself, you have to create a new environment and install the dependencies:

Step 1. Create a new environment with Poetry:

poetry init --no-interaction

Step 2. Add the dependencies:

poetry add torch torchvision torchaudio torch-geometric matplotlib scikit-learn

Step 3. Activate the environment:

poetry shell

A pyproject.toml file is created with all the dependencies for the project, making sure it's easy to reproduce. Mine looks like this:

[tool.poetry]
name = "gnnsdemo"
version = "0.1.0"
description = ""
authors = ["Hennie <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.9"
torch = "^2.4.1"
torchvision = "^0.19.1"
torchaudio = "^2.4.1"
torch-geometric = "^2.6.1"
matplotlib = "^3.9.2"
scikit-learn = "^1.5.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api" 

The Cora dataset is a benchmark dataset for graph neural networks. The dataset contains data about 2708 scientific publications. These publications are the nodes of the graph. An edge between nodes (publications) is created when a publication references the other one. The target is to predict the subject of each paper, there are seven classes in total. Here's how you can load the dataset:

from torch_geometric.datasets import Planetoid

dataset = Planetoid(root=".", name="Cora")
data = dataset[0]
num_labels = len(set(data.y.numpy())  # used for output_dim

I created multiple graph datasets myself for testing, only to find out something surprising… I'll get back to that.

Neural Network Recap

A neural network is composed of layers: you have an input layer, then hidden layers, and finally the output layer. Each layer consists of neurons, and they are connected to the next layer via weights. The goal of a neural network is to optimize the weights using the training data. When data passes through the neurons, the outputs are multiplied by weights, summed, and transformed by activation functions at each neuron. The output layer produces the final prediction. Other important aspects of a neural network are the loss function, which measures the difference between the predicted and actual outcome (or the error), and back propagation for updating the weights of the network (with an optimizer like Adam).

Recap of a basic neural network. Image by author.

Let's train a normal neural network on the datasets. So for this, we don't use graph information, we just use the features of every node to predict the target. The following code shows the class for the neural network, it has 2 hidden layers.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.lin1 = nn.Linear(input_dim, hidden_dim)
        self.lin2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, data):
        x = data.x  # no graph structure, only node features
        x = F.relu(self.lin1(x))
        x = self.lin2(x)
        return F.log_softmax(x, dim=1)

We use the following code for training and evaluating the models:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
results = {}

# iterate over the different model types
for model_class in [MLP]: # later we test also with GCN (this post) and GAT (next blog post)
    results[model_class.__name__] = []
    for i in range(10):
        print(f"Training {model_class.__name__} iteration {i+1}")

        # the output_dim is the number of unique classes in the set
        model = model_class(input_dim=data.x.shape[1], hidden_dim=32, output_dim=num_labels).to(device)
        optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

        # deal with the class imbalance
        class_weights = torch.bincount(data.y) / len(data.y)
        loss_fn = nn.CrossEntropyLoss(weight=1/class_weights).to(device)

        data = data.to(device)

        # training loop
        for epoch in range(100):
            model.train()
            optimizer.zero_grad()
            out = model(data)

            # calculate loss
            train_loss = loss_fn(out[data.train_mask], data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].argmax(dim=1), data.y[data.train_mask])
            train_loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                model.eval()
                with torch.no_grad():
                    val_loss = loss_fn(out[data.val_mask], data.y[data.val_mask])
                    val_acc = accuracy(out[data.val_mask].argmax(dim=1), data.y[data.val_mask])
                    print(f'Epoch {epoch} | Training Loss: {train_loss.item():.2f} | Train Acc: {acc:>5.2f} | Validation Loss: {val_loss.item():.2f} | Validation Acc: {val_acc:>5.2f}')

        # final evaluation on the test set
        model.eval()
        with torch.no_grad():
            out = model(data)
            test_loss = loss_fn(out[data.test_mask], data.y[data.test_mask])
            test_acc = accuracy(out[data.test_mask].argmax(dim=1), data.y[data.test_mask])
            print(f'{model_class.__name__} Test Loss: {test_loss.item():.2f} | Test Acc: {test_acc:>5.2f}')
            results[model_class.__name__].append([acc, val_acc, test_acc])

# print average on test set and standard deviation
for model_name, model_results in results.items():
    model_results = torch.tensor(model_results)
    print(f'{model_name} Test Accuracy: {model_results[:, 2].mean():.2f} ± {model_results[:, 2].std():.2f}')

So we train 10 times and calculate the average accuracy and standard deviation. The output for the MLP class is:

MLP Test Accuracy: 54.35 ± 1.06

Let's see if we can improve this result with a Graph Convolutional Network!

Graph Convolutional Networks

How can we add graph information to the basic neural network? For understanding one node, we need to look at its neighborhood and include that information in the GNN. Remember that a linear layer from a normal neural network is written as:

X is the input matrix, and W is the weight matrix. For simplicity we leave the biases out.

In graphs there are different ways to define the edges. One way is with an adjacency matrix. By multiplying the adjacency matrix with the input matrix X, the features from the neighbor nodes will be summed. In the following example calculations, we will look at the following graph (with corresponding adjacency matrix):

Graph with adjacency matrix. Image by author.

But I can hear you thinking: What about the features of the node itself? Yes, we also need to include the features from the node itself, and we can do this by adding loops (edges from a node to itself). The adjacency matrix needs to have ones on the diagonal to contain loops, so let's add an identity matrix to the adjacency matrix.

Adjacency matrix + identity matrix = adapted adjacency matrix.

Now we can add this updated matrix to the linear layer:

And now we have a graph linear layer! We can replace the normal linear with this one, then we have our first graph neural network. But there is one important step we should take before actually testing it, and that is normalization. Imagine, without normalization, nodes with more connections (e.g. one node having 10 neighbors vs. another with just 1) can dominate the learning process. The node with 10 neighbors would aggregate far more information than the one with 1, leading to imbalance and unstable learning. Normalization ensures that each node's contribution is appropriately scaled, so the network learns from the graph structure rather than being skewed by uneven data distribution.

How does the normalization step work? In GNNs it's common to use symmetric normalization. The idea is to normalize each node's aggregated features by the square root of its degree (the number of neighbors, including itself for self-loops). This helps to ensure that nodes with different degrees contribute equally during aggregation. First we need the degree matrix D:

Degree matrix + identity matrix (for self loops) to get the adapted degree matrix we can use for normalization.

We can normalize D by doing:

We have multiple options for normalization, for example:

But instead of normalizing across rows or columns, it's better to balance between rows and columns (symmetric normalization). This ensures that both the source and destination nodes in an edge are treated equally, which prevents biases toward high-degree nodes and helps maintain stability in the learning process. The symmetrically normalized (adapted) adjacency matrix is computed like this:

And if we combine everything, we get the linear layer of the GCN:

Let's test this! We can train the GCN on the dataset and calculate the accuracy, just like we did before with the MLP model. In PyTorch, we can use the GCNConv layer for this:

class GCN(torch.nn.Module):    
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

Now we train the model (with the same code as before, we can add GCN to the list with model classes.

The output is:

GCN Test Accuracy: 78.76 ± 0.38

Wow, that looks amazing! The accuracy with the normal neural network was 54.35, and now it's 78.76! And the only thing we changed was adding the edge information and changing the layers of the model.

Considerations

The example above works perfectly and because of that you might think you should change from normal neural network to GCN for every dataset you can put into a graph. But it's important to be aware that these amazing results don't occur all the time. I tested with multiple graph datasets, some I created by myself (based on beer review dataset and Amazon review dataset), and others were benchmark datasets (citeseer, pubmed, ogb). For most of the benchmark datasets the results were improving when switching from a normal neural net to a graph neural net, but for some of them and for the datasets I created myself the results didn't improve. Below some considerations related to graph neural networks:

  • The graph structure should really make a difference for the problem you are trying to solve. The structure should be meaningful for the prediction task at hand. Testing is important here. You can try to formulate the graph in different ways to see if one way of formulating works better than another one.
  • Training a graph neural network takes more time than training a normal neural network. So if the results improve only a little bit and training time is important, the normal neural network can be the best choice. Also, the effectiveness among types of graph neural networks (GCN, GAT, GraphSAGE) can vary greatly based on the problem.
  • Traditional neural networks can be efficiently batched during training. For graph neural networks, it's harder to batch the data because nodes have different neighbors, resulting in potentially uneven mini-batches. Efficient sampling techniques (like GraphSAGE) or mini-batch training are necessary for scalability.
  • Just like in standard neural networks, transfer learning (pre-training a GNN on a large dataset and fine-tuning on the target dataset) can be effective for GNNs. Checking for available pre-trained models for your task can be valuable.

Conclusion

As we've seen, simply adding graph information to a basic neural network can dramatically boost performance, as was the case when we moved from a normal neural network to a GCN for the Cora dataset. By aggregating information from neighboring nodes, GCNs can provide a richer representation of the data, leading to more accurate predictions. But, it's crucial to remember that GNNs aren't a magic bullet for every problem. The graph structure must be truly meaningful to the prediction task, and the increase in training complexity might not always justify the performance boost, especially when training time is critical.

Experimenting with different graph formulations and GNN architectures, like Graph Attention Networks (GATs), can lead to even better results depending on your dataset. In the next post, we'll dive deeper into GATs and see how attention mechanisms can enhance GNNs by learning to weigh the importance of neighboring nodes.

Related

Graph Neural Networks Part 2. Graph Attention Networks vs. GCNs

Optimizing Connections: Mathematical Optimization within Graphs

Mathematical Optimization Heuristics Every Data Scientist Should Know

Tags: Editors Pick Graph Convolution Network Graph Neural Networks Graph Theory node-classification

Comment