Beyond Transformers with PyNeuraLogic

Author:Murphy  |  View: 20977  |  Time: 2025-03-23 19:53:56

TOWARDS DEEP RELATIONAL LEARNING

Demonstrating the power of neuro-symbolic programming

Visualization of the attention computation graph from the perspective of one token, with visible relationships between tokens. Image by the author.

In the last few years, we have seen a rise of Transformer¹ based models with successful applications in many fields, such as Natural Language Processing or Computer Vision. In this article, we will explore a concise, explainable, and extendable way to express Deep Learning models, specifically transformers, as a hybrid architecture, i.e., via marrying deep learning with symbolic artificial intelligence. To do so, we will implement models in a Python neuro-symbolic framework called PyNeuraLogic (the author is a co-author of the framework).


"We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning."

  • Gary Marcus²

Combining symbolic representation with deep learning fills the gaps in the current deep learning models, such as out-of-the-box explainability or missing techniques for reasoning. Maybe, raising the number of parameters is not the soundest approach to achieving these desired results, just like increasing the number of camera megapixels does not necessarily yield better photos.

High-level visualization of the neuro-symbolic concept Lifted Relational Neural Networks³ (LRNN), which (Py)NeuraLogic implements. Here we show a simple template (logic program) with one linear layer followed by a sum aggregation. For each (input) sample a unique neural network is constructed. Image by the author.

The PyNeuraLogic framework is based on logic programming with a twist – logic programs hold differentiable parameters. The framework is well-suited for smaller structured data, such as molecules, and complex models, such as Transformers and Graph Neural Networks. On the other hand, PyNeuraLogic is not the best choice for non-relational and large tensor data.

The key component of the framework is a differentiable logic program that we refer to as a template. A template consists of logic rules that define the structure of neural networks in an abstract way – we can think of a template as a blueprint of the model's architecture. The template is then applied to each input data instance to produce (via grounding and neuralization) a neural network unique to the input sample. This process is entirely different from other frameworks with predefined architectures that cannot adjust themselves to different input samples. For a bit closer introduction to the framework, you can see, e.g., a previous article on PyNeuralogic from the perspective of Graph Neural Networks.


Symbolic Transformers

The Transformer architecture consists of two blocks – encoder (left) and decoder (right). Both blocks share similarities – the decoder is an extended encoder; therefore, we will focus only on the encoder, as the decoder implementation is analogous. Image by the author, inspired by [1].

We generally tend to implement deep learning models as tensor operations over input tokens batched into one large tensor. This makes sense because deep learning frameworks and hardware (e.g., GPUs) are typically optimized for processing larger tensors instead of multiple ones of diverse shapes and sizes. Transformers are no exception, and it is common to batch individual token vector representations into one large matrix and represent the model as operations over such matrices. Nevertheless, such implementations hide how individual input tokens relate to each other, as can be demonstrated in Transformer's attention mechanism.


The Attention Mechanism

The attention mechanism forms the very core of all the Transformer models. Specifically, its classic version makes use of a so-called multi-head scaled dot-product attention. Let us decompose the scaled dot-product attention with one head (for clarity) into a simple logic program.

The scaled dot product attention equation

The purpose of the attention is to decide what parts of the input the network should focus on. The attention does that by computing a weighted sum of the values V, where the weights represent the compatibility of the input keys K and queries Q. In this specific version, the weights are computed by the softmax function of the dot product of queries Q and keys K, divided by the square root of the input feature vector dimensionality _dk.

(R.weights(V.I, V.J) <= (R.d_k, R.k(V.J).T, R.q(V.I))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.J)) | [F.product]

In PyNeuraLogic, we can fully capture the attention mechanism with the above logical rules. The first rule expresses the computation of the weights – it calculates the product of the inverse square root of dimensionality with a transposed j-th key vector and i-th query vector. Then we aggregate all the results for a given i and all possible j‘s with softmax.

The second rule then calculates a product between this weight vector and the corresponding j-th value vector and sums up the results across different j‘s for each respective i-th token.


Attention Masking

During the training and evaluation, we usually limit what input tokens can attend to. For example, we want to restrict tokens from looking ahead and attending to upcoming words. Popular frameworks, such as PyTorch, implement this via masking, that is, by setting a subset of elements of the scaled dot-product result to some very low negative number. Those numbers enforce the softmax function to assign zero as the weight for the corresponding token pair.

(R.weights(V.I, V.J) <= (
    R.d_k, R.k(V.J).T, R.q(V.I), R.special.leq(V.J, V.I)
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],

With our symbolic representation, we can implement this by simply adding one body relation serving as a constraint. When calculating the weights, we restrict the j index to be less than or equal to the i index. In contrast to the masking, we compute only the needed scaled dot products.

Regular deep learning frameworks constrain the attention via masking (on the left). First, the whole QK^T matrix is calculated, then the values are masked by overriding with low values (white crossed cells) to simulate attending only to the relevant tokens (blue cells). In PyNeuraLogic, we compute only needed scalar values by applying a symbolic constraint (on the right) – hence there are no redundant calculations. This benefit is even more significant in the following attention versions. Image by the author.

Beyond standard Attention aggregation

Of course, the symbolic "masking" can be completely arbitrary. Most of us heard of the GPT-3⁴ (or its applications, such as ChatGPT), based on Sparse Transformers.⁵ The Sparse Transformer's attention (the strided version) has two types of attention heads:

  • One that attends only to previous n tokens (0ijn)
  • One that attends only to every n-th previous token ((ij) % n = 0)

The implementations of both types of heads require again only minor changes (e.g., for n = 5).

(R.weights(V.I, V.J) <= (
    R.d_k, R.k(V.J).T, R.q(V.I),
    R.special.leq(V.D, 5), R.special.sub(V.I, V.J, V.D),
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.weights(V.I, V.J) <= (
    R.d_k, R.k(V.J).T, R.q(V.I),
    R.special.mod(V.D, 5, 0), R.special.sub(V.I, V.J, V.D),
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],

The Relational Attention equations

We can go even further and generalize the attention for graph-like (relational) inputs, just like in Relational Attention.⁶ This type of attention operates on graphs, where nodes attend only to their neighbors (nodes connected by an edge). Queries Q, keys K, and values V are then edge embeddings summed with node vector embeddings.

(R.weights(V.I, V.J) <= (R.d_k, R.k(V.I, V.J).T, R.q(V.I, V.J))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.I, V.J)) | [F.product],

R.q(V.I, V.J) <= (R.n(V.I)[W_qn], R.e(V.I, V.J)[W_qe]),
R.k(V.I, V.J) <= (R.n(V.J)[W_kn], R.e(V.I, V.J)[W_ke]),
R.v(V.I, V.J) <= (R.n(V.J)[W_vn], R.e(V.I, V.J)[W_ve]),

This type of attention is, in our case, again almost the same as the previously shown scaled dot-product attention. The only difference is the addition of extra terms to capture the edges. Feeding a graph as input into the attention mechanism seems quite natural, which is not entirely surprising, considering that the Transformer is a type of Graph Neural Network, acting on fully-connected graphs (when no masking is applied). In the traditional tensor representation, this is not that obvious.


The Transformer Encoder

Now, when we showcased the implementation of the Attention mechanism, the missing pieces to construct an entire transformer encoder block are relatively straightforward.

Embeddings

We have already seen in the Relational Attention how one can implement embeddings. For the traditional Transformer, the embeddings will be pretty similar. We project the input vector into three embedding vectors – keys, queries, and values.

R.q(V.I) <= R.input(V.I)[W_q],
R.k(V.I) <= R.input(V.I)[W_k],
R.v(V.I) <= R.input(V.I)[W_v],

Skip connections, Normalization, and Feed-forward Network

Query embeddings are summed with the attention's output via a skip connection. The resulting vector is then normalized and passed into a multilayer perceptron (MLP).

(R.norm1(V.I) <= (R.attention(V.I), R.q(V.I))) | [F.norm],

For the MLP, we will implement a fully connected neural network with two hidden layers, which can be elegantly expressed as one logic rule.

(R.mlp(V.I)[W_2] <= (R.norm(V.I)[W_1])) | [F.relu],

The last skip connection with normalization is then identical to the previous one.

(R.norm2(V.I) <= (R.mlp(V.I), R.norm1(V.I))) | [F.norm],

Putting it all together

We have built all the necessary parts to construct a Transformer encoder. The decoder utilizes the same components; therefore, its implementation would be analogous. Let us combine all the blocks into one differentiable logic program that can be embedded into a Python script and compiled into Neural Networks with PyNeuraLogic.

R.q(V.I) <= R.input(V.I)[W_q],
R.k(V.I) <= R.input(V.I)[W_k],
R.v(V.I) <= R.input(V.I)[W_v],

R.d_k[1 / math.sqrt(embed_dim)],
(R.weights(V.I, V.J) <= (R.d_k, R.k(V.J).T, R.q(V.I))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.J)) | [F.product],

(R.norm1(V.I) <= (R.attention(V.I), R.q(V.I))) | [F.norm],
(R.mlp(V.I)[W_2] <= (R.norm(V.I)[W_1])) | [F.relu],
(R.norm2(V.I) <= (R.mlp(V.I), R.norm1(V.I))) | [F.norm],

Conclusion

In this article, we analysed the Transformer architecture and demonstrated its implementation in a neuro-symbolic framework called PyNeuraLogic. Via this approach, we were able to implement various types of Transformers with only minor changes in the code, illustrating how everyone can quickly pivot and develop novel Transformer architectures. It also points out the unmistakable resemblance of various versions of Transformers, and of Transformers with GNNs.


[1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I.. (2017). Attention Is All You Need.

[2]: Marcus, G.. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence.

[3]: Gustav Šourek, Filip Železný, & Ondřej Kuželka (2021). Beyond graph neural networks with lifted relational neural networks. Machine Learning, 110(7), 1695–1738.

[4]: Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D.. (2020). Language Models are Few-Shot Learners.

[5]: Child, R., Gray, S., Radford, A., & Sutskever, I.. (2019). Generating Long Sequences with Sparse Transformers.

[6]: Diao, C., & Loynd, R.. (2022). Relational Attention: Generalizing Transformers for Graph-Structured Tasks.


The author would like to thank Gustav Šír for proofreading this article and giving valuable feedback. If you want to learn more about combining logic with deep learning, head to Gustav's article series.

Tags: Deep Learning Deep Relational Learning Machine Learning Neural Networks Transformers

Comment