The Map Of Transformers

Author:Murphy  |  View: 22300  |  Time: 2025-03-23 18:52:01

Transformers

Fig. 1. Isometric map. Designed by vectorpocket / Freepik.

1. Introduction

The pace of research in deep learning has accelerated significantly in recent years, making it increasingly difficult to keep abreast of all the latest developments. Despite this, there is a particular direction of investigation that has garnered significant attention due to its demonstrated success across a diverse range of domains, including natural language processing, Computer Vision, and audio processing. This is due in large part to its highly adaptable architecture. The model is called Transformer, and it makes use of an array of mechanisms and techniques in the field (i.e., attention mechanisms). You can read more about the building blocks and their implementation along with multiple illustrations in the following articles:

Transformers in Action: Attention Is All You Need

This article provides more details about the attention mechanisms that I will be talking about throughout this article:

Rethinking Thinking: How Do Attention Mechanisms Actually Work?


2. Taxonomy of the Transformers

A comprehensive range of models has been explored based on the vanilla Transformer to date, which can broadly be broken down into three categories:

  • Architectural modifications
  • Pretraining methods
  • Applications
Fig. 2. Transformer variants modifications. Photo by author.

Each category above contains several other sub-categories, which I will investigate thoroughly in the next sections. Fig. 2. illustrates the categories researchers have modified Transformers.


3. Attention

Self-attention plays an elemental role in Transformer, although, it suffers from two main disadvantages in practice [1].

  1. Complexity: As for long sequences, this module turns into a bottleneck since its computational complexity is O(T²·D).
  2. Structural prior: It does not tackle the structural bias of the inputs and requires additional mechanisms to be injected into the training data which later it can learn (i.e. learning the order information of the input sequences).
Fig. 3. Categories of attention modifications and example papers. Photo by author.

Therefore, researchers have explored various techniques to overcome these drawbacks.

  1. Sparse attention: This technique tries to lower the computation time and the memory requirements of the attention mechanism by taking a smaller portion of the inputs into account instead of the entire input sequence, producing a sparse matrix in contrast to a full matrix.
  2. Linearized attention: Disentangling the attention matrix using kernel feature maps, this method tries to compute the attention in the reverse order to reduce the resource requirements to linear complexity.
  3. Prototype and memory compression: This line of modification tries to decrease the queries and key-value pairs to achieve a smaller attention matrix which in turn reduces the time and computational complexity.
  4. Low-rank self-attention: By explicitly modeling the low-rank property of the self-attention matrix using parameterization or replacing it with a low-rank approximation tries to improve the performance of the transformer.
  5. Attention with prior: Leveraging the prior attention distribution from other sources, this approach, combines other attention distributions with the one obtained from the inputs.
  6. Modified multi-head mechanism: There are various ways to modify and improve the performance of the multi-head mechanism which can be categorized under this research direction.

3.1. Sparse attention

The standard self-attention mechanism in a transformer requires every token to attend to all other tokens. However, it has been observed that in many cases, the attention matrix is often very sparse, meaning that only a small number of tokens actually attend to each other [2]. This suggests that it is possible to reduce the computational complexity of the self-attention mechanism by limiting the number of query-key pairs that each query attends to. By only computing the similarity scores for pre-defined patterns of query-key pairs, it is possible to significantly reduce the amount of computation required without sacrificing performance.

Eq. 1

In the un-normalized attention matrix Â, the -∞ items are not typically stored in memory in order to reduce the memory footprint. This is done to decrease the amount of memory required to implement the matrix, which can improve the efficiency and performance of the system.

We can map the attention matrix to a bipartite graph where the standard attention mechanism can be thought of as a complete bipartite graph, where each query receives information from all of the nodes in the memory and uses this information to update its representation. In this way, the attention mechanism allows each query to attend to all of the other nodes in the memory and incorporate their information into its representation. This allows the model to capture complex relationships and dependencies between the nodes in the memory. The sparse attention mechanism, on the other hand, can be thought of as a sparse graph. This means that not all of the nodes in the graph are connected, which can reduce the computational complexity of the system and improve its efficiency and performance. By limiting the number of connections between nodes, the sparse attention mechanism can still capture important relationships and dependencies, but with less computational overhead.

There are two main classes of approaches to sparse attention, based on the metrics used to determine the sparse connections between nodes [1]. These are position-based and content-based sparse attention.

3.1.1. Position-based sparse attention

In this type of attention, the connections in the attention matrix are limited according to predetermined patterns. They can be expressed as combinations of simpler patterns, which can be useful for understanding and analyzing the behavior of the attention mechanism.

Fig. 4. Main atomic sparse attention patterns. The colored squares demonstrate correspondent calculated attention scores. Image from [1].

3.1.1.1. Atomic sparse attention: There are five basic atomic sparse attention patterns that can be used to construct a variety of different sparse attention mechanisms that have different trade-offs between computational complexity and performance as shown in Fig. 4.

  1. Global attention: Global nodes can be used as an information hub across all other nodes that can attend to all other nodes in the sequence and vice versa as in Fig. 4 (a).
  2. Band attention (also sliding window attention or local attention): The relationships and dependencies between different parts of the data are often local rather than global. In the band attention, the attention matrix is a band matrix, with the queries only attending to a certain number of neighboring nodes on either side as shown in Fig. 4 (b).
  3. Dilated attention: Similar to how dilated convolutional neural networks (CNNs) can increase the receptive field without increasing computational complexity, it is possible to do the same with band attention by using a dilated window with gaps of dilation _wd >= 1, as shown in Fig. 4 (c). Also, it can be extended to strided attention where the dilation

    Tags: Computer Vision Deep Dives NLP Thoughts And Theory Transformers

Comment