Why Do We Have Huge Language Models and Small Vision Transformers?

Author:Murphy  |  View: 22385  |  Time: 2025-03-23 19:43:58

Artificial intelligence | computer vision | Vision Transformers

image by Joshua Earle at unsplash.com

In recent years we have seen a growth in the number of transformer parameters. Looking closely though, these are Language Models (LMs), up to an incredible 540 B of parameters. Why not for visual models?

As for text models, an increase in dataset size, scalable architectures, and new training methods have enabled this growth in the number of parameters. This has not only served to increase performance in particular tasks (classification and so on), but as the number of parameters has grown, we have seen emerging capabilities.

"Trend of sizes of state-of-the-art Natural Language Processing (NLP) models with time. The number of floating-point operations to train these models is increasing at an exponential rate". source: here

In addition, a large model can be used as a basis for transfer learning and fine-tuning, so there is interest in developing increasingly high-performance models. As much as LMs have been used successfully in a number of tasks, there are many other tasks where a model capable of image analysis is needed.

As of 2016, the transformer is the architecture of choice, and the use of self-attention has shown obvious advantages. Therefore, several groups have trained transformer models capable of working with images (vision transformer, ViT). So far, the widest ViT has only 15 B parameters. why?

In a new study, Google was able to train a 22 B model of parameters and understand why there is difficulty in scaling ViTs.

In summary:

  • They explain why the traditional method of training ViTs produces instability during scaling.
  • How to modify the architecture for scaling, and how the model reaches the state of the art.
  • How improving fairness when scaling the model

What are vision transformers?

Image from Wikipedia (source)

Transformers are of course permutation invariant, however, they cannot process grid-structured data (only sequences). So, we have to in order to use a Transformer with images we have to find a way to transform them into sequences. How?

The first step is to transform the images into a sequence of patches (the image is divided into a series of fragments called patches). These patches are basically our tokens (like words in the classic transformer). These images are then flattened and transformed into lower-dimensional embedding (this preserves the information but reduces the number of dimensions). Also, as in the original transformer, we use positional encoding so the model knows the position of the patch in the image.

original article describing ViT (arXiv prepint)

The model then is trained in supervised learning (a huge dataset of which we have image labels) and then can be used for downstream tasks.

A Visual Journey in What Vision-Transformers See

Why is hard to scale them and how to solve it?

Before ViTs, convolutional networks were the standard for computer vision tasks. In "A ConvNet for the 2020s" the authors note that the issue is still open and current.

On the other hand, though, we have not been able to scale up ViTs. Since in transformers scaling the model also leads to the emergence of behaviors that cannot be imagined in advance, this is a serious problem.

The authors noted that over 8 B of parameters, instability emerged as divergent training loss after a few thousand steps. This was caused "by extremely large values in attention logits, which lead to (almost one-hot) attention weights with near-zero entropy." To solve this, the authors added a layer-normalization to the Queries and Keys before the dot-product.

In the figure, they show how this expedient improves training.

(source: here)

The second expedient is to modify the architecture. In the classical transformer after the self-attention output is followed by a multi-layer-perceptron (MLP). Instead, here the self-attention blocks are in parallel with the MLP. This operation does not degrade performance while speeding up the training by 15 % (as shown with PaLM, another large Google model, this operation is basically combining matrix multiplications in a single operation).

In addition, the bias term is removed for attention projection (this also reduces the training time without reducing performance).

In the figure, the new block after these expedients:

(source: here)

The table compares Google's model (ViT-22) with the previously reported largest ViT models, ViT-G and ViT-e.

(source: here)

Training has also been optimized. Google used JAX (Google has been focusing more on JAX than TensorFlow for some time). They also used a number of tricks (asynchronous parallel linear operations, parameter sharding) to ensure the model was optimized for TPU.

The authors used a dataset of about 4 B images, which were semi-automatically annotated with 30,000 classes. As a gentle reminder, in a ViT the images are divided into several sections (called patches) which are then together with position (positional encoding) transformed into a sequence. Each image (224 x 224) is divided into 14 x 14 patches. So an image is eventually represented by 256 tokens.

Does scaling the ViT worth it?

Once the model was trained, they tested it on ImageNet (1 M images and 1000 classes) to test its classification ability. The authors show, that the frozen model (i.e., without the need for fine-tuning) has comparable performance to other models.

(source: here)

Moreover, the model has been tested on another dataset using different image resolutions. ViT-22B leads to significant accuracy improvement especially when the input size is small.

(source: here)

On the other hand, one of the most frequent uses of large models is transfer learning. After all, people often work with small datasets and use a large model for fine-tuning and for a different task than the one in which it was trained. In the authors' words:

Transfer learning for dense prediction is critical especially since obtaining pixel-level labels can be costly. In this section, we investigate the quality of captured geometric and spatial information by the ViT-22B model (trained using image-level classification objective) on semantic segmentation and monocular depth estimation tasks. (source: here)

The authors tested the model with three benchmark datasets for semantic segmentation (ADEK20k, Pascal Context, Pascal VOC). Not only that but they tested the model using a limited amount of data for transfer.

"Fewshot semantic segmentation on ADE20k, when only a fraction of the training set is used. We report mean IoU for semantic segmentation on the validation set" (source: here)

ViT-22 has the best performance when there is little data, which is useful because often getting images and their segmentation mask is expensive, so fewer examples are needed than the other models.

In addition, the model showed superior capabilities in monocular depth estimation on the Waymo Open dataset.

Monocular depth estimation from frozen ViT features using different decoders on the Waymo Open dataset. (source: here)

In addition, by repurposing the model (but keeping the pre-trained ViT-22 as a component) it was possible to use it for video classification. This demonstrates the plasticity of using the model for various possible tasks.

In addition, the authors showed that fine-tuning is capable of improving performance:

Video classification results. We evaluate the ViT-22B representations by freezing the backbone, and training a small transformer to aggregate frozen, per-frame representations. ViT-22B outperforms the largest previous vision backbone, ViT-e which contains 4 billion parameters. (source: here)

How fair is this model?

Artificial Intelligence models are susceptible to unintended bias. Many of these biases are present in the training dataset and the model can amplify, and learn spurious correlation and error disparities. Because pre-trained models are then used for subsequent tasks, errors are carried along.

The authors argue that scaling the model can serve to mitigate these biases and decided to test this by using demographic parity (DP) as a measure of fairness.

The authors explain their approach:

We use CelebA (Liu et al., 2015) with binary gender as a sensitive attribute while the target is "attractive" or "smiling". We emphasize that such experiments are carried out only to verify technical claims and shall by no means be interpreted as an endorsement of such vision-related tasks. We choose the latter attributes because they exhibit gender related bias as shown in Figure 15. (source: here)

"DP in the model often reflects DP in the data in the absence of bias mitigation. In this figure, binary sex is the sensitive attribute and linear heads are trained to predict other attributes in CelebA using pretrained features." (source: here)

Scaling the model, as described in the literature, offers a more favorable tradeoff ("performance improves with scale subject to any prescribed level of bias constraint"). Second, all subgroups benefit from the improvement, and scaling reduces the disparity in performance across the subgroups.

"top: Accuracy (ACC) for ViT variants after debiasing for each DP level. middle: Accuracy for each subgroup in CelebA prior to debiasing. bottom: y-axis is absolute difference in performance across the two subgroups: females and males. ViT-22B provides a more equitable performance, compared to smaller ViT architectures." (source: here)

What does the model see?

As described earlier, computer vision models focus primarily on texture, while humans rely more on shapes.

Humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias. (source: here)

The result is very interesting because most models have 20–30 % shape bias and 70–80 % texture bias (and this is also true for convolutional networks). This bias is also one of the reasons why by varying the texture of the image even if the shape is recognizable, a model can be tricked to misinterpret the image and mislabel it.

"Shape bias: many vision models have a low shape / high texture bias, whereas ViT-22B fine-tuned on ImageNet (red, green, blue trained on 4B images as indicated by brackets after model names, unless trained on ImageNet only) have the highest shape bias recorded in a ML model to date, bringing them closer towards a human-like shape bias." (source: here)

In addition, another way to understand what the model sees is to obtain saliency maps (gradient-based feature attribution methods).

Saliency before and after model cooldown (source: here)

Conclusions

Google unveiled a model that is more than 5 times the previous model of ViTs.

We presented ViT-22B, the currently largest vision transformer model at 22 billion parameters. We show that with small, but critical changes to the original architecture, we can achieve both excellent hardware utilization and training stability, yielding a model that advances the SOTA on several benchmarks. (source: here)

Beyond the size and results in the benchmarks, this model is a starting point for much larger models. In fact, before now succeeding in scaling a ViT model was difficult because of instability during training. The authors showed that these problems can be solved by modifying the architecture.

Large models can be used as pre-trained scaffolds for different tasks ( computer vision models can be used in many real-world tasks). In addition, unexpected behaviors emerge (which are not present in small models and cannot be predicted by the scaling law). Moreover, as shown these models can then integrate into multi-modal models (and could influence emergent behaviors in them).

In addition, ViT-22B shows how scaling has improved in terms of fairness. This model is also more robust and more aligned with human vision (less dependent on texture and more on the shape).

Most likely, we will soon see larger ViTs (alone or as a component of a multi-modal model). What do you think about it?

If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Microsoft BioGPT: Towards the ChatGPT of life science?

SparseGPT: fewer parameters is better?

Everything but everything you need to know about ChatGPT

RazzAIe awards 2022: what are the worst AI of the year?

Tags: Artificial Intelligence Data Science Machine Learning Science Technology

Comment