Platonic Representation: Are AI Deep Network Models Converging?

Author:Murphy | View: 29148 | Time: 2025-03-22 21:38:57

Are AI Deep Network Models Converging?

A recent MIT paper has come to my attention for its impressive claim: AI models are converging, even across different modalities – vision and language. "We argue that representations in AI models, particularly deep networks, are converging" is how The Platonic Representation Hypothesis paper begins.

But how can different models, trained on different datasets and for different use cases converge? What has led to this convergence?

✨This is a paid article. If you're not a Medium member, you can read this for free in my newsletter: Qiubyte.

Plato's allegory of the cave by Jan Saenredam (public domain).

1. The Platonic Representation Hypothesis

We argue that there is a growing similarity in how datapoints are represented in different neural network models. This similarity spans across different model architectures, training objectives, and even data modalities.

The Platonic Representation Hypothesis. The visual representation, X, and the textual one Y, are both projections of a common reality, Z. (source: Paper)

introduction

The paper's central argument is that models of various origins and modalities are converging to a representation of reality – the joint distribution over the events of the world that generate the data we observe and use to train the models.

The authors argue that this convergence towards a platonic representation is driven by the underlying structure and nature of the data that models are trained on, and by the growing complexity and capability of the models themselves. As models encounter various datasets and wider applications, they require a representation that captures the fundamental properties commonly found in all data types.

An Illustration of The Allegory of the Cave, from Plato's Republic (art by 4edges, source: Wikipedia)

2. Are AI Models Converging?

AI models of various scales, even built on diverse architecture and trained for different tasks, are showing signs of convergence in how they represent data. As these models grow in size and complexity and the feeding data becomes larger and varied, their methods of processing data begin to align.

Do models trained on different data modalities – vision or text, also converge? The answer could be yes!

2.1 Vision Models that Talk

This alignment spans over visual and textual data – the paper later confirms that the limitations of this theory are that it's focused on these two modularities and not other modalities such as audio, or robotics perception of the world. One of the cases [1] to support this is LLaVA, which shows projecting visual features into language features using a 2-layer MLP, resulting in state-of-the-art results.

Outline of how LLaVA maps visual features to a Language Model. (source: LLaVA, CC-BY)

2.2 Language Models that See

Another interesting example is A Vision Check-up for Language Models [2] which explores the extent to which large language models understand and process visual data. The study uses code as a bridge between images and text, as a novel approach to feed visual data to LLMs. The paper reveals that LLMs can generate images by code that while may not look realistic, still contain enough visual information to train vision models.

2.3 Bigger Models, Bigger Alignment

The alignment of different models is correlated with their scale. As an example, models trained on CIFAR-10 classification that are bigger, show greater alignment with each other, compared to smaller models. This means that with the current trend of building models in the order of 10s and now 100s of billions, these giants will be even more aligned.