LLaVA: An open-source alternative to GPT-4V(ision)
LLaVA (acronym of Large Language and Visual Assistant) is a promising open-source generative AI model that replicates some of the capabilities of OpenAI GPT-4 in conversing with images. Users can add images into LLaVA chat conversations, allowing to discuss about the content of these images, but also to use them as a way to describe ideas, contexts or situations in a visual way.
The most compelling features of LLaVA are its ability to improve upon other open-source solutions while using a simpler model architecture and orders of magnitude less training data. These characteristics make LLaVA not only faster and cheaper to train, but also more suitable for inference on consumer hardware.
This post gives an overview of LLaVA, and more specifically aims to
- show how to experiment with it from a web interface, and how it can be installed on your computer or laptop
- explain its main technical characteristics
- illustrate how to program with it, using as an example a simple chatbot application built with HuggingFace libraries (Transformers and Gradio) on Google Colab.
Using LLaVA online
If you have not yet tried it, the simplest way to use LLaVA is by going to the Web interface provided by its authors. The screenshot below illustrates how the interface operates, where a user asks for ideas about what meals to do given a picture of the content of their fridge. Images can be loaded using the widget on the left, and the chat interface allows to ask questions and obtain answers in the form of text.

In this example, LLaVA correctly identifies ingredients present in the fridge, such as blueberries, strawberries, carrots, yoghourt or milk, and suggest relevant ideas such as fruit salads, smoothies or cakes.
Other examples of conversations with LLaVA are given on the project website, which illustrate that LLaVA is capable of not just describing images but also making inferences and reasoning based on the elements within the image (identify a movie or a person using clues from a picture, code a website from a drawing, explain humourous situations, and so on).
Running LLaVA locally
LLaVA can also be installed on a local machine using Ollama or a Mozilla ‘llamafile‘. These tools can run on most CPU-only consumer-grade level machines, as the model only requires 8GB of RAM and 4GB of free disk space, and was even shown to successfully run on a Raspberry PI. Among the tools and interfaces developed around the Ollama project, a notable initiative is the Ollama-WebUI (illustrated below), which reproduces the look and feel of OpenAI ChatGPT user interface.

Brief overview of LLaVA's main features
LLaVA was designed by researchers from the University of Wisconsin-Madison, Microsoft Research and Columbia University, and was recently showcased at NeurIPS 2023. The project's code and technical specifications can be accessed on its Github repository, which also offers various interfaces for interacting with the assistant.
As the authors summarize in their paper's abstract:
[LLava] achieves state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
The benchmark results, reported in the paper as the radar chart below, illustrate the improvements compared to other state-of-the-art models.

Inner workings
LLaVA's data processing workflow is conceptually simple. The model essentially works as a standard causal language model, taking language instructions (a user text prompt) as input, and returning a language response. The ability of the language model to handle images is allowed by a separate vision encoder model that converts images into language tokens, which are quietly added to the user text prompt (acting as a kind of soft prompt). The LLaVA process is illustrated below.

LLaVA's language model and vision encoder rely on two reference models called Vicuna and CLIP, respectively. Vicuna is a pretrained large language model based on LLaMA-2 (designed by Meta) that boasts competitive performances with medium sized LLM (See model cards for the 7B and 13B versions on HuggingFace). CLIP is an image encoder designed by OpenAI, pretrained to encode images and text in a similar embedding space using contrastive language-image pretraining (hence ‘CLIP'). The model used in LLaVA is the vision transformer variant CLIP-ViT-L/14 (see its model card on HuggingFace).
To match the dimension of the vision encoder with those of the language model, a projection module (W in the image above) is applied. It is a simple linear projection in the original LLaVA, and a two-layer perceptron in LLaVA 1.5.
Training process
The training process of LLaVA consists of two relatively simple stages.
The first stage solely aims at tuning the projection module W, and the weights of the vision encoder and LLM are kept frozen. The training is performed using a subset of around 600k image/caption pairs from the CC3M conceptual caption dataset, and is available on HuggingFace in this repository.
In a second stage, the projection module weigths W are fine-tuned together with the LLM weights (while keeping the vision encoder's weights frozen), using dataset of 158K language-image instruction-following data. The data is generated using GPT4, and feature examples of conversations, detailed descriptions and complex reasonings, and is available on HuggingFace in this repository.
The whole training takes around a day using eight A100 GPUs.
Programming with LLaVA: How to get started
Code available on the Colab related notebook.
The LLaVA model is integrated in the Transformers library, and can be loaded using the standard pipeline object. The 7B and 13B variants of the models are available on the LLaVA