Improving CLIP performance in training-free manner with few-shot examples

Author:Murphy | View: 27308 | Time: 2025-03-22 23:05:26

This is the 3rd article on how to improve CLIP performance on classification. You can find the first [here](https://medium.com/towards-data-science/improving-performance-and-explainability-of-zero-shot-clip-33e579d3f4bb) and the second one here. In the first two articles our focus was on zero-shot classification where we discovered that leveraging a large language model (LLM) to tailor prompts could enhance CLIP's zero-shot classification performance. In this article we will explore how CLIP's classification performance can be further enhanced when provided with a few visual examples for each class. Before proceeding, I recommend refreshing your understanding of CLIP from my first article of this series.

Introduction

The zero-shot classification abilities of Clip are constrained by the knowledge it acquires during pre-training. Consequently, if we aim to classify data that are rare or absent in CLIP's pre-training data the classification performance may be not satisfactory. While assembling an extensive dataset can be challenging, obtaining a few examples for each class is typically feasible. One approach to enhance CLIP's performance is to incorporate small adapters on top and train them with the few-shot images while keeping CLIP's original weights frozen. However, there are instances where training even small adapters may not always be viable. Alternatively, we can leverage CLIP in a training-free manner while still benefiting from the new knowledge in the few-shot examples. In this article, we will explore how to achieve this using a method called Tip-Adapter [1].

Cached model

The main idea behind performing training-free classification is to utilize a cached model. This approach entails encoding the available few-shot training images using CLIP's visual encoder and storing these encodings. During testing, this cached model, which includes associated labels, can be employed to calculate similarities between the test image and the cached images within the image space. As we have multiple images available per class, we can aggregate the similarities across different labels and utilize them as indicators to determine which labels in the embedding space the test image is closest to. This is illustrated in the figure below. Note how this process resembles the k-nearest neighborhood model.

In formulas we have:

Here the A (affinity matrix) is basically what we have shown in the diagram, however it's slightly reformulated adding a modulating hyperparameter beta that controls the sharpness of the similarities and an exponential function to make the affinities not negative. At the end, however, it is like calculating the cosine similarities between the images. For example, if the images have a similarity of 0.1 and 0.8, with the above formula they become: for beta=2: np.exp(-beta (1–0.1)) = 0.16, np.exp(-beta (1–0.8)) = 0.67. for beta=5: np.exp(-2 (1–0.1)) = 0.01, np.exp(-2 (1–0.8)) = 0.36. for beta=1: np.exp(-2 (1–0.1)) = 0.40, np.exp(-2 (1–0.8)) = 0.81.

Thus, if we increase beta, the similarities get further apart while for smaller beta they get closer.

The matrix A is of shape (1 x NK) where N is the number of classes and K number of examples (shots) for each class. If we then sum the similarities (affinity matrix) by class, which can be simply done by multiplying the Affinity matrix by one-hot encoded matrix L_train (NK x N) with value of 1 at the class index and 0 otherwise, we get a (1 x N) matrix where each entry is the total similarity of the test image to the images belonging to a certain class.

Cached model by itself is however less useful as we still aim to leverage CLIP's zero-shot multimodal capabilities. To achieve this, we merge the logits from the cached model in the image space with the logits from CLIP's zero-shot predictions in the image-text space. As we discussed in previous articles, zero-shot prediction involves embedding all textual prompts into a matrix W of shape (N x C) where C represents the number of CLIP hidden dimensions. Multiplying this matrix by the test image of shape (1 x C) yields the similarities between the image and each textual prompt, resulting in a similarity matrix with the same shape as the similarities by class mentioned earlier (1 x N). Therefore, we combine these two matrices in a residual manner as follows:

With alpha being __ a parameter that controls how much importance give to the cached model.

Implementation

Dataset download steps

First of all we need to download the data and have a procedure to create few-shot examples. We are not going to do this from scratch and will reuse the available code from other work.

Let's start by downloading Oxford pets dataset together with train test splits on from here. (License: The dataset is available to download for commercial/research purposes)
Install Dassl package to conveniently select few-shot examples from oxford pets dataset. The installation is described here.
Create a folder called "datasets" in your working directory with oxford_pets.py file and copy the code from here.

By now you should have the following directory structure:

Method

First, we import some packages, the CLIP model and define the config file with the name of the dataset, the path to it, the number of shots per class and the batch size.

from dassl.data.datasets.build import build_dataset
from dassl.config import get_cfg_default
from dassl.data.data_manager import build_data_loader
from dassl.data.transforms.transforms import build_transform

from tqdm import tqdm
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
import numpy as np

import datasets.oxford_pets

# import CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model.eval().cuda()

# define the config 
cfg = get_cfg_default()

# root to the images
cfg.DATASET.ROOT = "/data/"
cfg.DATALOADER.NUM_WORKERS = 0
cfg.DATASET.NUM_SHOTS = 8

cfg.DATALOADER.TRAIN_X.BATCH_SIZE = 16
cfg.DATALOADER.TEST.BATCH_SIZE = 128

cfg.DATASET.NAME = 'OxfordPets' 
cfg.DATASET.SUBSAMPLE_CLASSES = "all" 
cfg.SEED = 1

dataset = build_dataset(cfg)

# Transformation for the data to resize and center
tfm_train = build_transform(cfg, is_train=True)
tfm_test = build_transform(cfg, is_train=False)

train_loader_x = build_data_loader(
            cfg,
            data_source=dataset.train_x,
            batch_size=cfg.DATALOADER.TRAIN_X.BATCH_SIZE,
            tfm=tfm_train,
        )

test_loader = build_data_loader(
                cfg,
                sampler_type=cfg.DATALOADER.TEST.SAMPLER,
                data_source=dataset.test,
                batch_size=cfg.DATALOADER.TEST.BATCH_SIZE,
                tfm=tfm_test,
                is_train=False,
            )

Now let's create the cached model and extract the CLIP textual features with the prompt "a photo of a {class}, a type of pet." for each type of pets in the dataset.

def build_cache_model(cfg, clip_model, train_loader_cache):
    """ create the cached model with encoded images and one-hot labels """
    cache_keys = []
    cache_values = []

    with torch.no_grad():
        train_features = []

        for i, batch in enumerate(tqdm(train_loader_cache)):
            images = batch['img'].cuda()
            target = batch['label'].cuda()

            image_features = clip_model.visual_projection(clip_model.vision_model(images).pooler_output)
            train_features.append(image_features)
            cache_values.append(target)
        cache_keys.append(torch.cat(train_features, dim=0).unsqueeze(0))

    cache_keys = torch.cat(cache_keys, dim=0).mean(dim=0)
    cache_keys /= cache_keys.norm(dim=-1, keepdim=True)
    cache_keys = cache_keys.permute(1, 0)
    cache_values = F.one_hot(torch.cat(cache_values, dim=0)).half()

    return cache_keys, cache_values

def clip_classifier(classnames, template, clip_model):
    """ encode textual prompts """
    with torch.no_grad():
        clip_weights = []

        for classname in tqdm(classnames):
            # Tokenize the prompts
            classname = classname.replace('_', ' ')
            texts = [t.format(classname) for t in template]
            texts = processor(text=texts, return_tensors="pt", padding=True)
            class_embeddings = clip_model.text_model(texts['input_ids'].cuda()).pooler_output
            class_embeddings = clip_model.text_projection(class_embeddings)
            class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
            clip_weights.append(class_embeddings)

        clip_weights = torch.stack(clip_weights, dim=1).cuda()[0].t()
    return clip_weights

# Textual features
print("nGetting textual features as CLIP's classifier.")
template = ["a photo of a {}, a type of pet."]
clip_weights = clip_classifier(dataset.classnames, template, model)

# cached model of the training data
print("nConstructing cache model by few-shot visual features and labels.")
cache_keys, cache_values = build_cache_model(cfg, model, train_loader_x)

# extract features for the test data
cache_keys_test, cache_values_test = build_cache_model(cfg, model, test_loader)

Finally, let's compute our results:

def cls_acc(output, target, topk=1):
    # Get the topk predictions
    pred = np.argmax(output, axis=1)    
    # Check if predictions match the target
    correct = pred == target.reshape(1, -1)
    # Calculate accuracy
    acc = correct[:topk].reshape(-1).sum(0)
    acc = 100 * acc / target.shape[0]

    return acc

def pred_cached(cliplogits, feats, c_keys, c_values, labels, alpha, beta):

    affinity = feats @ c_keys
    logits_cache = ((-1) * (beta - beta * affinity)).exp() @ c_values

    tipa_logits = cliplogits + logits_cache * alpha
    acc = cls_acc(tipa_logits.detach().cpu().numpy(), labels.detach().cpu().numpy())
    return acc

clip_logits_test = 100. * cache_keys_test.t() @ clip_weights
acc_zeroshot = cls_acc(clip_logits_test.detach().cpu().numpy(), cache_values_test.argmax(1).detach().cpu().numpy())
print(acc_zeroshot) 

acc_cached = pred_cached(clip_logits_test, cache_keys_test.t(), cache_keys, 
                          cache_values.float(), cache_values_test.argmax(1), 1, 4)

print(acc_cached)

From above we get 83.619 accuracy for zero-shot CLIP and 84.955 with the cached model of 8 examples per class. This might seem like not a big improvement, however it may vary depending on the dataset. In this case, the high accuracy of CLIP's zero-shot predictions indicates that the model's pre-trained knowledge already adeptly distinguishes between various types of pets. However, on a different dataset where CLIP's zero-shot accuracy is lower, leveraging the cached model could potentially result in a much more significant performance enhancement.

Conclusions

In this third article of the series on CLIP zero-shot prediction we have seen how we can improve zero-shot capabilities of CLIP when we have some examples per class available in a training-free manner, without training any additional parameters. We have shown that on Oxford pets dataset we can get a boost of almost 1.5% with 8 examples per class.