Image Search in 5 Minutes

Author:Murphy  |  View: 23259  |  Time: 2025-03-23 12:17:20
"Weighing Vectors" by the author using MidJourney. All images by the author unless otherwise specified.

In this post we'll implement Text-to-image search (allowing us to search for an image via text) and Image-to-image search (allowing us to search for an image based on a reference image) using a lightweight pre-trained model. The model we'll be using to calculate image and text similarity is inspired by Contrastive Language Image Pre-Training (CLIP), which I discuss in another article.

The results when searching for images with the text "a rainbow by the water"

Who is this useful for? Any developers who want to implement image search, data scientists interested in practical applications, or non-technical readers who want to learn about A.I. in practice.

How advanced is this post? This post will walk you through implementing image search as quickly and simply as possible.

Pre-requisites: Basic coding experience.

What We're Doing, and How We're Doing it

This article is a companion piece to my article on "Contrastive Language-Image Pre-Training". Feel free to check it out if you want a more thorough understanding of the theory:

CLIP, Intuitively and Exhaustively Explained

CLIP models are trained to predict if an arbitrary caption belongs with an arbitrary image. We'll be using this general functionality to create our image search system. Specifically, we'll be using the image and text encoders from CLIP to condense inputs into a vector, called an embedding, which can be thought of as a summary of the input.

The job of an encoder is to summarize an input into a meaningful representation, called an embedding. Image from my article on CLIP.

The whole idea behind CLIP is that similar text and images will have similar vector embeddings.

CLIP tries to get the embeddings for similar things to be close together. Image from my article on CLIP.

The specific model we'll be using is called uform, which is conceptually similar to CLIP. uform is a permissively licensed, pretrained, and resource efficient model which promises superior performance to CLIP. Uform comes in 3 flavors, we'll be using the "late fusion" variant which is conceptually similar to CLIP.

The three model flavors in the uform library. As you can see the "Late Fusion Model" is very similar to CLIP in that it returns two separate vectors from two encoders. source

Actual similarity between embeddings will be calculated using cosine similarity. The essence of cosine similarity is that two things can be defined as "similar" if the angle between their embedding is small. Thus, we can calculate how similar text and images are to each other by first embedding both the text and the images, then calculating the cosine similarity between the embeddings.

Cosine similarity uses the angle between vectors to determine similarity. the angle between A and B is small, and thus A and B are similar. C would be considered very different from both A and B. Image from my article on CLIP.

And that's the idea in a nutshell: we download the CLIP inspired model (uform), use the encoders to embed images and text, then use cosine similarity to search for similarities. Feel free to refer to the companion article for a deeper dive on the theory. Now we just need to put it into practice.

Implementation

I'll be skipping through some of the unimportant stuff. The full code can be found here:

MLWritingAndResearch/ImageSearch.ipynb at main · DanielWarfield1/MLWritingAndResearch

Downloading the Model

This is super easy, just pip install the uform module, then use the module to download the model from Hugging Face. We'll be using the english version, but versions in other languages are also available.

!pip install uform
import uform
model = uform.get_model('unum-cloud/uform-vl-english')

Defining a Database of Images to Search

I downloaded a few images from a dataset for us to play with, which is a derivative of a dataset form the harvard dataverse (licensed under creative commons), and put them in a public github repo. The following pseudo code downloads those images to the list images . This list of images is what we'll ultimately be searching through.

#List all files
urls = get_image_urls_from_github()

#Download each file
images = download_images(urls)

#Render out a few examples
render_examples(images)
A few examples from the dataset of images we'll be searching

Implementing Text-to-Image Search

Here's where the rubber meets the road. First we'll define some search text, in this example a rainbow by the water. Then we can embed that text and compare it to the embeddings for all images. We can then sort by the cosine similarity to display the top five images which are most similar to the search text. Keep in mind, a CLIP style model has a separate image and text encoder, So text gets encoded with the text encoder, and images get encoded with the image encoder.

"""Implementing text to image search
using the uform model to encode text and all images. Then using cosine
similarity to find images which match the specified text. Rendering out the
top 5 results
"""

import torch.nn.functional as F

#defining search phrase
text = "a rainbow by the water"
print(f'search text: "{text}"')

#embedding text
text_data = model.preprocess_text(text)
text_embedding = model.encode_text(text_data)

#calculating cosine similarity
sort_ls = []
print('encoding and calculating similarity...')
for image in tqdm(images):
    #encoding image
    image_data = model.preprocess_image(image)
    image_embedding = model.encode_image(image_data)

    #calculating similarity
    sim = F.cosine_similarity(image_embedding, text_embedding)

    #appending to list for later sorting
    sort_ls.append((sim, image))

#sorting by similarity
sort_ls.sort(reverse=True, key = lambda t: t[0])

print('top 5 most similar results:')
_, axs = plt.subplots(1, 5, figsize=(12, 8))
axs = axs.flatten()
for img, ax in zip([im for sim, im in sort_ls][:5], axs):
    ax.imshow(img)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

Implementing Image-to-Image Search

Image to image search behaves similarly to the text to image search previously discussed; we embed the image we're using to search, and embed all other images. The embedding of our search image is compared to the embedding of all other images (using cosine similarity), allowing us to find the most similar images to our search image. Naturally, the most similar image in this example is the image itself.

"""Implementing image to image search
similar to previous approach, except all images are compared to an input image.
Rendering out the top 5 results
"""

#defining search image
input_image = images[15]

#rendering search image
print('input image:')
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111)
ax.imshow(input_image)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()

#embedding image
image_data = model.preprocess_image(input_image)
search_image_embedding = model.encode_image(image_data)

#calculating cosine similarity
sort_ls = []
print('encoding and calculating similarity...')
for image in tqdm(images):
    #encoding image
    image_data = model.preprocess_image(image)
    image_embedding = model.encode_image(image_data)

    #calculating similarity
    sim = F.cosine_similarity(image_embedding, search_image_embedding)

    #appending to list for later sorting
    sort_ls.append((sim, image))

#sorting by similarity
sort_ls.sort(reverse=True, key = lambda t: t[0])

print('top 5 most similar results:')
_, axs = plt.subplots(1, 5, figsize=(12, 8))
axs = axs.flatten()
for img, ax in zip([im for sim, im in sort_ls][:5], axs):
    ax.imshow(img)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

Conclusion

And that's it! We successfully used a CLIP style model's image and text encoder to implement two types of image search; one based on input text, and one based on an input image. We did this by using the text encoder to calculate an embedding for the text, the image encoder to calculate an embedding of the images, and searched by sorting the similarity of embeddings using cosine similarity.

Feel free to check out the companion article for a deeper dive on CLIP.

Follow For More!

I describe papers and concepts in the ML space, with an emphasis on practical and intuitive explanations.

Attribution: All of the resources in this document were created by Daniel Warfield, unless a source is otherwise provided. You can use any resource in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both. An explicit commercial license may be granted upon request.

Tags: Data Science Machine Learning Programming Software Development Software Engineering

Comment