SAM: Segment Anything Model

Author:Murphy | View: 26859 | Time: 2025-03-22 23:09:17

Introduction

Transformers have been widely applied to Natural Language Processing use cases but they can also be applied to several other domains of Artificial Intelligence such as time series forecasting or computer vision.

Great examples of Transformers models applied to computer vision are Stable Diffusion for image generation, Detection Transformer for object detection or, more recently, SAM for image segmentation. The great benefit that these models bring is that we can use text prompts to manipulate images without much effort, all it takes is a good prompt.

The use cases for this type of models are endless, specially if you work at an e-commerce company. A simple, time consuming and expensive use case is the process from photographing an item to posting it on the website for sale. Companies need to photograph the items, remove the props used and, finally, in-paint the hole left by the prop before posting the item in the website. What if this entire process could be automated by AI and our human resources would just handle the complex use cases and review what was done by AI?

In this article, I provide a detailed explanation of SAM, an image segmentation model, and its implementation on a hypothetical use case where we want to perform an A/B test to understand which type of background would increase conversion rate.

Figure 1: Segment Anything Model (image generated by the author with DALL-E)

As always, the code is available on Github.

Segment Anything Model (SAM) [1] is a segmentation model developed by Meta that aims to create masks of the objects in an image guided by a prompt that can be text, a mask, a bounding box or just a point in an image.

The inspiration comes from the latest developments in Natural Language Processing and, particularly, from Large Language Models, where given an ambiguous prompt, the user expects a coherent response. In the same line of thought, the authors wanted to create a model that would return a valid segmentation mask even when the prompt is ambiguous and could refer to multiple objects in an image. This reasoning led to the development of a pre-trained algorithm and a general method for zero-shot transfer to downstream tasks.

SAM, as a prompt-able segmentation task solver, can solve new and different segmentation tasks than what was trained for, via prompting engineering, making it accessible for a wide variety of use cases with few or none fine-tuning on your own data.

How does it work?

As shown in Figure 2, SAM has three main components:

1. Image Encoder which is an adaptation of the encoder from Masked AutoEncoder model (MAE) [2]. MAE is pre-trained on images divided into regular non-overlapping patches where 75% of them are masked.

After this image transformation, the encoder receives the unmasked patches and encodes it into an embedding vector. This vector is then concatenated to the mask tokens (that identify a missing patch that needs to be predicted) and positional embeddings before going through the decoder to reconstruct the original image.

Figure 3: Training process of MAE model (image made by the author)

The most important part in this whole process is the decision of which patches should be masked. It relies on a random sampling without replacement that together with the high masking ratio (75%) guarantees a complex image reconstruction task that cannot be solved by simply extrapolating from the visible neighbouring patches. Thereby, the encoder must learn how to create a high quality vector representation of the image, so that the decoder can correctly reconstruct the original image.

The authors adapted the image encoder to produce an embedding which is a 16x downscaling of the original image with 64×64 dimension and 256 channels.

2. Flexible Prompt Encoder has four different components that are triggered depending on the prompt provided.

Masks are represented by a Convolution Neural Network that downscales two times the image by a factor of 4 using a kernel 2×2 and a stride-2 convolution operation with output channels of 4 and 16, respectively. After that, a 1×1 convolution maps the channel dimension to 256 which is then added element-wise to the output of the Image Encoder. If no mask is provided then a learned embedding representation of "no mask" replaces the mask embedding.

Figure 4: Processing mask prompts in SAM architecture (image made by the author)

Points are represented as a positional embedding and one of two learned embeddings that indicate if the point is in the background or the foreground. The positional embedding is achieved by applying the work developed by the authors in [3]. The coordinates of a point are mapped into Fourier features before feeding a Multi Layer Perceptron (MLP) which improves significantly the output generated. As shown in Figure 5, for the image regression task the model is able to generate a non blurred image when using Fourier features. For SAM, the authors apply the same logic and create a 256 dimensional vector to represent the point position.

Figure 5: Positional Embedding Creation (image made by the author)

Boxes follow the same principle of Points where there is a positional embedding for top-left corner and another for bottom-right corner, but instead of two learned embeddings to identify the foreground or the background, it has two learned embeddings to identify the top-left corner and the bottom-right corner.
Text is encoded by the text encoder from CLIP [4]. CLIP is a model developed by OpenAI that was trained to predict which caption goes with which image rather than the traditional approach of predicting a fixed set of object classes. This approach aligns the embedding created by the text encoder with the image encoder which allows to perform zero-shot classification based on the cosine similarity between both vector embeddings. The output of the text encoder in SAM is a 256 dimensional vector.

Figure 6: Text and Image Encoder trained together to create similar embeddings (image made by the author)

3. Finally, Fast Mask Decoder has 2 decoder layers that map the image embedding and a set of prompt embeddings into an output mask.

The decoder layer receives the prompt tokens as input that are transformed through a self-attention layer.
Its output is combined with the image embedding in a cross-attention layer to update the prompt embedding with image information.
Finally, the new prompt embedding goes through a MLP layer that feeds a new cross-attention which is responsible to update the image embedding with prompt information.

The output of the second decoder layer, which is an image embedding conditioned by the input prompts, goes through two transposed convolutional layers to upscale the image.

At the same time, the MLP output is combined with the same image embedding in a new cross-attention layer to feed a 3 layer MLP to produce a vector that is combined with the upscaled image embedding through a spatially point-wise product to generate the final masks.

Note that at every attention layer, positional encodings are added to the image embedding.

Figure 7: Decoder architecture (image made by the author)

Customize your product landing page using SAM

In this section, we will implement SAM in a hypothetical use case where we aim to create several versions of a product landing page in order to perform an A/B test and check which background leads to a higher conversion rate.

For that we will use the model facebook/same-vit-huge available in Hugging Face

Tags: AI Artificial Intelligence Computer Vision Data Science Machine Learning