How to Estimate Depth from a Single Image

Author:Murphy | View: 25335 | Time: 2025-03-22 23:07:52

Monocular Depth heat maps generated with Marigold on NYU depth v2 images. Image courtesy of the author.

Humans view the world through two eyes. One of the primary benefits of this binocular vision is the ability to perceive depth – how near or far objects are. The human brain infers object depths by comparing the pictures captured by left and right eyes at the same time and interpreting the disparities. This process is known as stereopsis.

Just as depth perception plays a crucial role in human vision and navigation, the ability to estimate depth is critical for a wide range of Computer Vision applications, from autonomous driving to robotics, and even augmented reality. Yet a slew of practical considerations from spatial limitations to budgetary constraints often limit these applications to a single camera.

Monocular depth estimation (MDE) is the task of predicting the depth of a scene from a single image. Depth computation from a single image is inherently ambiguous, as there are multiple ways to project the same 3D scene onto the 2D plane of an image. As a result, MDE is a challenging task that requires (either explicitly or implicitly) factoring in many cues such as object size, occlusion, and perspective.

In this post, we will illustrate how to load and visualize depth map data, run monocular depth estimation models, and evaluate depth predictions. We will do so using data from the SUN RGB-D dataset.

In particular, we will cover the following:

Loading and visualizing SUN-RGBD ground truth depth maps
Running inference with Marigold and DPT
Evaluating relative depth predictions

We will use the Hugging Face transformers and diffusers libraries for inference, FiftyOne for data management and visualization, and scikit-image for evaluation metrics. All of these libraries are open source and free to use. Disclaimer: I work at Voxel51, the lead maintainers of one of these libraries (FiftyOne).

Before we get started, make sure you have all of the necessary libraries installed:

pip install -U torch fiftyone diffusers transformers scikit-image

Then we'll import the modules we'll be using throughout the post:

from glob import glob
import numpy as np
from PIL import Image
import torch

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob
from fiftyone import ViewField as F

Loading and Visualizing SUN-RGBD Depth Data

The SUN RGB-D dataset contains 10,335 RGB-D images, each of which has a corresponding RGB image, depth image, and camera intrinsics. It contains images from the NYU depth v2, Berkeley B3DO, and SUN3D datasets. SUN RGB-D is one of the most popular datasets for monocular depth estimation and semantic segmentation tasks!

Tags: Computer Vision Data Science Hugging Face Machine Learning Open Source