Monocular Depth Estimation to Predict Surface Reliefs of Mars

Author:Murphy | View: 25524 | Time: 2025-03-23 12:48:17

Several approaches to estimating surface elevation from a single image have been discussed in the literature. In a previous article, I discussed how it is possible to predict the depth of a single 2D image using a monocular estimation model. However, when the input to the model is an image of a particular surface, the prediction represents a Digital Elevation Model (DEM). In my first research paper I showed how a DEM of the surface of Mars can be obtained from 2D greyscale images using deep learning approaches. To better understand the idea I'm going to propose, I suggest you first to try the demo of the project here.

Introduction

As discussed in more detail in another story, the DEM of a surface is a grid of elevation values where each cell stores the elevation of a particular point on the surface:

Graphic visualization of a DEM. NSIDC, CC BY 2.0, via Wikimedia Commons

DEMs are usually represented graphically using color maps. In the image above, the highest points are red and the lowest points are purple.

On the other hand, monocular depth estimation models are used to estimate the distance of each pixel of an image from the camera (even the camera of a satellite) that took the image:

Depth prediction of a bedroom. Input image from NYU-Depth V2.

The idea is that a satellite image of a surface can be fed into a monocular depth estimation model. In this way, it's possible to predict the DEM of that surface, because each point of the output represents a distance (depth), and elevations of the surface can be derived by using depths (more on this later).

The method discussed in this article can be used for other surfaces as well, and not just for Mars.

UAHiRISE

The High-Resolution Imaging Science Experiment (HiRISE) is a camera on board the Mars Reconnaissance Orbiter (a spacecraft that provides support for missions to Mars). HiRISE took a large number of grayscale images of the Martian surface, and each image is associated with a DTM. UAHiRISE is the University of Arizona website where all these resources are available. These files are geographical rasters, if you want more information about working with this file format, see my other article.

Dataset

A web scrapper was used to download all the greyscale images with their associated DTMs (code below). However, these files are too large in resolution to be used as input to a neural network. Thus, they had to be split into smaller patches (I discussed this procedure in another article). The final dataset consisted of 150,000 patches.

To train the model, a set of tiles of the Martian surface was used with the corresponding depth. However, each tile of the original DTM had to be scaled. Absolute elevations were converted to relative depths. This means that if a ground truth had values in the range [-3500, -2500] they were scaled to [0, 1000] where 0 is the closest point and 1000 is the farthest. This had to be done because a monocular depth estimation model can't directly predict absolute elevation values.

An example of a training sample is represented by the image below:

One of the training samples. Dataset made of images and DTMs from UAHiRISE.

The reddish colours are the closest points and the bluish colours are the furthest. However, the distance of a point is inversely proportional to its elevation, i.e. reddish colours have a higher elevation than blueish colours. After predicting depths, all points are converted to elevations.

The dataset is about 1TB in size and it's available on Kaggle.

Model

After an analysis of all the available architectures for monocular depth estimation, the GLPN¹ model was chosen:

GLPN architecture. Image from the official paper.

The architecture uses a hierarchical transformer encoder to obtain global information at different resolutions from the H × W × 3 input (RGB image). Then a decoder restore the bottleneck feature into the size of H × W × 1 (the output depth map).

The model was almost ready for use in the experiment, except for the input layer. Most of the images released by UAHiRISE are greyscale. Therefore, the input layer was modified to take 1-channel (greyscale) images instead of 3-channel (RGB) images. The ImageNet pre-trained for the encoder was still used, but without loading the weights for the input layer.

Final result

After training the model, it's ready to predict DTM. The following is an example of prediction made by the final model:

Input grayscale image. Tile from a UAHiRISE image.

3D model of the DTM. Image made by using this generator.

After calculating the metrics for all the test samples, the results showed a mean absolute error of 10.3 meters, which can be a good value or not depending on the scenario. However, the main obstacle to getting good results is that all the images used were greyscale and thus had less information than RGB or multispectral images. You can find all the metrics in the README of my repository.

Concluding remarks

While I was working on this research, someone suggested predicting heights directly rather than predicting depths. However, experiments showed better results by predicting depths and then converting them to heights.

A demo of the project is available here where you can predict a DTM and also obtain a 3D model. All the code is available in my repository along with all the resources to preparing the dataset and training the model. You can also use my pre-trained model directly.

The idea of this project can be applied to other scenarios where you want to predict a DTM from satellite images.

Thanks for reading, I hope you have found this useful.

All images, unless otherwise noted, are by the Author.

[1] Doyeon K., Woonghyun K., Pyungwhan A., Donggyu J., Sehwan C., Junmo K., Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth (2022)

Tags: 3d Deep Learning Geography Mars Remote Sensing