How to Process 10k Images in Seconds

Manual, repetitive tasks. Egh. One of the things I hate the most, especially if I know they can be automated. Imagine you need to edit a bunch of images with the same cropping and resizing operation. For a couple of images you might just open an image editor and do it by hand. But what about doing the same operation for a thousands or tens of thousands of images? Let's see how we can automate such an image processing task with Python and Opencv, as well as how we can optimize this data processing pipeline to run efficiently on a sizeable dataset.
Dataset
For this post, I created a toy example where I extracted 10,000 frames from a random video of a beach I recorded, where the goal is to crop the image to a square aspect ratio around the center and then resize the image to a fixed size of 224×224.
This roughly resembles part of a pre-processing step that might be required for a dataset when training a machine learning model.


Prerequisites
If you want to follow along, make sure to install the following packages, for example with uv. You can also find the full source code on GitHub.
uv add opencv-Python tqdm
Data Loading
Let's start by loading the images one by one using OpenCV. All the images are in a subfolder and will use the pathlib glob method to find all png files in this folder. To show the progress, I am using the tqdm library. By using the sorted method, I make sure the paths are sorted and convert the generator returned from the glob call to a list. This way tqdm knows the length of the iteration to show the progress bar.
from pathlib import Path
from tqdm import tqdm
img_paths = Path("images").glob("*.png")
for img_path in tqdm(sorted(img_paths)):
pass
Now we can also prepare our output directory, and make sure that it exists. This is where our processed images will be stored.
output_path = Path("output")
output_path.mkdir(exist_ok=True, parents=True)
Image Processing
For the processing of the image, let's define a function. It will take the input and output image path as arguments.
def process_image(input_path: Path, output_path: Path) -> None:
"""
Image processing pipeline:
- Center crop to square aspect ratio
- Resize to 224x224
Args:
input_path (Path): Path to input image
output_path (Path): Path to save processed image
"""
To implement this function, we first need to load the image with OpenCV. Make sure to import the opencv package at the beginning of the file.
...
import cv2
def process_image(input_path: Path, output_path: Path) -> None:
...
# Read image
img = cv2.imread(str(input_path))
To crop the image, we can directly slice the image array on the x-axis. Keep in mind that OpenCV image arrays are stored in YXC shape: X/Y the 2D axis of the image starting in the top-left corner and C being the color channel. So the x-axis is the second index of the image. For simplicity, I assume that the images are in landscape format with their width > height.
height, width, _ = img.shape
img = img[:, (width - height) // 2 : (width + height) // 2, :]
To resize the image, we can simply use the resize function from OpenCV. If we don't specify an interpolation method, it will use a bilinear interpolation, which is fine for this project.
target_size = 224
img = cv2.resize(img, (target_size, target_size))
Finally the image has to be saved to the output file, using the imwrite function.
cv2.imwrite(str(output_path), img)
Now we can simply call our process_image function in the loop over the image paths.
for img_path in tqdm(sorted(img_paths)):
process_image(input_path=img_path, output_path=output_path / img_path.name)
If I run this program on my machine, it takes a bit over a minute to process 10,000 images.
4%|█████▏ | 441/10000 [00:02<01:01, 154.34it/s]
Now while for this dataset size waiting a minute is still feasible, for a 10x larger dataset you would already wait for 10 minutes. We can do way better by parallelizing this process. If you look at the resource usage while running the current program, you will notice that only one core is at 100% utilization. The program is only using a single core!

Parallelize across Multiple Cores
To make our program use more of the available cores, we need to use a feature in python called Multiprocessing. Due to the Global Interpreter Lock (GIL), a single python process cannot really run a task in parallel (unless the GIL is disabled, which can be done with Python≥3.13). What we need to do instead is spawn multiple python processes (hence the name multiprocessing) that are managed by our main python program.
To implement this, we can make use of the built-in python modules multiprocessing and concurrent. We could theoretically manually spawn the python processes, while making sure not to submit more processes than we have number of cores. Since our process is CPU bound, we will not see a speed improvement with more processes, as they will just have to wait. In fact, at one point the overhead of switching between the processes will overweigh the advantage of the parallelization.
To manage the python processes, we can use a ProcessPoolExecutor. This will keep a pool of python processes instead of fully destroying and restarting each process for each submitted task. By default, it will use as many pools as the number of logical CPUs available, which is retrieved from os.process_cpu_count(). So by default it will spawn a process for every core of my CPU, in my case 20. You could also supply a max_workers argument to specify the number of processes to spawn in the pool.
NOTE: Make sure to wrap your multiprocess pool in a main check. On Windows, the subprocesses are separate processes that import the module, which will keep spawning new processes recursively if the process spawning is not inside the main guard!
from concurrent.futures import ProcessPoolExecutor
...
if __name__ == "__main__":
...
output_paths = [output_path / img_path.name for img_path in img_paths]
with ProcessPoolExecutor() as executor:
all_processes = executor.map(
process_image,
img_paths,
output_paths,
)
for _ in tqdm(all_processes, total=len(img_paths)):
pass
We use a context manager (the with statement) to create a process pool executor, this will make sure that the processes are cleaned up even if an exception occurs during the execution. Then we use the map function to create a process for each of our input img_paths and output_paths. Finally by wrapping the iteration of all_processes with tqdm, we can get a progress bar for the processes that have finished.
18%|█████ | 1760/10000 [00:00<00:04, 1857.23it/s]
Now if you run the program and check the CPU utilization again, you will see that all cores are used! The progress bar also shows how our iteration speed has increased.

Comparison
As a quick sanity check, I plotted the timing for processing 1000 images using different amounts of parallelization, starting with the single worker scenario and increasing the number of workers up to twice the number of cores my machine has. The figure below indicates the optimum being close to the number of CPU cores. There's a sharp increase in performance going from 1 worker to multiple, and a slight decrease of the performance with more workers than CPU cores.

Conclusion
In this post you learned how to efficiently process an image dataset by running the processing in parallel on all available cores. This way the data processing pipeline was sped up by a significant factor. I hope you learned something today, happy coding and take care!
All images and videos are created by the author.