Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading
The Kaggle Blueprints

Welcome to another edition of "The [Kaggle](https://www.kaggle.com/) Blueprints", where we will analyze Kaggle competitions' winning solutions for lessons we can apply to our own Data Science projects.
This edition will review the techniques and approaches from the "BirdCLEF 2023" competition, which ended in May 2023.
Problem Statement: Deep Learning Inference under Limited Time and Computation Constraints
The BirdCLEF competitions are a series of annually recurring competitions on Kaggle. The main objective of a BirdCLEF competition is usually to identify a specific bird species by sound. The competitors are given short audio files of single bird calls and then must predict whether a specific bird was present in a longer recording.
In an earlier edition of The Kaggle Blueprints, we have already reviewed the winning approaches to audio classification with Deep Learning from last year's "BirdCLEF 2022" competition.
One aspect that was novel in the "BirdCLEF 2023" competition was the limited time and computational constraints: Competitors were asked to predict roughly 200 10-minute-long recordings on a CPU Notebook within 2 hours.
Now, you might be asking why anyone would want to infer a Deep Learning model on a CPU instead of a GPU. This is a common practical problem statement [4] as oftentimes staff (especially in conservation but also in other industries) have budget constraints and thus only have access to limited computing resources. Additionally, being able to make predictions quickly is helpful.
Because covering how to approach audio classification with Deep Learning would be repetitive to the previous The Kaggle Blueprints edition on the BirdCLEF 2022 competition, we will focus on the novel aspect of how to speed up inference for Deep Learning models on CPU in this article.
If you are interested in winning approaches to audio classification with Deep Learning, check out the previous edition:
Approaching Deep Learning Inference on CPU
The main problem with having to infer on CPU within a limited time is that you are not able to create large ensembles of powerful and diverse models to squeeze out that last few percent of performance. Depending on the used model, some competitors even struggled to meet the limited time requirements with a single model.
However, it is common that an ensemble of weaker models will usually perform better than a single powerful model. In the competition write-ups, successful competitors shared some tricks of how they sped up the inference on CPU to be able to ensemble multiple models.
This article covers the following tricks that were shared in the write-ups:
Model Selection
The model size heavily impacts the inference time. As a rule of thumb: the bigger the model, the longer the inference time.
As a rule of thumb: the bigger the model, the longer the inference time.
Thus, when selecting the backbones for the models in their ensembles, competitors had to evaluate which models resulted in the best trade-off between performance and inference time.
While NFNet (eca_nfnet_l0
) [3, 5, 7, 9, 10, 11, 13, 14, 16] and EfficientNet didn't change as the popular backbones between the last year's and this year's competition, we could see that in this year's competition, competitors preferred smaller versions of EfficientNet.
While in the BirdCLEF 2022 competition tf_efficientnet_b0_ns
[8, 11], tf_efficientnet_b3_ns
[8], tf_efficientnetv2_s_in21k
[11, 16], and tf_efficientnetv2_m_in21k
[13] were popular, this year the smaller versions tf_efficientnet_b0_ns
[1, 5, 6, 7, 10] and tf_efficientnetv2_s_in21k
[1, 6, 15] were preferably used.
Below you can see a comparison of the model sizes in terms of the number of parameters for a selection of popular models in the BirdCLEF competition series.

As a result, we could see that successful competitors leveraged a combination of a larger model (eca_nfnet_l0
) with smaller models (e.g., tf_efficientnet_b0_ns
).
Post-Training Quantization
Another trick to speed up inference on CPU is to apply quantization to the model after training: Post-training quantization lowers the precision of the model's weights and activations from floating-point precision (32 bits) to a lower bit width representation (e.g., 8 bits).
This technique transforms the model into a more hardware-friendly representation and thus improves latency. However, due to the loss in precision of the weight and activation representation, it can also lead to slight performance loss.
Quantization goes hand in hand with hardware. For example, a Kaggle Notebook has 4 CPUs (Intel(R) Xeon(R) CPU @ 2.20GHz with x86_64 architecture). These Intel CPUs with x86 architecture prefer the quantized data types to be INT8
.
Hint: To display information about the CPU architecture, run the [lscpu](https://man7.org/linux/man-pages/man1/lscpu.1.html)
command and then check the manufacturer's homepage to see which quantized input data types that specific CPU prefers.
For an in-depth explanation of post-training quantization and a comparison of ONNX Runtime and OpenVINO, I recommend this article:
This section will specifically look at two popular techniques of post-training quantization:
ONNX Runtime
One popular approach to speed-up inference on CPU was to convert the final models to ONNX (Open Neural Network Exchange) format [2, 7, 9, 10, 14, 15].
The relevant steps to quantize and accelerate inference on CPU with ONNX Runtime are shown below:
Preparation: Install ONNX Runtime
pip install onnxruntime
Step 1: Convert PyTorch Model to ONNX
import torch
import torchvision
# Define your model here
model = ...
# Train model here
...
# Define dummy_input
dummy_input = torch.randn(1, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT, device="cuda")
# Export PyTorch model to ONNX format
torch.onnx.export(model, dummy_input, "model.onnx")
Step 2: Make predictions with an ONNX Runtime session
import onnxruntime as rt
# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...
# Define ONNX Runtime session
sess = rt.InferenceSession("model.onnx")
# Make prediction
y_pred = sess.run([], {'input' : X_test})[0]
OpenVINO
The equally popular approach to speed-up inference on CPU was to use OpenVINO (Open Visual Inference and Neural network Optimization) [5, 6, 12] as shown in this Kaggle Notebook:
The relevant steps to quantize and accelerate a Deep Learning model with OpenVINO are shown below:
Preparation: Install OpenVINO
!pip install openvino-dev[onnx]
Step 1: Convert PyTorch Model to ONNX (see Step 1 of ONNX Runtime)
Step 2: Convert ONNX Model to OpenVINO
mo --input_model model.onnx
This will output an XML file and a BIN file – of which we will we using the XML file in the next step.
Step 3: Quantize to INT8
using OpenVINO
import openvino.runtime as ov
core = ov.Core()
openvino_model = core.read_model(model='model.xml')
compiled_model = core.compile_model(openvino_model, device_name="CPU")
Step 4: Make predictions with an OpenVINO inference request
# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...
# Create inference request
infer_request = compiled_model.create_infer_request()
# Make prediction
y_pred = infer_request.infer(inputs=[X_test, 2])
Comparison: ONNX vs. OpenVINO vs. Alternatives
Both ONNX and OpenVINO are frameworks optimized for deploying models on CPUs. The inference times of a neural network quantized with ONNX and OpenVINO are said to be comparable [12].
Some competitors used PyTorch JIT [3] or TorchScript [1] as alternatives to speed up inference on CPU. However, other competitors shared that ONNX was considerably faster than TorchScript [10].
Multithreading with ThreadPoolExecutor
Another popular approach to speed-up inference on CPU was to use multithreading with ThreadPoolExecutor [2, 3, 9, 15] in addition to post-training quantization, as shown in this Kaggle Notebook:
This enabled competitors to run multiple inferences at the same time.
In the following example of ThreadPoolExecutor from the competition, we have a list of audio files to infer.
audios = ['audio_1.ogg',
'audio_2.ogg',
# ...,
'audio_n.ogg',]
Next, you need to define an inference function that takes an audio file as input and returns the predictions.
def predict(audio_path):
# Define any preprocessing of the audio file here
...
# Make predictions
...
return predictions
With the list of audios (e.g., audios
) and the inference function (e.g., predict()
), you now can use ThreadPoolExecutor to run multiple inferences at the same time (in parallel) as opposed to sequentially, which will give you a nice boost in inference time.
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
dicts = list(executor.map(predict, audios))
Summary
There are many more lessons to be learned from reviewing the learning resources Kagglers have created during the course of the "BirdCLEF 2023" competition. There are also many different solutions for this type of problem statement.
In this article, we focused on the general approach that was popular among many competitors:
- Model Selection: Select the model size according to the best trade-off between performance and inference time. Also, leverage bigger and smaller models in your ensemble.
- Post-Training Quantization: Post-training quantization can lead to faster inference times due to datatypes of the model weights and activations being optimized to the hardware. However, this can lead to a slight loss of model performance.
- Multithreading: Run multiple inferences in parallel instead of sequentially. This will give you a boost in inference time.
If you are interested in how to approach audio classification with Deep Learning, which was the main aspect of this competition, check out the write-up of the BirdCLEF 2022 competition:
Enjoyed This Story?
Subscribe for free to get notified when I publish a new story.
Find me on LinkedIn, Twitter, and Kaggle!
References
Image References
If not otherwise stated, all images are created by the author.
Web & Literature
[1] adsr (2023). 3rd place solution: SED with attention on Mel frequency bands in Kaggle Discussions (accessed June 1st, 2023)
[2] anonamename (2023). 6th place solution: BirdNET embedding + CNN in Kaggle Discussions (accessed June 1st, 2023)
[3] atfujita (2023). 4th Place Solution: Knowledge Distillation Is All You Need in Kaggle Discussions (accessed June 1st, 2023)
[4] beluga (2023). Inference constraints – CPU Notebook <= 120 minutes (accessed March 27th, 2023).
[5] Harshit Sheoran (2023). 9th Place Solution: 7 CNN Models Ensemble in Kaggle Discussions (accessed June 1st, 2023)
[6] HONG LIHANG (2023). 2nd place solution: SED + CNN with 7 models ensemble in Kaggle Discussions (accessed June 1st, 2023)
[7] HyeongChan Kim (2023). 24th place solution – pre-training & single model (5 folds ensemble with ONNX) in Kaggle Discussions (accessed June 1st, 2023)
[8] LeonShangguan (2022). [Public #1 Private #2] + [Private #7/8 (potential)] solutions. The host wins. in Kaggle Discussions (accessed March 13th, 2023)
[9] LeonShangguan (2023). 10th place solution in Kaggle Discussions (accessed June 1st, 2023)
[10] moritake04 (2023). 20th place solution: SED + CNN ensemble using onnx in Kaggle Discussions (accessed June 1st, 2023)
[11] slime (2022). 3rd place solution in Kaggle Discussions (accessed March 13th, 2023)
[12] storm (2023). top 7th solution – sumix
augmentation did all the work in Kaggle Discussions (accessed June 1st, 2023)
[13] Volodymyr (2022). 1st place solution models (it's not all BirdNet) in Kaggle Discussions (accessed March 13th, 2023)
[14] Volodymyr (2023). 1st place solution: Correct Data is All You Need in Kaggle Discussions (accessed June 1st, 2023)
[15] Yevhenii Maslov (2023). 5th place solution in Kaggle Discussions (accessed June 1st, 2023)
[16] yokuyama (2022). 5th place solution in Kaggle Discussions (accessed March 13th, 2023)