Understanding Race Conditions In the Context of Python

Author:Murphy  |  View: 27224  |  Time: 2025-03-22 21:46:19
Image created in Canva by author

No matter whether you are a Python user who uses threading frequently, or never used threading techniques but wants to try it in the future, there is one concept we can't bypass. That is the thread safety. It refers to many different kinds of subtle bugs that are particularly caused by multi-threading. While we get benefits from running tasks concurrently in multiple threads, thread safety must always be kept in mind to avoid those bugs.

Out of all the typical multi-threading-related bugs, race condition is a very subtle and annoying one. It may not easily happen if there are not too many threads running in parallel, which makes it more difficult to troubleshoot. However, it is also a bug that can be avoided if we really understand the reasons under the hood.

In this article, I'll introduce what is a race condition and why it could happen. In Python, the GIL makes it relatively more robust compared to most of the other Programming languages, but it still could happen. Under certain circumstances, the race condition may not happen in Python, but I'll find a way that guarantees to reproduce this bug and explain the mechanisms in detail.

1. What is Race Condition?

Image created in Canva by author

When I was a newbie and wrote code with race condition bugs many years ago, I was fortunate that someone had pointed out my mistake on Stack Overflow. However, the explanation was not 100% clear to me. Anyway, my initial understanding at that time was as follows.

We shouldn't let multiple threads access a shared resource and modify it concurrently without proper coordination.

Otherwise, the race condition will happen and some results from their threads will never be applied to the shared resource.

For example, if we define a global variable counter, and then define a function just simply add 1 to this variable 100k times. Then, we create 10 threads to run this function concurrently. The code below is the implementation.

import threading

# Shared variable
counter = 0

# The function
def increment():
    global counter
    for _ in range(100000):
        counter += 1

# Create multiple threads
threads = []
for _ in range(10):
    thread = threading.Thread(target=increment)
    threads.append(thread)
    thread.start()

# Run all the threads and get the result
for thread in threads:
    thread.join()

print(f"Final counter value: {counter}")

In the above code, we have used global counter to make sure the variable counter is accessible from inside the increment() function. Then, we simply use the threading module to create 10 threads and put them in a list. Finally, we use the thread.join() method to wait until all the threads are finished. The final value of the variable counter will be output.

OK, in this example, we execute the increment() function 10 times from different threads. If there is no bug, the value of the variable counter should be 100k x 10 = 1,000,000.

However, the race condition bug may happen in the above code, which cause some of the "incremental" action didn't actually happen. So, what we will get from the variable counter might be less than 1 million.

Demonstration of the Race Condition Bug

At that time, I tried to find out the answer to why race condition bugs may happen. There are so many articles on the Internet, but most of them tell similar concepts as my above-mentioned. To be honest, after several years, I realised the real reasons for the race condition bugs. So, rather than telling you the concept only, I want to demonstrate a typical race condition case using the diagrams.

Firstly, let's have a look at the scenario that there is only one thread.

"increment()" function with only one thread

Of course, there is surely no bug when there is only one thread. The steps are simple and clear.

  1. Load the global variable counter into the "local" of the function.
  2. Load the constant 1, because we want to +1 to the variable.
  3. Perform the calculation, counter: 0 + 1 = 1 so the local copy of the variable counter equals 1 now.
  4. Store the local copy of the variable counter will be saved back to the global variable counter.

So, the global variable counter will be equal to 1.

Now, let's have a look at the scenario with two threads where the race condition bug happened. There are two threads that both add 1 to the variable counter, so the result value should be 2, but the scenario below makes the result incorrectly as 1.

When there are two threads, Python GIL (Global Interpreter Lock) may lock some threads at a certain stage, so the other threads will have chances to be executed rather than staying hungry. One thread will only be started after the other thread is locked. If you are interested in how multi-threading works in Python with more details, please check out the article below.

Probably the Easiest Tutorial for Python Threads, Processes and GIL

Now, let's go back to the diagram itself. The problem starts from the step loading the global variable counter.

Yes, long story short, the Thread2 loaded the variable counter before Thread1 save the result back to the global variable. Both the "increment" calculations happened independently in the corresponding thread, and they ended up with a result that counter = 1at their local scope, because 0+1=1. Therefore, both threads save the result counter = 1 from their local scope back to the global variable counter.

In the example from the diagram particularly, Thread2 will overwrite counter = 1 with counter = 1 again. So, the final result of the variable counter will be 1, though it should be 2.

OK. I hope the above explanation makes sense to you. Are you curious why I haven't run the code to reproduce the race condition bugs? Good question, I'll show you the reason in the next section.

2. Why Python Is "Robust" to Race Condition?

Image created in Canva by author

OK. Let's run the code we have at the beginning.

Hold on, the result 1,000,000 is actually correct, which means that the race condition bug didn't happen. To be honest, when I saw this I was surprised, too. Initially, I suspected the bug didn't happen because the race condition is such a bug that happens by chance. So, I ran the code about 10 times more, and the bug still not happening. That can't be explained by probability.

Finally, I found this pull request from the CPython repo.

bpo-29988: Only check evalbreaker after calls and on backwards egdes. by markshannon · Pull Request…

Basically, this pull request has optimised the behaviour of the Python Interpreter. Long story short, this change in Python 3.10 reduced the frequency of GIL being released or acquired. When the steps are not too complex, such as loading a constant number and adding it to another number, the GIL will let it continue. So, one run of the increment() function has become an atomic action. Of course, the race condition will not happen.

Then, another question was raised, when GIL will lock a thread and resume another one? The answer is the "eval breaker", which will be explained in the next section.

3. When Race Condition May Happen in Python?

Image created in Canva by author

OK, from the previous section, we know that GIL will not lock a thread because "loading a constant number" is not an "eval breaker". What is an eval breaker exactly?

It refers to certain types of operations that may trigger the GIL to release the lock and possibly switch to another thread to be executed. Typical eval breakers include

  • I/O operations such as writing data to a file, receiving data from the network, etc.
  • Function calls. For example, the operation is calling another function and wants to use the result for the next step. Since executing another function is more likely to be a time-consuming operation, this is an eval breaker, too.
  • Loop or iteration starts. Some loop iterations could be long-running so they might be eval breakers as well.

OK. We now understand what are the typical eval breakers. So, we can try to reproduce the race condition bug, now.

The easiest way of doing this is just converting the "loading constant number" into a function. Please see the code as follows.

import threading

# A shared variable
counter = 0

def one(): 
    return 1 

def increment():
    global counter
    for _ in range(100000):
        counter += one()

threads = []

# Create and start multiple threads
for _ in range(10):
    thread = threading.Thread(target=increment)
    threads.append(thread)
    thread.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

print(f"Final counter value: {counter}")

We defined a function one() which just returns a number 1. Well, I know it doesn't make any sense, but it should be the easiest way for demonstration purposes. Let's execute it.

See, the result is incorrect now, which shows that the race condition did happen. BTW, the final counter value should vary if you execute it multiple times because race condition is such a bug with probability to happen.

Summary

Image created in Canva by author

I hope all the content above will help you to understand what is a race condition bug in Python and when it may happen. Also, in Python 3.10, an optimisation makes Python more robust to this bug in certain scenarios. However, we still need to bear in mind that the bug could be happening in other scenarios.

The purpose of this article is not only to introduce the bug itself. By understanding it thoroughly, this piece of knowledge will help us to avoid this type of bug during our development. Hope it helps!

Tags: Artificial Intelligence Data Science Machine Learning Programming Technology

Comment