Home > OS >  Python: what does Thread.is_alive *exactly* mean?
Python: what does Thread.is_alive *exactly* mean?

Time:02-12

In Python 3.9.10, I am stumbling on the following very unsettling behaviour:

class MyThread(threading.Thread):
    def run(self):
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # prints FALSE!!!
        ...  # do some work involving asyncio and networking
        ...  # (specifically, I'm using aiohttp) and I know this work is
        ...  # actually being done because I can see its side-effects
        ...  # from across the network.
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # prints False AGAIN!!!
        ...  # go on with that work (still detectable and detected)
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # still False...

Under some conditions, that call to is_alive() is returning False. Now, I'm not doing anything weird like redefining methods in MyThread that I'm not supposed to, or messing around with internals of anything.

My question is, under normal circumstances, is there any situation in which Thread.is_alive would return False after the thread has started, but while it is still doing work? (By the way, doing mostly Python work, not some C code running in the background.)

More Details

There is a main thread, and two worker threads. It goes on something like this. The following code runs in the main thread:

exit_signal = threading.Event()
workers = {}
# pass them the exit_signal so they know when to stop:
workers['connect_to_server_1'] = MyThread("server1.com", exit_signal)
workers['connect_to_server_1'].start()
workers['connect_to_server_2'] = MyThread("server2.com", exit_signal)
workers['connect_to_server_2'].start()

# wait until the process gets a SIGINT (user hits ^C)
try:
    for w in workers.values():
        w.join()  # will never return
except KeyboardInterrupt:
    logging.info("ok, user wants to quit, let's quit")
else:
    logging.critical("threads have quit on their own")  # never happens

# list thread statuses
w = workers['connect_to_server_1']
logging.debug(f"Is {w} alive? {w.is_alive()}")  # prints FALSE
w = workers['connect_to_server_2']
logging.debug(f"Is {w} alive? {w.is_alive()}")  # prints TRUE

# For debugging purposes, give the workers some more time to keep doing
# their jobs. This here is an interesting time window: the main thread
# has already received ^C, but the workers are supposedly not aware of
# that.
time.sleep(10)
# Finally, tell workers to stop, and wait for them to go:
exit_signal.set()
workers['connect_to_server_1'].join()
workers['connect_to_server_2'].join()
logging.info("all good, bye!")

What happens is this

  • Before I hit ^C, I see in my log output messages from both workers telling me that they are heatlhy and successfully doing their jobs; more importantly, those "Am I alive?" messages (from the first code snippet at the top) always say True.

  • After I hit ^C, I see the log messages from the main thread checking the is_alive() status of both workers. I expected it to tell me that both workers are alive, since the interrupt signal always interrupts just the main thread. However, it tells me that the second worker is alive, but the first is not.

  • After that, while the main thread is blocked on that time.sleep(10) call, I still see messages from both workers in the log output. Both workers tell me that they are successfully doing their jobs (which can be verified by log messages from the other server they're talking to). However, everytime the first worker logs the "Am I alive?" message, it says False. WTF?

  • Finally, I set the exit_signal.

    • I see a message from the second worker (the one that was saying "Am I alive? True"), telling me that it received the signal, and then it goes on doing its shutdown routine, closing files and sockets etc.
    • I don't see any message from the other worker, and I can't see anywhere anything that indicates that he has received that signal, except for the fact that the join method on its thread returned successfully!

Closing thoughts

This code has been running in Python 3.6 for months, usually with around 15 workers instead of 2, and this issue never happened. It only happens when I try to run it in Python 3.9. It's somewhat easily reproducible: when I run that service with Python 3.9, around half of the time everything works perfectly, but the in the other half I get scared by this zombie thread telling me that it's dead, yet it's talking to me.

Also, the zombie thread is always the one talking to one specific server, which makes me think that this might be a problem with that one server's SSL certificate, or its implementation of the websocket protocol, but whatever, I don't control that one server. What I do control is the this instance of threading.Thread which should be either dead or walking upright, but not both.

What am I missing here?

CodePudding user response:

Turns out this is a recently-introduced bug in the threading implementation. Thread.join calls the internal method Thread._wait_for_tstate_lock, and that method was recently changed to look like this:

try:
    if lock.acquire(block, timeout):
        lock.release()
        self._stop()
except:
    if lock.locked():
        # bpo-45274: lock.acquire() acquired the lock, but the function
        # was interrupted with an exception before reaching the
        # lock.release(). It can happen if a signal handler raises an
        # exception, like CTRL C which raises KeyboardInterrupt.
        lock.release()
        self._stop()
    raise

The if lock.locked() check was an attempt to fix a hanging issue that occurred if this method was interrupted by Ctrl-C between the lock.acquire and the lock.release immediately afterward, but the check is wrong. It doesn't check if the previous lock.acquire call acquired the lock. It just checks if the lock is locked at all! The lock is almost always locked, and particularly, it's supposed to be locked the entire time a thread is alive.

This means that if you interrupt the lock.acquire call in this method with Ctrl-C, the code releases the lock (that someone else is holding) and calls self._stop to perform end-of-thread cleanup, including marking the thread as not alive any more. That's why your is_alive calls are returning False.

  • Related