The Concurrency with Python Series:

Overview

Threads and locks are a software-defined formalization of the hardware underneath, and as such comprise the simplest possible concurrency model. It forms the basis of other concurrency abstractions built on top of it, so it's important to understand in this regards. However, it's difficult or impossible to build reliable, scalable systems directly on these primitives.

While most every language has support for threads and locks, CPython remains special in its use of a global interpreter lock that prevents threads from concurrently accessing shared memory, because CPython's memory management is not thread-safe. While blocking operations happen outside the GIL and can lead to performance speedups, the syscall overhead required for thread switching may degrade performance. This means threading in Python is primarily used for I/O-bound operations, instead of CPU-bound operations.

(As an aside, I mention CPython because other, partial implementations of the Python specification, such as Jython, do not have a global interpreter lock. However, these implementations are not as widely used in practice because a) nobody wants to support multiple Python implementations unless they have to, b) they are less fully fleshed out, and c) due to the need to natively support the C/C++ extensions API, the Python language definition is tightly coupled to C/C++ and is less a technical specification than a reference implementation.

Python directly supports threading through a high-level threading module, and a lower-level _thread model. To get more information on how these modules work, the source code is available online; threading.py linked here.

Getting Started

The quintessential single-threaded "Hello World" execution in Python is famously simple:

print('Hello World!')

The multi-threaded analogue isn't too different:

import threading
import time

def hello_world():
    print(
        'Hello from {0}'.format(
            threading.get_ident()
        )
    )

t1 = threading.Thread(target=hello_world)
t2 = threading.Thread(target=hello_world)

t1.start()
t2.start()

During my limited amount of testing, the above script generated the result shown below:

(python3.7) host:~ username$ python test.py
Hello from 123145444872192
Hello from 123145450127360
(python3.7) host:~ username$ python test.py
Hello from 123145370476544
Hello from 123145370476544
(python3.7) host:~ username$ python test.py
Hello from 123145409552384
Hello from 123145414807552
(python3.7) host:~ username$ python test.py
Hello from 123145333612544
Hello from 123145333612544

I used get_ident() in order to print the "thread identifier" (a magic value with no meaning except to disambiguate between different threads at runtime). You can see how in some instances, the thread identifier is different, while in some other cases, the thread identifier is the same. Identical thread identifiers do not imply work remains on the same thread, but happens because Python re-uses the identifier if work does not overlap and a different thread identifier is not required.

Pitfalls: Timing and Consistency

If you swapped the thread identifier with the name of the thread using threading.current_thread().getName(), you might get the results in order, and a large reason for that may be because each thread is using the same function and code path, and therefore the difference in latency between each thread is insignificant next to the latency of the interpreter. However, this does not mean the in-order execution is guaranteed; here's an example from "Python Programming" on WikiBooks where the creation of each thread and the execution of each thread have explicitly different timings:

import threading
import time

class MyThread(threading.Thread):
    def run(self):
        print(
            "Hello from {}!".format(
                self.getName()
            )
        )
        # Thread execution is spaced out by
        # at least 1.0 seconds.
        time.sleep(1)
        print(
            "{} finished!".format(
                self.getName()
            )
        )

def main():
    for x in range(4):
        mythread = MyThread(
            name = "Thread-{}".format(
                x + 1)
            )
        mythread.start()
        # Thread creation is spaced out
        # by at least 0.9 seconds.
        time.sleep(.9)

if __name__ == '__main__':
    main()

This results in the following output on one sample run:

(python3.7) host:~ username$ python test.py
Hello from Thread-1!
Hello from Thread-2!
Thread-1 finished!
Hello from Thread-3!
Thread-2 finished!
Hello from Thread-4!
Thread-3 finished!
Thread-4 finished!

This log indicates thread creation/execution is interleaved. With increasingly divergent timings between thread creation and execution due to the increased variability of added functionality, these results will become ever more unpredictable. The principle remains the same, though; there are no guarantees for consistent behavior when using multiple threads.

Pitfalls: Accessing Shared Memory

This can result in incorrect behavior when different threads access shared memory. You can extend this example to see a race condition when counting using multiple threads:

import threading
import time

class Counter():
    def __init__(self):
        self.count = 0

    def increment_until_100(self):
        while self.count != 100:
            print(
                '{0} incrementing.'.format(
                    threading.current_thread().getName()
                )
            )
            self.count += 1
            time.sleep(1)

def worker(counter):
    counter.increment_until_100()

def main():
    counter = Counter()
    for x in range(7):
        count_thread = threading.Thread(
            name="Thread-{}".format(
                x + 1
            ),
            args=[counter],
            target=worker
        )
        count_thread.start()
        time.sleep(.9)

    print(
        'Counter final value is {0}'.format(
            counter.count
        )
    )

if __name__ == '__main__':
    main()

This generates on one sample run, this output:

(python3.7) host:~ username$ python test.py
Thread-1 incrementing.
Thread-2 incrementing.
Thread-1 incrementing.
Thread-3 incrementing.
Thread-2 incrementing.
Thread-1 incrementing.
Thread-4 incrementing.
Thread-3 incrementing.
Thread-2 incrementing.
Thread-1 incrementing.
Thread-5 incrementing.
Thread-4 incrementing.
Thread-3 incrementing.
Thread-2 incrementing.
Thread-1 incrementing.
Thread-6 incrementing.
..
Thread-2 incrementing.
Thread-1 incrementing.
Counter final value is 28
Thread-7 incrementing.
Thread-6 incrementing.
...
Thread-7 incrementing.
Thread-6 incrementing.
(python3.7) host:~ username$

This result varies based on the number of threads created, but you can see how the result of 28 is very different from the intended value of 100. ~~Counter().count is not thread-safe, and it's demonstrated here (if you have a different machine than I do, you might get a different result than 28)~~ (Update: See correction at end of post). If you encounter a race condition, the relevant section of code may be difficult to find without sufficient logging.

Pitfalls: Deadlocking

Deadlocking occurs when two agents attempt to acquire the same area of shared memory, and end up blocking each other. The only solution to this when working with the low-level abstractions of threads and locks is to ensure you have a way for each agent to manage its locks correctly, or have an overall paradigm of lock coordination. For example, the dining philosophers problem underlines the importance of process synchronization. Rosetta Code's Dining Philosophers solution in Python resolves this synchronization issue by ensuring if you (an agent) cannot acquire both forks in good time, you release any forks you already have so that another agent may acquire both:

def dine(self):
    fork1, fork2 = self.forkOnLeft, self.forkOnRight

    while self.running:
        fork1.acquire(True)
        # NOTE: Do not block the lock when attempting to acquire,
        # in order to avoid deadlock.
        locked = fork2.acquire(False)
        if locked: break
        # NOTE: If the lock acquisition is not successful, then
        # release the lock on the first object.
        fork1.release()
        print '%s swaps forks' % self.name
        fork1, fork2 = fork2, fork1
    else:
        return

    self.dining()
    fork2.release()
    fork1.release()

This method does not exclude other methods of locking, such as lock ordering, or system designs involving process synchronization like producer-consumer models using semaphores, but may be less prevalent in Python than in other languages.

Pitfalls: Alien methods and dependencies

If you are going to apply multithreading in your Python application, manually validating and verifying thread safety and the threading model of your dependencies is something you must do if you wish to guarantee correctness in your entire stack. Some dependencies designed for enterprise-grade usage in a multi-service environment, such as redis, may keep their concurrency models first and foremost in mind during the design phase (see antirez's comments regarding a multithreaded version of redis on Hacker News). Some dependencies may not; I may have run into a deadlock with boto2 when downloading files from S3 in parallel using multiprocessing.pool.Pool, which necessitated a rewrite of a function. Hence, another difficulty with dependencies arises; they cannot be commoditized, which means if you have not validated all your dependencies you will use in your application before implementing a threading model in your application, you may end up boxing yourself into a dead end when attempting to add a dependency for a particular use case to your project.

Multithreaded Logging

If you do choose to go with a native threading model in Python, you may be pleasantly surprised to find that not only is the logging module thread-safe by default, but it also supports logging from any particular thread or process (an example demonstrated in the logging cookbook). The difficulty then becomes where exceptions will likely be triggered in your application, how that affects your threading model, and ensuring robust logging around those sections of code. Adding logs to your application may present non-trivial latency penalties, as pylint may inform you through the warning logging-lazy-interpolation, which may also present difficulties in your threading model.

`concurrent.futures`

I was rather unhappily surprised when writing this post to discover the Python multiprocessing.pool.ThreadPool implementation was never documented or tested because it was never finished. It does appear to remain that way even in Python 3.7, as it appears in the source code on the GitHub mirror.

Given the omnipresence of the global interpreter lock, and the nature of concurrent applications primarily parallelizing I/O-related work, it may make sense to leverage concurrent.futures.Executor or similar, that use the new asynchronous paradigms present in Python 3.x, as they are more fully-featured. I have not used this module, but I would imagine it would not incur a significant performance penalty in comparison to multiprocessing.

Conclusion

Python has rudimentary support for threads and locks, and it may be less fully-featured and useful than threading and locking in another language (e.g. Java). Threading and locking primitives should also be best avoided when operating in a higher-level, interpreted language like Python. However, Python does present enough friendly exposure about threading and locking to give a good academic exercise into how threads and locks work, and present an exciting introduction to the world of concurrency.

To learn more about threading and locking in production, check out "Seven Concurrency Models in Seven Weeks", by Paul Butcher.

(Correction on 2019/03/22): As aaron_m04 pointed out on the Hacker News submission of this post and Riccardo Campari pointed out below in Disqus, the issue with the race condition in the example "Pitfalls: Accessing Shared Memory" is not an issue with whether Counter.count() is thread safe, but rather because the child threads are never joined with **.join()**. Thanks very much to both for pointing this out to other readers and to me.

One new development in Python 3.8 is the **multiprocessing.shared_memory library, where shared memory is implemented according to the System V shared memory specifications. I have not used this library, nor have I tried out Python 3.8 (currently in development at the time of writing this), but I would assume this would likely be the given way to apply shared memory across Python processes when shared-nothing paradigms don't work for some reason.

I am honestly at a loss as to why the core Python development team added this library recently. If somebody could post an RFC to the design document around this library in the comments, or perhaps ask Davin Potts for more information, I personally would appreciate any clarification.