Why don't I get faster run-times with ThreadPoolExecutor?-CodePudding

In order to understand how threads work in Python, I wrote the following simple function:

def sum_list(thelist:list, start:int, end:int):
    s = 0
    for i in range(start,end):
        s  = thelist[i]**3//10
    return s

Then I created a list and tested how much time it takes to compute its sum:

LISTSIZE = 5000000
big_list = list(range(LISTSIZE))

start = time.perf_counter()
big_sum=sum_list(big_list, 0, LISTSIZE)
print(f"One thread: sum={big_sum}, time={time.perf_counter()-start} sec")

It took about 2 seconds.

Then I tried to partition the computation into threads, such that each thread computes the function on a subset of the list:

THREADCOUNT=4
SUBLISTSIZE = LISTSIZE//THREADCOUNT
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(THREADCOUNT) as executor:
    futures = [executor.submit(sum_list, big_list, i*SUBLISTSIZE, (i 1)*SUBLISTSIZE) for i in range(THREADCOUNT)] 
    big_sum = 0
    for res in concurrent.futures.as_completed(futures):   # return each result as soon as it is completed:
        big_sum  = res.result()
print(f"{THREADCOUNT} threads: sum={big_sum}, time={time.perf_counter()-start} sec")

Since I have a 4-cores CPU, I expected it to run 4 times faster. But it did not: it ran in about 1.8 seconds on my Ubuntu machine (on my Windows machine, with 8 cores, it ran even slower than the single-thread version: about 2.2 seconds).

Is there a way to use ThreadPoolExecutor (or another threads-based mechanism in Python) so that I can compute this function faster?

CodePudding user response：

Since I have a 4-cores CPU, I expected it to run 4 times faster.

ThreadPoolExecutor does not use multiple CPUs, so this isn't a sensible expectation. All worker threads are executing on the same CPU.

Threads generally only help when you're IO bound, large calculations are CPU bound.

To take advantage of multiple CPUs, you may want to look at a ProcessPoolExecutor instead. However, spawning/forking additional processes has a much higher overhead than threading, and any objects sent across the process boundary need to be picklable. Since the workers in your example code all reference the same big_list instance, it may not work well with multiprocessing either - it will be copying the entire list in each worker process, even though the worker only intends to use a small segment of the list.

You can refactor it to only send the data you need for the calculation (easy), or you can use shared memory (difficult).

CodePudding user response：

The problem is that the function you are trying to make faster is CPU-bound and the Python Global Interpreter Lock (GIL) prevents any performance gain from parallelisation of such code.

In Python, threads are wrapper around genuine OS thread. However, in order to avoid race conditions due to concurrent execution, only one thread can access the Python interpreter to execute bytecode at a time. This restriction is enforced by a lock called the GIL.

Thus in Python, true multithreading cannot be achieved and multiprocessing should be used instead. However, note that the GIL is not locked by IO operations (file reading, networking, etc.) and some library code (numpy, etc.) so these operations can still benefit from Python multithreading.

The function sum_list used neither of those operations so it will not benefit from Python multithreading.

You can use ProcessPoolExecutor to effectively get parallelism but this may copy the input list in your case. Multiprocessing is equivalent to launching multiple independent Python interpreters, thus the GIL's (one per intepreter) is not an issue anymore. However, multiprocessing incurs performance penalties during inter-process communication.