I'm curious if threading is the right approach for my use case. I'm working on a genetic algorithm, which needs to evaluate the fitness of genes 1...n. The evaluation of each is independent of others for the most part. Yet, each gene will be passed through the same function, eval(gene
.
My intention is that once all genes have been evaluated, I will sort by fitness, and only retain top x.
From this tutorial, it seems that I should be able to do the following, where the specifics of eval
are out of scope for this question, but suppose each function call updates a common dictionary of form, {gene : fitness}
:
for gene in gene_pool:
thread_i = threading.Thread(target=eval(gene), name=f"fitness gene{i}")
thread_i.start()
for i in range(len(genes)):
thread_i.join()
In the tutorial, I don't see the function actually invoking the function eval()
, but rather just referencing its name, eval
. I'm not sure if this will problematic for my use case.
My first question is: Is this the right approach? Should I consider a different approach to threading?
I don't believe that I will need to account for race conditions or locks because, while every thread will update the same dictionary, the keys and values will be independent.
And my last question: Does multiprocessing generally a better bet? It seems that it's a bit higher level, which might be ideal for someone new to parallelization.
CodePudding user response:
In Python, threading is constrained by the GIL, so that it is very limited performance-wise. In case of IO-bound code (reading/writing files, requests on the network, ...) async is the way to go. But from what you explain, your code is rather CPU-bound (computing many things). Then if you want your code to go fast then you need to circumvent the Python GIL. There is two main ways :
- multiprocessing (having multiple different Python processes in parallel)
- or calling code written in lower-level languages (Cython, C, ...), typically wrapped in a nice library
If you want something simple, stick to multiprocessing
: at the start create a pool whose size is the number of competing genes (N), then at each iteration submit to the pool N new tasks to it and wait for their results (the pool.map
function), repeat as many times as you want.
I think it is the simplest way to get a full parallelization, which should give you decent speed.