How can I replace a loop with multiprocessing / multithreading in Python?-CodePudding

I am currently encountering the following problem. I have a method_A() that loops over a given set A1 of strings. On each of these strings I have to execute another method_B() that again returns me a set B* of strings. All the returned sets B* and set A should then be merged together in a new set called results, since the sets B* can have duplicates of certain strings.

I now want to make my method_A() faster by using multiprocessing instead of a loop. So I want to execute the method_B() for all strings of the set A at the same time.

Here is an example of what my code currently looks like:

# Method A that takes in a set of strings and returns the merged set of all sets B*
def method_A(set_A):
    # Initialize empty set to store results
    results = set()
    
    # Loop over each string in set A
    for string in set_A:

        # Execute method B
        set_B = method_B(string)
        
        # Merge set B into results set
        results = results.union(set_B)
    
    # Return the final results set
    return results

# Method B that takes in a string and returns a set of strings
def method_B(string):
    # Perform some operations on the string to generate a set of strings
    set_B = # Generated set of strings
    
    # Return the generated set
    return set_B

I never used multiprocessing but by googling my problem I found this as a possible solution to make my script faster. I tried to implement it myself with the help of ChatGPT but I'm always running into the problem that my resulting set is either empty or the multiprocessing isn't working at all. Maybe Multithreading suits this case better but I'm not sure. In general, I want to make my method_A faster. I'm open for any solution that will make it faster!

I'm glad if you can help!

CodePudding user response：

You can replace your for loop with something like this:

Add import concurrent.futures

    with concurrent.futures.ProcessPoolExecutor() as executor:
        for set_B in executor.map(method_B, set_A):
            results = results.union(set_B)

This will create a pool of sub processes, each running its own python interpreter.

executor.map(methodB, set_A) means: for every element in set_A, execute method_B

method_B will be executed in a subprocess and several calls to method_B will be executed in parallel.

Passing values to the subprocesses and getting the return values back is transparently handled by the executor.

More details can be found in Python's documentation: concurrent.futures

CodePudding user response：

To solve this with threads would look something like this:

from threading import Thread

# Method A that takes in a set of strings and returns the merged set of all sets B*
def method_A(set_A):
    # Initialize empty set to store results
    results = set()
    threads = []
    
    # Start new thread for each string in set A
    for string in set_A:
        t = Thread(target=method_B, args=(string, results))
        t.start()
        threads.append(t)

    # Wait for all threads to finish
    for t in threads:
        t.join()
    
    # Return the final results set
    return results

# Method B that takes in a string and returns a set of strings
def method_B(string, results):
    # Perform some operations on the string to generate a set of strings
    set_B = # Generated set of strings
    
    # Update results dict with the generated set
    results = results.union(set_B)

Note that the thread does not return the function's value, so instead you can pass your dict to the thread and edit it there. Hope this helps!