Home > other >  Multiprocessing and Threading in Python
Multiprocessing and Threading in Python

Time:01-15

i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.

To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:

import multiprocess 
from threading import Thread

def clean_qa():
    for index, row in data.iterrows():
        data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "",  str(data["qa"].loc[index]))

if __name__ == '__main__':
    threads = []
    
    for i in range(os.cpu_count()):
        threads.append(Thread(target=test_qa))
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()

if __name__ == '__main__':
    processes = []

    for i in range(os.cpu_count()):
        processes.append(multiprocess.Process(target=test_qa))
        
    for process in processes:
        process.start()
        
    for process in processes:
        process.join()
    

When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.

However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.

Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?

Thank you very much!

CodePudding user response:

This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.

Looks like all you actually want here is just:

r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")

def clean_qa():
    data["qa"] = data["qa"].str.replace(r, "")

to let Pandas deal with the looping and parallelization.

CodePudding user response:

Answering about Threading, in answer to this question there's a python 3.9 example:

#example from the page below by Xiddoc
from threading import Thread
from time import sleep

# Here is a function that uses the sleep() function. If you called this directly, it would stop the main Python execution
def my_independent_function():
    print("Starting to sleep...")
    sleep(10)
    print("Finished sleeping.")

# Make a new thread which will run this function
t = Thread(target=my_independent_function)
# Start it in parallel
t.start()

# You can see that we can still execute other code, while other function is running
for i in range(5):
    print(i)
    sleep(1)

(Taken from this question: Can I run a coroutine in python independently from all other code?)

And you probably shouldn't try using Threading and multiprocessing simultaneously.

If you'd like to read more general information about multiprocessing\threading in python, you can see this post: How can I use threading in Python?

  •  Tags:  
  • Related