Multithreading not improving results in python?-CodePudding

I am applying Multi-threading to a python script to improve its performance. I don't understand why there is no improvement in the execution time. Following is the code snippet of my implementation.

from queue import Queue
from threading import Thread
from datetime import datetime
import time



class WP_TITLE_DOWNLOADER(Thread):
    def __init__(self, queue,name):
        Thread.__init__(self)
        self.queue = queue
        self.name = name
 
    
    def download_link(self,linkss):       
       ####some test function
       ###later some processing will be done on this list.
       #####this will be processed on CPU. 
       for idx,link in enumerate(linkss):
           ##time.sleep(0.01)
           test.append(idx)

       for idx,i in enumerate(testv):
           i=i.append(2)
      ##

    def run(self):
        while True:
            # Get the work from the queue
            linkss = self.queue.get()
            try:
                 self.download_link(linkss)
            finally:
                 self.queue.task_done()                


       
######with threading

testv=[[i for i in range(5000)] for j in range(20)]
links_list=[[i for i in range(100000)] for j in range(20)]
test=[]
start_time =time.time()
queue = Queue()
thread_count=8
for x in range(thread_count):
    worker = WP_TITLE_DOWNLOADER(queue,str(x))
    # Setting daemon to True will let the main thread exit even though the workers are blocking
    worker.daemon = True
    worker.start()




##FILL UP Queye for threads
for links in links_list: 
        queue.put(links)
        
        
        
##print("queing time={}".format(time.time()-start_time))        
#print(test)
#wait for all to end
j_time =time.time()
queue.join()
t_time = time.time()-start_time
print("With threading time={}".format(t_time))
           
    



#############without threading,  
###following function is same as the one in threading. 
test=[]
def download_link(links1):       
        for idx,link in enumerate(links1):
           ##time.sleep(0.01)
           test.append(idx)
           
        for idx,i in enumerate(testv):
           i=i.append(2)



start_time =time.time()
for links in links_list: 
        download_link(links)
       
        
t_time = time.time()-start_time
print("without threading time={}".format(t_time))

With threading time=0.564049482345581 without threading time=0.13332700729370117

NOTE: When I uncomment time.sleep, with threading time is lower than without threading. My test case is I have a list of lists, each list has more than 10000s elements, the idea of using multi-threading is that instead of processing a single list item, multiple lists can be processed simultaneously, resulting in a decrease in overall time. But the results are not as expected.

CodePudding user response：

Python has a concept called 'GIL(Global Interpreter Lock)'. This lock ensures that only one thread looks during runtime. Therefore, even if you spawned multiple threads to process multiple lists, only one thread is processing at a time. You can try multi-processing for parallel execution.

CodePudding user response：

Threading is awkward in Python because of the GIL (Global Interpreter Lock). Threads have to compete to get the main interpreter to be able to compute. Threading in python is only beneficial when the code inside the thread does not require the global interpreter, ie. when offloading computations to a hardware accelerator, when doing I/O bound computations or when calling a non-python library. For true concurrency in python, use multiprocessing instead. It's a bit more cumbersome as you have to specifically share your variables or duplicate them and often serialize your communications.

CodePudding user response：

As a general rule (there will always be exceptions) multithreading is best suited to IO-bound processing (this includes networking). Multiprocessing is well suited to CPU-intensive activities.

Your testing is therefore flawed.

Your intention is clearly to do some kind of web-crawling but that's not happening in your test code which means that your test is CPU-intensive and therefore not suitable for multi-threading. Whereas once you've added you networking code you may find that matters have improved providing you've used suitable techniques.

Take a look at ThreadPoolExecutor in concurrent.futures. You may find that useful in particular because you can swap to multiprocessing by simply replacing ThreadPoolExecutor with ProcessPoolExecutor which will make your experiments easier to quantify