I have a list of search queries to build a dataset:
classes = [...]
. There are 100 search queries in this list.
Basically, I divide the list into 4 chunks of 25 queries.
def divide_chunks(l, n):
for i in range(0, len(l), n):
yield classes[i:i n]
classes = list(divide_chunks(classes, 25))
And below, I've created a function that downloads queries from each chunk iteratively:
def download_chunk(n):
for label in classes[n]:
try:
downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
except:
pass
However, I want to run each 4 chunks concurrently. In other words, I want to run 4 separate iterative operations concurrently. I took both the Threading
and Multiprocessing
approaches but both of them don't work:
process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()
process_1.join()
process_2.join()
process_3.join()
process_4.join()
###########################################################
thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()
CodePudding user response:
You're running download_chunk
outside of the thread/process. You need to provide the function and arguments separately in order to delay execution:
For example:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process
class.
For this use-case, I would suggest using multiprocessing.Pool
:
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(4) as pool:
pool.map(download_chunk, range(4))
It handles the work of creating, starting, and later joining the 4 processes. Each process calls download_chunk
with each of the arguments provided in the iterable, which is range(4)
in this case.
More info about multiprocessing.Pool
can be found in the docs.