I have a csv file with 10,000 rows, each row contains a link, and I want to download some info of each link. As that's a consuming task I manually splitted it in 4 Python scripts, each one working on 2,500 rows. After that I open 4 terminals and run each of the scripts.
However I wonder if there's a more efficient way of doing that. Up to now I have 4 scripts .py that I manually lunch. What happens if I have to do the same but with 1,000,000 rows? Should I manually create for example 50 scripts and in each script download the info of the rows of that script?. I hope I managed to explain myself :)
Thanks!
CodePudding user response:
You don't need to do any manual splitting – set up a multiprocessing.Pool()
with the number of workers you want to be processing your data, and have a function do your work for each item. A simplified example:
import multiprocessing
# This function is run in a separate process
def do_work(line):
return f"{line} is {len(line)} characters long. This result brought to you by {multiprocessing.current_process().name}"
def main():
work_items = [f"{2 ** i}" for i in range(1_000)] # You'd read these from your file
with multiprocessing.Pool(4) as pool:
for result in pool.imap(do_work, work_items, chunksize=20):
print(result)
if __name__ == "__main__":
main()
This has (up to) 4 processes working on your data, with, for optimization reasons, each worker getting 20 tasks to work on.
If you don't need the results to be in order, use the faster imap_unordered
.
CodePudding user response:
You can take a look at https://docs.python.org/3/library/asyncio-task.html to make the download processing tasks async.
CodePudding user response:
Use Threads to run multiple interpreter instances simultaneously (https://realpython.com/intro-to-python-threading)