Home > front end >  Why does subprocess.Popen run in parallel but subprocess.run does not
Why does subprocess.Popen run in parallel but subprocess.run does not

Time:04-16

I would like to perform an expensive IO operation in parallel in a subprocess. However, when I execute the following code, it runs sequentially and takes forever to complete:

import subprocess
import csv

processes = []

with open(f'identifiers.csv', 'r', newline='') as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        id = row['id']
        p = child_processes.run(f'expensive_io_operation {id}', shell=True)
        child_processes.append(p)
for process in child_processes:
    process.wait()

This block of code runs sequentially.

However, if I replace subprocess.run with subprocess.Popen, then this block of code will execute in parallel.

I would like to understand why.

CodePudding user response:

This is clearly documented; subprocess.run blocks and waits for the subprocess to finish. If you want to run a parallel subprocess, you need Popen; but then you also need to do the required management of the subprocess object which run takes care of for you behind the scenes. (And indeed, the CompletedProcess object returned by run obviously doesn't have a wait method, because, uh, it's completed.)

Tangentially, if speed is important, you really want to avoid the completely superflous shell=True; on Unix-like systems, you then need to break up the command to a list of strings (or use shlex.split()).

        p = child_processes.Popen(['expensive_io_operation', id])

In this context, gluing the parts into a single string so that a shell is needed to take them apart again is obviously also unnecessary and slightly wasteful.

If the expensive operation is I/O bound, you have to be careful about where you add parallell processing. Throwing more CPU at a problem which is already saturating your I/O bandwidth will only add congestion, and make things slower.

Perhaps also look at multiprocessing which can take care of running a controlled number of parallel tasks and collect their results. With multiprocessing.Pool and a queue, you can run, say, 16 processes in parallel and divide the tasks from the queue between them according to their availability, and without the nitty-gritty of managing the individual processes yourself.

Finally, for I/O where many tasks involve a lot of waiting, perhaps instead look into async, which lets you implement (apparent) parallellism in a single process by way of dividing the work into non-blocking fragments where Python takes care of switching between them whenever they reach an await statement (or etc; this brief exposition is obviously abridged).

  • Related