Home > Software design >  How to run parallelization on multiple files with multiple arguments in python?
How to run parallelization on multiple files with multiple arguments in python?

Time:09-29

I am not familiar with python, but would like to run a function to read and write multiple files in parallel in python. For a minimal example:

from multiprocessing import Pool
import pandas as pd

def multiple(input_path, output_path, n):
    df = pd.read_csv(input_path, index_col=0)
    new_df = df.multiply(n)
    new_df.to_csv(output_path)

workers = 6
input_filenames = [f'input_i.csv' for i in range(1,11)]
output_filenames = [f'output_i.csv' for i in range(1,11)]

with Pool(workers) as pool:
    pool.map(multiple, ...)

If I am using for loop, I can do this like below:

for i, input_file in enumerate(input_filenames):
    input_path = input_filenames[i]
    output_path = output_filenames[i]
    multiple(input_path, output_path, 2)

How should I convert into pool.map to match the index of each input and output filenames and also feed 3 arguments to the function (input_path, output_path, n)?

Thank you!

CodePudding user response:

You can zip three lists input_filenames/output_filenames/n to one list, and then you give the function multiple and that list to pool.map. It will take each element from the list, and a worker will execute multiple on it.

def multiple(inp_out_n):
    input_path, output_path, n = inp_out_n
    df = pd.read_csv(input_path, index_col=0)
    new_df = df.multiply(n)
    new_df.to_csv(output_path)

workers = 6
input_filenames = [f'input_{i}.csv' for i in range(1,11)]
input_filenames = [f'output_{i}.csv' for i in range(1,11)]
n = 10 * [2]
in_out_n = zip(input_filenames, input_filenames, n)

pool = multiprocessing.Pool(processes=workers)
pool.map(multiple, in_out_n)
pool.close()

CodePudding user response:

Instead of using a pool you can create a Process, point it to the multiple method, and give it the filenames as arguments.

from multiprocessing import Process

processes = []
for i, input_file in enumerate(input_filenames):
    input_path = input_filenames[i]
    output_path = output_filenames[i]
    multiple(input_path, output_path, 2)
    process = Process(target=multiple, args=(input_path, output_path, 2,))
    processs.append(process)
    process.start()


for processs in processes:
    p.join()
  • Related