Home > Software engineering >  parallel execution of script on cluster for multiple inputs
parallel execution of script on cluster for multiple inputs

Time:04-27

I have a script pytonscript.py that I want to run on 500 samples. I have 50 CPUs available and want to run the script in parallel using 1 CPU for each sample, so that 50 samples are constantly running with 1 CPU each. Any ideas how to set this up without typing 500 lines with the different inputs? I know how to make a loop for each sample, but not how to make 50 samples running in parallel. I guess GNU parallel is a way?

Input samples in folder samples:

sample1 sample2 sample2 ... sample500

pytonscript.py -i samples/sample1.sam.bz2 -o output_folder

CodePudding user response:

What about GNU xargs?

printf '%s\0' samples/sample*.sam.bz |
xargs -0L1 -P 50 pytonscript.py -o output_dir -i

This launches a new instance of the python script for each file, concurrently, maintaining a maximum of 50 at once.

If the wildcard glob expansion isn't specific enough, you can use bash's extglob: shopt -s exglob; # samples/sample ([0-9]).sam.bz

  • Related