I'm invoking a job with qsub myjob.pbs
. In there, I have some logic to run my experiments, which includes running torchrun
, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes ( gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus=
and torchrun --nproc_per_node=
are set depending on the command line argument.
I tried, the following:
#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"
torchrun --standalone --nnodes=1 --nproc_per_node=$1 myscript.py
and invoked it like so:
qsub --pass "4" myjob.pbs
but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'
. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?
CodePudding user response:
The problem is that your shell sees PBS directives as comments, so it will not be able to expand arguments in this way. This means that the expansion of $1
will not be occur using:
#PBS -l "nodes=1:ppn=12:gpus=$1"
Instead, you can apply the -l gpus=
argument on the command line and remove the directive from your PBS script. For example:
#!/bin/sh
#PBS -l ncpus=12
set -eu
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node="${nproc_per_node}" \
myscript.py
Then just use a simple wrapper, e.g. run_myjob.sh
:
#!/bin/sh
set -eu
qsub \
-l gpus="$1" \
-v nproc_per_node="$1" \
myjob.pbs
Which should let you specify the number of gpus as a command-line argument:
sh run_myjob.sh 4