Home > Net >  Set number of gpus in PBS script from command line
Set number of gpus in PBS script from command line

Time:07-30

I'm invoking a job with qsub myjob.pbs. In there, I have some logic to run my experiments, which includes running torchrun, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes ( gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus= and torchrun --nproc_per_node= are set depending on the command line argument.

I tried, the following:

#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"

torchrun --standalone --nnodes=1 --nproc_per_node=$1  myscript.py

and invoked it like so:

qsub --pass "4" myjob.pbs

but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?

CodePudding user response:

The problem is that your shell sees PBS directives as comments, so it will not be able to expand arguments in this way. This means that the expansion of $1 will not be occur using:

#PBS -l "nodes=1:ppn=12:gpus=$1"

Instead, you can apply the -l gpus= argument on the command line and remove the directive from your PBS script. For example:

#!/bin/sh
#PBS -l ncpus=12
set -eu

torchrun \
    --standalone \
    --nnodes=1 \
    --nproc_per_node="${nproc_per_node}" \
    myscript.py

Then just use a simple wrapper, e.g. run_myjob.sh:

#!/bin/sh
set -eu

qsub \
    -l gpus="$1" \
    -v nproc_per_node="$1" \
    myjob.pbs

Which should let you specify the number of gpus as a command-line argument:

sh run_myjob.sh 4
  • Related