How to submit a job array in Hoffman2 if I have a limit of 500 jobs?-CodePudding

I need to submit a job array of 100'000 jobs in Hoffman2. I have a limit of 500. Thus starting job 500, I get the following error:

Unable to run job: job rejected: Only 500 jobs are allowed per user (current job count: 500). Job of user "XX" would exceed that limit. Exiting.

Right now the submission Bash code is:

#!/bin/bash
#$ -cwd
#$ -o  test.joblog.LOOP.$JOB_ID
#$ -j y
#$ -l h_data=3G,h_rt=02:00:00
#$ -m n
#$ -t 1-100000
echo "STARTING TIME -- $(date) "
echo "$SGE_TASK_ID  "
/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh $SGE_TASK_ID

I tried to modify my code according to some Slurm Documentation but it does not work for Hoffman2 apparently (by adding % I am able to set the number of simultaneous running job).

#$ -cwd
#$ -o  test.joblog.LOOP.$JOB_ID
#$ -j y
#$ -l h_data=3G,h_rt=02:00:00
#$ -m n
#$ -t 1-100000P0
echo "STARTING TIME -- $(date) "
echo "$SGE_TASK_ID  "
/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh $SGE_TASK_ID

Do you know how can I modify my submission Bash code in order to always have 500 running job?

CodePudding user response：

Assuming that your job is visible as

/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh

in ps -e, you could try something quick and dirty like:

#!/bin/bash

maxproc=490 

while : ; do
    qproc=$(ps -e | grep '/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh' | wc -l)
    if [ "$qproc" -lt $maxproc ] ; then
        submission_code #with correct arguments
    fi
    sleep 10  # or anytime that you feel appropriate
done

Of course, this shows only the principle; you may need to do some testing whether there are more submission-codes; I also assumed he submissioncode self-backgrounds. And many more. But you'll get the idea.

CodePudding user response：

A possible approach (free of busy waiting and ugliness of that kind) is to track the number of jobs on the client side, cap their total count at 500 and, each time any of them finishes, immediately start a new one to replace it. (This is, however, based on the assumption that the client script outlives the jobs.) Concrete steps:

Make the qsub tool block and (passively) wait for the completion of its remote job. Depending on the particular qsub implementation, it may have a -sync flag or something more complex may be needed.
Keep exactly 500, no more and, if possible, no fewer waiting instances of qsub. This can be automated by using this answer or this answer and setting MAX_PARALLELISM to 500 there. qsub itself would be started from the do_something_and_maybe_fail() function.

Here’s a copy&paste of the Bash outline from the answers linked above, just to make this answer more self-contained. Starting with a trivial and runnable harness / dummy example (with a sleep instead of a qsub -sync):

#!/bin/bash
set -euo pipefail

declare -ir MAX_PARALLELISM=500  # pick a limit
declare -i pid
declare -a pids=()

do_something_and_maybe_fail() {
  ### qsub -sync "$@" ... ###    # add the real thing here
  sleep $((RANDOM % 10))         # remove this :-)
  return $((RANDOM % 2 * 5))     # remove this :-)
}

for pname in some_program_{a..j}{00..60}; do  # 600 items
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n \
    && echo "${pids[pid]} succeeded" 1>&2 \
    || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  do_something_and_maybe_fail &  # forking here
  pids[$!]="${pname}"
  echo "${#pids[@]} running" 1>&2
done

for pid in "${!pids[@]}"; do
  wait -n "$((pid))" \
  && echo "${pids[pid]} succeeded" 1>&2 \
  || echo "${pids[pid]} failed with ${?}" 1>&2
done

The first loop needs to be adjusted for the specific use case. An example follows, assuming that the right do_something_and_maybe_fail() implementation is in place and that one_command_per_line.txt is a list of arguments for qsub, one invocation per line, with an arbitrary number of lines. (The script could accept a file name as an argument or just read the commands from standard input, whatever works best.) The rest of the script would look exactly like the boilerplate above, keeping the number of parallel qsubs at MAX_PARALLELISM at most.

while read -ra job_args; do
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n \
    && echo "${pids[pid]} succeeded" 1>&2 \
    || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  do_something_and_maybe_fail "${job_args[@]}" &  # forking here
  pids[$!]="${job_args[*]}"
  echo "${#pids[@]} running" 1>&2
done < /path/to/one_command_per_line.txt