I need to submit a job array of 100'000 jobs in Hoffman2. I have a limit of 500. Thus starting job 500, I get the following error:
Unable to run job: job rejected: Only 500 jobs are allowed per user (current job count: 500). Job of user "XX" would exceed that limit. Exiting.
Right now the submission Bash code is:
#!/bin/bash
#$ -cwd
#$ -o test.joblog.LOOP.$JOB_ID
#$ -j y
#$ -l h_data=3G,h_rt=02:00:00
#$ -m n
#$ -t 1-100000
echo "STARTING TIME -- $(date) "
echo "$SGE_TASK_ID "
/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh $SGE_TASK_ID
I tried to modify my code according to some Slurm Documentation but it does not work for Hoffman2 apparently (by adding %
I am able to set the number of simultaneous running job).
#$ -cwd
#$ -o test.joblog.LOOP.$JOB_ID
#$ -j y
#$ -l h_data=3G,h_rt=02:00:00
#$ -m n
#$ -t 1-100000P0
echo "STARTING TIME -- $(date) "
echo "$SGE_TASK_ID "
/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh $SGE_TASK_ID
Do you know how can I modify my submission Bash code in order to always have 500 running job?
CodePudding user response:
Assuming that your job is visible as
/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh
in ps -e
, you could try something quick and dirty like:
#!/bin/bash
maxproc=490
while : ; do
qproc=$(ps -e | grep '/u/systems/UGE8.6.4/bin/lx-amd64/qsub submit_job.sh' | wc -l)
if [ "$qproc" -lt $maxproc ] ; then
submission_code #with correct arguments
fi
sleep 10 # or anytime that you feel appropriate
done
Of course, this shows only the principle; you may need to do some testing whether there are more submission-codes; I also assumed he submissioncode self-backgrounds. And many more. But you'll get the idea.
CodePudding user response:
A possible approach (free of busy waiting and ugliness of that kind) is to track the number of jobs on the client side, cap their total count at 500 and, each time any of them finishes, immediately start a new one to replace it. (This is, however, based on the assumption that the client script outlives the jobs.) Concrete steps:
Make the
qsub
tool block and (passively) wait for the completion of its remote job. Depending on the particularqsub
implementation, it may have a-sync
flag or something more complex may be needed.Keep exactly 500, no more and, if possible, no fewer waiting instances of
qsub
. This can be automated by using this answer or this answer and settingMAX_PARALLELISM
to500
there.qsub
itself would be started from thedo_something_and_maybe_fail()
function.
Here’s a copy&paste of the Bash outline from the answers linked above, just to make this answer more self-contained. Starting with a trivial and runnable harness / dummy example (with a sleep
instead of a qsub -sync
):
#!/bin/bash
set -euo pipefail
declare -ir MAX_PARALLELISM=500 # pick a limit
declare -i pid
declare -a pids=()
do_something_and_maybe_fail() {
### qsub -sync "$@" ... ### # add the real thing here
sleep $((RANDOM % 10)) # remove this :-)
return $((RANDOM % 2 * 5)) # remove this :-)
}
for pname in some_program_{a..j}{00..60}; do # 600 items
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail & # forking here
pids[$!]="${pname}"
echo "${#pids[@]} running" 1>&2
done
for pid in "${!pids[@]}"; do
wait -n "$((pid))" \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
done
The first loop needs to be adjusted for the specific use case. An example follows, assuming that the right do_something_and_maybe_fail()
implementation is in place and that one_command_per_line.txt
is a list of arguments for qsub
, one invocation per line, with an arbitrary number of lines. (The script could accept a file name as an argument or just read the commands from standard input, whatever works best.) The rest of the script would look exactly like the boilerplate above, keeping the number of parallel qsub
s at MAX_PARALLELISM
at most.
while read -ra job_args; do
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail "${job_args[@]}" & # forking here
pids[$!]="${job_args[*]}"
echo "${#pids[@]} running" 1>&2
done < /path/to/one_command_per_line.txt