I'm trying to make GNU parallel abort all processing when one of the subprocesses fails. The option --halt now,fail=1
does this correctly: if a subprocess exits with a non-zero exit code or if it is sigkilled (by the OOM killer), parallel
will stop all jobs.
If parallel
detects a non-zero exit code, it will itself exit with the same non-zero exit code. This lets the parent script detect that something went wrong.
The problem is that in the case where a subprocess is sigkilled, the return code from parallel itself is zero and the parent script has no way to tell that the process failed.
Here's a demonstration of the problem (single job to keep it simple but the same issue exists for multiple jobs taking different amounts of time etc).
# Create script that will SIGKILL itself.
cat << 'EOF' > selfkill.sh
echo Process $BASHPID will now terminate itself
kill -9 $BASHPID
EOF
# And another that will exit with code 1.
cat << 'EOF' > exit_one.sh
echo Process $BASHPID will now exit with code 1
exit 1
EOF
echo Running exit_one.sh via parallel with default error handling
parallel --joblog job.log bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with default error handling
parallel --joblog job.log bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running exit_one.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log
Output:
Running exit_one.sh via parallel with default error handling
Process 1521789 will now exit with code 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.403 0.002 0 42 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with default error handling
Process 1521804 will now terminate itself
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.573 0.003 0 42 0 9 bash selfkill.sh 1
Running exit_one.sh via parallel with abort on error
Process 1521819 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.730 0.003 0 42 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with abort on error
Process 1521834 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 0
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.880 0.001 0 42 0 9 bash selfkill.sh 1
All good except the last case where I need a non-zero exit code since there was a problem.
The test job appears to exit with code 137 when it is sigkilled, so I would expect it to be detected. However parallel
is somehow able to see that it was killed (good) but then exits with zero (not good). Maybe the --halt
option causes parallel
to consider itself successful because it's successfully halted all the other processes? Is there a workaround for this behaviour?
$ bash selfkill.sh
Process 1485899 will now terminate itself
Killed
$ echo $?
137
This is on an Ubuntu 20.04 with GNU parallel 20161222 and bash 5.0.17.
Further context in case this is an XY problem and there's a better approach than using GNU parallel. Initially we weren't using GNU parallel. We had a number of background jobs and pipelines running in parallel via extensive use of &
. But we couldn't find a way to detect when jobs had been sigkilled and the parent script would carry on much the same as it does in the example above.
CodePudding user response:
<sigh> I should have tried a more recent version before asking here.
Ubuntu 20.04 is stuck with an old version but I was able to experiment inside a Docker container with docker run --rm -it ubuntu:22.04
where I could install GNU parallel 20210822.
Running exit_one.sh via parallel with default error handling
Process 1179 will now exit with code 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765327.768 0.100 0 39 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with default error handling
Process 1186 will now terminate itself
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765328.292 0.092 0 39 0 9 bash selfkill.sh 1
Running exit_one.sh via parallel with abort on error
Process 1193 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765328.818 0.111 0 39 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with abort on error
Process 1200 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 137
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765329.321 0.085 0 39 0 9 bash selfkill.sh 1
The last case which was problematic before now exits with 137 which is great and consistent with what bash sees.