Home > Blockchain >  How to make GNU parallel exit with non-zero code in case of subprocess being SIGKILLed?
How to make GNU parallel exit with non-zero code in case of subprocess being SIGKILLed?

Time:10-17

I'm trying to make GNU parallel abort all processing when one of the subprocesses fails. The option --halt now,fail=1 does this correctly: if a subprocess exits with a non-zero exit code or if it is sigkilled (by the OOM killer), parallel will stop all jobs.

If parallel detects a non-zero exit code, it will itself exit with the same non-zero exit code. This lets the parent script detect that something went wrong.

The problem is that in the case where a subprocess is sigkilled, the return code from parallel itself is zero and the parent script has no way to tell that the process failed.

Here's a demonstration of the problem (single job to keep it simple but the same issue exists for multiple jobs taking different amounts of time etc).

# Create script that will SIGKILL itself.
cat << 'EOF' > selfkill.sh
echo Process $BASHPID will now terminate itself
kill -9 $BASHPID
EOF

# And another that will exit with code 1.
cat << 'EOF' > exit_one.sh
echo Process $BASHPID will now exit with code 1
exit 1
EOF

echo Running exit_one.sh via parallel with default error handling
parallel --joblog job.log bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with default error handling
parallel --joblog job.log bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running exit_one.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log

Output:

Running exit_one.sh via parallel with default error handling
Process 1521789 will now exit with code 1
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665764801.403       0.002      0       42      1       0       bash exit_one.sh 1

Running selfkill.sh via parallel with default error handling
Process 1521804 will now terminate itself
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665764801.573       0.003      0       42      0       9       bash selfkill.sh 1

Running exit_one.sh via parallel with abort on error
Process 1521819 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665764801.730       0.003      0       42      1       0       bash exit_one.sh 1

Running selfkill.sh via parallel with abort on error
Process 1521834 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 0
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665764801.880       0.001      0       42      0       9       bash selfkill.sh 1

All good except the last case where I need a non-zero exit code since there was a problem.

The test job appears to exit with code 137 when it is sigkilled, so I would expect it to be detected. However parallel is somehow able to see that it was killed (good) but then exits with zero (not good). Maybe the --halt option causes parallel to consider itself successful because it's successfully halted all the other processes? Is there a workaround for this behaviour?

$ bash selfkill.sh 
Process 1485899 will now terminate itself
Killed
$ echo $?
137

This is on an Ubuntu 20.04 with GNU parallel 20161222 and bash 5.0.17.


Further context in case this is an XY problem and there's a better approach than using GNU parallel. Initially we weren't using GNU parallel. We had a number of background jobs and pipelines running in parallel via extensive use of &. But we couldn't find a way to detect when jobs had been sigkilled and the parent script would carry on much the same as it does in the example above.

CodePudding user response:

<sigh> I should have tried a more recent version before asking here.

Ubuntu 20.04 is stuck with an old version but I was able to experiment inside a Docker container with docker run --rm -it ubuntu:22.04 where I could install GNU parallel 20210822.

Running exit_one.sh via parallel with default error handling
Process 1179 will now exit with code 1
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665765327.768       0.100      0       39      1       0       bash exit_one.sh 1

Running selfkill.sh via parallel with default error handling
Process 1186 will now terminate itself
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665765328.292       0.092      0       39      0       9       bash selfkill.sh 1

Running exit_one.sh via parallel with abort on error
Process 1193 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665765328.818       0.111      0       39      1       0       bash exit_one.sh 1

Running selfkill.sh via parallel with abort on error
Process 1200 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 137
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       :       1665765329.321       0.085      0       39      0       9       bash selfkill.sh 1

The last case which was problematic before now exits with 137 which is great and consistent with what bash sees.

  • Related