bash wait for all processes to finish (doesn't work)-CodePudding

I have a directory with several sub-directories with names

1
2
3
4
backup_1
backup_2

I wrote a parallelized bash code to process files in these folders and a minimum working example is as follows:

#!/bin/bash
P=`pwd`
task(){
    dirname=$(basename $dir)
    echo $dirname running >> output.out
    if [[ $dirname != "backup"* ]]; then
        sed -i "s/$dirname running/$dirname is good/" $P/output.out
    else
        sed -i "s/$dirname running/$dirname ignored/" $P/output.out
    fi
}

for dir in */; do
    ((i=i%8)); ((i  ==0)) && wait
    task "$dir" &
done
wait
echo all done

The "wait" at the end of the script is supposed to wait for all processes to finish before proceeding to echo "all done". The output.out file, after all processes are finished should show

1 is good
2 is good
3 is good
4 is good
backup_1 ignored
backup_2 ignored

I am able to get this output if I set the script to run in serial with ((i=i%1)); ((i ==0)) && wait. However, if I run it in parallel with ((i=i%2)); ((i ==0)) && wait, I get something like

2 is good
1 running
3 running
4 is good
backup_1 running
backup_2 ignored

Can anyone tell me why is wait not working in this case?

I also know that GNU parallel can do the same thing in parallelizing tasks. However, I don't know how to command parallel to run this task on all sub-directories in the parent directory. It'll be great is someone can produce a sample script that I can follow.

Many thanks Jacek

CodePudding user response：

A literal porting to GNU Parallel looks like this:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running >> output.out
    if [[ $dirname != "backup"* ]]; then
        sed -i "s/$dirname running/$dirname is good/" $P/output.out
    else
        sed -i "s/$dirname running/$dirname ignored/" $P/output.out
    fi
}
export -f task

parallel -j8 task ::: */
echo all done

As others point out you have race conditions when you run sed on the same file in parallel.

To avoid race conditions you could do:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running
    if [[ $dirname != "backup"* ]]; then
        echo "$dirname is good" >&2
    else
        echo "$dirname ignored" >&2
    fi
}
export -f task

parallel -j8 task ::: */ >running.out 2>done.out
echo all done

You will end up with two files running.out and done.out.

If you really just want to ignore the dirs called backup*:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running
    echo "$dirname is good" >&2
}
export -f task

parallel -j8 task '{=/backup/ and skip()=}' ::: */ >running.out 2>done.out
echo all done

Consider spending 20 minutes on reading chapter 1 2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.