Home > database >  Aggregate files with common prefix but don't repeat header in Bash
Aggregate files with common prefix but don't repeat header in Bash

Time:10-05

I have a bunch of files with different common prefixes that I would like to aggregate together. Files with the same prefix all have the same header, and I don't want that header to end up in my aggregate file more than once. I've written a function which takes the common prefix as an argument, finds all the files matching that prefix and prints all but the first line to the aggregate output file, collects the header, and prepends it to the output file with cat.

aggr () {
        outfile=${1}_aggregate.txt
        find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
        fl=`find . -name "${1}_*.txt" -print -quit`
        header=`head -n1 $fl`
        cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}

This generally works well, but when the find commands takes a long time to run, I sometimes don't get a header in my output files. From my logs, I can see the following error after echoing the correct header string:

mv: cannot stat ‘tmp.txt’: No such file or directory

I'm not entirely sure what is happening, but it seems like the cat command adding a header is being executed before the find function has finished, sometimes. Then the command fails to produce the tmp.txt file, and subsequently the mv command never happens. I modified my function by adding wait after the find commands, but it did not resolve the issue. Any suggestions? I'm at a loss as to why this is happening only with some files.

aggr () {
        outfile=${1}_aggregate.txt
        find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
        wait
        fl=`find . -name "${1}_*.txt" -print -quit`
        wait
        header=`head -n1 $fl`
        cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}

CodePudding user response:

I cannot comment as to why cat seemingly succeeds and tmp.txt doesn't exist; the && should not execute the mv if there was not a successful return from cat, which should always be writing the contents of outfile at a minimum even if some kind of race condition exists with header handling...

That said, I could propose a modification to your script which might make it more robust, and would save you multiple invocations of the find command, making it faster if you have a larger dataset (I suspect):

aggr () {
    header=0
    outfile=${1}_aggregate.txt
    find . -name "${1}_*.txt" -print0 | 
        while IFS= read -r -d '' line; do
            if [ $header -eq 0 ]; then
                header=1
                cp $line $outfile
            else
                tail -n 2 $line >> $outfile
            fi
        done
}

Hope this helps!

CodePudding user response:

You don't need to invoke find twice for this. You don't need a temporary file either.

outfile=${1}_aggregate.txt \
find . -name "${1}_*.txt" -exec sh -c '
if ! test -f "$outfile"; then
  cp "$1" "$outfile"
  shift
fi
# like tail -n  2 but works with multiple files
awk "FNR != 1" "$@" /dev/null >>"$outfile"' sh {}  
  • Related