Bash function - add header to find output file-CodePudding

I have a bunch of files with different common prefixes that I would like to aggregate together. Files with the same prefix all have the same header, and I don't want that header to end up in my aggregate file more than once. I've written a function which takes the common prefix as an argument, finds all the files matching that prefix and prints all but the first line to the aggregate output file, collects the header, and prepends it to the output file with cat.

aggr () {
        outfile=${1}_aggregate.txt
        find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
        fl=`find . -name "${1}_*.txt" -print -quit`
        header=`head -n1 $fl`
        cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}

This generally works well, but when the find commands takes a long time to run, I sometimes don't get a header in my output files. From my logs, I can see the following error after echoing the correct header string:

mv: cannot stat ‘tmp.txt’: No such file or directory

I'm not entirely sure what is happening, but it seems like the cat command adding a header is being executed before the find function has finished, sometimes. Then the command fails to produce the tmp.txt file, and subsequently the mv command never happens. I modified my function by adding wait after the find commands, but it did not resolve the issue. Any suggestions? I'm at a loss as to why this is happening only with some files.

aggr () {
        outfile=${1}_aggregate.txt
        find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
        wait
        fl=`find . -name "${1}_*.txt" -print -quit`
        wait
        header=`head -n1 $fl`
        cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}

CodePudding user response：

I cannot comment as to why cat seemingly succeeds and tmp.txt doesn't exist; the && should not execute the mv if there was not a successful return from cat, which should always be writing the contents of outfile at a minimum even if some kind of race condition exists with header handling...

That said, I could propose a modification to your script which might make it more robust, and would save you multiple invocations of the find command, making it faster if you have a larger dataset (I suspect):

aggr () {
    header=0
    outfile=${1}_aggregate.txt
    find . -name "${1}_*.txt" -print0 | 
        while IFS= read -r -d '' line; do
            if [ $header -eq 0 ]; then
                header=1
                cp $line $outfile
            else
                tail -n 2 $line >> $outfile
            fi
        done
}

Hope this helps!

CodePudding user response：

_{Extending @sea0003 answer a little bit}

The problem here is that you want to process the first file differently than the rest, otherwise you could have processed the result of find directly with -exec.

A work-around is to get the first file from the output of find, process it, and then let xargs take over the rest:

#!/bin/bash

aggr() {
    local outfile="${1}_aggregate.txt"

    find . -name "${1}_*.txt" -print0 |
    if IFS='' read -r -d '' first_file
    then
        cp "$first_file" "$outfile" &&
        xargs -0 tail -n  2 >> "$outfile"
    fi
}