I have a bunch of files with different common prefixes that I would like to aggregate together. Files with the same prefix all have the same header, and I don't want that header to end up in my aggregate file more than once. I've written a function which takes the common prefix as an argument, finds all the files matching that prefix and prints all but the first line to the aggregate output file, collects the header, and prepends it to the output file with cat.
aggr () {
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
fl=`find . -name "${1}_*.txt" -print -quit`
header=`head -n1 $fl`
cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}
This generally works well, but when the find commands takes a long time to run, I sometimes don't get a header in my output files. From my logs, I can see the following error after echoing the correct header string:
mv: cannot stat ‘tmp.txt’: No such file or directory
I'm not entirely sure what is happening, but it seems like the cat command adding a header is being executed before the find function has finished, sometimes. Then the command fails to produce the tmp.txt file, and subsequently the mv command never happens. I modified my function by adding wait after the find commands, but it did not resolve the issue. Any suggestions? I'm at a loss as to why this is happening only with some files.
aggr () {
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
wait
fl=`find . -name "${1}_*.txt" -print -quit`
wait
header=`head -n1 $fl`
cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}
CodePudding user response:
I cannot comment as to why cat
seemingly succeeds and tmp.txt doesn't exist; the && should not execute the mv if there was not a successful return from cat, which should always be writing the contents of outfile at a minimum even if some kind of race condition exists with header handling...
That said, I could propose a modification to your script which might make it more robust, and would save you multiple invocations of the find command, making it faster if you have a larger dataset (I suspect):
aggr () {
header=0
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -print0 |
while IFS= read -r -d '' line; do
if [ $header -eq 0 ]; then
header=1
cp $line $outfile
else
tail -n 2 $line >> $outfile
fi
done
}
Hope this helps!
CodePudding user response:
You don't need to invoke find
twice for this. You don't need a temporary file either.
outfile=${1}_aggregate.txt \
find . -name "${1}_*.txt" -exec sh -c '
if ! test -f "$outfile"; then
cp "$1" "$outfile"
shift
fi
# like tail -n 2 but works with multiple files
awk "FNR != 1" "$@" /dev/null >>"$outfile"' sh {}