I have a bunch of files with different common prefixes that I would like to aggregate together. Files with the same prefix all have the same header, and I don't want that header to end up in my aggregate file more than once. I've written a function which takes the common prefix as an argument, finds all the files matching that prefix and prints all but the first line to the aggregate output file, collects the header, and prepends it to the output file with cat.
aggr () {
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
fl=`find . -name "${1}_*.txt" -print -quit`
header=`head -n1 $fl`
cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}
This generally works well, but when the find commands takes a long time to run, I sometimes don't get a header in my output files. From my logs, I can see the following error after echoing the correct header string:
mv: cannot stat ‘tmp.txt’: No such file or directory
I'm not entirely sure what is happening, but it seems like the cat command adding a header is being executed before the find function has finished, sometimes. Then the command fails to produce the tmp.txt file, and subsequently the mv command never happens. I modified my function by adding wait after the find commands, but it did not resolve the issue. Any suggestions? I'm at a loss as to why this is happening only with some files.
aggr () {
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -exec tail -n 2 {} \; > $outfile
wait
fl=`find . -name "${1}_*.txt" -print -quit`
wait
header=`head -n1 $fl`
cat - $outfile <<< "$header" > tmp.txt && mv tmp.txt $outfile
}
CodePudding user response:
I cannot comment as to why cat
seemingly succeeds and tmp.txt doesn't exist; the && should not execute the mv if there was not a successful return from cat, which should always be writing the contents of outfile at a minimum even if some kind of race condition exists with header handling...
That said, I could propose a modification to your script which might make it more robust, and would save you multiple invocations of the find command, making it faster if you have a larger dataset (I suspect):
aggr () {
header=0
outfile=${1}_aggregate.txt
find . -name "${1}_*.txt" -print0 |
while IFS= read -r -d '' line; do
if [ $header -eq 0 ]; then
header=1
cp $line $outfile
else
tail -n 2 $line >> $outfile
fi
done
}
Hope this helps!
CodePudding user response:
Extending @sea0003 answer a little bit
The problem here is that you want to process the first file differently than the rest, otherwise you could have processed the result of find
directly with -exec
.
A work-around is to get the first file from the output of find
, process it, and then let xargs
take over the rest:
#!/bin/bash
aggr() {
local outfile="${1}_aggregate.txt"
find . -name "${1}_*.txt" -print0 |
if IFS='' read -r -d '' first_file
then
cp "$first_file" "$outfile" &&
xargs -0 tail -n 2 >> "$outfile"
fi
}