I have a directory /user/test
with 2000 compressed files.
I want to check if any given file has 5 records then I have to store it in decompressed format.
I am able to do it serially but it is taking a lot of time to finish this job.
Serially I am doing as below:
for i in `find /user/test -iname "abc*.gz"`;
do
lines=`zcat $i | wc -l`
if [ $lines = 5 ]; then
fname=`basename -s .$file_ext $i`
echo "copying $fname to new path"
`zcat $i > new_path/$fname`
cnt=$((cnt 1))
else
echo "Ignoring file $i. Expecting 5 records. It has more or less records"
fi
done
I want to do the same in parallel.
I tried exploring GNU parallel
but am seeing an error. I tried below command
find /user/test -iname "abc*.gz" |
parallel 'zcat {} | awk 'NR == 5 {print $0}' < {}.txt'
Above command is not working throwing error.
CodePudding user response:
Untested:
doit() {
zcat "$@" | awk 'NR == 5 {print $0}'
}
export -f doit
find /user/test -iname "abc*.gz" |
parallel doit
Based on what you do serially:
doit() {
i="$1"
lines=`zcat $i | wc -l`
if [ $lines = 5 ]; then
fname=`basename -s .$file_ext $i`
echo "copying $fname to new path"
`zcat $i > new_path/$fname`
else
echo "Ignoring file $i. Expecting 5 records. It has more or less records"
fi
}
export -f doit
export file_ext
find /user/test -iname "abc*.gz" | parallel doit
The general idea is to build a bash function that works on a single input. export
the function (and the variables needed by the function) and run the function in parallel.
The benefit is that it is pretty easy to test the function on a single input.
When writing the function there is a small gotcha: The function cannot write to hardcoded files, because this will create a race condition (multiple instances writing at the same time). So you need to write the function in a way in which this does not happen.