How to parallel process multi-line file with parallel in while loop?-CodePudding

input.txt: My actual input file has 5000000 lines

A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06

data.txt : This is another file I need to refer in while loop.

A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30

Here's what I need to do

cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo  $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done

Desired output:

A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

My major problem is the while loop takes more than an hour to complete going through 500000 input lines. How can I parallel process this since each line is independent of each other and the order of lines in output file doesn't matter. I've tried using GNU parallel based on few discussions. But none of them were helpful or maybe I am not sure how to implement it. I am using RHEL with BASH or KSH.

CodePudding user response：

If you achieve the development of a function to do whatever you need for each iteration, you could use

CodePudding user response：

Here is one potential solution:

cat script.awk
#!/usr/bin/awk -f

NR==FNR{
  n=gsub("\.*","",$4)
  a[n,$5]=$6; next
} (n,$5) in a {
  print $0, a[n,$5]
}

cat input.txt | parallel --pipe -q ./script.awk data.txt -
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

It should be relatively fast. You can tweak the parallel command (e.g. use --pipepart instead of --pipe) to increase performance depending on your parameters (i.e. size of each file, number of available cores, etc).

Edit

Rough benchmarking suggests it will be significantly faster:

# Copy input.txt many times
for f in {1..100}; do cat input.txt >> input.txt_2; done
for f in {1..1000}; do cat input.txt_2 >> input.txt_3; done
for f in {1..10}; do cat input.txt_3 >> input.txt_4; done

du -h input.txt_4
137M    input.txt_4

wc -l input.txt_4
6000000 input.txt_4

time cat input.txt_4 | parallel --pipe -q ./script.awk data.txt - > output.txt
real    0m7.533s
user    0m22.085s
sys     0m4.494s

Took <10 seconds to process the 6M row input file. Does this solve your problem?

CodePudding user response：

Without parallel took 8 seconds for 5068056 lines

$ wc -l input.txt 
5068056 input.txt
$ time awk 'NR==FNR{a[$4]=$6} NR!=FNR{print $0, a[$4]}' data.txt input.txt  > output.txt

real    0m8.274s
user    0m5.397s
sys     0m2.869s

$ wc -l output.txt
5068056 output.txt

With parallel

time cat input.txt | parallel --pipe -q awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' data.txt - > output.txt 

real    0m3.319s
user    0m9.284s
sys     0m5.990s

Using split

inputfile=input.txt
outputfile=output.txt
data=data.txt
count=10

split -n l/$count $inputfile /tmp/input$$
for file in /tmp/input$$*; do
    awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' $data $file > ${file}.out &
done
wait
cat /tmp/input$$*.out > $outputfile
rm /tmp/input$$*

$ time ./split.sh

real    0m1.781s
user    0m7.244s
sys     0m1.536s