I have a very large tab separated text file which i am parsing to obtain certain data. Since the input file is very large the script is very slow, how can i speed up?
I tried using &
and wait
which result is bit slower and using nice
(checked using time
)
Update
Few lines of input.tsv
Names Number Cylinder torque HP cc others
chevrolet 18 8 307 130 3504 SLR=0.1;MIN=5;MAX=19;PR=0.008;SUM=27;SD=0.5;IQR=9.5;RANG=7.5;MP_R=0.0177;MX_R=9.118
buick 15 8 350 165 3693 SLR=0.7;MIN=7;MAX=17;PR=0.07;SUM=30;SD=2.5;IQR=7.5;RANG=9.5;MP_R=0.0197;MX_R=9.1541
satellite 18 8 318 150 3436 SLR=0.12;MIN=2;MAX=11;PR=0.065;SUM=17;SD=5.5;IQR=11.5;RANG=6.5;MP_R=0.0377;MX_R=9.154
rebel 16 8 304 150 3433 SLR=0.61;MIN=8;MAX=15;PR=0.04148;SUM=24;SD=4.5;IQR=12.5;RANG=9.5;MP_R=0.018;MX_R=9.186
torino 17 8 302 140 3449 SLR=0.2;MIN=4;MAX=14;PR=0.018;SUM=22;SD=1.5;IQR=7.5;RANG=5.5;MP_R=0.0141;MX_R=9.115
Thank you
extract.sh
#!/bin/bash
zcat input.tsv.gz | while read a b c d e f g;
do
m=$(echo $g | awk -v key="MAX" -v RS=';' -F'=' '$1==key{print$2}')
n=$(echo $g | awk -v key="MIN" -v RS=';' -F'=' '$1==key{print$2}')
o=$(echo $g | awk -v key="SUM" -v RS=';' -F'=' '$1==key{print$2}')
p=$(echo $g | awk -v key="SD" -v RS=';' -F'=' '$1==key{print$2}')
q=$(echo $g | awk -v key="IQR" -v RS=';' -F'=' '$1==key{print$2}')
r=$(echo $g | awk -v key="RANG" -v RS=';' -F'=' '$1==key{print$2}')
data=$(printf "$a\t$b\t$c\t$d\t$e\t$f\tMAX=$m\tMIN=$n\tSUM=$o\tSD=$p\tIQR=$q\tRANG=$r")
echo $data
done
How do i modify to run with xargs
or parallel
to speed up the process or instruct to use more resources of the computer?
CodePudding user response:
This will be faster than your shell loop:
awk '
BEGIN {
split("MAX,MIN,SUM,SD,IQR,RANG", keys, /,/)
for (i in keys)
idx[keys[i]] = 6 i
}
NR > 1 {
split($7, fields, /;/)
for (i in fields) {
key = fields[i]
sub(/=.*/, "", key)
if (key in idx)
$(idx[key]) = fields[i]
}
}
1'
CodePudding user response:
Your semi-colon
delimited field seems to contain the same keywords in the same order so you should be able to use something like this:
#!/bin/bash
zcat input.tsv.gz |
awk -F "\t" '
NR > 1 {
printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"FS, $1,$2,$3,$4,$5,$6
split($7, a, ";")
printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"ORS, a[3],a[2],a[5],a[6],a[7],a[8]
}
'
chevrolet 18 8 307 130 3504 MAX=19 MIN=5 SUM=27 SD=0.5 IQR=9.5 RANG=7.5
buick 15 8 350 165 3693 MAX=17 MIN=7 SUM=30 SD=2.5 IQR=7.5 RANG=9.5
satellite 18 8 318 150 3436 MAX=11 MIN=2 SUM=17 SD=5.5 IQR=11.5 RANG=6.5
rebel 16 8 304 150 3433 MAX=15 MIN=8 SUM=24 SD=4.5 IQR=12.5 RANG=9.5
torino 17 8 302 140 3449 MAX=14 MIN=4 SUM=22 SD=1.5 IQR=7.5 RANG=5.5