How to speed up a bash script?-CodePudding

I have a very large tab separated text file which i am parsing to obtain certain data. Since the input file is very large the script is very slow, how can i speed up?

I tried using & and wait which result is bit slower and using nice (checked using time)

Update Few lines of input.tsv

Names   Number  Cylinder    torque  HP  cc  others
chevrolet   18  8   307 130 3504    SLR=0.1;MIN=5;MAX=19;PR=0.008;SUM=27;SD=0.5;IQR=9.5;RANG=7.5;MP_R=0.0177;MX_R=9.118
buick   15  8   350 165 3693    SLR=0.7;MIN=7;MAX=17;PR=0.07;SUM=30;SD=2.5;IQR=7.5;RANG=9.5;MP_R=0.0197;MX_R=9.1541
satellite   18  8   318 150 3436    SLR=0.12;MIN=2;MAX=11;PR=0.065;SUM=17;SD=5.5;IQR=11.5;RANG=6.5;MP_R=0.0377;MX_R=9.154
rebel   16  8   304 150 3433    SLR=0.61;MIN=8;MAX=15;PR=0.04148;SUM=24;SD=4.5;IQR=12.5;RANG=9.5;MP_R=0.018;MX_R=9.186
torino  17  8   302 140 3449    SLR=0.2;MIN=4;MAX=14;PR=0.018;SUM=22;SD=1.5;IQR=7.5;RANG=5.5;MP_R=0.0141;MX_R=9.115

Thank you

extract.sh

#!/bin/bash
zcat input.tsv.gz | while read a b c d e f g;
        do
        m=$(echo  $g | awk -v key="MAX" -v RS=';' -F'=' '$1==key{print$2}')
        n=$(echo  $g | awk -v key="MIN" -v RS=';' -F'=' '$1==key{print$2}')
        o=$(echo  $g | awk -v key="SUM" -v RS=';' -F'=' '$1==key{print$2}')
        p=$(echo  $g | awk -v key="SD" -v RS=';' -F'=' '$1==key{print$2}')
        q=$(echo  $g | awk -v key="IQR" -v RS=';' -F'=' '$1==key{print$2}')
        r=$(echo  $g | awk -v key="RANG" -v RS=';' -F'=' '$1==key{print$2}')
        data=$(printf "$a\t$b\t$c\t$d\t$e\t$f\tMAX=$m\tMIN=$n\tSUM=$o\tSD=$p\tIQR=$q\tRANG=$r")
        echo $data
done

How do i modify to run with xargs or parallel to speed up the process or instruct to use more resources of the computer?

CodePudding user response：

This will be faster than your shell loop:

awk '
BEGIN {
  split("MAX,MIN,SUM,SD,IQR,RANG", keys, /,/)
  for (i in keys)
    idx[keys[i]] = 6   i
}

NR > 1 {
  split($7, fields, /;/)
  for (i in fields) {
    key = fields[i]
    sub(/=.*/, "", key)
    if (key in idx)
      $(idx[key]) = fields[i]
  }
}

1'

CodePudding user response：

Your semi-colon delimited field seems to contain the same keywords in the same order so you should be able to use something like this:

#!/bin/bash
zcat input.tsv.gz |
awk -F "\t" '
    NR > 1 {
        printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"FS, $1,$2,$3,$4,$5,$6
        split($7, a, ";")
        printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"ORS, a[3],a[2],a[5],a[6],a[7],a[8]
    }
'

chevrolet   18  8   307 130 3504    MAX=19  MIN=5   SUM=27  SD=0.5  IQR=9.5 RANG=7.5
buick   15  8   350 165 3693    MAX=17  MIN=7   SUM=30  SD=2.5  IQR=7.5 RANG=9.5
satellite   18  8   318 150 3436    MAX=11  MIN=2   SUM=17  SD=5.5  IQR=11.5    RANG=6.5
rebel   16  8   304 150 3433    MAX=15  MIN=8   SUM=24  SD=4.5  IQR=12.5    RANG=9.5
torino  17  8   302 140 3449    MAX=14  MIN=4   SUM=22  SD=1.5  IQR=7.5 RANG=5.5