Home > database >  How to speed up a bash script?
How to speed up a bash script?

Time:12-15

I have a very large tab separated text file which i am parsing to obtain certain data. Since the input file is very large the script is very slow, how can i speed up?

I tried using & and wait which result is bit slower and using nice (checked using time)

Update Few lines of input.tsv

Names   Number  Cylinder    torque  HP  cc  others
chevrolet   18  8   307 130 3504    SLR=0.1;MIN=5;MAX=19;PR=0.008;SUM=27;SD=0.5;IQR=9.5;RANG=7.5;MP_R=0.0177;MX_R=9.118
buick   15  8   350 165 3693    SLR=0.7;MIN=7;MAX=17;PR=0.07;SUM=30;SD=2.5;IQR=7.5;RANG=9.5;MP_R=0.0197;MX_R=9.1541
satellite   18  8   318 150 3436    SLR=0.12;MIN=2;MAX=11;PR=0.065;SUM=17;SD=5.5;IQR=11.5;RANG=6.5;MP_R=0.0377;MX_R=9.154
rebel   16  8   304 150 3433    SLR=0.61;MIN=8;MAX=15;PR=0.04148;SUM=24;SD=4.5;IQR=12.5;RANG=9.5;MP_R=0.018;MX_R=9.186
torino  17  8   302 140 3449    SLR=0.2;MIN=4;MAX=14;PR=0.018;SUM=22;SD=1.5;IQR=7.5;RANG=5.5;MP_R=0.0141;MX_R=9.115

Thank you

extract.sh

#!/bin/bash
zcat input.tsv.gz | while read a b c d e f g;
        do
        m=$(echo  $g | awk -v key="MAX" -v RS=';' -F'=' '$1==key{print$2}')
        n=$(echo  $g | awk -v key="MIN" -v RS=';' -F'=' '$1==key{print$2}')
        o=$(echo  $g | awk -v key="SUM" -v RS=';' -F'=' '$1==key{print$2}')
        p=$(echo  $g | awk -v key="SD" -v RS=';' -F'=' '$1==key{print$2}')
        q=$(echo  $g | awk -v key="IQR" -v RS=';' -F'=' '$1==key{print$2}')
        r=$(echo  $g | awk -v key="RANG" -v RS=';' -F'=' '$1==key{print$2}')
        data=$(printf "$a\t$b\t$c\t$d\t$e\t$f\tMAX=$m\tMIN=$n\tSUM=$o\tSD=$p\tIQR=$q\tRANG=$r")
        echo $data
done

How do i modify to run with xargs or parallel to speed up the process or instruct to use more resources of the computer?

CodePudding user response:

This will be faster than your shell loop:

awk '
BEGIN {
  split("MAX,MIN,SUM,SD,IQR,RANG", keys, /,/)
  for (i in keys)
    idx[keys[i]] = 6   i
}

NR > 1 {
  split($7, fields, /;/)
  for (i in fields) {
    key = fields[i]
    sub(/=.*/, "", key)
    if (key in idx)
      $(idx[key]) = fields[i]
  }
}

1'

CodePudding user response:

Your semi-colon delimited field seems to contain the same keywords in the same order so you should be able to use something like this:

#!/bin/bash
zcat input.tsv.gz |
awk -F "\t" '
    NR > 1 {
        printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"FS, $1,$2,$3,$4,$5,$6
        split($7, a, ";")
        printf "%s"FS"%s"FS"%s"FS"%s"FS"%s"FS"%s"ORS, a[3],a[2],a[5],a[6],a[7],a[8]
    }
'
chevrolet   18  8   307 130 3504    MAX=19  MIN=5   SUM=27  SD=0.5  IQR=9.5 RANG=7.5
buick   15  8   350 165 3693    MAX=17  MIN=7   SUM=30  SD=2.5  IQR=7.5 RANG=9.5
satellite   18  8   318 150 3436    MAX=11  MIN=2   SUM=17  SD=5.5  IQR=11.5    RANG=6.5
rebel   16  8   304 150 3433    MAX=15  MIN=8   SUM=24  SD=4.5  IQR=12.5    RANG=9.5
torino  17  8   302 140 3449    MAX=14  MIN=4   SUM=22  SD=1.5  IQR=7.5 RANG=5.5
  • Related