Home > other >  How to normalize the values of specific columns of a csv with awk?
How to normalize the values of specific columns of a csv with awk?

Time:05-24

I have a csv with several variables and I would like to normalize only some specific columns using the standard deviation. The value minus the mean of the variable divided by the standard deviation of the variable.

formula

The file is comma separated and the transformations needs to be done only with awk to the variables months_loan_duration and amount.

The input would look like this but with a thousand rows:

checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,6,critical,radio/tv,1169.53
1 - 200 DM,48,repaid,radio/tv,5951.78
,12,critical,education,2096.23

And the output would be like this:

checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-1.236,critical,radio/tv,-0.745
1 - 200 DM,2.248,repaid,radio/tv,0.95
,-0.738,critical,education,-0.417

So far I have tried the following unsuccessfully:

#! /usr/bin/awk -f
BEGIN{FS=","; OFS=",";numberColumn=NF}
NR!=1
{
for(i=1;i <=  numberColumn;i  )
        {
        total[i] =$i;
        totalSquared[i] =$i^2;
        }

for (i=1;i <= numberColumn;i  )
        {
        avg[i]=total[i]/(NR-1);
        std[i]=sqrt((totalSquared[i]/(NR-1))-avg[i]^2);
        }

for (i=1;i <= numberColumn;i  )
        {
        norm[i]=(($i-avg[i])/std[i])
        }
}
{
print $1,$norm[2],3,4,$norm[5]
}

CodePudding user response:

It will be easier to read the file twice:

awk -F, -v OFS=, '
    NR==FNR {                   # 1st pass: accumulate values
        if (FNR > 1) {
            sx2  = $2           # sum of col2
            sxx2  = $2 * $2     # sum of col2^2
            sx5  = $5           # sum of col5
            sxx5  = $5 * $5     # sum of col5^2
            n                   # count of samples
        }
        next
    }
    FNR==1 {                    # 2nd pass, 1st line: calc means and stdevs
        ave2 = sx2 / n          # mean of col2
        var2 = sxx2 / (n - 1) - ave2 * ave2 * n / (n - 1)
        if (var2 < 0) var2 = 0  # avoid rounding error
        sd2 = sqrt(var2)        # stdev of col2
        ave5 = sx5 / n
        var5 = sxx5 / (n - 1) - ave5 * ave5 * n / (n - 1)
        if (var5 < 0) var5 = 0
        sd5 = sqrt(var5)
        print                   # print the header line
    }
    FNR>1 {
        if (sd2 > 0) $2 = ($2 - ave2) / sd2
        if (sd5 > 0) $5 = ($5 - ave5) / sd5
        print
    }
' input_file.csv input_file.csv

Output:

checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-0.704361,critical,radio/tv,-0.750328
1 - 200 DM,1.14459,repaid,radio/tv,1.13527
,-0.440225,critical,education,-0.384939

Please note the calculated values differ from your expected result.

CodePudding user response:

thousands of rows isn't all that big a file for awk : might as well load it in all at once - here i created a 23.6 mn rows synthetic version of it (tested on both gawk and mawk) -

  • while overall performance is similar to other solutions, this code avoids having to explicitly list the input file twice to perform its equivalent of 2-pass processing

INPUT

     rows = 23,622,127. | UTF8 chars = 799192890. | bytes = 799192890.

     1  checking_balance,months_loan_duration,credit_history,purpose,amount
     2  < 0 DM,889,critical,luna,758.61
     3  ,150,critical,terra,1823.93
     4  1 - 200 DM,883,repaid,stablecoin,2525.55
     5  1 - 200 DM,65,repaid,terra,2405.67
     6  < 0 DM,9,critical,luna,4059.34
     7  < 0 DM,201,critical,stablecoin,5043
     8  1 - 200 DM,549,repaid,terra,471.92
     9  < 0 DM,853,critical,stablecoin,422.78
    10  < 0 DM,659,critical,luna,684.94
 

CODE

    # gawk profile, created Tue May 24 04:11:02 2022

    'function abs(_) { 
         return \
                  _<-_?-_:_ 
    } BEGIN {
        split(_____=(_=length(FS = RS = "^$")) _,____,"")
    }
    END {
     1      gsub("\n", ",&")
     1      FS = "["(OFS= ",")"]"
     1     $!_ = $!( __ = _)
     1      __ = --NF

23622126    while ((_____ _) < (__-=_)) {
23622126        ____[___=_____]  = ($__)^_
23622126          ____[ -—___ ]  = ($__)
23622126          ____[___ * _]  = -_^!_
23622126          ____[___-= _]  = ($(__-=_ _^!_))
23622126          ____[   ___ ]  = ($__)^_
        }
     1      ___ = (__=-____[_ _ _])-_^!_

     1       RS = -(abs((____[(_)]/___-(((NR=____[ _^! _]/__)^_)*__/___)))^_^(_/-_)

            ___ = -(abs((____[_ _]/___-(((RT=____[_ _^!_]/__)^_)*__/___)))^_^(_/-_)
     1      ORS = "\n"
     1      gsub(ORS, "")
     1      OFS = ","
     1      print $(_^=_<_), $(__=  _), $  _, $  _, $  _
     1      OFMT = "%."(__*__ !(__=NF-__-__))"f"
23622126    while (  _ <= __) {
23622126        print $_, (NR-$  _)/RS, $  _, $  _, (RT-$  _)/___
        }
    }'
    

OUTPUT

     out9:  837MiB 0:00:28 [29.2MiB/s] [29.2MiB/s] [ <=> ]
      in0:  762MiB 0:00:00 [2.95GiB/s] [2.95GiB/s] [======>] 100%            
    ( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )
    
     26.98s user 1.58s system 99% cpu 28.681 total

23622127 878032266 878032266 testfile_stdnorm_test_004.txt_out.txt

 1  checking_balance,months_loan_duration,credit_history,purpose,amount
 2  < 0 DM,1.2000,critical,luna,-1.2939
 3  ,-1.2949,critical,terra,-0.6788
 4  1 - 200 DM,1.1798,repaid,stablecoin,-0.2737
 5  1 - 200 DM,-1.5818,repaid,terra,-0.3429
 6  < 0 DM,-1.7709,critical,luna,0.6119
 7  < 0 DM,-1.1227,critical,stablecoin,1.1798
 8  1 - 200 DM,0.0522,repaid,terra,-1.4594
 9  < 0 DM,1.0785,critical,stablecoin,-1.4878

ALTERNATE SOLUTION OPTIMIZED FOR SMALLER INPUTS (e.g. up to 10^6 (1 mn) rows)

    # gawk profile, created Tue May 24 06:19:24 2022

    # BEGIN rule(s)

    BEGIN {
     1      __ = (FS = RS = "^$") * (ORS = "")
    }

    # END rule(s)

    END {
     1      _ = $__
     1      gsub("[\n][,]","\n_,",_)
     1      sub("^. amount\n","",_) gsub("[,][0-9. -] [,\n]", "\3&\1", _)
     1      _____ = "[^0-9. -] "
     1      gsub("^" (_____) "|\1[^\1\3] \3","",_)
     1      _____ = __ = split(_,___,_____)
1048575     while (-(--__) <  __) {
1048575         ___["_"]  = _=___[(__)]
1048575         ___["="]  = _*_
1048575         ___["~"]  = _=___[--__]
1048575         ___["^"]  = _*_
1048575         ___[":"]  
        }
     1      _ = (__=___[":"])-(____ ^= _<_)
     1        ____
     1      ___["}"] = -(abs((___["^"]/_)-(((___["{"] = ___["~"] / __)^____)*__/_)))^____^(-(_^(!_)))
     1      ___[")"] = -(abs((___["="]/_)-(((___["("] = ___["_"] / __)^____)*__/_)))^____^(-(_^(!_)))
     1      if (_ < _) {
            for (_ in ___) {
                print "debug", _, ___[_]
            }
        }
     1      ____ = split($(_ < _), ______, ORS = "\n")
     1      _ = index(FS = "[" (OFS = ",") "]", OFS)
     1      print ______[_ ^ (! _)]
1048574     for (__  = __ ^= _ < _; __ < ____; __  ) {
1048574         print sprintf("%.*s%s,% .*f,%s,%s,% .*f", ! __, $! _ = ______[__], $(_ ~ _), _   _, (___["{"] - $_) / ___["}"], $  _, $(--_   _), _   _, (___["("] - $NF) / ___[")"])
        }
    }


    # Functions, listed alphabetically

     2  function abs(_)
    {
     2      return ( _ < -_ ? -_ : _)
    }

PERFORMANCE OF SOLUTION # 2 : End-to-End 2.57 secs for 2^20 rows

 rows = 1048575. | UTF8 chars = 39912117. | bytes = 39912117.

 ( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )  
 2.46s user 0.13s system 100% cpu 2.573 total
  • Related