Home > Software design >  Replace the numeric value
Replace the numeric value

Time:03-27

How do we mask the numeric data if the file is in excel format/csv

Input.xlsx/Input.csv 
Col1,Col2,Col3
A,B,-98.02
M,N,1003
P,Q,-1025.345

Output.xlsx/output.csv 
Col1,Col2,Col3
A,B,-78.14
M,N,1143
P,Q,-1245.745

I want to mask each numeric value to different value dynamically. Instead of original data it has to mask with dummy number in shell

CodePudding user response:

This works with CSV files, not with xlsx files:

gawk -F, 'BEGIN{ OFS=","}
          { for(i=1;i<=NF;i  ){ 
               x=$i; 
               gsub(/[0-9.-]*/,"",x); 
               if(x==""){ $i=rand()*100 }
          } print $0 }'  Input.csv

With above script every numeric value is replaced with a numeric value between 0 and 100 (with decimals)

output (example):

Col1,Col2,Col3
A,B,92.4046
M,N,59.3909
P,Q,30.6394

You can replace the rand()*100 with (almost) anything to get another range.

  • for(i=1;i<=NF;i ) loop over all field in the current line
  • gsub(/[0-9.-]*/,"",x) Will replace every number (0-9), or . or - with an empty string (""), effectively leaving an empty string when it's a number.
  • if(x=="") When x is empty replace the field with a random value.

NOTE: The final piece } print $0 }' can also be written as }}1', but I think using the first option is more clear.

EDIT: Keeping negative numbers negative, with (about) the same number of decimals

Create an awk script (i.e. replace.awk):

function rep(s) {
        sgn = s<0?-1:1
        nbc = int(log(sgn*s)/log(10)) 1
        dec = (s-int(s)==0)?0:length(sgn*s-int(sgn*s))-2
        #print "DEBUG" sgn, nbc, dec, "%"nbc"."dec"f"
        return sprintf("%"nbc"."dec"f",sgn*rand()*(10**nbc))
        }

BEGIN{ OFS=","
        srand()
}
{ for(i=1;i<=NF;i  ){
               x=$i;
               gsub(/[0-9.-]*/,"",x);
              if(x==""){ $i=rep($i) }
  }
}1

and do: gawk -F, -f replace.awk Input.csv

output should look like this (example):

Col1,Col2,Col3
A,B,-49.73
M,N,5577
P,Q,-551.278

The introduced variables with this solution:

  • sgn The sign of the number ( 1 or -1)
  • nbc Numbers Before Comma (Comma ==> meant is Decimal separator)
  • dec Number of digits after Comma (Comma ==> meant is Decimal separator)
  • Related