How do we mask the numeric data if the file is in excel format/csv
Input.xlsx/Input.csv
Col1,Col2,Col3
A,B,-98.02
M,N,1003
P,Q,-1025.345
Output.xlsx/output.csv
Col1,Col2,Col3
A,B,-78.14
M,N,1143
P,Q,-1245.745
I want to mask each numeric value to different value dynamically. Instead of original data it has to mask with dummy number in shell
CodePudding user response:
This works with CSV files, not with xlsx files:
gawk -F, 'BEGIN{ OFS=","}
{ for(i=1;i<=NF;i ){
x=$i;
gsub(/[0-9.-]*/,"",x);
if(x==""){ $i=rand()*100 }
} print $0 }' Input.csv
With above script every numeric value is replaced with a numeric value between 0 and 100 (with decimals)
output (example):
Col1,Col2,Col3
A,B,92.4046
M,N,59.3909
P,Q,30.6394
You can replace the rand()*100
with (almost) anything to get another range.
for(i=1;i<=NF;i )
loop over all field in the current linegsub(/[0-9.-]*/,"",x)
Will replace every number (0-9), or.
or-
with an empty string (""), effectively leaving an empty string when it's a number.if(x=="")
Whenx
is empty replace the field with a random value.
NOTE: The final piece } print $0 }'
can also be written as }}1'
, but I think using the first option is more clear.
EDIT: Keeping negative numbers negative, with (about) the same number of decimals
Create an awk script (i.e. replace.awk
):
function rep(s) {
sgn = s<0?-1:1
nbc = int(log(sgn*s)/log(10)) 1
dec = (s-int(s)==0)?0:length(sgn*s-int(sgn*s))-2
#print "DEBUG" sgn, nbc, dec, "%"nbc"."dec"f"
return sprintf("%"nbc"."dec"f",sgn*rand()*(10**nbc))
}
BEGIN{ OFS=","
srand()
}
{ for(i=1;i<=NF;i ){
x=$i;
gsub(/[0-9.-]*/,"",x);
if(x==""){ $i=rep($i) }
}
}1
and do: gawk -F, -f replace.awk Input.csv
output should look like this (example):
Col1,Col2,Col3
A,B,-49.73
M,N,5577
P,Q,-551.278
The introduced variables with this solution:
sgn
The sign of the number ( 1 or -1)nbc
Numbers Before Comma (Comma ==> meant is Decimal separator)dec
Number of digits after Comma (Comma ==> meant is Decimal separator)