I want to create a script which masks partial data from xml log files. My approach is - there will be a lookup file wherein I can add tags which needs to be masked with a flag as Yes or No. My unix script will read the lookup file and mask only those tags which has a yes flag. I am using awk in my script to find the tag(that has a Yes flag in lookup file )from the xml log file and substitute its value by doing something like -
awk -v elem=$1 -v rep="*" 'BEGIN {FS="<|>"}
{ for (I=1;i<=NF;i )
($I ~ elem)
{gsub (elem,var,$(i 1)) }
}1' file.log > output.log
But this is clearly not working. Also I want to retain the '<>' xml tags but its getting removed in the output file. Any suggestions to alter the code or my approach is highly appreciated.
XML log
<name>sam</name>
<phone>98762123</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>
lookup file
tag, flag
name,Yes
phone,Yes
DOB,No
PS: using awk to parse xml is not the best option Im aware, but I have no options here.
CodePudding user response:
If all your files does follow one-tag-per-line rule like
<name>sam</name>
<phone>98762123</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>
does then I suggest exploiting FPAT
to tokenize them into starting text ending, i.e. in first row <name>
becomes $1
, sam
becomes $2
and </name>
becomes $3
. Consider following solution for simplified case: replace all content of <phone>
-s using *
, let above file be named file.xml
then
awk 'BEGIN{FPAT="<?[^<>]*>?";OFS=""}index($1,"phone"){$2="*"}{print}' file.xml
does output
<name>sam</name>
<phone>*</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>
Explanation: I use FPAT
to describe field <
repeated 0 or 1 times, all but <
and >
repeated 0 or more times, >
repeated 0 or 1 times, which does tokenize into starting text ending as described earlier (note that this assume there is never literal <
and literal >
, it might fail for more advanced legal XML file) and set OFS
(output field separator) to empty string to avoid inserting whitespaces between tokens in output. Then for every line I use index
to check if phone
is substring of starting (index
does return 0
if not found, else position in string, so if you then treat it as boolean it is was substring found?) and if it so replace text with *
. For every line I do print
it.
What you need to do in order to make it suitable for your case: add loading tags to be veiled from your configuration file, add for loop to allow replacing more than single tag (my code as-is does just veil phone
).
(tested in gawk 4.2.1)
CodePudding user response:
$ cat tst.awk
BEGIN { FS="[,<>]" }
NR==FNR {
if ( $2 == "Yes" ) {
mask[$1]
}
next
}
$2 in mask {
sub(/>[^<]*/,">*")
}
{ print }
$ awk -f tst.awk lookup.txt file.xml
<name>*</name>
<phone>*</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>
If that's not what you need then edit your question to clarify your requirements and provide the expected output.