Data masking in XML file using awk-CodePudding

I want to create a script which masks partial data from xml log files. My approach is - there will be a lookup file wherein I can add tags which needs to be masked with a flag as Yes or No. My unix script will read the lookup file and mask only those tags which has a yes flag. I am using awk in my script to find the tag(that has a Yes flag in lookup file )from the xml log file and substitute its value by doing something like -

   awk -v elem=$1 -v rep="*" 'BEGIN {FS="<|>"}
    { for (I=1;i<=NF;i  )
     ($I ~ elem)
    {gsub (elem,var,$(i 1)) }
    }1' file.log > output.log

But this is clearly not working. Also I want to retain the '<>' xml tags but its getting removed in the output file. Any suggestions to alter the code or my approach is highly appreciated.

XML log

<name>sam</name>
<phone>98762123</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>

lookup file

tag, flag
name,Yes
phone,Yes
DOB,No

PS: using awk to parse xml is not the best option Im aware, but I have no options here.

CodePudding user response：

If all your files does follow one-tag-per-line rule like

<name>sam</name>
<phone>98762123</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>

does then I suggest exploiting FPAT to tokenize them into starting text ending, i.e. in first row <name> becomes $1, sam becomes $2 and </name> becomes $3. Consider following solution for simplified case: replace all content of <phone>-s using *, let above file be named file.xml then

awk 'BEGIN{FPAT="<?[^<>]*>?";OFS=""}index($1,"phone"){$2="*"}{print}' file.xml

does output

<name>sam</name>
<phone>*</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>

Explanation: I use FPAT to describe field < repeated 0 or 1 times, all but < and > repeated 0 or more times, > repeated 0 or 1 times, which does tokenize into starting text ending as described earlier (note that this assume there is never literal < and literal >, it might fail for more advanced legal XML file) and set OFS (output field separator) to empty string to avoid inserting whitespaces between tokens in output. Then for every line I use index to check if phone is substring of starting (index does return 0 if not found, else position in string, so if you then treat it as boolean it is was substring found?) and if it so replace text with *. For every line I do print it.

What you need to do in order to make it suitable for your case: add loading tags to be veiled from your configuration file, add for loop to allow replacing more than single tag (my code as-is does just veil phone).

(tested in gawk 4.2.1)

CodePudding user response：

$ cat tst.awk
BEGIN { FS="[,<>]" }
NR==FNR {
    if ( $2 == "Yes" ) {
        mask[$1]
    }
    next
}
$2 in mask {
    sub(/>[^<]*/,">*")
}
{ print }

$ awk -f tst.awk lookup.txt file.xml
<name>*</name>
<phone>*</phone>
<DOB>12-09-77</DOB>
<bankaccount>4563728495847</bankaccount>

If that's not what you need then edit your question to clarify your requirements and provide the expected output.