Home > Blockchain >  Sensitive data masking in XML log file in bash
Sensitive data masking in XML log file in bash

Time:02-20

I want to create a script which masks partial data from xml log files. My approach is - there will be a lookup file wherein I can add tags which needs to be masked with a flag as Yes or No. My unix script will read the lookup file and mask only those tags which has a yes flag. I am using awk in my script to find the tag(that has a Yes flag in lookup file )from the xml log file and substitute its value by doing something like -

    mask()
{
  pat =[a-zA-Z0-9]
  elem=$1
  awk -v var="$pat" -v rep="*" 'BEGIN {FS=">"; OFS=">"; RS="<"; ORS="<"}
{ for (i=1;i<=NF;i  )
 if ($i ~ elem)
{gsub (var,rep,$(i 1)) }
}1' xml.log > output.log
}

 exec < $lookupfile.txt
 read header 
  while IFS="," read -r tag flag
   do 
     if [[ "$flag" == "Yes" ]];then
     for i in $tag
      do
        mask $tag
      done 
     fi
   done

But this is clearly not working. Also I want to retain the '<>' xml tags but its getting removed in the output file. Any suggestions to alter the code or my approach is highly appreciated.

Please note:- the xml file is not parsed in the log file. I just need to mask the values present inside the xml tags.

**XML in log file **

2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>98762123</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>123456789</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

lookup file

tag, flag
name,Yes
phone,Yes
DOB,No

output file must look like this

2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

PS: the masking algorithm must should mask the value according to some pattern. PS: using awk to parse xml is not the best option Im aware, but I have no options here.

CodePudding user response:

Using GNU awk for the 3rg arg to match() and gensub():

$ cat tst.awk
BEGIN { FS="," }
NR==FNR {
    if ( $2 == "Yes" ) {
        mask[$1]
    }
    next
}
match($0,"(.*<sometag[^<>] )(.*)(</sometag>.*)",rec) {
    bef = rec[1]
    xml = rec[2]
    aft = rec[3]

    $0 = bef
    while ( match(xml,"(<([^>] )>)([^<]*)(<[^>] >)",tagVals) ) {
        tag = tagVals[2]
        if ( tag in mask ) {
            val = tagVals[3]
            masked = gensub(/(.)../,"\\1**","g",val"  ")
            tagVals[3] = substr(masked,1,length(val))
        }
        $0 = $0 tagVals[1] tagVals[3] tagVals[4]
        xml = substr(xml,tagVals[0,"length"])
    }
    $0 = $0 aft
}
{ print }

$ awk -f tst.awk lookup.txt file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

The above is assuming your posted expected output is wrong and you don't really expect the DOB or bankaccount values to change.

CodePudding user response:

POSIX awk:

$ cat mask-list
name         1
phone        10010010
DOB          00100100
bankaccount  1001100100100

$ cat mask.sh
awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j  ) {
        if (substr(masks[i], j, 1)==1) {
            c = substr(str, j, 1)
        } else {
            c = "*"
        }

        str_masked = str_masked c
    }

    return str_masked
}

FNR == NR {
    tags[NR-1] = $1
    masks[NR-1] = $2
}

FNR != NR {
    line = $0

    for (i in tags) {
        regex = "<"tags[i]">[^<] </"tags[i]">"
        masked_line = ""
        l = length(tags[i])

        while (match(line, regex) > 0) {

            # extract tag value and mask it
            fulltag = substr(line, RSTART, RLENGTH)
            tagval = substr(fulltag, l 3, RLENGTH-l-l-5)
            fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"

            # append the line portion before tag, and the masked tag
            masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked

            # truncate line start to end of matched tag, for next match
            line = substr(line, RSTART   RLENGTH)
        }

        line = masked_line line
    }

    print line
}' "$@"

Example:

$ sh mask.sh mask-list log-file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

Explanation:

  • Read the mask file for tags to mask, and their mask pattern.
  • For the log file, use match() and substr() to get each relevant <tag>val</tag> sub-string in a variable, and substr() again to get val, so it can be passed to a function that masks it, based on the pattern for the current tag.
  • Re-assemble and repeat, for each tag.
  • The mask list includes a corresponding mask pattern. 1 means show, any other character (or none, see name) means hide. You can add more entries.
  • In the mask() function, there is an unused parameter str_masked, to keep that variable's scope local.
  • substr() is used to compare the string and the mask pattern one character at a time.
  • Related