I want to write a script which masks partial data from xml log files. My approach is - there will be a lookup file wherein I can add tags which needs to be masked with a flag as Yes or No. My unix script will read the lookup file and mask only those tags which has a yes flag. I am using awk in my script to find the tag(that has a Yes flag in lookup file )from the xml log file and substitute its value with a pattern.The xml file is not parsed in the log file. I just need to mask the values present inside the xml tags.
Any suggestions to alter the code or my approach is highly appreciated.
XML in log file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>98762123</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>123456789</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
lookup file
tag, flag
name,Yes
phone,Yes
DOB,No
output file must look like this
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>12-09-77</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
PS: the masking algorithm must should mask the value according to some pattern. PPS: using awk to parse xml is not the best option Im aware, but I have no options here.
CodePudding user response:
POSIX awk:
$ cat mask-list
name 1
phone 10010010
DOB 00100100
bankaccount 1001100100100
$ cat mask.sh
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j ) {
if (substr(masks[i], j, 1)==1) {
c = substr(str, j, 1)
} else {
c = "*"
}
str_masked = str_masked c
}
return str_masked
}
FNR == NR {
tags[NR-1] = $1
masks[NR-1] = $2
}
FNR != NR {
line = $0
for (i in tags) {
regex = "<"tags[i]">[^<] </"tags[i]">"
masked_line = ""
l = length(tags[i])
while (match(line, regex) > 0) {
# extract tag value and mask it
fulltag = substr(line, RSTART, RLENGTH)
tagval = substr(fulltag, l 3, RLENGTH-l-l-5)
fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
# append the line portion before tag, and the masked tag
masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked
# truncate line start to end of matched tag, for next match
line = substr(line, RSTART RLENGTH)
}
line = masked_line line
}
print line
}' "$@"
Example:
$ sh mask.sh mask-list log-file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
Explanation:
- Read the mask file for tags to mask, and their mask pattern.
- For the log file, use
match()
andsubstr()
to get each relevant<tag>val</tag>
sub-string in a variable, andsubstr()
again to getval
, so it can be passed to a function that masks it, based on the pattern for the current tag. - Re-assemble and repeat, for each tag.
- The mask list includes a corresponding mask pattern.
1
means show, any other character (or none, seename
) means hide. You can add more entries. - In the
mask()
function, there is an unused parameterstr_masked
, to keep that variable's scope local. substr()
is used to compare the string and the mask pattern one character at a time.
CodePudding user response:
Using GNU awk for the 3rg arg to match() and gensub():
$ cat tst.awk
BEGIN { FS="," }
NR==FNR {
if ( $2 == "Yes" ) {
mask[$1]
}
next
}
match($0,"(.*<sometag[^<>] )(.*)(</sometag>.*)",rec) {
bef = rec[1]
xml = rec[2]
aft = rec[3]
$0 = bef
while ( match(xml,"(<([^>] )>)([^<]*)(<[^>] >)",tagVals) ) {
tag = tagVals[2]
if ( tag in mask ) {
val = tagVals[3]
masked = gensub(/(.)../,"\\1**","g",val" ")
tagVals[3] = substr(masked,1,length(val))
}
$0 = $0 tagVals[1] tagVals[3] tagVals[4]
xml = substr(xml,tagVals[0,"length"])
}
$0 = $0 aft
}
{ print }
$ awk -f tst.awk lookup.txt file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
The above is assuming your posted expected output is wrong and you don't really expect the DOB or bankaccount values to change.