I have a function which substitutes actual values with some pattern from a file. The objective I'm trying to achieve here is to call a function which uses gsub
to find and replace the string in a way that the substitution value is basically coming from another function call.
$ cat pat-file
name 10101010
phone 10101010
code 10101010
bankaccount 1010101010101
$ cat data_sub.sh
abc()
{
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j ) {
if (substr(masks[i], j, 1)==1) {
c = substr(str, j, 1)
} else {
c = "*"
}
str_masked = str_masked c
}
return str_masked
}
FNR == NR {
tags[NR-1] = $1
masks[NR-1] = $2
}
FNR != NR {
line = $0
for (i in tags) {
regex = "<"tags[i]">[^<] </"tags[i]">"
masked_line = ""
l = length(tags[i])
while (match(line, regex) > 0) {
fulltag = substr(line, RSTART, RLENGTH)
tagval = substr(fulltag, l 3, RLENGTH-l-l-5)
fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked
line = substr(line, RSTART RLENGTH)
}
line = masked_line line
}
print line
}' "$@" pat-file file-1 > output_file
}
abc
The tagval
variable stores the value of the XML tag which gets masked inside the XML but as it is present outside the XML as well, I need to mask those values too. See the input file
file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
The logic is simple and pretty straight forward i.e store all the extracted tag value that get masked, then perform the same masking algorithm on those values but outside XML. How can I achieve this?
Output file
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
Expected Output file
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
CodePudding user response:
Assumptions:
- if a string shows up under different tags (eg,
name=ABCD
andcode=ABCD
) then the 1st mask found byawk
will be used to mask the string (ie, we won't prioritize the order in which tag/mask pairs are processed) - strings (to be masked) could show up anywhere in a line
- when matching for non-tag substrings we'll use
awk
word boundaries (eg, when maskingABCD
we'll also maskABCD-XYZ
but we won't maskABCDABCD
norABCD_XYZ
) - both files, along with an array of value/masked-value pairs, will fit in memory
- if OP provides a mask of
111111111...
(all1's
) we'll go ahead and perform the (effective) no-op operation
General operation:
- process input file (eg,
file-1
) looking for 'tag' entries - if we find any matching 'tag' entries we'll apply the proposed mask to the corresponding value
- for each value that is masked we'll keep a copy of said value, and its mask, in a new array
- for repeat values we'll apply the saved mask
- all lines, with or without tags/masked-data, are saved in an array
END
processing runs through our array of lines again, looking for any (word-boundaried) strings that were previously masked and if found, replace with the saved mask value- in the case of a mask of
11111111...
(all1's
) thisEND
processing will re-mask the 'tag' entries, too (still, effectively, a no-op) - all lines are then sent to stdout
Adding some lines to the sample input file:
$ cat file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = ABCD and XYCD
This is a demo data = XYCD and MEH
This is ABCD and MEH demo data <tag changed="yes"<name>Winkelstein</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
One last line ABCD ABCD-XYZ ABCDABCD ABCD_XYZ
One idea building on OP's current awk
code:
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j ) {
if (substr(masks[tag], j, 1) == 1)
c = substr(str, j, 1)
else
c = "*"
str_masked = str_masked c
}
return str_masked
}
FNR == NR { masks[$1] = $2; next }
{ line = $0
for (tag in masks) {
regex = "<" tag ">[^<] </" tag ">"
masked_line = ""
len = length(tag)
while (match(line, regex) > 0) {
val = substr(line, RSTART (len 2), RLENGTH-(len 2)-(len 3))
masked[val]= (val in masked) ? masked[val] : mask(val)
masked_line = masked_line substr(line, 1, RSTART-1) "<" tag ">" masked[val] "</" tag ">"
line = substr(line, RSTART RLENGTH)
}
line = masked_line line
}
lines[FNR]=line
}
END { for (i=1;i<=FNR;i ) {
for (val in masked) {
regex="\\<" val "\\>"
gsub(regex,masked[val],lines[i])
}
print lines[i]
}
}
' pat-file file-1
This generates:
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = A*C* and XYCD
This is a demo data = XYCD and M*H
This is A*C* and M*H demo data <tag changed="yes"<name>W*n*e*s****</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
One last line A*C* A*C*-XYZ ABCDABCD ABCD_XYZ