Using gsub to substitute a variable with another variable which is a value from a function call-CodePudding

I have a function which substitutes actual values with some pattern from a file. The objective I'm trying to achieve here is to call a function which uses gsub to find and replace the string in a way that the substitution value is basically coming from another function call.

$ cat pat-file
name         10101010
phone        10101010
code         10101010
bankaccount  1010101010101

$ cat data_sub.sh

abc()
{
awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j  ) {
        if (substr(masks[i], j, 1)==1) {
            c = substr(str, j, 1)
        } else {
            c = "*"
        }

        str_masked = str_masked c
    }

    return str_masked
}

FNR == NR {
    tags[NR-1] = $1
    masks[NR-1] = $2
}

FNR != NR {
    line = $0

    for (i in tags) {
        regex = "<"tags[i]">[^<] </"tags[i]">"
        masked_line = ""
        l = length(tags[i])
        while (match(line, regex) > 0) {
            fulltag = substr(line, RSTART, RLENGTH)
            tagval = substr(fulltag, l 3, RLENGTH-l-l-5)
            fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
            masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked

            line = substr(line, RSTART   RLENGTH)
        }

        line = masked_line line
    }

    print line
}' "$@" pat-file file-1 > output_file
}

abc

The tagval variable stores the value of the XML tag which gets masked inside the XML but as it is present outside the XML as well, I need to mask those values too. See the input file

file-1

This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>

The logic is simple and pretty straight forward i.e store all the extracted tag value that get masked, then perform the same masking algorithm on those values but outside XML. How can I achieve this?

Output file

This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>

Expected Output file

This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>

CodePudding user response：

Assumptions:

if a string shows up under different tags (eg, name=ABCD and code=ABCD) then the 1st mask found by awk will be used to mask the string (ie, we won't prioritize the order in which tag/mask pairs are processed)
strings (to be masked) could show up anywhere in a line
when matching for non-tag substrings we'll use awk word boundaries (eg, when masking ABCD we'll also mask ABCD-XYZ but we won't mask ABCDABCD nor ABCD_XYZ)
both files, along with an array of value/masked-value pairs, will fit in memory
if OP provides a mask of 111111111... (all 1's) we'll go ahead and perform the (effective) no-op operation

General operation:

process input file (eg, file-1) looking for 'tag' entries
if we find any matching 'tag' entries we'll apply the proposed mask to the corresponding value
for each value that is masked we'll keep a copy of said value, and its mask, in a new array
for repeat values we'll apply the saved mask
all lines, with or without tags/masked-data, are saved in an array
END processing runs through our array of lines again, looking for any (word-boundaried) strings that were previously masked and if found, replace with the saved mask value
in the case of a mask of 11111111... (all 1's) this END processing will re-mask the 'tag' entries, too (still, effectively, a no-op)
all lines are then sent to stdout

Adding some lines to the sample input file:

$ cat file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = ABCD and XYCD
This is a demo data = XYCD and MEH
This is ABCD and MEH demo data <tag changed="yes"<name>Winkelstein</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
One last line ABCD ABCD-XYZ ABCDABCD ABCD_XYZ

One idea building on OP's current awk code:

awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j  ) {
        if (substr(masks[tag], j, 1) == 1)
           c = substr(str, j, 1)
        else
           c = "*"
        str_masked = str_masked c
    }
    return str_masked
}

FNR == NR { masks[$1] = $2; next }
          { line = $0

            for (tag in masks) {
                regex = "<" tag ">[^<] </" tag ">"
                masked_line = ""
                len = length(tag)

                while (match(line, regex) > 0) {
                      val = substr(line, RSTART (len 2), RLENGTH-(len 2)-(len 3))
                      masked[val]= (val in masked) ? masked[val] : mask(val)
                      masked_line = masked_line substr(line, 1, RSTART-1) "<" tag ">" masked[val] "</" tag ">"
                      line = substr(line, RSTART   RLENGTH)
                }
                line = masked_line line
            }
            lines[FNR]=line
        }

END     { for (i=1;i<=FNR;i  ) {
              for (val in masked) {
                  regex="\\<" val "\\>"
                  gsub(regex,masked[val],lines[i])
              }
              print lines[i]
          }
        }
' pat-file file-1

This generates:

This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = A*C* and XYCD
This is a demo data = XYCD and M*H
This is A*C* and M*H demo data <tag changed="yes"<name>W*n*e*s****</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
One last line A*C* A*C*-XYZ ABCDABCD ABCD_XYZ