Home > database >  Compare a file with two separate lookup files using awk
Compare a file with two separate lookup files using awk

Time:04-12

Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match. Can you please help in tweaking the code to achieve the same?

code

awk '
FNR==NR { if ($0 in lookups)    
             next                            
          lookups[$0]=$0
          for (i=1;i<=NF;i  ) {         
              oldstr=$i
              newstr=""
              while (oldstr) {               
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)   
              }
              ndx=index(lookups[$0],$i)   
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx length($i))
          }
          next
        }

        { for (i in lookups) { 
              ndx=index($0,i)                
              while (ndx > 0) {                       t
                    $0=substr($0,1,ndx-1) lookups[i] substr($0,ndx length(lookups[i]))
                    ndx=index($0,i)                    
              }
          }
          print
        }
' lookup_1 xyz.txt > output.txt

lookup_1

ha
achine
skhatw
at
ree
ter
man
dun

lookup_2

United States
CDEXX123X
Institution

xyz.txt

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

current output

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

desired output

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##


CodePudding user response:

We can make a couple changes to the current code:

  • feed the results of cat lookup_1 lookup_2 into awk such that it looks like a single file to awk (see last line of new code)
  • use word boundary flags (\< and \>) to build regexes with which to perform the replacements (see 2nd half of new code)

The new code:

awk '
        # the FNR==NR block of code remains the same

FNR==NR { if ($0 in lookups)
             next
          lookups[$0]=$0
          for (i=1;i<=NF;i  ) {
              oldstr=$i
              newstr=""
              while (oldstr) {
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)
              }
              ndx=index(lookups[$0],$i)
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx length($i))
          }
          next
        }

        # complete rewrite of the following block to perform replacements based on a regex using word boundaries

        { for (i in lookups) {
              regex= "\\<" i "\\>"            # build regex
              gsub(regex,lookups[i])          # replace strings that match regex
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

This generates:

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

NOTES:

  • the 'boundary' characters (\< and \>) match on non-word characters; in awk a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more details
  • all of the sample lookup values fall within the definition of an awk word so this new code works as desired
  • your previous question includes lookup values that cannot be considered as an awk 'word' (eg, @vanti Finserv Co., 11:11 - Capital, MS&CO(NY)) in which case this new code may fail to replace these new lookup values
  • for lookup values that contain non-word characters it's not clear how you would define 'whole word match' as you would also need to determine when a non-word character (eg, @) is to be treated as part of a lookup string vs being treated as a word boundary

If you need to replace lookup values that contain (awk) non-word characters you could try replacing the word-boundary characters with \W, though this then causes problems for the lookup values that are (awk) 'words'.

One possible workaround may be to run a dual set of regex matches for each lookup value, eg:

awk '
FNR==NR { ... no changes to this block of code ... }

        { for (i in lookups) {
              regex= "\\<" i "\\>"
              gsub(regex,lookups[i])
              regex= "\\W" i "\\W"
              gsub(regex,lookups[i])
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt

You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.

  • Related