Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match. Can you please help in tweaking the code to achieve the same?
code
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i ) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz.txt
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
current output
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
desired output
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
CodePudding user response:
We can make a couple changes to the current code:
- feed the results of
cat lookup_1 lookup_2
intoawk
such that it looks like a single file toawk
(see last line of new code) - use word boundary flags (
\<
and\>
) to build regexes with which to perform the replacements (see 2nd half of new code)
The new code:
awk '
# the FNR==NR block of code remains the same
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i ) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx length($i))
}
next
}
# complete rewrite of the following block to perform replacements based on a regex using word boundaries
{ for (i in lookups) {
regex= "\\<" i "\\>" # build regex
gsub(regex,lookups[i]) # replace strings that match regex
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code
This generates:
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
NOTES:
- the 'boundary' characters (
\<
and\>
) match on non-word characters; inawk
a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more details - all of the sample lookup values fall within the definition of an
awk
word so this new code works as desired - your previous question includes lookup values that cannot be considered as an
awk
'word' (eg,@vanti Finserv Co.
,11:11 - Capital
,MS&CO(NY)
) in which case this new code may fail to replace these new lookup values - for lookup values that contain non-word characters it's not clear how you would define 'whole word match' as you would also need to determine when a non-word character (eg,
@
) is to be treated as part of a lookup string vs being treated as a word boundary
If you need to replace lookup values that contain (awk
) non-word characters you could try replacing the word-boundary characters with \W
, though this then causes problems for the lookup values that are (awk
) 'words'.
One possible workaround may be to run a dual set of regex matches for each lookup value, eg:
awk '
FNR==NR { ... no changes to this block of code ... }
{ for (i in lookups) {
regex= "\\<" i "\\>"
gsub(regex,lookups[i])
regex= "\\W" i "\\W"
gsub(regex,lookups[i])
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt
You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.