Awk iteratively replacing strings from array-CodePudding

I've been recently trying to do the following in awk - we have two files (F1.txt F2.txt.gz). While streaming from the second one, I want to replace all occurrences of entries from f1.txt with its substrings. I came to this point:

zcat F2.txt.gz |
    awk 'NR==FNR {a[$1]; next}
    {for (i in a)
         $0=gsub(i, substr(i, 0, 2), $0) #this does not work of course
    }
    {print $0}
' F1.txt -

Was wondering how to do this properly in Awk. Thanks!

CodePudding user response：

try to change

$0=gsub(i, substr(i, 0, 2), $0)

into

gsub(i, substr(i, 0, 2))

The return value of the gsub() function is the number of successful replacements instead of the string after the replacement.

CodePudding user response：

$0=gsub(i, substr(i, 0, 2), $0) #this does not work of course

GNU AWK's function gsub does alter value of 3rd argument (thus it must be assignable) and does return number of substitutions made. You should not care about return value if you just want altered value. Consider following simple example, let file1.txt content be

a x
b y
c z

and file2.txt content be

quick fox jumped over lazy dog

then

awk 'FNR==NR{arr[$1]=$2;next}{for(i in arr){gsub(i,arr[i],$0)};print}' file1.txt file2.txt

gives output

quizk fox jumped over lxzy dog

be warned that if there is any chain in your replacement

a b
b c

then output becomes dependent on array traversal order.

(tested in gawk 4.2.1)

CodePudding user response：

Please correct the assumptions if wrong.

You have two files, one includes a set of entries. If the second file has any one of these words, replace them with first two chars.

Example:

==> file1 <==
Azerbaijan
Belarus
Canada

==> file2 <==
Caspian sea is in Azerbaijan
Belarus is in Europe
Canada is in metric system.


$ awk 'NR==FNR {a[$1]; next} 
               {for(i=1;i<=NF;i  ) 
                   if($i in a) $i=substr($i,1,2)}1' file1 file2

Caspian sea is in Az
Be is in Europe
Ca is in metric system.

note that substring index starts with 1 in awk.