Partial string search between two files using AWK-CodePudding

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.

file1 contains:

Abc
xyz
123
blah
hh
a,b

file2 contains:

abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c

Original command : egrep -i -f file1 file2

Original (egrep) command output :

$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c

I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.

Modified command in awk : awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2

Modified command (awk) output:

$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah

This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.

CodePudding user response：

Use index like this for a partial literal match:

awk '
NR == FNR {
  needles[tolower($0)]
  next
}
{
  haystack = tolower($0)
  for (needle in needles) {
    if (index(haystack, needle)) {
      print
      break
    }
  }
}' file1 file2

CodePudding user response：

I would be a bit surprised that it's significantly faster than egrep but you can try this:

$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123

Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.

If you have GNU awk you can use its IGNORECASE variable instead of tolower:

$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123

And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:

Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, 'awk' must first convert the string into this internal form and then perform the pattern matching.

In order to do this you can try:

$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
    FNR==1 && NR!=FNR {r=@//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123

(r=@// forces variable r to be of type regexp and sub(//,s,r) does not change this)

Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.