Home > database >  What is the equivalent of grep -wf in awk?
What is the equivalent of grep -wf in awk?

Time:12-24

I have to search a list of items in file1 such as:

RYK
RELA
CCNB1
RXRG
CREB1

in a file2 with multiple columns, as for example

KIAA0196 FAM21C
BIRC2 UBE2D2
BIRC3 UBE2D2
BIRC7 UBE2D2
XIAP UBE2D2
BRCA1 UBE2D2
CDK5R1 HSP90AA1
ICAM1 ITGB2
RYK CDK1
CSNK2A1 CDK1
NFKB1 RELA
CREB1 JUN
PPME1 NFKB1
ARID4A CDK1
ICT1 TFAM
EZH2 CDK1
CDK1 EZH2
CDK1 EZH2
CDK1 HIST1H1D
CDK1 EZH2
CDK1 EZH2
CDK1 EZH2
BCL6 E2F3
CDK1 CCNB1
MME PDIA5
PPP2R1B CDK1
PPP2R1A CDK1
PPP2R1A CDK1
PPP2R1B CDK1
NCOR2 RXRG
THRB RXRG
RARA RXRG
RARB RXRG
RARG RXRG
PPARD RXRG
HDAC5 CDK1
CDK1 RUNX1
CREBBP CREB1

I usually make use of grep -wf file1 file2, simple and efficient,and I have the following results

RYK CDK1
NFKB1 RELA
CREB1 JUN
CDK1 CCNB1
NCOR2 RXRG
THRB RXRG
RARA RXRG
RARB RXRG
RARG RXRG
PPARD RXRG
CREBBP CREB1

I would like to switch to awk I was wondering if there is a way to do this in awk syntax as simple as the grep one.

CodePudding user response:

this corresponds to

$ grep -wFf file1 file2

assuming file1 has single words

$ awk 'NR==FNR{a[$1]; next} {for(i=1;i<=NF;i  ) if($i in a) {print; next}}' file1 file2

I mentioned -F to check for literal match (not regex). Probably an implied requirement in your case as well.

CodePudding user response:

One idea using GNU awk (or any awk that supports GNU Regex Operators):

awk '
FNR==NR  { regexes["\\y" $0 "\\y"]       # build and store regexs using "\y" to designate a word boundary
           next
         }
         { for (regex in regexes)        # loop through list of regexes
               if ( $0 ~ regex) {        # if regex matches anywhere in the line then ...
                  print $0               # print the line to stdout and ...
                  break                  # break out of loop (ie, keep us from printing the line multiple times for multiple regex matches)
               }
         }
' file1 file2

Running this against OP's sample input files gives us:

RYK CDK1
NFKB1 RELA
CREB1 JUN
CDK1 CCNB1
NCOR2 RXRG
THRB RXRG
RARA RXRG
RARB RXRG
RARG RXRG
PPARD RXRG
CREBBP CREB1

Using a different set of inputs to demonstrate word boundaries:

$ cat f1
ABC
DEF

$ cat f2
ABC DEF
DEF ABC
DEF ABC XYZ
DEF|ABC XYZ
ABCDEF
DEFABC
XYZ{ABC/123:
XYZ{ABCDEF/123:
XYZ-DEF:123\

Comparing grep and awk:

$ grep -wf f1 f2
ABC DEF
DEF ABC
DEF ABC XYZ
DEF|ABC XYZ
XYZ{ABC/123:
XYZ-DEF:123\

$ awk 'FNR==NR {regexes["\\y" $0 "\\y"]; next} {for (regex in regexes) if ($0 ~ regex) {print $0; break}}' f1 f2
ABC DEF
DEF ABC
DEF ABC XYZ
DEF|ABC XYZ
XYZ{ABC/123:
XYZ-DEF:123\

While others may be able to reduce this a bit (other than using shorter variable names) I don't think we're going to get anywhere near as 'simple' as these 7 characters: grep -wf

  • Related