Home > Net >  Use egrep and sed with pattern list to return first instance of every pattern in a single target fil
Use egrep and sed with pattern list to return first instance of every pattern in a single target fil

Time:12-21

I have a lengthy pattern list in a text file, one item per line. I'm using an older version of Solaris Unix, so I have to use egrep at the command line as I have very limited scripting experience. The file I am searching through has many instances of each pattern. I want to return only the line from the first instance for each pattern

$ cat patterns.txt
p1
p2
p3

$ cat target.txt
p1
p3
p1
p1
p3
p2
p3
p2
p1

The command to get the whole list of matches is

egrep -f patterns.txt target.txt

I have found many examples of how to return only the first line, or the first and the last line for patterns in the list. What I need is to return the first of each pattern from the patterns.txt in the target.txt

I have tried to adapt examples using awk and sed (below), but I am not very familiar with the commands or their usage, so I'm likely doing it wrong.

awk 'BEGIN { while(getline<"patterns.txt") M[$1]=1 }; { if(M[$1]==1) { print; M[$1]=2 } }' target.txt

egrep -f patterns.txt target.txt | sed -n '1p;$p'

The last one yielded the first pattern matched and the last pattern matched in the target.txt file. I think this is heading in the right direction, but I don't understand sed well enough to get the parameters right.

CodePudding user response:

Based solely on OP's provided data it looks like we can merely match on whole lines.

One awk idea:

awk '
FNR==NR   {ptn[$0];next}             # 1st file: store line in array ptn[]; skip to next input line
$0 in ptn {print; delete ptn[$0]}    # 2nd file: if line is an index for the array then print line and delete array entry (so it will not match next time we see it)
' patterns.txt target.txt

# or as a one-liner sans comments:

awk 'FNR==NR {ptn[$0];next} $0 in ptn {print; delete ptn[$0]}' patterns.txt target.txt

This generates:

p1
p3
p2

Granted, we can't tell solely from this output which line we matched on so for debug purposes we'll add an explicit print to the mix to include the input line number:

$ awk 'FNR==NR {ptn[$0];next} $0 in ptn {print FNR,$0; delete ptn[$0]}' patterns.txt target.txt
1 p1
2 p3
6 p2

NOTE: while this (seems) to answer OP's question for the (limited) provided inputs, I'm guessing OP's real world data may be more involved (eg, the patterns could exist as a subset of a line; we do (not?) need to match on whole words; we do (not?) need to worry about case sensitive matching; etc); if OP's real requirement is more involved I'd suggest trying to modify any answers received here (for this question and data) and if unsuccessful then ask a new question, making sure to provide a more realistic set of sample data

CodePudding user response:

This might work for you (GNU sed):

sed 's#.*#/&/{x;/&/{x;d};s/^/\\n&/;x;b}#' filePatterns | sed -f - fileTarget

Generate a sed script from the patterns file and apply the script to a second invocation of sed using the target file.

  • Related