Use egrep and sed with pattern list to return first instance of every pattern in a single target fil-CodePudding

I have a lengthy pattern list in a text file, one item per line. I'm using an older version of Solaris Unix, so I have to use egrep at the command line as I have very limited scripting experience. The file I am searching through has many instances of each pattern. I want to return only the line from the first instance for each pattern

$ cat patterns.txt
p1
p2
p3

$ cat target.txt
p1
p3
p1
p1
p3
p2
p3
p2
p1

The command to get the whole list of matches is

egrep -f patterns.txt target.txt

I have found many examples of how to return only the first line, or the first and the last line for patterns in the list. What I need is to return the first of each pattern from the patterns.txt in the target.txt

I have tried to adapt examples using awk and sed (below), but I am not very familiar with the commands or their usage, so I'm likely doing it wrong.

awk 'BEGIN { while(getline<"patterns.txt") M[$1]=1 }; { if(M[$1]==1) { print; M[$1]=2 } }' target.txt

egrep -f patterns.txt target.txt | sed -n '1p;$p'

The last one yielded the first pattern matched and the last pattern matched in the target.txt file. I think this is heading in the right direction, but I don't understand sed well enough to get the parameters right.

CodePudding user response：

Based solely on OP's provided data it looks like we can merely match on whole lines.

One awk idea:

awk '
FNR==NR   {ptn[$0];next}             # 1st file: store line in array ptn[]; skip to next input line
$0 in ptn {print; delete ptn[$0]}    # 2nd file: if line is an index for the array then print line and delete array entry (so it will not match next time we see it)
' patterns.txt target.txt

# or as a one-liner sans comments:

awk 'FNR==NR {ptn[$0];next} $0 in ptn {print; delete ptn[$0]}' patterns.txt target.txt

This generates:

p1
p3
p2

Granted, we can't tell solely from this output which line we matched on so for debug purposes we'll add an explicit print to the mix to include the input line number:

$ awk 'FNR==NR {ptn[$0];next} $0 in ptn {print FNR,$0; delete ptn[$0]}' patterns.txt target.txt
1 p1
2 p3
6 p2

NOTE: while this (seems) to answer OP's question for the (limited) provided inputs, I'm guessing OP's real world data may be more involved (eg, the patterns could exist as a subset of a line; we do (not?) need to match on whole words; we do (not?) need to worry about case sensitive matching; etc); if OP's real requirement is more involved I'd suggest trying to modify any answers received here (for this question and data) and if unsuccessful then ask a new question, making sure to provide a more realistic set of sample data

CodePudding user response：

This might work for you (GNU sed):

sed 's#.*#/&/{x;/&/{x;d};s/^/\\n&/;x;b}#' filePatterns | sed -f - fileTarget

Generate a sed script from the patterns file and apply the script to a second invocation of sed using the target file.