I have a lengthy pattern list in a text file, one item per line. I'm using an older version of Solaris Unix, so I have to use egrep at the command line as I have very limited scripting experience. The file I am searching through has many instances of each pattern. I want to return only the line from the first instance for each pattern
$ cat patterns.txt
p1
p2
p3
$ cat target.txt
p1
p3
p1
p1
p3
p2
p3
p2
p1
The command to get the whole list of matches is
egrep -f patterns.txt target.txt
I have found many examples of how to return only the first line, or the first and the last line for patterns in the list. What I need is to return the first of each pattern from the patterns.txt in the target.txt
I have tried to adapt examples using awk and sed (below), but I am not very familiar with the commands or their usage, so I'm likely doing it wrong.
awk 'BEGIN { while(getline<"patterns.txt") M[$1]=1 }; { if(M[$1]==1) { print; M[$1]=2 } }' target.txt
egrep -f patterns.txt target.txt | sed -n '1p;$p'
The last one yielded the first pattern matched and the last pattern matched in the target.txt file. I think this is heading in the right direction, but I don't understand sed well enough to get the parameters right.
CodePudding user response:
Based solely on OP's provided data it looks like we can merely match on whole lines.
One awk
idea:
awk '
FNR==NR {ptn[$0];next} # 1st file: store line in array ptn[]; skip to next input line
$0 in ptn {print; delete ptn[$0]} # 2nd file: if line is an index for the array then print line and delete array entry (so it will not match next time we see it)
' patterns.txt target.txt
# or as a one-liner sans comments:
awk 'FNR==NR {ptn[$0];next} $0 in ptn {print; delete ptn[$0]}' patterns.txt target.txt
This generates:
p1
p3
p2
Granted, we can't tell solely from this output which line we matched on so for debug purposes we'll add an explicit print
to the mix to include the input line number:
$ awk 'FNR==NR {ptn[$0];next} $0 in ptn {print FNR,$0; delete ptn[$0]}' patterns.txt target.txt
1 p1
2 p3
6 p2
NOTE: while this (seems) to answer OP's question for the (limited) provided inputs, I'm guessing OP's real world data may be more involved (eg, the patterns could exist as a subset of a line; we do (not?) need to match on whole words; we do (not?) need to worry about case sensitive matching; etc); if OP's real requirement is more involved I'd suggest trying to modify any answers received here (for this question and data) and if unsuccessful then ask a new question, making sure to provide a more realistic set of sample data
CodePudding user response:
This might work for you (GNU sed):
sed 's#.*#/&/{x;/&/{x;d};s/^/\\n&/;x;b}#' filePatterns | sed -f - fileTarget
Generate a sed script from the patterns file and apply the script to a second invocation of sed using the target file.