Home > Enterprise >  Find lines in a file that has only words in a list
Find lines in a file that has only words in a list

Time:03-30

Here is file1.txt:

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea
.zebra

Here file2.txt

.apple
.mine.nice
.cow
.tea

Expected Result:

.apple .cow .tea .mine.nice
.mine.nice
.tea

while using following does not give expected result

grep -w -F -f file2.txt file1.txt 

gives

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea

How to get expected result?

CodePudding user response:

I would exploit GNU AWK next for this task following way, let file1.txt content be

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea
.zebra

and file2.txt content be

.apple
.mine.nice
.cow
.tea

then

awk 'NR==FNR{arr[$1];next}{for(i=1;i<=NF;i =1){if(!($i in arr)){next}};print}' file2.txt file1.txt

gives

.apple .cow .tea .mine.nice
.mine.nice
.tea

Explanation: during processing 1st file of mentioned (note that this is file2.txt) i.e. where number of row is equal number of row of current file (NR==FNR) ask about key being 1st file of array arr. This cause creating key in array, I do not specify any value and it is irrelevant for future. After doing that go to next line, i.e. do not do anything else during processing 1st file. For all but 1st line iterate over fields using for loop, if you encounter field which is not one of keys of array arr go to next line, after processing all fields print whole line as is. Note that this code short-circuit i.e. go to next line as soon as 1st not allowed word is detected. Disclaimer: I assume that file2.txt is holding exactly 1 word per line.

(tested in gawk 4.2.1)

CodePudding user response:

This might work for you (GNU sed):

sed -En '1{x;s/.*/cat file2/e;y/\n/ /;s/$/ /;x}
         s/.*/& \n&/;G
         :a;s/^(\S  )(.*\n.*\n.*\1)/\2/;ta;s/^\n(.*)\n.*/\1/p' file1

The solution juggles three lines in the pattern space, two copies of the current line and the contents of file2. The first copy of the current line is matched against the strings in file2 and reduced in size until there are no more matches. If the result of the matching produces an empty line, the matches were successful and the line is printed otherwise it is discarded. The flow of processing is as follows:

Prime the hold space with the contents of file2, replace newlines by spaces and append a space for pattern matching purposes.

Double the current line, again adding a space to the first copy,separate the copies by newlines and append the hold space.

Iterate through the strings at the front of the first copy of the current line, removing it if it matches in file2.

When there are no more matches, if all that is left is the newline separating the copies then print the unadulterated copy of the current line.

Otherwise the current line did not match the strings in file2 and no output is produced for that line.

CodePudding user response:

I haven't come up with an alternative that is any simpler than the awk or sed answers here, but thought it worth explaining something. (Sorry if this is inappropriate as an 'answer'; I'm new here, and can't yet leave comments.)

What you are trying to do is fundamentally different to what grep is designed to do, which is to find all lines that contain any matching strings. You could chain grep commands together (e.g., with pipes, |) to get all lines that contain all of the strings, but finding lines that only consist of a specific set of substrings is probably always going to require a complex regular expression.

  • Related