Home > Software design >  Use grep or sed to keep only the words that are in another word list file
Use grep or sed to keep only the words that are in another word list file

Time:09-24

I have a list of sentences (one sentence per line), and a dictionary (a list of words, one word per line). I want to use awk, grep or sed to edit the sentences file such that only the words that are in my dictionary file are kept. For example, dictionary:

hello
dog
lost
I
miss
computer
buy

input file:

I miss my dog
I want to buy a new computer

result:

I miss dog
I buy computer

I know this can be done easily with Python but im trying to use the terminal commands (awk, sed, grep, or any other terminal command).

Thank you.

CodePudding user response:

In Python I would just read the word list file, create a list of strings with the words, then read the input file and output the word if it exists in the array.

And that's how you'd do in in awk too:

$ awk 'FNR == NR { dict[$0] = 1; next } # Read the dictionary file
       { # And for each word of each line of the sentence file
         for (word = 1; word <= NF; word  ) {
           if ($word in dict) # See if it's in the dictionary
             printf "%s ", $word
         }
         printf "\n"
       }' dict.txt input.txt
I miss dog
I buy computer

(This does leave a trailing space on each line, but that's easy to filter out if it matters)

CodePudding user response:

awk '
    NR==FNR { dict[$1]; next }
    {
        sent = ""
        for (i=1; i<=NF; i  ) {
            if ($i in dict) {
                sent = (sent=="" ? "" : sent OFS) $i
            }
        }
        print sent
    }
' dict file
I miss dog
I buy computer

The ternary expression (sent=="" ? "" : sent OFS) is to ensure we don't get a spurious blank char at the start or end of the sentence that's going to be output by only adding a blank before the current word if there's already another preceding word.

The above assumes the matches should be case-sensitive. If not then change dict[$1] to dict[tolower[$1]] and $i in dict to tolower($i) in dict. It also assumes there's no punctuation to be accounted for, e.g. I miss my dog. or my dog's friendly. If that's wrong then edit your question to provide sample input/output that includes punctuation.

CodePudding user response:

This is the basic algorithm as pseudocode. I would suggest trying implementing this using AWK:

if (condition) statement [ else statement ] 

while (condition) statement

do statement while (condition)

for (expr1; expr2; expr3) statement

for (var in array) statement

break

continue
  • Related