I have a list of sentences (one sentence per line), and a dictionary (a list of words, one word per line). I want to use awk, grep or sed to edit the sentences file such that only the words that are in my dictionary file are kept. For example, dictionary:
hello
dog
lost
I
miss
computer
buy
input file:
I miss my dog
I want to buy a new computer
result:
I miss dog
I buy computer
I know this can be done easily with Python but im trying to use the terminal commands (awk, sed, grep, or any other terminal command).
Thank you.
CodePudding user response:
In Python I would just read the word list file, create a list of strings with the words, then read the input file and output the word if it exists in the array.
And that's how you'd do in in awk
too:
$ awk 'FNR == NR { dict[$0] = 1; next } # Read the dictionary file
{ # And for each word of each line of the sentence file
for (word = 1; word <= NF; word ) {
if ($word in dict) # See if it's in the dictionary
printf "%s ", $word
}
printf "\n"
}' dict.txt input.txt
I miss dog
I buy computer
(This does leave a trailing space on each line, but that's easy to filter out if it matters)
CodePudding user response:
awk '
NR==FNR { dict[$1]; next }
{
sent = ""
for (i=1; i<=NF; i ) {
if ($i in dict) {
sent = (sent=="" ? "" : sent OFS) $i
}
}
print sent
}
' dict file
I miss dog
I buy computer
The ternary expression (sent=="" ? "" : sent OFS)
is to ensure we don't get a spurious blank char at the start or end of the sentence that's going to be output by only adding a blank before the current word if there's already another preceding word.
The above assumes the matches should be case-sensitive. If not then change dict[$1]
to dict[tolower[$1]]
and $i in dict
to tolower($i) in dict
. It also assumes there's no punctuation to be accounted for, e.g. I miss my dog.
or my dog's friendly
. If that's wrong then edit your question to provide sample input/output that includes punctuation.
CodePudding user response:
This is the basic algorithm as pseudocode. I would suggest trying implementing this using AWK:
if (condition) statement [ else statement ]
while (condition) statement
do statement while (condition)
for (expr1; expr2; expr3) statement
for (var in array) statement
break
continue