I am trying to exclude certain words from dictionary file.
# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross
# cat exclude.txt
test
batch
# grep -vf exclude.txt en.txt
access/p
cross
The words like "testing" and "batches" should be included in the results.
expected result:
testing
access/p
batches
cross
Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".
CodePudding user response:
I would harness GNU AWK
for this task following way, let en.txt
content be
test
testing
access/p
batch
batch/n
batches
cross
and exclude.txt
content be
test
batch
then
awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt
gives output
testing
access/p
batches
cross
Explanation: I inform GNU AWK
that /
is field separator (FS
), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR
) I simply use 1st column value as key in array arr
and then go to next
line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not (!
) one of keys of array arr
.
(tested in GNU Awk 5.0.1)
CodePudding user response:
Since there are many words in a dictionary that may have a root in one of those to exclude we cannot use a look-up (on a hash built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # for convenient read of the file
my $excl_file = 'exclude.txt';
my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);
while (<>) {
if ( m{^ $re_excl (?:/.)? $}x ) {
# say "Skip printing (so filter out): $_";
next;
}
say;
}
This is used as program.pl dictionary-filename
and it prints the filtered list.
Here I've assumed that what may follow the root-word to exclude is /
followed by one character, (?:/.)?
, since examples use that and there is no precise statement on it. The pattern also assumes no spaces around the word.
Please adjust as/if needed for what may actually follow /
. For example, it'd be (?:/. )?
for at least one character, (?:/[np])?
for any character from a specific list (n
or p
), (?:[^xy] )?
for any characters not in the given list, etc.