Home > Back-end >  exclude words those may or may not end with slash
exclude words those may or may not end with slash

Time:12-03

I am trying to exclude certain words from dictionary file.

# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross

# cat exclude.txt
test
batch

# grep -vf exclude.txt en.txt
access/p
cross

The words like "testing" and "batches" should be included in the results.

expected result:
testing
access/p
batches
cross

Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".

CodePudding user response:

I would harness GNU AWK for this task following way, let en.txt content be

test
testing
access/p
batch
batch/n
batches
cross

and exclude.txt content be

test
batch

then

awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt

gives output

testing
access/p
batches
cross

Explanation: I inform GNU AWK that / is field separator (FS), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR) I simply use 1st column value as key in array arr and then go to next line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not (!) one of keys of array arr.

(tested in GNU Awk 5.0.1)

CodePudding user response:

Since there are many words in a dictionary that may have a root in one of those to exclude we cannot use a look-up (on a hash built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list

use warnings;
use strict;
use feature 'say';
use Path::Tiny;  # for convenient read of the file

my $excl_file = 'exclude.txt';

my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);

while (<>) { 
    if ( m{^ $re_excl (?:/.)? $}x )  {   
        # say "Skip printing (so filter out): $_";
        next;
    }
    say;
}

This is used as program.pl dictionary-filename and it prints the filtered list.

Here I've assumed that what may follow the root-word to exclude is / followed by one character, (?:/.)?, since examples use that and there is no precise statement on it. The pattern also assumes no spaces around the word.

Please adjust as/if needed for what may actually follow /. For example, it'd be (?:/. )? for at least one character, (?:/[np])? for any character from a specific list (n or p), (?:[^xy] )? for any characters not in the given list, etc.

  • Related