Home > OS >  Negating bracketed character classes in Perl regular expressions and grep
Negating bracketed character classes in Perl regular expressions and grep

Time:11-02

I'm attempting to solve a very simple problem - find strings in an array which only contain certain letters. However, I've run up against something in the behavior of regular expressions and/or grep that I don't get.

#!/usr/bin/perl

use warnings;
use strict;

my @test_data = qw(ant bee cat dodo elephant giraffe horse);

# Words wanted include these letters only. Hardcoded for demonstration purposes
my @wanted_letters = qw/a c d i n o t/;

# Subtract those letters from the alphabet to find the letters to eliminate.
# Interpolate array into a negated bracketed character class, positive grep
# against a list of the lowercase alphabet: fine, gets befghjklmpqrsuvwxyz.
my @unwanted_letters = grep(/[^@wanted_letters]/, ('a' .. 'z'));

# The desired result can be simulated by hardcoding the unwanted letters into a
# bracketed character class then doing a negative grep: matches ant, cat, and dodo.
my @works = grep(!/[befghjklmpqrsuvwxyz]/, @test_data);

# Doing something similar but moving the negation into the bracketed character
# class fails and matches everything.
my @fails1 = grep(/[^befghjklmpqrsuvwxyz]/, @test_data);

# Doing the same thing that produced the array of unwanted letters also fails.
my @fails2 = grep(/[^@unwanted_letters]/, @test_data);

print join ' ', @works; print "\n";
print join ' ', @fails1; print "\n";
print join ' ', @fails2; print "\n";

Questions:

  • Why does @works get the correct result but not @fails1? The grep docs suggest the former, and the negation section of perlrecharclass suggests the latter, although it uses =~ in its example. Is this something specifically to do with using grep?
  • Why does @fails2 not work? Is it something to do with array vs list context? It otherwise looks the same as the subtraction step.
  • Besides that, is there a pure regex way to achieve this that avoids the subtraction step?

CodePudding user response:

You're matching something outside the character set anywhere in the string. But it can still have characters in the character set somewhere else in the string. For instance, if the test word is elephant, the negated character class matches the a character.

If you want to test the whole string, you need to quantify it and anchor to the ends.

grep(/^[^befghjklmpqrsuvwxyz]*$/, @test_data);

Translated into English, it's the difference between "word contains no characters in the set" and "word contains a character not in the set".

CodePudding user response:

Both are fixed with the addition of anchors ^ and $ and quantifier

These both work:

my @fails1 = grep(/^[^befghjklmpqrsuvwxyz] $/, @test_data);
my @fails2 = grep(/^[^@unwanted_letters] $/, @test_data);

Keep in mind that /[^befghjklmpqrsuvwxyz]/ or /[^@unwanted_letters]/ only matches ONE character. Adding means as many as possible. Adding ^ and $ means all characters from the start to the end of the string.

With /[@wanted_letters]/ you will return a match if there is a single wanted character (even with unwanted characters) -- the logical equivalent to any. Compare to /^[@wanted_letters] $/ where all the letters need to be in the set of @wanted_letters and is the equivalent of all.

Demo1 only ONE character so grep fails.

Demo2 quantifier means more than one but no anchor - grep fails

Demo3 Anchors and quantifier - expected result.

Once you understand character classes only match ONE character and anchors for the WHOLE string and quantifiers for everything extending the match to the anchors, you can directly grep just with wanted letters:

my @wanted = grep(/^[@wanted_letters] $/, @test_data);
  • Related