Home > Net >  Regex to match substrings containing n non repeted characters
Regex to match substrings containing n non repeted characters

Time:12-06

I am facing a (naive) problem with regular expression. I need to find any substrings composed of a fixed number (n) of different characters.

So, for "aaabcddd", if n=3 the substrings that I expect to find are: "abc" and "bcd".

My idea is to use n-1 capture groups and '[^' to exclude characters already matched. Thus, I wrote the following Perl regex (in Julia):

r"(([[:alpha:]])[^\2])[^\1]"

But is not working.

Do you have any tips?

CodePudding user response:

You can not use a backreference to a capture group using a negated character class [^\1]

What you can do is use a negative lookahead to assert what is directly to the right of the current position is not what you have already captured in a previous group.

If that is the case, capture a single alpha in a new group.

The matches abc and bcd are in capture group 1

(?=(([[:alpha:]])(?!\2)([[:alpha:]])(?!\3|\2)[[:alpha:]]))
  • (?= Positive lookahead
    • ( Capture group 1
      • ([[:alpha:]]) Capture the first char in group 2
      • (?!\1)([[:alpha:]]) If not looking at what is captured by group 2 to the right, capture the second char in group 3
      • (?!\2|\1) If not looking to the right at what is captured by group 2 or 3
      • [[:alpha:]] Mach the 3rd char
    • ) Close group 1
  • ) Close the lookahead

Regex demo

Or a bit shorter using a case insensitive match:

(?=(([a-z])(?!\2)([a-z])(?!\3|\2)[a-z]))

CodePudding user response:

Here is a solution to an arbitrary value of n characters:

#!/usr/local/bin/perl

use strict; use warnings; use feature ':5.10';

my $s="aaabcded";
my $n=3;

while ($s=~/(?=([[:alpha:]]{$n}))/g){
    my $hit=$1;
    my @chars = split //, $hit;  
    my %uniq; 
    @uniq{@chars} = ();
    say "$hit" if (scalar keys %uniq) == $n;
}

Running with $n=3 prints:

abc
bcd
cde

Running with $n=4 prints:

abcd
bcde

And $n=5:

abcde
  • Related