I am facing a (naive) problem with regular expression. I need to find any substrings composed of a fixed number (n) of different characters.
So, for "aaabcddd", if n=3 the substrings that I expect to find are: "abc" and "bcd".
My idea is to use n-1 capture groups and '[^' to exclude characters already matched. Thus, I wrote the following Perl regex (in Julia):
r"(([[:alpha:]])[^\2])[^\1]"
But, it is not working.
Do you have any tips?
CodePudding user response:
You can not use a backreference to a capture group using a negated character class [^\1]
What you can do is use a negative lookahead to assert what is directly to the right of the current position is not what you have already captured in a previous group.
If that is the case, capture a single alpha in a new group.
The matches abc
and bcd
are in capture group 1
(?=(([[:alpha:]])(?!\2)([[:alpha:]])(?!\3|\2)[[:alpha:]]))
(?=
Positive lookahead(
Capture group 1([[:alpha:]])
Capture the first char in group 2(?!\1)([[:alpha:]])
If not looking at what is captured by group 2 to the right, capture the second char in group 3(?!\2|\1)
If not looking to the right at what is captured by group 2 or 3[[:alpha:]]
Mach the 3rd char
)
Close group 1
)
Close the lookahead
Or a bit shorter using a case insensitive match:
(?=(([a-z])(?!\2)([a-z])(?!\3|\2)[a-z]))
CodePudding user response:
Here is a solution to an arbitrary value of n
characters:
#!/usr/local/bin/perl
use strict; use warnings; use feature ':5.10';
my $s="aaabcded";
my $n=3;
while ($s=~/(?=([[:alpha:]]{$n}))/g){
my $hit=$1;
my @chars = split //, $hit;
my %uniq;
@uniq{@chars} = ();
say "$hit" if (scalar keys %uniq) == $n;
}
Running with $n=3
prints:
abc
bcd
cde
Running with $n=4
prints:
abcd
bcde
And $n=5
:
abcde