I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*
, where a
, b
, c
could only appear once. So:
foo a b c
foo b c a
foo a b
foo b
would all be valid, but
foo b b
would not be
CodePudding user response:
You may use this regex with a capture group and a negative lookeahd:
^foo((?!.*\1) [abc]) $
RegEx Details:
^
: Startfoo
: Matchfoo
(
: Start a capture group #1(?!.*\1)
: Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input[abc]
: Match a space followed bya
orb
orc
)
: End capture group #1. Repeat this group 1 times$
: End
CodePudding user response:
You can assert that there is no match for a second match for a space and a letter at the right:
foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])*
foo
Match literally(?!
Negative lookahead(?: [abc])*
Match optional repetitions of a space and a b or c( [abc])
Capture group, use to compare with a backreference for the same(?: [abc])*
Match again a space and either a b or c\1
Backreference to group 1
)
Close lookahead(?: [abc])*
Match optional repetitions or a space and either a b or c
If you don't want to match only foo, you can change the quantifier to 1 or more (?: [abc])
CodePudding user response:
You can do it using references to previously captured groups.
foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$
This gets quite long with many options. Such a regex can be generated dynamically, if necessary.
def match_sequence_without_repeats(options, seperator):
def prevent_previous(n):
if n == 0:
return ""
groups = "".join(rf"\{i}" for i in range(1, n 1))
return f"(?!{groups})"
return "".join(
f"(?:{seperator}{prevent_previous(i)}([{options}]))?"
for i in range(len(options))
)
print(f"foo{match_sequence_without_repeats('abc', ' ')}$")
CodePudding user response:
I have assumed that the elements of the string can be in any order and appear any number of times. For example, 'a foo'
should match and 'a foo b foo'
should not.
You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one "foo"
's and/or zero or one "a"
's. You could use the following regular expression:
^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
This matches, for example, 'foofoo'
, 'aa'
and afooa
. If they are not to be matched remove the word breaks (\b
).
Notice that this expression begins by asserting the start of the string (^
) followed by two positive lookaheads, one for 'foo'
and one for 'a'
. To also check for, say, 'c'
one would tack on
(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b))))
which is the same as
(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
with \ba\b
changed to \bc\b
.
It would be nice to be able to use back-references but I don't see how that could be done.
By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.)
Note that
(?!\bfoo\b).
matches a character provided it does not begin the word 'foo'
. Therefore
(?:(?!\bfoo\b).)*
matches a substring that does not contain 'foo'
and does not end with 'f'
followed by 'oo'
.
Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.
CodePudding user response:
If the order of the strings doesn't matter, and you want to make sure every string occurs only once, you can turn the list into a set in Python:
my_lst = ['a', 'a', 'b', 'c']
my_set = set(lst)
print(my_set)
# {'a', 'c', 'b'}