python regex where a set of options can occur at most once in a list, in any order-CodePudding

I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*, where a, b, c could only appear once. So:

foo a b c
foo b c a
foo a b
foo b

would all be valid, but

foo b b

would not be

CodePudding user response：

You may use this regex with a capture group and a negative lookeahd:

^foo((?!.*\1) [abc]) $

RegEx Demo

RegEx Details:

^: Start
foo: Match foo
(: Start a capture group #1
- (?!.*\1): Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input
- [abc]: Match a space followed by a or b or c
) : End capture group #1. Repeat this group 1 times
$: End

CodePudding user response：

You can assert that there is no match for a second match for a space and a letter at the right:

foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])*

foo Match literally
(?! Negative lookahead
- (?: [abc])* Match optional repetitions of a space and a b or c
- ( [abc]) Capture group, use to compare with a backreference for the same
- (?: [abc])* Match again a space and either a b or c
- \1 Backreference to group 1
) Close lookahead
(?: [abc])* Match optional repetitions or a space and either a b or c

Regex demo

If you don't want to match only foo, you can change the quantifier to 1 or more (?: [abc])

CodePudding user response：

You can do it using references to previously captured groups.

foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$

This gets quite long with many options. Such a regex can be generated dynamically, if necessary.

def match_sequence_without_repeats(options, seperator):
    def prevent_previous(n):
        if n == 0:
            return ""
        groups = "".join(rf"\{i}" for i in range(1, n   1))
        return f"(?!{groups})"

    return "".join(
        f"(?:{seperator}{prevent_previous(i)}([{options}]))?"
        for i in range(len(options))
    )


print(f"foo{match_sequence_without_repeats('abc', ' ')}$")

CodePudding user response：

I have assumed that the elements of the string can be in any order and appear any number of times. For example, 'a foo' should match and 'a foo b foo' should not.

You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one "foo"'s and/or zero or one "a"'s. You could use the following regular expression:

^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))

Start your engine!

This matches, for example, 'foofoo', 'aa' and afooa. If they are not to be matched remove the word breaks (\b).

Notice that this expression begins by asserting the start of the string (^) followed by two positive lookaheads, one for 'foo' and one for 'a'. To also check for, say, 'c' one would tack on

(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b))))

which is the same as

(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))

with \ba\b changed to \bc\b.

It would be nice to be able to use back-references but I don't see how that could be done.

By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.)

Note that

(?!\bfoo\b).

matches a character provided it does not begin the word 'foo'. Therefore

(?:(?!\bfoo\b).)*

matches a substring that does not contain 'foo' and does not end with 'f' followed by 'oo'.

Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.

CodePudding user response：

If the order of the strings doesn't matter, and you want to make sure every string occurs only once, you can turn the list into a set in Python:

my_lst = ['a', 'a', 'b', 'c']
my_set = set(lst)

print(my_set)
# {'a', 'c', 'b'}