Home > database >  Repeat entire group 0 or more times (one or more words separated by 's)
Repeat entire group 0 or more times (one or more words separated by 's)

Time:01-04

I am trying to match words separated with the character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z characters). For example, an input string foo should match as well as foo bar and foo bar baz with each of the words (not 's) being captured.

So far, I have tried a few regular expressions but the closest I have got is this:

/^([A-z ] )\ ([A-z ] )$/

However, this only matches the case in which there are two words separated with a , I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (\ ([A-z ] )) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.

CodePudding user response:

You could write the pattern as:

(?i)[A-Z] (?:\ [A-Z] )*$

Explanation

  • (?i) Inline modifier for case insensitive
  • [A-Z] Match 1 chars A-Z
  • (?:\ [A-Z] )* Optionally repeat matching and again 1 chars A-Z
  • $ End of string

See a regex101 demo for the matches:

For example

import re

predeterminedList = ["foo", "bar"]
strings = ["foo", "foo bar", "foo bar baz", "test abc"]
pattern = r"(?i)[A-Z] (?:\ [A-Z] )*$"

for s in strings:
    m = re.match(pattern, s)
    if m:
        words = m.group().split(" ")
        intersect = bool(set(words) & set(predeterminedList))
        fmt = ','.join(predeterminedList)
        if intersect:
            print(f"'{s}' contains at least one of '{fmt}'")
        else:
            print(f"'{s}' contains none of '{fmt}'")

Another option could be created a dynamic pattern listing the alternatives:

(?i)^(?:[A-Z] \ )*(?:foo|bar)(?:\ [A-Z] )*$

Example

import re

predeterminedList = ["foo", "bar"]
strings = ["foo", "foo bar", "foo bar baz", "test abc"]
pattern = rf"(?i)^(?:[A-Z] \ )*(?:{'|'.join(predeterminedList)})(?:\ [A-Z] )*$"

for s in strings:
    m = re.match(pattern, s)
    fmt = ','.join(predeterminedList)
    if m:
        print(f"'{s}' contains at least one of '{fmt}'")
    else:
        print(f"'{s}' contains none of '{fmt}'")

Both will output:

'foo' contains at least one of 'foo,bar'
'foo bar' contains at least one of 'foo,bar'
'foo bar baz' contains at least one of 'foo,bar'
'test abc' contains none of 'foo,bar'

CodePudding user response:

NOTE: A-z in your [A-z ] does not only mean that any capital letter from A to Z or any small letter from a to z, it also means that other characters in that range like []\`^_ will also be included. See ASCII table. I think you mean this [A-Za-z ] .


Try this regex pattern:

^(?![\s\S]*\ $)(?:[A-Za-z] \ ?) $
  • ^ start of the string.

  • (?![\s\S]*\ $) ensures the end of the string is not a literal .

  • (?:[A-Za-z] \ ?) non-capturing group: [A-Za-z] \ ? one or more letter followed by an optional literal , this group will be repeated at least once.

  • $ end of the string.

See regex demo

import re
txt = 'foo bar baz'
arr = re.findall(r'^(?![\s\S]*\ $)(?:[A-Za-z] \ ?) $', txt)

if arr:
    arr=arr[0].split(' ')

print(arr)

#Output ['foo', 'bar', 'baz']

CodePudding user response:

I would recommend slightly different approach using lookarounds:

Pattern: (?<=^|\ )(?=foo|baz)[^ ]

Pattern explanation:

(?<=^|\ ) - positive lookbehind - assert that preceeding text is neither ^ (beginning of string) or (our 'word delimiter').

(?=foo|baz) - positive lookahead - assert that following text match one of words (from predefined list)

[^ ] - match one or more characters other from

Regex demo

  • Related