I am trying to match words separated with the
character as input from a user in python and check if each of the words in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z
characters). For example, an input string foo
should match as well as foo bar
and foo bar baz
with each of the words (not
's) being captured.
So far, I have tried a few regular expressions but the closest I have got is this:
/^([A-z ] )\ ([A-z ] )$/
However, this only matches the case in which there are two words separated with a
, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (\ ([A-z ] )
) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.
CodePudding user response:
You could write the pattern as:
(?i)[A-Z] (?:\ [A-Z] )*$
Explanation
(?i)
Inline modifier for case insensitive[A-Z]
Match 1 chars A-Z(?:\ [A-Z] )*
Optionally repeat matching$
End of string
See a regex101 demo for the matches:
For example
import re
predeterminedList = ["foo", "bar"]
strings = [
"foo",
"foo bar",
"foo bar baz",
"test abc"
]
pattern = r"(?i)[A-Z] (?:\ [A-Z] )*$"
for s in strings:
m = re.match(pattern, s)
if m:
words = m.group().split(" ")
intersect = bool(set(words) & set(predeterminedList))
fmt = ','.join(predeterminedList)
if intersect:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains at none of '{fmt}'")
Output
'foo' contains at least one of 'foo,bar'
'foo bar' contains at least one of 'foo,bar'
'foo bar baz' contains at least one of 'foo,bar'
'test abc' contains at none of 'foo,bar'
CodePudding user response:
NOTE:
A-z
in your [A-z ]
does not only mean that any capital letter from A
to Z
or any small letter from a
to z
, it also means that other characters in that range like []\`^_
will also be included. See ASCII table. I think you mean this [A-Za-z ]
.
Try this regex pattern:
^(?![\s\S]*\ $)(?:[A-Za-z] \ ?) $
^
start of the string.(?![\s\S]*\ $)
ensures the end of the string is not a literal(?:[A-Za-z] \ ?)
non-capturing group:[A-Za-z] \ ?
one or more letter followed by an optional literal$
end of the string.
See regex demo
import re
txt = 'foo bar baz'
arr = re.findall(r'^(?![\s\S]*\ $)(?:[A-Za-z] \ ?) $', txt)
if arr:
arr=arr[0].split(' ')
print(arr)
#Output ['foo', 'bar', 'baz']