I am trying to find a way to detect ,
and or
in a string even if they are repeated. So even a string such as one , , or or, two
with re.split() should return "one" and "two".
So far this is what I have (Using Python 3.10):
import re
pattern = re.compile(r"(?:\s*,\s*or\s*|\s*,\s*|\s or\s ) ", flags=re.I)
string = "one,two or three , four or five or , or six , oR , seven, ,,or, ,, eight or qwertyor orqwerty,"
result = re.split(pattern, string)
print(result)
which returns:
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'qwertyor orqwerty', '']
My issue so far is if I have consecutive or
, my pattern will only recognize every other or
. For example:
string = "one or or two"
>>> ['one', 'or two']
string = "one or or or two"
>>> ['one', 'or', 'two']
Notice in the first example the second element contains or
and in the second example or
is an element by itself.
Is there a way to get around this? Also if there is a better way of separating these strings that would be greatly appreciated as well.
CodePudding user response:
You can use
import re
text = "one,two or three , four or five or , or six , oR , seven, ,,or, ,, eight or qwertyor orqwerty,"
print( re.split(r'(?:\s*(?:,|\bor\b)) \s*', text.rstrip().rstrip(',')) )
# => ['one', 'two', 'three', 'four', 'five', 'six', 'oR', 'seven', 'eight', 'qwertyor orqwerty']
See the Python demo and the regex demo.
Details:
(?:\s*(?:,|\bor\b))
- one or more repetitions of\s*
- zero or more whitespaces(?:,|\bor\b)
- either a comma or a whole wordor
\s*
- zero or more whitespaces.
Note the use of non-capturing groups, this is crucial since you are using the pattern in re.split
.
Also, note the text.rstrip().rstrip(',')
so that there is no trailing empty item in the result.
CodePudding user response:
Does Python support the word boundary flag \b
? If so, you could probably simplify the regular expression to something along the following lines:
\s*((,|\bor\b)\s*)