Home > database >  Python Regex: Find pattern, even if repeated, with re.split()
Python Regex: Find pattern, even if repeated, with re.split()

Time:05-13

I am trying to find a way to detect , and or in a string even if they are repeated. So even a string such as one , , or or, two with re.split() should return "one" and "two".

So far this is what I have (Using Python 3.10):

import re

pattern = re.compile(r"(?:\s*,\s*or\s*|\s*,\s*|\s or\s ) ", flags=re.I)
string = "one,two or three   ,   four   or   five  or , or six , oR   ,  seven, ,,or,   ,, eight or qwertyor orqwerty,"
result = re.split(pattern, string)
print(result)

which returns:

['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'qwertyor orqwerty', '']

My issue so far is if I have consecutive or, my pattern will only recognize every other or. For example:

string = "one or or two"
>>> ['one', 'or two']

string = "one or or or two"
>>> ['one', 'or', 'two']

Notice in the first example the second element contains or and in the second example or is an element by itself.

Is there a way to get around this? Also if there is a better way of separating these strings that would be greatly appreciated as well.

CodePudding user response:

You can use

import re
text = "one,two or three   ,   four   or   five  or , or six , oR   ,  seven, ,,or,   ,, eight or qwertyor orqwerty,"
print( re.split(r'(?:\s*(?:,|\bor\b)) \s*', text.rstrip().rstrip(',')) )
# => ['one', 'two', 'three', 'four', 'five', 'six', 'oR', 'seven', 'eight', 'qwertyor orqwerty']

See the Python demo and the regex demo.

Details:

  • (?:\s*(?:,|\bor\b)) - one or more repetitions of
    • \s* - zero or more whitespaces
    • (?:,|\bor\b) - either a comma or a whole word or
  • \s* - zero or more whitespaces.

Note the use of non-capturing groups, this is crucial since you are using the pattern in re.split.

Also, note the text.rstrip().rstrip(',') so that there is no trailing empty item in the result.

CodePudding user response:

Does Python support the word boundary flag \b? If so, you could probably simplify the regular expression to something along the following lines:

\s*((,|\bor\b)\s*) 
  • Related