Not able to split with regular expresion-CodePudding

I was trying to split a text with regex but not able to properly make it work. I want to split only if it finds a pattern.

pattern = . (¨whatever text¨).

txt = "primer dia. (pag. 4 - pag. 5) otro dia mas. (pag. 4 - pag. 5) tercer parte. (pag. 9)"
x = re.split("\.\s\(. \)", txt)
print(x)

Expected output = ['primer dia', 'otro dia mas', 'tercer parte.']

But the code is not working because it only returns the first section only, am I missing something?

Thanks.

CodePudding user response：

An other approach would be to actually correct your regex. The problem is that regex are greedy, so the parenthesized match actually matches from the first parenthesis to the last one. Two solutions to that.

First solution: make it non-greedy

txt = "primer dia. (pag. 4 - pag. 5) otro dia mas. (pag. 4 - pag. 5) tercer parte. (pag. 9)"
x = re.split(r"\.\s\(. ?\)", txt)[:-1]
print(x)

Two differences from your code. The first one, the most important, is that I added a ? after , making it non-greedy. The second one is that I drop the last match, because at the end of the sentence, you've got an empty string that you probably don't care about.

Second solution: refuse parenthesis match

txt = "primer dia. (pag. 4 - pag. 5) otro dia mas. (pag. 4 - pag. 5) tercer parte. (pag. 9)"
x = re.split(r"\.\s\([^)] \)", txt)[:-1]
print(x)

As before, I dropped the last match because it's the empty string. This time though, I have replaced . with [^)], meaning "match any character that is not a closing parenthesis".

CodePudding user response：

One approach would be to use a regex alternation to match the terms in parentheses first, which you don't want, falling back to matching the sentences you do want.

txt = "primer dia. (pag. 4 - pag. 5) otro dia mas. (pag. 4 - pag. 5) tercer parte. (pag. 9)"
matches = [x for x in re.findall(r'\(.*?\)|(\w (?: \w )*\.)', txt) if x]
print(matches)  # ['primer dia.', 'otro dia mas.', 'tercer parte.']

Note that we only put the desired content in a capture group. This means that the (...) will show up the output from re.findall as empty string, which we remove using a simple list comprehension.