I'm trying to split some paragraphs up by sentences using python3 and the re.split function. That's easy to do and is working. However, if a sentence is trailed by another sentence starting/ending with (), I want to split that sentence out but include the text in the parenthesis as well.
I've tried and tried to get this to work and am currently at this point in my trial.
For further specification, here is the result I want:
If I start with this:
This is a sentence that I can split out. This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)
I want to end up with this:
This is a sentence that I can split out.
This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)
CodePudding user response:
here is one way to do it using re.findall()
:
import re
data = "This is a sentence that I can split out. This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)"
sentences = re.findall(r'.*?\.\s*\)?(?!\s*\()', data)
print(sentences)
# [
# 'This is a sentence that I can split out. ',
# "This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)"
# ]
.*?
: Matches any character, between 0 and unlimited times, as few as possible.\.
: Matches a dot. -\s*
:Matches a space between zero and unlimited times.\)?
: Matches)
between zero and one time.(?!)
: Negative lookahead.\s*
: Matches a space between zero and unlimited times.\(
: Matches a parenthesis.
CodePudding user response:
For a given string str
you can use
re.sub(rgx, "\n", str)
where
rgx = r'(?:(?<=[.!?])|(?<=[.!?]\))) (?!\()'
Python demo<-\(ツ)/->Regex demo
The regular expression, which matches one or more spaces, can be broken down as follows.
(?: # being a non-capture group
(?<= # begin positive lookbehind
[.!?] # match a char in the char class
) # end positive lookbehind
| # or
(?<= # begin positive lookbehind
[.!?]\) # match a char in the char class then ')'
) # end positive lookbehind
) # end non-capture group
# match one or more spaces
(?!\() # negative lookahead asserts next char is not '('
A non-capture group containing an alteration of two positive lookbehinds is needed because Python's re
does not support variable-length lookbehinds.
Notice at the links that where paragraphs end there are no trailing spaces or spaces at the beginning of the following paragraph.