How can I split a paragraph by sentences, but sometimes include two sentences together if the first-CodePudding

I'm trying to split some paragraphs up by sentences using python3 and the re.split function. That's easy to do and is working. However, if a sentence is trailed by another sentence starting/ending with (), I want to split that sentence out but include the text in the parenthesis as well.

I've tried and tried to get this to work and am currently at this point in my trial.

regex101.com example image

For further specification, here is the result I want:

If I start with this:

This is a sentence that I can split out. This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)

I want to end up with this:

This is a sentence that I can split out.
This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)

CodePudding user response：

here is one way to do it using re.findall():

import re

data = "This is a sentence that I can split out. This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)"

sentences = re.findall(r'.*?\.\s*\)?(?!\s*\()', data)
print(sentences)
# [
#     'This is a sentence that I can split out. ',
#     "This sentence shouldn't be split out by itself. (I want to split that second sentence out but by the ending parenthesis instead.)"
# ]

.*?: Matches any character, between 0 and unlimited times, as few as possible.
\.: Matches a dot. -\s*:Matches a space between zero and unlimited times.
\)?: Matches ) between zero and one time.
(?!): Negative lookahead.
\s*: Matches a space between zero and unlimited times.
\(: Matches a parenthesis.

CodePudding user response：

For a given string str you can use

re.sub(rgx, "\n", str)

where

rgx = r'(?:(?<=[.!?])|(?<=[.!?]\)))  (?!\()'

Python demo^_<-_\(ツ)/^_->Regex demo

The regular expression, which matches one or more spaces, can be broken down as follows.

(?:          # being a non-capture group
  (?<=       # begin positive lookbehind 
    [.!?]    # match a char in the char class
  )          # end positive lookbehind
|            # or
  (?<=       # begin positive lookbehind
    [.!?]\)  # match a char in the char class then ')'
  )          # end positive lookbehind 
)            # end non-capture group
             # match one or more spaces
(?!\()       # negative lookahead asserts next char is not '('

A non-capture group containing an alteration of two positive lookbehinds is needed because Python's re does not support variable-length lookbehinds.

Notice at the links that where paragraphs end there are no trailing spaces or spaces at the beginning of the following paragraph.