Regex in Python: splitting on whitespace character in between two words that start with a capital le-CodePudding

In my NLP pipeline, I need to split titles from body text. Titles always consist of a sequence of capitalized words without any punctuation. The titles are separated from the body text using two whitespace characters \n\n.

For example:

This Is A Title

This is where the body starts.

I want to split the title and body text on the whitespace using Regex in Python, such that the result is: This Is A Title, This is where the body starts.

Can anybody help me to write the right Regex? I tried the following:

r'(?<=[A-Z][a-z] )\n\n(?=[A-Z])'

but then I got the error that lookbehinds only work with strings of fixed length (but they should be allowed to be variable).

Many thanks for helping me out!

CodePudding user response：

You can match the title followed by 2 newlines, and for the body match all lines that are not a title pattern using 2 capture groups instead of splitting.

^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )

^ Start of string
( Capture group 1
- [A-Z][a-z]* Match an uppercase char and optional lower case chars to also match for example just A
- (?:[^\S\n] [A-Z][a-z]*)* Optionally repeat 1 spaces and the same pattern as before
) Close group
\n\n Match 2 newlines
( Capture group 2
- (?: Non capture group
  - (?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$) Negative lookahead, assert that the line is not a title pattern
  - .* If the previous assertion it true, match the whole line
  - (?:\n|$) Match either a newline or the end of the string
- ) Close the non capture group and repeat 1 or more times
) Close group 2

See a regex demo and a Python demo.

import re

pattern = r"^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )"

s = ("This Is A Title\n\n"
    "This is where the body starts.\n\n"
    "And this is more body.")
    
print(re.findall(pattern, s))

Output

[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]

CodePudding user response：

Suppose you have this text:

txt='''\
This Is A Title

This is where the body starts.
more body

Not a title -- body!

This Is Another Title

This is where the body starts.

The End
'''

You can use This Regex and separate titles (as you have defined them) from body:

import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*) $).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))"

>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]

As The fourth bird helpfully states in comments, the first lookahead can be eliminated:

(^(?:[A-Z][a-z]*[ \t]*) $)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))

Demo