In my NLP pipeline, I need to split titles from body text. Titles always consist of a sequence of capitalized words without any punctuation. The titles are separated from the body text using two whitespace characters \n\n
.
For example:
This Is A Title
This is where the body starts.
I want to split the title and body text on the whitespace using Regex in Python, such that the result is: This Is A Title, This is where the body starts.
Can anybody help me to write the right Regex? I tried the following:
r'(?<=[A-Z][a-z] )\n\n(?=[A-Z])'
but then I got the error that lookbehinds only work with strings of fixed length (but they should be allowed to be variable).
Many thanks for helping me out!
CodePudding user response:
You can match the title followed by 2 newlines, and for the body match all lines that are not a title pattern using 2 capture groups instead of splitting.
^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )
^
Start of string(
Capture group 1[A-Z][a-z]*
Match an uppercase char and optional lower case chars to also match for example justA
(?:[^\S\n] [A-Z][a-z]*)*
Optionally repeat 1 spaces and the same pattern as before
)
Close group\n\n
Match 2 newlines(
Capture group 2(?:
Non capture group(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$)
Negative lookahead, assert that the line is not a title pattern.*
If the previous assertion it true, match the whole line(?:\n|$)
Match either a newline or the end of the string
)
Close the non capture group and repeat 1 or more times
)
Close group 2
See a regex demo and a Python demo.
import re
pattern = r"^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )"
s = ("This Is A Title\n\n"
"This is where the body starts.\n\n"
"And this is more body.")
print(re.findall(pattern, s))
Output
[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]
CodePudding user response:
Suppose you have this text:
txt='''\
This Is A Title
This is where the body starts.
more body
Not a title -- body!
This Is Another Title
This is where the body starts.
The End
'''
You can use This Regex and separate titles (as you have defined them) from body:
import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*) $).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))"
>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]
As The fourth bird helpfully states in comments, the first lookahead can be eliminated:
(^(?:[A-Z][a-z]*[ \t]*) $)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))