Home > Software design >  Regex in Python: splitting on whitespace character in between two words that start with a capital le
Regex in Python: splitting on whitespace character in between two words that start with a capital le

Time:12-11

In my NLP pipeline, I need to split titles from body text. Titles always consist of a sequence of capitalized words without any punctuation. The titles are separated from the body text using two whitespace characters \n\n.

For example:

This Is A Title

This is where the body starts.

I want to split the title and body text on the whitespace using Regex in Python, such that the result is: This Is A Title, This is where the body starts.

Can anybody help me to write the right Regex? I tried the following:

r'(?<=[A-Z][a-z] )\n\n(?=[A-Z])'

but then I got the error that lookbehinds only work with strings of fixed length (but they should be allowed to be variable).

Many thanks for helping me out!

CodePudding user response:

You can match the title followed by 2 newlines, and for the body match all lines that are not a title pattern using 2 capture groups instead of splitting.

^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )
  • ^ Start of string
  • ( Capture group 1
    • [A-Z][a-z]* Match an uppercase char and optional lower case chars to also match for example just A
    • (?:[^\S\n] [A-Z][a-z]*)* Optionally repeat 1 spaces and the same pattern as before
  • ) Close group
  • \n\n Match 2 newlines
  • ( Capture group 2
    • (?: Non capture group
      • (?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$) Negative lookahead, assert that the line is not a title pattern
      • .* If the previous assertion it true, match the whole line
      • (?:\n|$) Match either a newline or the end of the string
    • ) Close the non capture group and repeat 1 or more times
  • ) Close group 2

See a regex demo and a Python demo.

import re

pattern = r"^([A-Z][a-z]*(?:[^\S\n] [A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z] (?:[^\S\n] [A-Z][a-z]*)*$).*(?:\n|$)) )"

s = ("This Is A Title\n\n"
    "This is where the body starts.\n\n"
    "And this is more body.")
    
print(re.findall(pattern, s))

Output

[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]

CodePudding user response:

Suppose you have this text:

txt='''\
This Is A Title

This is where the body starts.
more body

Not a title -- body!

This Is Another Title

This is where the body starts.

The End
'''

You can use This Regex and separate titles (as you have defined them) from body:

import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*) $).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))"

>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]

As The fourth bird helpfully states in comments, the first lookahead can be eliminated:

(^(?:[A-Z][a-z]*[ \t]*) $)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*) $))

Demo

  • Related