Regex for matching uppercase headers in any position in a string-CodePudding

I have some relatively long strings (about 7000K tokens) with this format:

LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros.

I want to use a regex (or any method actually), to split the text using the expressions in bold as headers. I suppose the headers:

Are all uppercase
Can be made up of one or more words
are always followed by a newline

What I am trying now is:

groups = re.finditer("([A-Z]  *) \\n ", text)

However, in some cases, the operation takes forever. Why is that? Has catastrophic backtracking anything to do with it? What can I do to make it work?

CodePudding user response：

Here is one way to do so:

[A-Z][A-Z ] (?=\n)

import re

data = """LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros."""

print(re.findall(r"[A-Z][A-Z ] (?=\n)", data))
# ['LOREM', 'PHASELLUS', 'PELLENTESQUE NEC CONSEQUAT']

[A-Z]: Matches a capital letter.
[A-Z ] : Matches a capital letter or a space, between one and unlimited times, as much as possible.
(?=): Positive lookahead.
- \n: Matches a newline.

Your regex also works, but one of the features of re.findall is to only return capture groups when there are any.

That explains why re.findall(r"([A-Z] *) \n ", data) does only return the last word.

You should instead use a non-capturing group:

(?:[A-Z]  *) \n 

import re

data = """LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros."""

print(re.findall(r"(?:[A-Z]  *) \n ", data))
# ['LOREM\n', 'PHASELLUS\n', 'PELLENTESQUE NEC CONSEQUAT\n\n']

CodePudding user response：

You could write

rgx = r'\n   (?=[A-Z] (?:  [A-Z] )*\n)'
re.split(rgx, text)
  #=> [
  #     'LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.',
  #     'PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.',
  #     'PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros.'
  #  ]

Python Demo^_<-_\(ツ)/^_->Regex demo

The regular expression can be broken down as follows.

\n             # match >= 1 newlines followed by >= 1 spaces
(?=            # begin negative lookahead
  [A-Z]        # match >= 1 capital letters
  (?:  [A-Z] ) # match ?= 1 spaces followed by >= 1 capital letters
               # in a non-capture group
  *            # execute the non-capture group >= 0 times
  \n           # match a newline
)              # end positive lookahead

If the files may be produced using Windows (which uses '\r\n' as a line terminator) the regular expression should be modified as follows:

r'(?:\r?\n)   (?=[A-Z] (?:  [A-Z] )*\r? \n)'