I have some relatively long strings (about 7000K tokens) with this format:
LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros.
I want to use a regex (or any method actually), to split the text using the expressions in bold as headers. I suppose the headers:
- Are all uppercase
- Can be made up of one or more words
- are always followed by a newline
What I am trying now is:
groups = re.finditer("([A-Z] *) \\n ", text)
However, in some cases, the operation takes forever. Why is that? Has catastrophic backtracking anything to do with it? What can I do to make it work?
CodePudding user response:
Here is one way to do so:
[A-Z][A-Z ] (?=\n)
import re
data = """LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros."""
print(re.findall(r"[A-Z][A-Z ] (?=\n)", data))
# ['LOREM', 'PHASELLUS', 'PELLENTESQUE NEC CONSEQUAT']
[A-Z]
: Matches a capital letter.[A-Z ]
: Matches a capital letter or a space, between one and unlimited times, as much as possible.(?=)
: Positive lookahead.\n
: Matches a newline.
Your regex also works, but one of the features of re.findall
is to only return capture groups when there are any.
That explains why re.findall(r"([A-Z] *) \n ", data)
does only return the last word.
You should instead use a non-capturing group:
(?:[A-Z] *) \n
import re
data = """LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.\n PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.\n PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros."""
print(re.findall(r"(?:[A-Z] *) \n ", data))
# ['LOREM\n', 'PHASELLUS\n', 'PELLENTESQUE NEC CONSEQUAT\n\n']
CodePudding user response:
You could write
rgx = r'\n (?=[A-Z] (?: [A-Z] )*\n)'
re.split(rgx, text)
#=> [
# 'LOREM\n ipsum dolor sit amet, consectetur adipiscing elit.',
# 'PHASELLUS\n metus est, fringilla sit amet convallis nec, dictum nec nunc. Integer in scelerisque sem, sed suscipit ligula.\n Aliquam turpis orci, pellentesque a vulputate ut, interdum nec lacus.',
# 'PELLENTESQUE NEC CONSEQUAT\n\n Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec volutpat vulputate leo, in consectetur eros.'
# ]
Python Demo<-\(ツ)/->Regex demo
The regular expression can be broken down as follows.
\n # match >= 1 newlines followed by >= 1 spaces
(?= # begin negative lookahead
[A-Z] # match >= 1 capital letters
(?: [A-Z] ) # match ?= 1 spaces followed by >= 1 capital letters
# in a non-capture group
* # execute the non-capture group >= 0 times
\n # match a newline
) # end positive lookahead
If the files may be produced using Windows (which uses '\r\n'
as a line terminator) the regular expression should be modified as follows:
r'(?:\r?\n) (?=[A-Z] (?: [A-Z] )*\r? \n)'