How to match everything after a double hash "##" until the next double hash "##" and including any repetition of the "#" character which is not "##". For instance the below example should return two matches, one for chapter 1 and 1.1 and the second for chapter 2.
## chapter 1
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Suspendisse mollis magna nec felis gravida, id posuere libero molestie.
### subchapter 1.1
Sed vel ipsum eget tortor maximus ultrices vitae eget dolor.
## chapter 2
Aenean pellentesque lectus quis ex tristique ultrices. Vestibulum eget purus eu ipsum vestibulum pulvinar
At the moment the best I found is the following regex:
((?!#){2}[\s\S])
which however is confused when a ### or #### is found and is counted as a new chapter.
Link to regex example: https://regex101.com/r/gydtq1/1
CodePudding user response:
You can use
re.findall(r'(?ms)^##(?!#).*?(?=\n##(?!#)|\Z)', text)
re.findall(r'^##(?!#).*?(?=\n##(?!#)|\Z)', text, re.M | re.S)
See the regex demo. Details:
(?ms)
- are.DOTALL
(re.S
) andre.MULTILINE
(re.M
) flags^
- start of a line##(?!#)
- a##
string not immediately followed with a#
.*?
- zero or more chars as few as possible(?=\n##(?!#)|\Z)
- a location immediately followed with a newline and##
not immediately followed with a#
or end of string.
CodePudding user response:
Matching feels "overrated" sometimes, an alternative could be
re.split(r'(?m)^##(?!#)', text)
Parts are very similar to the ones in the accepted answer:
(?m)
- flag for multiline (could be passed separately as a 4th argument)^##(?!#)
- a##
string at the start of a line (^
), not immediately followed with a subsequent#
Caveat: the resulting list will have an entry for everything what precedes the first ##
, which is an empty string for the example.