Home > front end >  Regex match double ## including any repetition of # whih is not double
Regex match double ## including any repetition of # whih is not double

Time:03-02

How to match everything after a double hash "##" until the next double hash "##" and including any repetition of the "#" character which is not "##". For instance the below example should return two matches, one for chapter 1 and 1.1 and the second for chapter 2.

## chapter 1

Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Suspendisse mollis magna nec felis gravida, id posuere libero molestie.

### subchapter 1.1

Sed vel ipsum eget tortor maximus ultrices vitae eget dolor.

## chapter 2

Aenean pellentesque lectus quis ex tristique ultrices. Vestibulum eget purus eu ipsum vestibulum pulvinar

At the moment the best I found is the following regex:

((?!#){2}[\s\S]) 

which however is confused when a ### or #### is found and is counted as a new chapter.

Link to regex example: https://regex101.com/r/gydtq1/1

CodePudding user response:

You can use

re.findall(r'(?ms)^##(?!#).*?(?=\n##(?!#)|\Z)', text)
re.findall(r'^##(?!#).*?(?=\n##(?!#)|\Z)', text, re.M | re.S)

See the regex demo. Details:

  • (?ms) - a re.DOTALL (re.S) and re.MULTILINE (re.M) flags
  • ^ - start of a line
  • ##(?!#) - a ## string not immediately followed with a #
  • .*? - zero or more chars as few as possible
  • (?=\n##(?!#)|\Z) - a location immediately followed with a newline and ## not immediately followed with a # or end of string.

CodePudding user response:

Matching feels "overrated" sometimes, an alternative could be

re.split(r'(?m)^##(?!#)', text)

Parts are very similar to the ones in the accepted answer:

  • (?m) - flag for multiline (could be passed separately as a 4th argument)
  • ^##(?!#) - a ## string at the start of a line (^), not immediately followed with a subsequent #

Caveat: the resulting list will have an entry for everything what precedes the first ##, which is an empty string for the example.

  • Related