Home > Software design >  Find a change level and a text of the change in a .rst news files using regex
Find a change level and a text of the change in a .rst news files using regex

Time:09-09

What I want

I'm trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst file:

  • hence I want a following regex (changelevel): (change text)
  • I was thinking about something like (changelevel): (anything until no next change level)
* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
  patch

Should return a match, group1 and group2 as following

Match 1:

"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"

Match 2:

"* Patch: This\nis a multiline\n  patch"
"* Patch: "
"This\nis a multiline\n  patch

What I need help with

I cannot make a regex that will take care of multilines and asterisks present in the "change text" I tried following logic

  1. Match the change level ^(\*\s (\w ):\s)
  2. Match anything - with "dot matches newline" option turned on" .*
  3. Negative forward lookup until I match the change level (?!^(\*\s (\w ):\s))
  • I ended up with ^(\*\s (\w ):\s).*(?!^(\*\s (\w ):\s)) but .* seems to just match everything to group 2

enter image description here

What works

I managed to get the first group working with a following regex which works works:

  • beginning of the line
  • star in front
  • then whitespace
  • a word
  • colon
  • white space

^(\*\s (\w ):\s)

enter image description here

CodePudding user response:

You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.

^(\*\s \w :\s)(.*(?:\n(?!\*\s \w :\s).*)*)

Explanation

  • ^ Start of string
  • ( Capture group 1
    • \*\s \w :\s match *, 1 whitespace chars, 1 word chars, : and a whitespace char
  • ) Close group 1
  • ( Capture group 2
    • .* Match the whole line
    • (?: Non capture group to repeat as a whole
    • \n Match a newline
      • (?!\*\s \w :\s) The negative lookahead, asserting not the starting pattern here
      • .* Match the whole line
    • )* Close the non capture group and optionally repeat it to match alles lines
  • ) Close group 2

See a regex demo and a Python demo.

Example code:

import re
 
pattern = r"^(\*\s \w :\s)(.*(?:\n(?!\*\s \w :\s).*)*)"
 
s = ("* Major: This is a **Major** change\n"
    "* Minnor: This is is a minor change with a typo\n"
    "* Patch: This\n"
    "is a multiline\n"
    "  patch")
 
result = re.findall(pattern, s, re.MULTILINE)
print(result)

Output

[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'This\nis a multiline\n  patch')]

CodePudding user response:

re.findall(r'(\*\s*\w :\s )([\s\S]*?(?=\n\*|$))',text)
  • Use \newline followed by * or end of string $ as a anchor

  • Group 1: A literal * followed by zero or more \spaces and any \word character, a literal : and one or more \spaces

  • Group 2: Match everything non greedily *? upto \n\* or $

  • Related