What I want
I'm trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst
file:
- hence I want a following regex (changelevel): (change text)
- I was thinking about something like (changelevel): (anything until no next change level)
* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
patch
Should return a match, group1 and group2 as following
Match 1:
"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"
Match 2:
"* Patch: This\nis a multiline\n patch"
"* Patch: "
"This\nis a multiline\n patch
What I need help with
I cannot make a regex that will take care of multilines and asterisks present in the "change text" I tried following logic
- Match the change level
^(\*\s (\w ):\s)
- Match anything - with "dot matches newline" option turned on"
.*
- Negative forward lookup until I match the change level
(?!^(\*\s (\w ):\s))
- I ended up with
^(\*\s (\w ):\s).*(?!^(\*\s (\w ):\s))
but.*
seems to just match everything to group 2
What works
I managed to get the first group working with a following regex which works works:
- beginning of the line
- star in front
- then whitespace
- a word
- colon
- white space
^(\*\s (\w ):\s)
CodePudding user response:
You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.
^(\*\s \w :\s)(.*(?:\n(?!\*\s \w :\s).*)*)
Explanation
^
Start of string(
Capture group 1\*\s \w :\s
match*
, 1 whitespace chars, 1 word chars,:
and a whitespace char
)
Close group 1(
Capture group 2.*
Match the whole line(?:
Non capture group to repeat as a whole\n
Match a newline(?!\*\s \w :\s)
The negative lookahead, asserting not the starting pattern here.*
Match the whole line
)*
Close the non capture group and optionally repeat it to match alles lines
)
Close group 2
See a regex demo and a Python demo.
Example code:
import re
pattern = r"^(\*\s \w :\s)(.*(?:\n(?!\*\s \w :\s).*)*)"
s = ("* Major: This is a **Major** change\n"
"* Minnor: This is is a minor change with a typo\n"
"* Patch: This\n"
"is a multiline\n"
" patch")
result = re.findall(pattern, s, re.MULTILINE)
print(result)
Output
[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'This\nis a multiline\n patch')]
CodePudding user response:
re.findall(r'(\*\s*\w :\s )([\s\S]*?(?=\n\*|$))',text)
Use
\n
ewline followed by*
or end of string$
as a anchorGroup 1: A literal
*
followed by zero or more\s
paces and any\w
ord character, a literal:
and one or more\s
pacesGroup 2: Match everything non greedily
*?
upto\n\*
or$