I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :
.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9] ) ?(.*) ?:?"
r2 = r"Step ?([0-9] ) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing :
at the end.
r2 is capturing the step number, and ''
. I'm not sure how to handle the case where there is a .*
followed by a string.
Necessary Edit:
The heading might contain :
inside the string, I just want to ignore the trailing one. I know I can strip(':')
but I want to understand what I'm doing wrong.
CodePudding user response:
You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d ) ?([^:\n] )
\bStep ?
Match the wordStep
and optional space(\d ) ?
Capture 1 digits in group 1 followed by matching an optional space([^:\n] )
Capture 1 chars other than:
or a newline in group 2
If the colon has to be at the end of the string:
\bStep ?(\d ) ?([^:\n] ):?$