I have the following regex problem:
import re
a1 = '<href="§ 5"> asd'
a2 = 'asdas § 5 asdas '
a1 = re.sub(r'[^"](§ \d )', r"§\1", a1)
a2 = re.sub(r'[^"](§ \d )', r"§\1", a2)
For string a1
nothing be should be substituted, as "
is before the § with the number. That works fine. For the second string a2
, the § the substitution should take place and is supposed to result in asdas §§ 5 asdas
. But the regex function also substitutes the space before, resulting in asdas§§ 5 asdas
. How can I change the regex that it doesn't include the space before the §.
CodePudding user response:
Use
re.sub(r'(^|[^"])(§\s*\d )', r'\1§\2', a1)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^"] any character except: '"'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
§ '§'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \2
CodePudding user response:
My preference would be to use no capture groups at all, by replacing a zero-width match of the following regular expression with '§'
:
rgx = r'(?<!")(?=§ \d)'
re.sub(rgx, '§', '<href="§ 5"> asd') #=> '<href="§ 5"> asd'
re.sub(rgx, '§', 'asdas § 5 asdas ') #=> 'asdas §§ 5 asdas '
re.sub(rgx, '§', '§ 5 the cat came in') #=> '§§ 5 the cat came in'
(?<!")
is a negative lookbehind which asserts that the current string location is not preceded by a double-quote. (?=§ \d)
is a positive lookahead which asserts that the current string location is not followed by '§ '
followed by a digit. When both lookarounds are satisfied a match is made of the zero-width location immediately before '§'
and re.sub
directs that that location is to be replaced with the string '§'
.
Notice that (?=§ \d )
is satisfied if (?=§ \d)
is satisfied, so matching a digit after the space is sufficient.
I also like the way this reads: "insert '§'
before '§'
when the latter is followed by a space and then a digit, and is not preceded by a double-quote".