Regex should not substitute character before matching string-CodePudding

I have the following regex problem:

import re
a1 = '<href="§ 5"> asd'
a2 = 'asdas § 5 asdas '

a1 = re.sub(r'[^"](§ \d )', r"§\1", a1)
a2 = re.sub(r'[^"](§ \d )', r"§\1", a2)

For string a1 nothing be should be substituted, as " is before the § with the number. That works fine. For the second string a2, the § the substitution should take place and is supposed to result in asdas §§ 5 asdas. But the regex function also substitutes the space before, resulting in asdas§§ 5 asdas. How can I change the regex that it doesn't include the space before the §.

CodePudding user response：

Use

re.sub(r'(^|[^"])(§\s*\d )', r'\1§\2', a1)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    ^                        the beginning of the string
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [^"]                     any character except: '"'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    §                        '§'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \d                       digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2

CodePudding user response：

My preference would be to use no capture groups at all, by replacing a zero-width match of the following regular expression with '§':

rgx = r'(?<!")(?=§ \d)'

re.sub(rgx, '§', '<href="§ 5"> asd')    #=> '<href="§ 5"> asd'
re.sub(rgx, '§', 'asdas § 5 asdas ')    #=> 'asdas §§ 5 asdas '
re.sub(rgx, '§', '§ 5 the cat came in') #=> '§§ 5 the cat came in'

Demo

(?<!") is a negative lookbehind which asserts that the current string location is not preceded by a double-quote. (?=§ \d) is a positive lookahead which asserts that the current string location is not followed by '§ ' followed by a digit. When both lookarounds are satisfied a match is made of the zero-width location immediately before '§' and re.sub directs that that location is to be replaced with the string '§'.

Notice that (?=§ \d ) is satisfied if (?=§ \d) is satisfied, so matching a digit after the space is sufficient.

I also like the way this reads: "insert '§' before '§' when the latter is followed by a space and then a digit, and is not preceded by a double-quote".