Home > Blockchain >  re.sub not acting as I would expect, please explain what is happening here?
re.sub not acting as I would expect, please explain what is happening here?

Time:02-17

For simplification, say I have the following code:

import re

line = '(5) 3:16 The footnote explaination for footnote number one here.'

# trying to match a literal open parenthesis, followed by a number,
# followed by closing parenthesis - with match.group(1) being the number.
match = re.match(r'\((\d )\)', line)


reordered_num = 1

renumbered_line_1 = re.sub(match.group(0), '{}'.format(reordered_num), line )
renumbered_line_2 = re.sub(match.group(1), '{}'.format(reordered_num), line )

I expected renumbered_line_1 to have substituted "1" in place of "(5)" in the text.

I expected renumbered_line_2 to have substituted "1" in place of "5" in the text.

Question: Why do are both renumbered_line_1 and renumbered_line_2 have exactly the same contents being:

(1) 3:16 The footnote explaination for footnote number one here.

Is this a bug with Python 3.9.7 running on Mac ... or is there something that I am not understanding here ?

Python 3.9.7 (default, Sep  3 2021, 12:45:31)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>
>>> line = '(5) 3:16 The footnote explaination for footnote number one here.'
>>>
>>> # trying to match a literal open parenthesis, followed by a number,
>>> # followed by closing parenthesis - with match.group(1) being the number.
>>> match = re.match(r'\((\d )\)', line)
>>>
>>>
>>> reordered_num = 1
>>>
>>> renumbered_line_1 = re.sub(match.group(0), '{}'.format(reordered_num), line )
>>> renumbered_line_2 = re.sub(match.group(1), '{}'.format(reordered_num), line )
>>>
>>> renumbered_line_1
'(1) 3:16 The footnote explaination for footnote number one here.'
>>> renumbered_line_2
'(1) 3:16 The footnote explaination for footnote number one here.'
>>>

CodePudding user response:

The result of match.group(0) and match.group(1) in your code are respectively (5) and 5. So this is what you are doing:

>>> re.sub('(5)', '1', '(5) 3:16 the footnote')
'(1) 3:16 the footnote'
>>> re.sub('5', '1', '(5) 3:16 the footnote')
'(1) 3:16 the footnote'

The reason only the 5 is replaced in both cases is that the pattern (5) is the single character 5 inside a group. It matches (and captures) the single character 5 in your string, so that is what you replace.

If you wanted to replace the string (5) including parentheses, you could do any of the following:

  • manually escape the parentheses:
    re.sub(r'\(5\)', '1', '(5) 3:16 the footnote')
    
  • use re.escape to escape the parentheses:
    re.sub(re.escape('(5)'), '1', '(5) 3:16 the footnote')
    
  • use non-regex replace:
    '(5) 3:16 the footnote'.replace('(5)', '1')
    

I recommend the third option, since you do not appear to be trying to use regex functionality at that point in your code.

So it would look like this in your code:

renumbered_line_1 = line.replace(match.group(0), str(reordered_num))
  • Related