Home > database >  Generalized replacement by matching group id
Generalized replacement by matching group id

Time:11-06

Given a string of the form <digit>-<non-digit> or <non-digit>-<digit>, I need to remove the hyphen (in Python). I.e. 2-f becomes 2f, f-2 becomes f2.

So far I have (?:\d-\D)|(?:\D-\d), which finds the patterns but I can't figure out a way to replace the hyphen with blank. In particular:

  • if I sub the regex above, it will replace the surrounding characters (because they are the ones matched);
  • I can do (?:(\d)-(\D))|(?:(\D)-(\d)) to expressly capture the characters and then sub with \1\2 will correctly process 2-f, turning it to 2f... but! it will fail f-2 of course because those characters are in the 3rd and 4th groups, so we'll need to sub with \3\4. Tried to give names to the group failed because all names need to be unique.

I know I can just run it through 2 sub statements, but is there a more elegant solution? I know regex is super-powerful if you know what you're doing... Thank you!

CodePudding user response:

The alternative you could be using \1\2 in the replacement using the regex PyPi module in combination with a branch reset group (?| to be able to use the same group numbers with an alternation.

(?|(\d)-(\D)|(\D)-(\d))

Note that \D can also match a space or a newline. If you want to match a non whitespace char other than a digit, you could also use [^\s\d] instead of \D.

See a Python demo and regex demo.

For example:

import regex

pattern = r"(?|(\d)-(\D)|(\D)-(\d))"
s = "2-f or f-2"

print(regex.sub(pattern, r"\1\2", s))

Output

2f or f2

CodePudding user response:

There is nothing that stops you from replacing with \1\2\3\4:

import re
text = "2-f becomes 2f, f-2 becomes f2"
print( re.sub(r"(\d)-(\D)|(\D)-(\d)", r"\1\2\3\4", text) )

See the regex demo and the Python demo.

This is possible because all backreferences pointing to groups that did not participate in the match are initialized with an empty string beginning with Python 3.5 (before, they were not and that caused issues, see Empty string instead of unmatched group error, and you would have to use a callable as a replacement argument).

Certainly, (?<=\d)-(?=\D)|(?<=\D)-(?=\d) regex, with positive lookarounds instead of capturing groups, looks much cleaner in the current scenario, but it will not work if the boundary patterns are of variable length.

  • Related