Home > Mobile >  Replacing sub-string occurrences with elements of a given list
Replacing sub-string occurrences with elements of a given list

Time:01-23

Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.

For example, consider this scenario:

pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert

The goal is to obtain a string of the form:

s = "a(_001_), b(_002_), c(_003_)"

The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.

I've came up with this solution:

import re

pattern = "_____"
s = "a(_____), b(_____), c(_____)"

l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
    s = s[:el]   f"_{str(i).zfill(5 - 2)}_"   s[el   5:]
    i  = 1

print(s)

Output: a(_000_), b(_001_), c(_002_)

This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?

CodePudding user response:

You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.

import re

pattern = "_____"
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]

for i, val in enumerate(r):
    s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)

a(_000_), b(_001_), c(_002_)

CodePudding user response:

TL;DR

Use re.sub with a replacement callable and an iterator:

import re

p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]

it = iter(r)

print(re.sub(p, lambda _: f"_{next(it):03d}_", s))

Long version

Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.

As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?

Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.

The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.

So in its simplest form, a function doing what you described, could look like this:

import re
from collections.abc import Iterable


def multi_replace(
    pattern: re.Pattern[str],
    replacements: Iterable[int],
    string: str,
) -> str:
    iterator = iter(replacements)

    def repl(_match: re.Match[str]) -> str:
        return f"_{next(iterator):03d}_"

    return re.sub(pattern, repl, string)

Trying it out with your example data:

if __name__ == "__main__":
    p = re.compile("_____")
    s = "a(_____), b(_____), c(_____)"
    r = [0, 1, 2]
    print(multi_replace(p, r, s))

Output: a(_000_), b(_001_), c(_002_)

In this simple application, we aren't doing anything with the Match object in our replacement function.


If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.

First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.

The format specifiers are still useful because the allow centering the inserted object with the ^ align code.

import re
from collections.abc import Iterable


def dynamic_replace(
    pattern: re.Pattern[str],
    replacements: Iterable[int],
    string: str,
) -> str:
    iterator = iter(replacements)

    def repl(match: re.Match[str]) -> str:
        replacement = f"{next(iterator):03d}"
        length = len(match.group())
        return f"{replacement:_^{length}}"

    return re.sub(pattern, repl, string)
if __name__ == "__main__":
    p = re.compile("(_ )")
    s = "a(_____), b(_____), c(_____), d(_______), e(___)"
    r = [0, 1, 2, 30, 4000]

    print(dynamic_replace(p, r, s))

Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)

Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.

I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.

  • Related