Why is str.replace so much slower with a single outlier?-CodePudding

I tested s.replace('a', '') where s is a string of two million 'a' and potentially a single outlier 'b' at start, middle or end. That single outlier made it much slower:

At TIO with their Python 3.8 pre-release version:

a = 'a' * 10**6; s = f'{a}{a}'      5 ms    5 ms    5 ms 
a = 'a' * 10**6; s = f'b{a}{a}'    24 ms   24 ms   24 ms 
a = 'a' * 10**6; s = f'{a}b{a}'    25 ms   25 ms   25 ms 
a = 'a' * 10**6; s = f'{a}{a}b'    25 ms   25 ms   25 ms 
a = 'a' * 10**6; s = f'b{a}b{a}b'  25 ms   25 ms   25 ms

On my old laptop with Python 3.10:

a = 'a' * 10**6; s = f'{a}{a}'      4 ms    4 ms    4 ms 
a = 'a' * 10**6; s = f'b{a}{a}'    93 ms   94 ms   95 ms 
a = 'a' * 10**6; s = f'{a}b{a}'    94 ms   95 ms   95 ms 
a = 'a' * 10**6; s = f'{a}{a}b'    94 ms   94 ms   95 ms 
a = 'a' * 10**6; s = f'b{a}b{a}b'  95 ms   95 ms   96 ms

How does that single outlier 'b' make it so much slower?

Full benchmark code:

from timeit import repeat

setups = [
    "a = 'a' * 10**6; s = f'{a}{a}'",
    "a = 'a' * 10**6; s = f'b{a}{a}'",
    "a = 'a' * 10**6; s = f'{a}b{a}'",
    "a = 'a' * 10**6; s = f'{a}{a}b'",
    "a = 'a' * 10**6; s = f'b{a}b{a}b'",
]

for _ in range(3):
    for setup in setups:
        times = repeat("s.replace('a', '')", setup, number=1)
        print(f'{setup:{max(map(len, setups))}}',
              *('= ms ' % (t * 1e3) for t in sorted(times)[:3]))
    print()

CodePudding user response：

str.replace() is implemented at https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 in the C function replace(). Here's an excerpt from that function. It shows that if the size of the returned string (new_size) will be 0, we stop early and return an empty string. Otherwise, we perform the long task of looping through the input string and performing the replacements one-by-one.

new_size = slen   n * (len2 - len1);
if (new_size == 0) {
    u = unicode_new_empty();
    goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
    PyErr_SetString(PyExc_OverflowError,
                    "replace string is too long");
    goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
    goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
    while (n-- > 0) {
        /* look for next match */
        j = anylib_find(rkind, self,
                        sbuf   rkind * i, slen-i,
                        str1, buf1, len1, i);
        if (j == -1)
            break;
        else if (j > i) {
            /* copy unchanged part [i:j] */
            memcpy(res   rkind * ires,
                   sbuf   rkind * i,
                   rkind * (j-i));
            ires  = j - i;
        }
        /* copy substitution string */
        if (len2 > 0) {
            memcpy(res   rkind * ires,
                   buf2,
                   rkind * len2);
            ires  = len2;
        }
        i = j   len1;
    }
    if (i < slen)
        /* copy tail [i:] */
        memcpy(res   rkind * ires,
               sbuf   rkind * i,
               rkind * (slen-i));
}