I tested s.replace('a', '')
where s
is a string of two million 'a'
and potentially a single outlier 'b'
at start, middle or end. That single outlier made it much slower:
At TIO with their Python 3.8 pre-release version:
a = 'a' * 10**6; s = f'{a}{a}' 5 ms 5 ms 5 ms
a = 'a' * 10**6; s = f'b{a}{a}' 24 ms 24 ms 24 ms
a = 'a' * 10**6; s = f'{a}b{a}' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'{a}{a}b' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 25 ms 25 ms 25 ms
On my old laptop with Python 3.10:
a = 'a' * 10**6; s = f'{a}{a}' 4 ms 4 ms 4 ms
a = 'a' * 10**6; s = f'b{a}{a}' 93 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'{a}b{a}' 94 ms 95 ms 95 ms
a = 'a' * 10**6; s = f'{a}{a}b' 94 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 95 ms 95 ms 96 ms
How does that single outlier 'b
' make it so much slower?
Full benchmark code:
from timeit import repeat
setups = [
"a = 'a' * 10**6; s = f'{a}{a}'",
"a = 'a' * 10**6; s = f'b{a}{a}'",
"a = 'a' * 10**6; s = f'{a}b{a}'",
"a = 'a' * 10**6; s = f'{a}{a}b'",
"a = 'a' * 10**6; s = f'b{a}b{a}b'",
]
for _ in range(3):
for setup in setups:
times = repeat("s.replace('a', '')", setup, number=1)
print(f'{setup:{max(map(len, setups))}}',
*('= ms ' % (t * 1e3) for t in sorted(times)[:3]))
print()
CodePudding user response:
str.replace()
is implemented at https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 in the C function replace()
. Here's an excerpt from that function. It shows that if the size of the returned string (new_size
) will be 0, we stop early and return an empty string. Otherwise, we perform the long task of looping through the input string and performing the replacements one-by-one.
new_size = slen n * (len2 - len1);
if (new_size == 0) {
u = unicode_new_empty();
goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
PyErr_SetString(PyExc_OverflowError,
"replace string is too long");
goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
while (n-- > 0) {
/* look for next match */
j = anylib_find(rkind, self,
sbuf rkind * i, slen-i,
str1, buf1, len1, i);
if (j == -1)
break;
else if (j > i) {
/* copy unchanged part [i:j] */
memcpy(res rkind * ires,
sbuf rkind * i,
rkind * (j-i));
ires = j - i;
}
/* copy substitution string */
if (len2 > 0) {
memcpy(res rkind * ires,
buf2,
rkind * len2);
ires = len2;
}
i = j len1;
}
if (i < slen)
/* copy tail [i:] */
memcpy(res rkind * ires,
sbuf rkind * i,
rkind * (slen-i));
}