Here are 2 measurements :
timeit.timeit('"toto"=="1234"', number=100000000)
1.8320042459999968
timeit.timeit('"toto"=="toto"', number=100000000)
1.4517491540000265
As you can see comparing 2 strings that matches is faster than comparing 2 strings with the same size that does not match.
This is quite disturbing : During a string comparison, I believed that python was testing strings char by char, so "toto"=="toto"
should be longer to test than "toto"=="1234"
as it requires 4 tests against 1 for the non-matching comparison. May be the comparison is hash-based, but in this case, timings should be the same for both comparisons.
Do you have an idea why?
CodePudding user response:
Combining my comment and the comment by @khelwood:
TL;DR:
When analysing the bytecode for the two comparisons, it reveals the 'time'
and 'time'
strings are assigned to the same object. Therefore, an up-front identity check (at C-level) is the reason for the increased comparison speed.
Bytecode:
import dis
In [24]: dis.dis("'time'=='time'")
1 0 LOAD_CONST 0 ('time') # <-- same object (0)
2 LOAD_CONST 0 ('time') # <-- same object (0)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE
In [25]: dis.dis("'time'=='1234'")
1 0 LOAD_CONST 0 ('time') # <-- different object (0)
2 LOAD_CONST 1 ('1234') # <-- different object (1)
4 COMPARE_OP 2 (==)
6 RETURN_VALUE
Assignment Timing:
The 'speed-up' can also be seen in using assignment for the time tests. The assignment (and compare) of two variables to the same string, is faster than the assignment (and compare) of two variables to different strings. Further supporting the hypothesis the underlying logic is performing an object comparison. This is confirmed in the next section.
In [26]: timeit.timeit("x='time'; y='time'; x==y", number=1000000)
Out[26]: 0.0745926329982467
In [27]: timeit.timeit("x='time'; y='1234'; x==y", number=1000000)
Out[27]: 0.10328884399496019
Python source code:
As helpfully provided by @mkrieger1 and @Masklinn in their comments, the source code for unicodeobject.c
performs a pointer comparison first and if True
, returns immediately.
int
_PyUnicode_Equal(PyObject *str1, PyObject *str2)
{
assert(PyUnicode_CheckExact(str1));
assert(PyUnicode_CheckExact(str2));
if (str1 == str2) { // <-- Here
return 1;
}
if (PyUnicode_READY(str1) || PyUnicode_READY(str2)) {
return -1;
}
return unicode_compare_eq(str1, str2);
}
Appendix:
- Reference answer nicely illustrating how to read the disassembled bytecode output.
CodePudding user response:
A proof that identity is indeed the reason of this behavior (as @S3DEV has brilliantly explained) is this one:
>>> x = 'toto'
>>> y = 'toto'
>>> z = 'totoo'[:-1]
>>> w = 'abcd'
>>> x == y
True
>>> x == z
True
>>> x == w
False
>>> id(x) == id(y)
True
>>> id(x) == id(z)
False
>>> id(x) == id(w)
False
>>> timeit.timeit('x==y', number=100000000, globals={'x': x, 'y': y})
3.893762200000083
>>> timeit.timeit('x==z', number=100000000, globals={'x': x, 'z': z})
4.205321462000029
>>> timeit.timeit('x==w', number=100000000, globals={'x': x, 'w': w})
4.15288594499998
It's always faster to compare objects having the same id.