I have the following code in a Python file called benchmark.py
.
source = """
for i in range(1000):
a = len(str(i))
"""
import timeit
print(timeit.timeit(stmt=source, number=100000))
When I tried to run with multiple python versions I am seeing a drastic performance difference.
C:\Users\Username\Desktop>py -3.10 benchmark.py
16.79652149998583
C:\Users\Username\Desktop>py -3.11 benchmark.py
10.92280820000451
As you can see this code runs faster with python 3.11 than previous Python versions. I tried to disassemble the bytecode to understand the reason for this behaviour but I could only see a difference in opcode names (CALL_FUNCTION
is replaced by PRECALL
and CALL
opcodes).
I am quite not sure if that's the reason for this performance change. so I am looking for an answer that justifies with reference to cpython source code.
python 3.11
bytecode
0 0 RESUME 0
2 2 PUSH_NULL
4 LOAD_NAME 0 (range)
6 LOAD_CONST 0 (1000)
8 PRECALL 1
12 CALL 1
22 GET_ITER
>> 24 FOR_ITER 22 (to 70)
26 STORE_NAME 1 (i)
3 28 PUSH_NULL
30 LOAD_NAME 2 (len)
32 PUSH_NULL
34 LOAD_NAME 3 (str)
36 LOAD_NAME 1 (i)
38 PRECALL 1
42 CALL 1
52 PRECALL 1
56 CALL 1
66 STORE_NAME 4 (a)
68 JUMP_BACKWARD 23 (to 24)
2 >> 70 LOAD_CONST 1 (None)
72 RETURN_VALUE
python 3.10
bytecode
2 0 LOAD_NAME 0 (range)
2 LOAD_CONST 0 (1000)
4 CALL_FUNCTION 1
6 GET_ITER
>> 8 FOR_ITER 8 (to 26)
10 STORE_NAME 1 (i)
3 12 LOAD_NAME 2 (len)
14 LOAD_NAME 3 (str)
16 LOAD_NAME 1 (i)
18 CALL_FUNCTION 1
20 CALL_FUNCTION 1
22 STORE_NAME 4 (a)
24 JUMP_ABSOLUTE 4 (to 8)
2 >> 26 LOAD_CONST 1 (None)
28 RETURN_VALUE
PS: I understand that python 3.11 introduced bunch of performance improvements but I am curios to understand what optimization makes this code run faster in python 3.11
CodePudding user response:
There's a big section in the "what's new" page labeled "faster runtime". It looks like the most likely cause of the speedup here is PEP 659, which is a first start towards JIT optimization (perhaps not quite JIT compilation, but definitely JIT optimization).
Particularly, the lookup and call for len
and str
now bypass a lot of dynamic machinery in the overwhelmingly common case where the built-ins aren't shadowed or overridden. The global and builtin dict lookups to resolve the name get skipped in a fast path, and the underlying C routines for len
and str
are called directly, instead of going through the general-purpose function call handling.
You wanted source references, so here's one. The str
call will get specialized in specialize_class_call
:
if (tp->tp_flags & Py_TPFLAGS_IMMUTABLETYPE) {
if (nargs == 1 && kwnames == NULL && oparg == 1) {
if (tp == &PyUnicode_Type) {
_Py_SET_OPCODE(*instr, PRECALL_NO_KW_STR_1);
return 0;
}
where it detects that the call is a call to the str
builtin with 1 positional argument and no keywords, and replaces the corresponding PRECALL
opcode with PRECALL_NO_KW_STR_1
. The handling for the PRECALL_NO_KW_STR_1
opcode in the bytecode evaluation loop looks like this:
TARGET(PRECALL_NO_KW_STR_1) {
assert(call_shape.kwnames == NULL);
assert(cframe.use_tracing == 0);
assert(oparg == 1);
DEOPT_IF(is_method(stack_pointer, 1), PRECALL);
PyObject *callable = PEEK(2);
DEOPT_IF(callable != (PyObject *)&PyUnicode_Type, PRECALL);
STAT_INC(PRECALL, hit);
SKIP_CALL();
PyObject *arg = TOP();
PyObject *res = PyObject_Str(arg);
Py_DECREF(arg);
Py_DECREF(&PyUnicode_Type);
STACK_SHRINK(2);
SET_TOP(res);
if (res == NULL) {
goto error;
}
CHECK_EVAL_BREAKER();
DISPATCH();
}
which consists mostly of a bunch of safety prechecks and reference fiddling wrapped around a call to PyObject_Str
, the C routine for calling str
on an object.
Python 3.11 includes many other performance enhancements besides the above, including optimizations to stack frame creation, method lookup, common arithmetic operations, interpreter startup, and more. Most code should run much faster now, barring things like I/O-bound workloads and code that spent most of its time in C library code (like NumPy).