I saw a video about speed of loops in python, where it was explained that doing sum(range(N))
is much faster than manually looping through range
and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy
to the mix. As I expected np.sum(np.arange(N))
is the fastest, but sum(np.arange(N))
and np.sum(range(N))
are even slower than doing the naive for loop.
Why is this?
Here's the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.8.10, numpy 1.19.5):
updated script:
import numpy as np
from timeit import timeit
N = 10_000_000
repetition = 10
def sum0(N = N):
s = 0
i = 0
while i < N: # condition is checked in python
s = i
i = 1 # both additions are done in python
return s
def sum1(N = N):
s = 0
for i in range(N): # increment in C
s = i # addition in python
return s
def sum2(N = N):
return sum(range(N)) # everything in C
def sum3(N = N):
return sum(list(range(N)))
def sum4(N = N):
return np.sum(range(N)) # very slow np.array conversion
def sum5(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = np.int))
def sum6(N = N):
# possibly slow conversion to Py_long from np.int
return sum(np.arange(N))
def sum7(N = N):
# list returns a list of np.int-s
return sum(list(np.arange(N)))
def sum7v2(N = N):
# tolist conversion to python int seems faster than the implicit conversion
# in sum(list()) (tolist returns a list of python int-s)
return sum(np.arange(N).tolist())
def sum8(N = N):
return np.sum(np.arange(N)) # everything in numpy (fortran libblas?)
def array_basic(N = N):
return np.array(range(N))
def array_dtype(N = N):
return np.array(range(N),dtype = np.int)
def array_iter(N = N):
# np.sum's source code mentions to use fromiter to convert from generators
return np.fromiter(range(N),dtype = np.int)
print(f"while loop: {timeit(sum0, number = repetition)}")
print(f"for loop: {timeit(sum1, number = repetition)}")
print(f"sum_range: {timeit(sum2, number = repetition)}")
print(f"sum_rangelist: {timeit(sum3, number = repetition)}")
print(f"npsum_range: {timeit(sum4, number = repetition)}")
print(f"npsum_fromiterrange:{timeit(sum5, number = repetition)}")
print(f"sum_arange: {timeit(sum6, number = repetition)}")
print(f"sum_list_arange: {timeit(sum7, number = repetition)}")
print(f"sum_arange_tolist: {timeit(sum7v2, number = repetition)}")
print(f"npsum_arange: {timeit(sum8, number = repetition)}")
print(f"array_basic: {timeit(array_basic, number = repetition)}")
print(f"array_dtype: {timeit(array_dtype, number = repetition)}")
print(f"array_iter: {timeit(array_iter, number = repetition)}")
# Example output:
#
# while loop: 9.249794696999743
# for loop: 6.026467555000636
# sum_range: 1.4830789409988938
# sum_rangelist: 3.6745876889999636
# npsum_range: 16.216972655000063
# npsum_fromiterrange:3.47655400199983
# sum_arange: 16.656015603000924
# sum_list_arange: 19.500842117000502
# sum_arange_tolist: 4.004777374000696
# npsum_arange: 0.2332638230000157
# array_basic: 16.1631146109994
# array_dtype: 16.550737804000164
# array_iter: 3.9803170430004684
CodePudding user response:
Let's see if I can summarize the results.
sum
can work with any iterable, repeatedly asking for the next value and adding it. range
is a generator, that's happy to supply the next value
# sum_range: 1.4830789409988938
Making a list from a range takes time:
# sum_rangelist: 3.6745876889999636
Summing a pregenerated list is actually faster than summing the range:
%%timeit x = list(range(N))
...: sum(x)
np.sum
is designed to sum arrays. It's a wrapper to np.add.reduce
.
np.sum
has a deprecation warning for np.sum(generator)
, recommending the use of fromiter
or Python sum
:
# npsum_range: 16.216972655000063
fromiter
is the best way of making an array from a generator. Using np.array
on range
is legacy code and may go away in the future. I think it's the only generator
that np.array
will accept.
np.array
is a general purpose function that can handle many cases, including nested arrays, and conversion to various dtypes. As such it has to process the whole input argument, deducing both shape and dtype.
# npsum_fromiterrange:3.47655400199983
Iteration on a numpy array is slower than a list, since it has to "unbox" each element.
# sum_arange: 16.656015603000924
Similarly making a list from an array is slow; same sort of python level iteration.
# sum_list_arange: 19.500842117000502
arr.tolist()
is relatively fast, creating a pure python list in compiled code. So speed is similar to making a list from range.
# sum_arange_tolist: 4.004777374000696
np.sum
of an array is pure numpy
and quite fast. np.sum(x)
where x=np.arange(N)
is even faster (by about 4x)
# npsum_arange: 0.2332638230000157
np.sum
from range or list is dominated by the cost of creating the array first:
# array_basic: 16.1631146109994
# array_dtype: 16.550737804000164
# array_iter: 3.9803170430004684
CodePudding user response:
From the cpython source code for sum
sum initially seems to attempt a fast path that assumes all inputs are the same type. If that fails it will just iterate:
/* Fast addition by keeping temporary sums in C instead of new Python objects.
Assumes all inputs are the same type. If the assumption fails, default
to the more general routine.
*/
I'm not entirely certain what is happening under the hood, but it is likely the repeated creation/conversion of C types to Python objects that is causing these slow-downs. It's worth noting that both sum
and range
are implemented in C.
This next bit is not really an answer to the question, but I wondered if we could speed up sum
for python range
s as range
is quite a smart object.
To do this I've used functools.singledispatch
to override the built-in sum
function specifically for the range
type; then implemented a small function to calculate the sum of an arithmetic progression.
from functools import singledispatch
def sum_range(range_, /, start=0):
"""Overloaded `sum` for range, compute arithmetic sum"""
n = len(range_)
if not n:
return start
return int(start (n * (range_[0] range_[-1]) / 2))
sum = singledispatch(sum)
sum.register(range, sum_range)
def test():
"""
>>> sum(range(0, 100))
4950
>>> sum(range(0, 10, 2))
20
>>> sum(range(0, 9, 2))
20
>>> sum(range(0, -10, -1))
-45
>>> sum(range(-10, 10))
-10
>>> sum(range(-1, -100, -2))
-2500
>>> sum(range(0, 10, 100))
0
>>> sum(range(0, 0))
0
>>> sum(range(0, 100), 50)
5000
>>> sum(range(0, 0), 10)
10
"""
if __name__ == "__main__":
import doctest
doctest.testmod()
I'm not sure if this is complete, but it's definitely faster than looping.
CodePudding user response:
np.sum(range(N))
is slow mostly because the current Numpy implementation do not use enough informations about the exact type/content of the values provided by the generator range(N)
. The heart of the general problem is inherently due to dynamic typing of Python and big integers although Numpy could optimize this specific case.
First of all, range(N)
returns a dynamically-typed Python object which is a (special kind of) Python generator. The object provided by this generator are also dynamically-typed. It is in practice a pure-Python integer.
The thing is Numpy is written in the statically-typed language C and so it cannot efficiently work on dynamically-typed pure-Python objects. The strategy of Numpy is to convert such objects into C types when it can. One big problem in this case is that the integers provided by the generator can theorically be huge: Numpy do not know if the values can overflow a np.int32
or even a np.int64
type. Thus, Numpy first detect the good type to use and then compute the result using this type.
This translation process can be quite expensive and appear not to be needed here since all the values provided by range(10_000_000)
. However, range(5_000_000_000)
returns the same object type with pure-Python integers overflowing np.int32
and Numpy needs to automatically detect this case not to return wrong results. The thing is also the input type can be correctly identified (np.int32
on my machine), it does not means that the output result will be correct because overflows can appear in during the computation of the sum. This is sadly the case on my machine.
Numpy developers decided to deprecate such a use and put in the documentation that np.fromiter
should be used instead. np.fromiter
has a dtype
required parameter to let the user define what is the good type to use.
One way to check this behaviour in practice is to simply use create a temporary list:
tmp = list(range(10_000_000))
# Numpy implicitly convert the list in a Numpy array but
# still automatically detect the input type to use
np.sum(tmp)
A faster implementation is the following:
tmp = list(range(10_000_000))
# The array is explicitly converted using a well-defined type and
# thus there is no need to perform an automatic detection
# (note that the result is still wrong since it does not fit in a np.int32)
tmp2 = np.array(tmp, dtype=np.int32)
result = np.sum(tmp2)
The first case takes 476 ms on my machine while the second takes 289 ms. Note that np.sum
takes only 4 ms. Thus, a large part of the time is spend in the conversion of pure-Python integer objects to internal int32 types (more specifically the management of pure-Python integers). list(range(10_000_000))
is expensive too as it takes 205 ms. This is again due to the overhead of pure-Python integers (ie. allocations, deallocations, reference counting, increment of variable-sized integers, memory indirections and conditions due to the dynamic typing) as well as the overhead of the generator.
sum(np.arange(N))
is slow because sum
is a pure-Python function working on a Numpy-defined object. The CPython interpreter needs to call Numpy functions to perform basic additions. Moreover, Numpy-defined integer object are still Python object and so they are subject to reference counting, allocation, deallocation, etc. Not to mention Numpy and CPython add many checks in the functions aiming to finally just add two native numbers together. A Numpy-aware just-in-time compiler such as Numba can solve this issue. Indeed, Numba takes 23 ms on my machine to compute the sum of np.arange(10_000_000)
(with code still written in Python) while the CPython interpreter takes 556 ms.