Home > database >  Difference between sum, 'sum' and np.sum *under the hood* (Python / Pandas / Numpy)
Difference between sum, 'sum' and np.sum *under the hood* (Python / Pandas / Numpy)

Time:07-20

How do, sum, 'sum' and np.sum differ, under the bonnet, here:

df.agg(x=('A', sum), y=('B', 'sum'), z=('C', np.sum))

as the output would, arguably, be identical,

adapted from here:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html

df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean))

     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

My guess is that the latter of the three is linked to Numpy and the first two may be linked to Python (and/or Pandas), but that's just a rough, un-educated first guess... it would be interesting to know what the single apostrophe signifies here in this context.

CodePudding user response:

for 'sum', if you know how the numbers are laid out in memory you can simply loop over them in memory and take their average, which actually calls pd.sum, because it's a part of pandas library, (which knows how the data exists in memory).

for np.sum, it is interfacing with another library, the other library accepts only data laid in a certain way, so what you can do is to copy your data in another container that has the correct shape in memory, then pass it to that external function, as it calls np.sum, which is a part of the numpy library, which is a completely different library, with its own containers.

lastly sum is using the python builtin sum, which works on any container without copying any data, but is much slower, because it doesn't know anything about how the data exists in memory, and doesn't get any 'hardware acceleration', it's only useful if you are in a project that doesn't include numpy or pandas for some reason (to work on microcontrollers maybe, or just does basic maths), otherwise just use 'mean' as it should be the fastest.

this answer shows the performance difference between different functions.

CodePudding user response:

When you call df.agg('sum') it invokes df.sum() (see this answer for an explanation).

df.sum and np.sum(df) will have very similar performance, as pandas Series objects implement numpy's array protocols and calls to np.sum(df) will actually invoke something similar to df.apply(pd.Series.sum) under the hood. Both of these will be faster than the builtin sum for any meaningfully sized DataFrame, as the data is already stored as an array.

See the pandas guide to enhancing performance for more tips.

  • Related