I have the following dataframe:
>>> df
a b
0 aaa 22.0
1 bb 33.0
2 4 44.0
3 6 11.0
I want to sum the column b
. I know that I can do np.sum(df['b']). But I want to understand syntax-wise why I can not use the following two to get the sum:
>>> df['b'].apply(np.sum, axis=0)
0 22.0
1 33.0
2 44.0
3 11.0
Name: b, dtype: float64
>>> df['b'].apply(np.sum)
0 22.0
1 33.0
2 44.0
3 11.0
Name: b, dtype: float64
Why is the apply
function with np.sum
not applied to the whole series?
In https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html it says
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
The np.sum
is for sure a "NumPy function". I think I may have misunderstanding with "Can be ufunc (a NumPy function that applies to the entire Series)" - does this mean if it's a NumPy function then the function is applied to the entire Series at each cell value without aggregation?
CodePudding user response:
You're applying np.sum
to each element of the series df['b']
. That's why you're not getting a scalar.
The method apply()
takes a function as a parameter and applies it to the DataFrame column by column (where their values are the inputs). If the function used is aggregating (e.g np.sum
, like yours), that is, it takes an input list and returns a single value. So, as a result, you will get a Series, with each element corresponding to a column.
With that being said, the equivalent of np.sum(df['b'])
will be df.apply(np.sum)['b']
which gives 110.0
as well.
CodePudding user response:
Simple:
From documentation, np.sum Sum of array elements over a given axis
.
on the other hand pd.apply passes a function and applies it on every single value of the Pandas series
Thus, when combined like you have, summation will happen on the axis but on a single value.
So, you are better of just using np.sum(df['b'])
, which by default will sum on the axis=0
because thats what a pd series is.
CodePudding user response:
Problem is that np.sum
is not a Universal functions (ufunc)
> isinstance(np.sum, np.ufunc)
False
Series.apply
handles ufunc
with SeriesApply.apply_standard
internally
class SeriesApply(NDFrameApply):
obj: Series
axis = 0
def apply_standard(self) -> DataFrame | Series:
# caller is responsible for ensuring that f is Callable
f = cast(Callable, self.f)
obj = self.obj
with np.errstate(all="ignore"):
if isinstance(f, np.ufunc): # <------
return f(obj)
If f
is a numpy ufunc, it just pass the obj
which is the Series itself to f
.
So what you are using is just a function that only works on single values.
If you want to sum
on the Series, you can use Series.sum()
or np.sum(Series)
.
> df['b'].sum()
110.0
> np.sum(df['b'])
110.0