Home > Software design >  Why is the pandas.Series.apply function with np.sum not applied to the entire Series?
Why is the pandas.Series.apply function with np.sum not applied to the entire Series?

Time:10-06

I have the following dataframe:

>>> df
     a     b
0  aaa  22.0
1   bb  33.0
2    4  44.0
3    6  11.0

I want to sum the column b. I know that I can do np.sum(df['b']). But I want to understand syntax-wise why I can not use the following two to get the sum:

>>> df['b'].apply(np.sum, axis=0)
0    22.0
1    33.0
2    44.0
3    11.0
Name: b, dtype: float64
>>> df['b'].apply(np.sum)
0    22.0
1    33.0
2    44.0
3    11.0
Name: b, dtype: float64

Why is the apply function with np.sum not applied to the whole series? In https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html it says

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

The np.sum is for sure a "NumPy function". I think I may have misunderstanding with "Can be ufunc (a NumPy function that applies to the entire Series)" - does this mean if it's a NumPy function then the function is applied to the entire Series at each cell value without aggregation?

CodePudding user response:

You're applying np.sum to each element of the series df['b']. That's why you're not getting a scalar.

The method apply() takes a function as a parameter and applies it to the DataFrame column by column (where their values are the inputs). If the function used is aggregating (e.g np.sum, like yours), that is, it takes an input list and returns a single value. So, as a result, you will get a Series, with each element corresponding to a column.

With that being said, the equivalent of np.sum(df['b']) will be df.apply(np.sum)['b'] which gives 110.0 as well.

CodePudding user response:

Simple:

From documentation, np.sum Sum of array elements over a given axis.

on the other hand pd.apply passes a function and applies it on every single value of the Pandas series

Thus, when combined like you have, summation will happen on the axis but on a single value.

So, you are better of just using np.sum(df['b']), which by default will sum on the axis=0 because thats what a pd series is.

CodePudding user response:

Problem is that np.sum is not a Universal functions (ufunc)

> isinstance(np.sum, np.ufunc)
False

Series.apply handles ufunc with SeriesApply.apply_standard internally

class SeriesApply(NDFrameApply):
    obj: Series
    axis = 0

    def apply_standard(self) -> DataFrame | Series:
        # caller is responsible for ensuring that f is Callable
        f = cast(Callable, self.f)
        obj = self.obj

        with np.errstate(all="ignore"):
            if isinstance(f, np.ufunc):  # <------
                return f(obj)

If f is a numpy ufunc, it just pass the obj which is the Series itself to f.

So what you are using is just a function that only works on single values.

If you want to sum on the Series, you can use Series.sum() or np.sum(Series).

> df['b'].sum()
110.0

> np.sum(df['b'])
110.0
  • Related