np.cumsum(df['column']) treatment of nans-CodePudding

np.cumsum([1, 2, 3, np.nan, 4, 5, 6]) will return nan for every value after the first np.nan. Moreover, it will do the same for any generator. However, np.cumsum(df['column']) will not. What does np.cumsum(...) do, such that dataframes are treated specially?

In [2]: df = pd.DataFrame({'column': [1, 2, 3, np.nan, 4, 5, 6]})

In [3]: np.cumsum(df['column'])
Out[3]: 
0     1.0
1     3.0
2     6.0
3     NaN
4    10.0
5    15.0
6    21.0
Name: column, dtype: float64

CodePudding user response：

When you call np.cumsum(object) with an object that is not a numpy array, it will try calling object.cumsum() See this thread for details . You can also see it in the Numpy source.

The pandas method has a default of skipna=True. So np.cumsum(df) gets turned into the equivalent of df.cumsum(axis=None, skipna=True, *args, **kwargs), which, of course skips the NaN values. The Numpy method does not have a skipna option.

You can also verify this yourself by overriding the pandas method with your own:

class DF(pd.DataFrame):
    def cumsum(self, axis=None, skipna=True, *args, **kwargs):
        print('calling pandas cumsum')
        return super().cumsum(axis=None, skipna=True, *args, **kwargs)

df = DF({'column': [1, 2, 3, np.nan, 4, 5, 6]})

# does calling the numpy function call your pandas method?   
np.cumsum(df)

This will print

calling pandas cumsum

and return the expected result:

    column
0   1.0
1   3.0
2   6.0
3   NaN
4   10.0
5   15.0
6   21.0

You can then experiment with the result of changing skipna=True.

CodePudding user response：

There is a Numpy operation to treat the NAN's as 0's https://numpy.org/doc/stable/reference/generated/numpy.nancumsum.html which would give the same final answer.

it would work like so:

import numpy as np
np.nancumsum([1, 2, 3, np.nan, 4, 5, 6])

>>> array([ 1.,  3.,  6.,  6., 10., 15., 21.])

The implementation questions are more for the maintainers and the curious. Stackoverflow is a more "fixing the problem" site.