np.cumsum([1, 2, 3, np.nan, 4, 5, 6])
will return nan
for every value after the first np.nan
. Moreover, it will do the same for any generator. However, np.cumsum(df['column'])
will not. What does np.cumsum(...)
do, such that dataframes are treated specially?
In [2]: df = pd.DataFrame({'column': [1, 2, 3, np.nan, 4, 5, 6]})
In [3]: np.cumsum(df['column'])
Out[3]:
0 1.0
1 3.0
2 6.0
3 NaN
4 10.0
5 15.0
6 21.0
Name: column, dtype: float64
CodePudding user response:
When you call np.cumsum(object)
with an object that is not a numpy array, it will try calling object.cumsum()
See this thread for details
. You can also see it in the Numpy source.
The pandas method has a default of skipna=True
. So np.cumsum(df)
gets turned into the equivalent of df.cumsum(axis=None, skipna=True, *args, **kwargs)
, which, of course skips the NaN values. The Numpy method does not have a skipna
option.
You can also verify this yourself by overriding the pandas method with your own:
class DF(pd.DataFrame):
def cumsum(self, axis=None, skipna=True, *args, **kwargs):
print('calling pandas cumsum')
return super().cumsum(axis=None, skipna=True, *args, **kwargs)
df = DF({'column': [1, 2, 3, np.nan, 4, 5, 6]})
# does calling the numpy function call your pandas method?
np.cumsum(df)
This will print
calling pandas cumsum
and return the expected result:
column
0 1.0
1 3.0
2 6.0
3 NaN
4 10.0
5 15.0
6 21.0
You can then experiment with the result of changing skipna=True
.
CodePudding user response:
There is a Numpy operation to treat the NAN's as 0's https://numpy.org/doc/stable/reference/generated/numpy.nancumsum.html which would give the same final answer.
it would work like so:
import numpy as np
np.nancumsum([1, 2, 3, np.nan, 4, 5, 6])
>>> array([ 1., 3., 6., 6., 10., 15., 21.])
The implementation questions are more for the maintainers and the curious. Stackoverflow is a more "fixing the problem" site.