I can't understand the behaviour of pandas.rolling.apply
with np.prod
and NaNs. E.g.
import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [1, 1, 2, np.nan, 4], 'C': [1, 2, 3, 4, 5]}, index=pd.date_range('2013-01-01', '2013-01-05'))
Gives this dataframe:
B C
2013-01-01 1.0 1
2013-01-02 1.0 2
2013-01-03 2.0 3
2013-01-04 NaN 4
2013-01-05 4.0 5
If I apply
the numpy np.prod
function to a 3 day rolling window with raw=False
and min_periods=1
it works as expected, ignoring the NaNs.
df.rolling('3D', min_periods=1).apply(np.prod, raw=False)
B C
2013-01-01 1.0 1.0
2013-01-02 1.0 2.0
2013-01-03 2.0 6.0
2013-01-04 2.0 24.0
2013-01-05 8.0 60.0
However with raw=True
I get NaNs in column B:
df.rolling('3D', min_periods=1).apply(np.prod, raw=True)
B C
2013-01-01 1.0 1.0
2013-01-02 1.0 2.0
2013-01-03 2.0 6.0
2013-01-04 NaN 24.0
2013-01-05 NaN 60.0
I'd like to use raw=True
for speed, but I don't understand this behavior? Can someone explain what's going on?
CodePudding user response:
It's very simple. You can try this code
import pandas as pd
import numpy as np
def foo(x):
return np.prod(x, where=~np.isnan(x))
if __name__ == '__main__':
df = pd.DataFrame({'B': [1, 1, 2, np.nan, 4], 'C': [1, 2, 3, 4, 5]},
index=pd.date_range('2013-01-01', '2013-01-05'))
res = df.rolling('3D', min_periods=1).apply(foo, raw=True)
print(res)
B C
2013-01-01 1.0 1.0
2013-01-02 1.0 2.0
2013-01-03 2.0 6.0
2013-01-04 2.0 24.0
2013-01-05 8.0 60.0
CodePudding user response:
Thanks to @padu and @bui for contributing comments/answers to lead me to the answer I was looking for, namely explaining the different behaviors.
As the documentation points out, when calling rolling apply
with raw=False
, each window is converted to a pandas.Series before being passed to np.prod
. With raw=True
each window is converted to a numpy array.
The key observation then is that np.prod
behaves differently on a Series compared to an ndarray, ignoring the NaN in the Series case, and this is why we get different behaviors:
np.prod(np.array([1, 2, np.nan, 3]))
gives nan
np.prod(pd.Series([1, 2, np.nan, 3]))
gives 6.0
It's not clear to me why the NaN is ignored for the Series, but as @bui points out, you can ignore the NaNs for the ndarray case by setting the where
keyword to np.prod
.