Home > Enterprise >  pandas rolling apply with NaNs
pandas rolling apply with NaNs

Time:12-01

I can't understand the behaviour of pandas.rolling.apply with np.prod and NaNs. E.g.

import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [1, 1, 2, np.nan, 4], 'C': [1, 2, 3, 4, 5]}, index=pd.date_range('2013-01-01', '2013-01-05'))

Gives this dataframe:

            B   C
2013-01-01  1.0 1
2013-01-02  1.0 2
2013-01-03  2.0 3
2013-01-04  NaN 4
2013-01-05  4.0 5

If I apply the numpy np.prod function to a 3 day rolling window with raw=False and min_periods=1 it works as expected, ignoring the NaNs.

df.rolling('3D', min_periods=1).apply(np.prod, raw=False)

            B   C
2013-01-01  1.0 1.0
2013-01-02  1.0 2.0
2013-01-03  2.0 6.0
2013-01-04  2.0 24.0
2013-01-05  8.0 60.0

However with raw=True I get NaNs in column B:

df.rolling('3D', min_periods=1).apply(np.prod, raw=True)

            B   C
2013-01-01  1.0 1.0
2013-01-02  1.0 2.0
2013-01-03  2.0 6.0
2013-01-04  NaN 24.0
2013-01-05  NaN 60.0

I'd like to use raw=True for speed, but I don't understand this behavior? Can someone explain what's going on?

CodePudding user response:

It's very simple. You can try this code

import pandas as pd
import numpy as np


def foo(x):
    return np.prod(x, where=~np.isnan(x))


if __name__ == '__main__':
    df = pd.DataFrame({'B': [1, 1, 2, np.nan, 4], 'C': [1, 2, 3, 4, 5]},
                      index=pd.date_range('2013-01-01', '2013-01-05'))
    res = df.rolling('3D', min_periods=1).apply(foo, raw=True)
    
    print(res)

             B     C
2013-01-01  1.0   1.0
2013-01-02  1.0   2.0
2013-01-03  2.0   6.0
2013-01-04  2.0  24.0
2013-01-05  8.0  60.0

CodePudding user response:

Thanks to @padu and @bui for contributing comments/answers to lead me to the answer I was looking for, namely explaining the different behaviors.

As the documentation points out, when calling rolling apply with raw=False, each window is converted to a pandas.Series before being passed to np.prod. With raw=True each window is converted to a numpy array.

The key observation then is that np.prod behaves differently on a Series compared to an ndarray, ignoring the NaN in the Series case, and this is why we get different behaviors:

np.prod(np.array([1, 2, np.nan, 3])) gives nan

np.prod(pd.Series([1, 2, np.nan, 3])) gives 6.0

It's not clear to me why the NaN is ignored for the Series, but as @bui points out, you can ignore the NaNs for the ndarray case by setting the where keyword to np.prod.

  • Related