Pandas returns incorrect groupby rolling sum of zeros for float64 when having many groups-CodePudding

When doing groupby rolling in pandas with dtype float64, sum of zeros become an arbitrary small float when number of groups is large. For example,

import pandas as pd
import numpy as np

np.random.seed(1)
df = pd.DataFrame({'a': (np.random.random(800)*1e5 1e5).tolist()   [0.0]*800, 'b': list(range(80))*20})
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})

The first line generates a dataframe with 2 columns `a` and `b`.

Column a has 800 random numbers between 1e5 and 2e5 and 800 zeros.
Column b assigns these to 80 groups.

For example for group 79, the df looks like below:

                  a   b
79    158742.001924  79
159   115045.502837  79
239   171582.695286  79
319   181072.123361  79
399   194672.826961  79
479   130100.794308  79
559   169784.165605  79
639   132752.405585  79
719   162355.180105  79
799   148140.045915  79
879        0.000000  79
959        0.000000  79
1039       0.000000  79
1119       0.000000  79
1199       0.000000  79
1279       0.000000  79
1359       0.000000  79
1439       0.000000  79
1519       0.000000  79
1599       0.000000  79

The second line calculates the rolling sum of `5` for column `a` for each group.

One would expect the rolling sum to be zeros for the last few entries in each group, e.g. 79. However, arbitrary small floats are returned, e.g. -5.820766e-11 for group 79 below

                 a
79    1.587420e 05
159   2.737875e 05
239   4.453702e 05
319   6.264423e 05
399   8.211152e 05
479   7.924739e 05
559   8.472126e 05
639   8.083823e 05
719   7.896654e 05
799   7.431326e 05
879   6.130318e 05
959   4.432476e 05
1039  3.104952e 05
1119  1.481400e 05
1199 -5.820766e-11
1279 -5.820766e-11
1359 -5.820766e-11
1439 -5.820766e-11
1519 -5.820766e-11
1599 -5.820766e-11

If we decrease the number of groups to 20, the issue disappears. E.g.

df['b'] = df['b'] = list(range(20))*80
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})

This yields (for group 19, since there are only 20 groups from 0-19)

                  a
19    165083.125668
39    359750.793592
59    485563.758520
79    644305.760443
99    837370.199660
             ...
1519       0.000000
1539       0.000000
1559       0.000000
1579       0.000000
1599       0.000000
[80 rows x 1 columns]

This is only tested on pandas 1.2.5/python 3.7.9/windows 10. You might have to increase the number of groups for this to show up depending on your machine memory.

In my application, I can't really control the number of groups. I can change the dtype to float32 and the issue goes away. But, this causes me to loose precision for large numbers.

Any idea what's causing this and how to resolve it besides using float32?

CodePudding user response：

TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.

The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).

The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.

Sources: Python code calling window aggregation, Cython implementation of the rolling sum

The first line generates a dataframe with 2 columns a and b.

The second line calculates the rolling sum of 5 for column a for each group.

The first line generates a dataframe with 2 columns `a` and `b`.

The second line calculates the rolling sum of `5` for column `a` for each group.