When doing groupby rolling in pandas with dtype float64, sum of zeros become an arbitrary small float when number of groups is large. For example,
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'a': (np.random.random(800)*1e5 1e5).tolist() [0.0]*800, 'b': list(range(80))*20})
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})
The first line generates a dataframe with 2 columns a
and b
.
- Column
a
has 800 random numbers between1e5
and2e5
and 800 zeros. - Column
b
assigns these to 80 groups.
For example for group 79, the df
looks like below:
a b
79 158742.001924 79
159 115045.502837 79
239 171582.695286 79
319 181072.123361 79
399 194672.826961 79
479 130100.794308 79
559 169784.165605 79
639 132752.405585 79
719 162355.180105 79
799 148140.045915 79
879 0.000000 79
959 0.000000 79
1039 0.000000 79
1119 0.000000 79
1199 0.000000 79
1279 0.000000 79
1359 0.000000 79
1439 0.000000 79
1519 0.000000 79
1599 0.000000 79
The second line calculates the rolling sum of 5
for column a
for each group.
One would expect the rolling sum to be zeros for the last few entries in each group, e.g. 79.
However, arbitrary small floats are returned, e.g. -5.820766e-11
for group 79 below
a
79 1.587420e 05
159 2.737875e 05
239 4.453702e 05
319 6.264423e 05
399 8.211152e 05
479 7.924739e 05
559 8.472126e 05
639 8.083823e 05
719 7.896654e 05
799 7.431326e 05
879 6.130318e 05
959 4.432476e 05
1039 3.104952e 05
1119 1.481400e 05
1199 -5.820766e-11
1279 -5.820766e-11
1359 -5.820766e-11
1439 -5.820766e-11
1519 -5.820766e-11
1599 -5.820766e-11
If we decrease the number of groups to 20
, the issue disappears. E.g.
df['b'] = df['b'] = list(range(20))*80
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})
This yields (for group 19, since there are only 20 groups from 0-19)
a
19 165083.125668
39 359750.793592
59 485563.758520
79 644305.760443
99 837370.199660
...
1519 0.000000
1539 0.000000
1559 0.000000
1579 0.000000
1599 0.000000
[80 rows x 1 columns]
This is only tested on pandas 1.2.5/python 3.7.9/windows 10. You might have to increase the number of groups for this to show up depending on your machine memory.
In my application, I can't really control the number of groups. I can change the dtype to float32
and the issue goes away. But, this causes me to loose precision for large numbers.
Any idea what's causing this and how to resolve it besides using float32
?
CodePudding user response:
TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.
The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).
The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.
Sources: Python code calling window aggregation, Cython implementation of the rolling sum