I have the following example dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> d = {'date': pd.date_range(start='2022-12-09 00:00:00',
end='2022-12-09 02:50:00',
freq='10min'),
'amount': [np.nan]*6 [1]*5 [np.nan] [2]*6}
>>> df = pd.DataFrame(d)
>>> df
date amount
0 2022-12-09 00:00:00 NaN
1 2022-12-09 00:10:00 NaN
2 2022-12-09 00:20:00 NaN
3 2022-12-09 00:30:00 NaN
4 2022-12-09 00:40:00 NaN
5 2022-12-09 00:50:00 NaN
6 2022-12-09 01:00:00 1.0
7 2022-12-09 01:10:00 1.0
8 2022-12-09 01:20:00 1.0
9 2022-12-09 01:30:00 1.0
10 2022-12-09 01:40:00 1.0
11 2022-12-09 01:50:00 NaN
12 2022-12-09 02:00:00 2.0
13 2022-12-09 02:10:00 2.0
14 2022-12-09 02:20:00 2.0
15 2022-12-09 02:30:00 2.0
16 2022-12-09 02:40:00 2.0
17 2022-12-09 02:50:00 2.0
I am trying to use df.resample
on this dataframe to aggregate the columns by hour as follows:
>>> df.resample(rule='H', on='date').agg({'amount': sum})
amount
date
2022-12-09 00:00:00 0.0
2022-12-09 01:00:00 5.0
2022-12-09 02:00:00 12.0
However, I would like to have hours which contain just NaN
values to aggregate to NaN
rather than 0
. Hours which contain a mix of NaN
and numerical numbers should treat NaN
as 0
as currently. My desired output is as follows:
amount
date
2022-12-09 00:00:00 NaN
2022-12-09 01:00:00 5.0
2022-12-09 02:00:00 12.0
Is there any way to achieve this - ideally using df.resample
- or otherwise?
CodePudding user response:
Use Series.sum
with min_count=1
parameter:
df = df.resample(rule='H', on='date').agg({'amount': lambda x: x.sum(min_count=1)})
print (df)
amount
date
2022-12-09 00:00:00 NaN
2022-12-09 01:00:00 5.0
2022-12-09 02:00:00 12.0
Or if need processing column(s) by resample with sum
:
s = df.resample(rule='H', on='date')['amount'].sum(min_count=1)
print (s)
date
2022-12-09 00:00:00 NaN
2022-12-09 01:00:00 5.0
2022-12-09 02:00:00 12.0
Freq: H, Name: amount, dtype: float64