import pandas as pd
data = {'time': ['09:30:00',
'09:31:00',
'09:37:00',
'09:38:00',
'09:39:00',
'09:40:00',
'09:46:00',
'09:47:00',
'09:48:00'],
'sum': [5, 8, 5, 10, 15, 2, 0, 0, 0]}
my_index = pd.MultiIndex.from_arrays([["A"]*6 ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)
If I do resampling for every 3 minutes with sum(),i.e.->
data=data.set_index('time').groupby(['ID-A','ID-B']).resample('3min').sum()
There is a window({9:33 - 9:36},{9:42 - 9:45}) for which the sum() comes out to be 0. There are some values in my dataframe which actually evaluate to 0 even after the time windows are available for it(9:45-9:48). I do not want resampling for time windows where there is no data available.
I want to find out windows where my sum is 0 but due to resampling I am getting fake 0s in my data as there are no data available for those time.
CodePudding user response:
Use min_count=1
parameter in sum
and then remove missing values:
df['time'] = pd.to_timedelta(df['time'])
df = df.resample('3Min', on='time').sum(min_count=1).dropna()
print (df)
sum
time
0 days 09:30:00 13.0
0 days 09:36:00 15.0
0 days 09:39:00 17.0
0 days 09:45:00 0.0
0 days 09:48:00 0.0
Details:
print (df.resample('3Min', on='time').sum(min_count=1))
sum
time
0 days 09:30:00 13.0
0 days 09:33:00 NaN
0 days 09:36:00 15.0
0 days 09:39:00 17.0
0 days 09:42:00 NaN
0 days 09:45:00 0.0
0 days 09:48:00 0.0
EDIT: Solution per groups:
df['time'] = pd.to_timedelta(df['time'])
data = (df.reset_index()
.set_index('time')
.groupby(['ID-A','ID-B'])['sum']
.resample('3min')
.sum(min_count=1)
.dropna())
print (data)
ID-A ID-B time
A 1 0 days 09:30:00 13.0
0 days 09:36:00 5.0
2 0 days 09:38:00 27.0
B 1 0 days 09:46:00 0.0
Name: sum, dtype: float64