I have a dataframe with a timestamp coloumn. I'm able to groupby the rows of this dataframe by timestamps in the range of 10 minutes, as you can see from the code below
minutes = '10T'
grouped_df=df.loc[df['id_area'] == 3].groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"].count()
When I print the dataframe I get this
timestamp
2022-11-09 14:10:00 2
2022-11-09 14:20:00 1
2022-11-09 15:10:00 1
2022-11-09 15:30:00 1
2022-11-09 16:10:00 2
Name: x, dtype: int64
So as you can see for example between 14:20 and15:10 there no values. I need to fill these steps with 0. How can I do it?
CodePudding user response:
Data sample:
np.random.seed(2022)
N = 20
df = pd.DataFrame({'id_area':np.random.choice([1,2,3], size=N),
'x':np.random.choice([1,np.nan], size=N),
'timestamp':pd.date_range('2022-11-11', freq='7Min', periods=N)})
If need only add missing datetimes in DatetimeIndex
add Series.asfreq
:
minutes = '10T'
grouped_df1=(df.loc[df['id_area'] == 3]
.groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"]
.count()
.asfreq(minutes, fill_value=0))
print (grouped_df1)
timestamp
2022-11-11 00:50:00 1
2022-11-11 01:00:00 0
2022-11-11 01:10:00 0
2022-11-11 01:20:00 0
2022-11-11 01:30:00 0
2022-11-11 01:40:00 0
2022-11-11 01:50:00 0
2022-11-11 02:00:00 1
Freq: 10T, Name: x, dtype: int64
Or use Grouper
:
minutes = '10T'
grouped_df1=(df.assign(timestamp = pd.to_datetime(df["timestamp"]))
.loc[df['id_area'] == 3]
.groupby(pd.Grouper(freq=minutes, key='timestamp'))["x"]
.count())
print (grouped_df1)
timestamp
2022-11-11 00:50:00 1
2022-11-11 01:00:00 0
2022-11-11 01:10:00 0
2022-11-11 01:20:00 0
2022-11-11 01:30:00 0
2022-11-11 01:40:00 0
2022-11-11 01:50:00 0
2022-11-11 02:00:00 1
Freq: 10T, Name: x, dtype: int64
If need count not matched values to 0
replace x
to NaN
in Series.where
:
grouped_df2=(df['x'].where(df['id_area'] == 3)
.groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))
.count())
print (grouped_df2)
timestamp
2022-11-11 00:00:00 0
2022-11-11 00:10:00 0
2022-11-11 00:20:00 0
2022-11-11 00:30:00 0
2022-11-11 00:40:00 0
2022-11-11 00:50:00 1
2022-11-11 01:00:00 0
2022-11-11 01:10:00 0
2022-11-11 01:20:00 0
2022-11-11 01:30:00 0
2022-11-11 01:40:00 0
2022-11-11 01:50:00 0
2022-11-11 02:00:00 1
2022-11-11 02:10:00 0
Name: x, dtype: int64
CodePudding user response:
For clarity, you can always create a parallel dataframe that contains every date you need (in this case, in 10 minute intervals)
grouped_df = grouped_df.reset_index()
times = pd.date_range(start=grouped_df['time'].min(), end=grouped_df['time'].max(), freq='10min')
Now, all the dates you need should be in the times object:
times:
DatetimeIndex(['2022-11-09 14:10:00', '2022-11-09 14:20:00',
'2022-11-09 14:30:00', '2022-11-09 14:40:00',
'2022-11-09 14:50:00', '2022-11-09 15:00:00',
'2022-11-09 15:10:00', '2022-11-09 15:20:00',
'2022-11-09 15:30:00', '2022-11-09 15:40:00',
'2022-11-09 15:50:00', '2022-11-09 16:00:00',
'2022-11-09 16:10:00'],
dtype='datetime64[ns]', freq='10T')
We can then join the previous dataframe grouped_df and fill the blank values with zeroes.
final_df = pd.merge(grouped_df, pd.DataFrame(times, columns=['time']), how='outer', on='time').sort_values('time').fillna(0)
Your end result should look a lot like this (please, keep in mind i made up some values to reproduce your original result):
time values
0 2022-11-09 14:10:00 10.0
1 2022-11-09 14:20:00 5.0
2 2022-11-09 14:30:00 0.0
3 2022-11-09 14:40:00 0.0
4 2022-11-09 14:50:00 0.0
5 2022-11-09 15:00:00 0.0
6 2022-11-09 15:10:00 20.0
7 2022-11-09 15:20:00 0.0
8 2022-11-09 15:30:00 15.0
9 2022-11-09 15:40:00 0.0
10 2022-11-09 15:50:00 0.0
11 2022-11-09 16:00:00 0.0
12 2022-11-09 16:10:00 30.0