pandas fill the counting of rows on missing datetime-CodePudding

I have a dataframe with a timestamp coloumn. I'm able to groupby the rows of this dataframe by timestamps in the range of 10 minutes, as you can see from the code below

minutes = '10T'
grouped_df=df.loc[df['id_area'] == 3].groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"].count()

When I print the dataframe I get this

timestamp
2022-11-09 14:10:00    2
2022-11-09 14:20:00    1
2022-11-09 15:10:00    1
2022-11-09 15:30:00    1
2022-11-09 16:10:00    2
Name: x, dtype: int64

So as you can see for example between 14:20 and15:10 there no values. I need to fill these steps with 0. How can I do it?

CodePudding user response：

Data sample:

np.random.seed(2022)

N = 20
df = pd.DataFrame({'id_area':np.random.choice([1,2,3], size=N),
                  'x':np.random.choice([1,np.nan], size=N),
                   'timestamp':pd.date_range('2022-11-11', freq='7Min', periods=N)})

If need only add missing datetimes in DatetimeIndex add Series.asfreq:

minutes = '10T'
grouped_df1=(df.loc[df['id_area'] == 3]
              .groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"]
              .count()
              .asfreq(minutes, fill_value=0))

print (grouped_df1)
timestamp
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
Freq: 10T, Name: x, dtype: int64

Or use Grouper:

minutes = '10T'
grouped_df1=(df.assign(timestamp = pd.to_datetime(df["timestamp"]))
               .loc[df['id_area'] == 3]
               .groupby(pd.Grouper(freq=minutes, key='timestamp'))["x"]
              .count())

print (grouped_df1)
timestamp
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
Freq: 10T, Name: x, dtype: int64

If need count not matched values to 0 replace x to NaN in Series.where:

grouped_df2=(df['x'].where(df['id_area'] == 3)
                   .groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))
                   .count())
print (grouped_df2)  
timestamp
2022-11-11 00:00:00    0
2022-11-11 00:10:00    0
2022-11-11 00:20:00    0
2022-11-11 00:30:00    0
2022-11-11 00:40:00    0
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
2022-11-11 02:10:00    0
Name: x, dtype: int64

CodePudding user response：

For clarity, you can always create a parallel dataframe that contains every date you need (in this case, in 10 minute intervals)

grouped_df = grouped_df.reset_index()
times = pd.date_range(start=grouped_df['time'].min(), end=grouped_df['time'].max(), freq='10min')

Now, all the dates you need should be in the times object:

    times:
DatetimeIndex(['2022-11-09 14:10:00', '2022-11-09 14:20:00',
               '2022-11-09 14:30:00', '2022-11-09 14:40:00',
               '2022-11-09 14:50:00', '2022-11-09 15:00:00',
               '2022-11-09 15:10:00', '2022-11-09 15:20:00',
               '2022-11-09 15:30:00', '2022-11-09 15:40:00',
               '2022-11-09 15:50:00', '2022-11-09 16:00:00',
               '2022-11-09 16:10:00'],
              dtype='datetime64[ns]', freq='10T')

We can then join the previous dataframe grouped_df and fill the blank values with zeroes.

final_df = pd.merge(grouped_df, pd.DataFrame(times, columns=['time']), how='outer', on='time').sort_values('time').fillna(0)

Your end result should look a lot like this (please, keep in mind i made up some values to reproduce your original result):

        time           values
0   2022-11-09 14:10:00 10.0
1   2022-11-09 14:20:00 5.0
2   2022-11-09 14:30:00 0.0
3   2022-11-09 14:40:00 0.0
4   2022-11-09 14:50:00 0.0
5   2022-11-09 15:00:00 0.0
6   2022-11-09 15:10:00 20.0
7   2022-11-09 15:20:00 0.0
8   2022-11-09 15:30:00 15.0
9   2022-11-09 15:40:00 0.0
10  2022-11-09 15:50:00 0.0
11  2022-11-09 16:00:00 0.0
12  2022-11-09 16:10:00 30.0