Fill missing dates hourly per group with previous value in certain column using Pandas-CodePudding

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 14:00:00      False      3
2     B  2022-09-01 13:00:00      False      1
3     B  2022-09-01 16:00:00       True      4

I would like to fill in the missing dates between dates hourly. So each hour that is missing between dates should be filled and the values should be the same as the previous data. Here is the desired output:

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2], 
        ['A', '2022-09-01 12:00:00', False, 2], ['A', '2022-09-01 13:00:00', False, 2], 
        ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1],
        ['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 11:00:00      False      2
2     A  2022-09-01 12:00:00      False      2
3     A  2022-09-01 13:00:00      False      2
4     A  2022-09-01 14:00:00      False      3
5     B  2022-09-01 13:00:00      False      1
6     B  2022-09-01 14:00:00      False      1
7     B  2022-09-01 15:00:00      False      1
8     B  2022-09-01 16:00:00       True      4

So I was wondering if it is possible to fill the missing dates hourly per group with the previous value in column value using Pandas?

CodePudding user response：

You can use:

df['date'] = pd.to_datetime(df['date'])

out = (df
   .groupby('group', as_index=False, group_keys=False)
   .apply(lambda g: g.set_index('date')
                     .reindex(pd.date_range(g['date'].min(),
                                            g['date'].max(),
                                            freq='H'))
                     .ffill(downcast='infer').reset_index()
         
         )
   .reset_index(drop=True)
)

output:

                index group  indicator  value
0 2022-09-01 10:00:00     A      False      2
1 2022-09-01 11:00:00     A      False      2
2 2022-09-01 12:00:00     A      False      2
3 2022-09-01 13:00:00     A      False      2
4 2022-09-01 14:00:00     A      False      3
5 2022-09-01 13:00:00     B      False      1
6 2022-09-01 14:00:00     B      False      1
7 2022-09-01 15:00:00     B      False      1
8 2022-09-01 16:00:00     B       True      4

CodePudding user response：

here is one other way about it

df['date']=pd.to_datetime(df['date'])

df2=(df.set_index('date' )
     .groupby('group',  group_keys=False)
     .apply(lambda x: x.resample('1H').ffill()) 
     .reset_index() )
df2

                   date     group   indicator   value
0   2022-09-01 10:00:00        A    False       2
1   2022-09-01 11:00:00        A    False       2
2   2022-09-01 12:00:00        A    False       2
3   2022-09-01 13:00:00        A    False       2
4   2022-09-01 14:00:00        A    False       3
5   2022-09-01 13:00:00        B    False       1
6   2022-09-01 14:00:00        B    False       1
7   2022-09-01 15:00:00        B    False       1
8   2022-09-01 16:00:00        B    True        4