I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 12:00:00', True, 3], ['A', '2022-09-01 14:00:00', False, 1],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4], ['B', '2022-09-01 18:00:00', False, 3]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 12:00:00 True 3
2 A 2022-09-01 14:00:00 False 1
3 B 2022-09-01 13:00:00 False 1
4 B 2022-09-01 16:00:00 True 4
5 B 2022-09-01 18:00:00 False 3
I would like to fill in the missing dates hourly per group where the value is the same as the previous existing row. When the previous existing row has an indicator of True, I would like to fill these missing rows with an indicator of False instead of True but the value is although the same. Here is the desired output:
data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2], ['A', '2022-09-01 12:00:00', True, 3], ['A', '2022-09-01 13:00:00', False, 3], ['A', '2022-09-01 14:00:00', False, 1],
['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1], ['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4], ['B', '2022-09-01 17:00:00', False, 4], ['B', '2022-09-01 18:00:00', False, 3]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 10:00:00 False 2
1 A 2022-09-01 11:00:00 False 2
2 A 2022-09-01 12:00:00 True 3
3 A 2022-09-01 13:00:00 False 3
4 A 2022-09-01 14:00:00 False 1
5 B 2022-09-01 13:00:00 False 1
6 B 2022-09-01 14:00:00 False 1
7 B 2022-09-01 15:00:00 False 1
8 B 2022-09-01 16:00:00 True 4
9 B 2022-09-01 17:00:00 False 4
10 B 2022-09-01 18:00:00 False 3
As you can see the dates are filled hourly per group and the indicators became False when the previous indicator was True.
So I was wondering if anyone knows how to fill these missing dates hourly per group with keeping care of when the indicator is True using pandas
?
CodePudding user response:
First create DatetimeIndex
with DataFrame.set_index
and then in lambda function add missing hours by DataFrame.asfreq
, last replace missing values by Series.fillna
and forward filling missing values:
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.groupby('group')[['indicator', 'value']]
.apply(lambda x: x.asfreq('H'))
.assign(indicator = lambda x: x['indicator'].fillna(False),
value = lambda x: x['value'].ffill())
.reset_index())
print (df)
group date indicator value
0 A 2022-09-01 10:00:00 False 2.0
1 A 2022-09-01 11:00:00 False 2.0
2 A 2022-09-01 12:00:00 True 3.0
3 A 2022-09-01 13:00:00 False 3.0
4 A 2022-09-01 14:00:00 False 1.0
5 B 2022-09-01 13:00:00 False 1.0
6 B 2022-09-01 14:00:00 False 1.0
7 B 2022-09-01 15:00:00 False 1.0
8 B 2022-09-01 16:00:00 True 4.0
9 B 2022-09-01 17:00:00 False 4.0
10 B 2022-09-01 18:00:00 False 3.0