pandas - dynamically fill the missing row of group by(create a duplicate row if required based on pr-CodePudding

Need to fill the data accrding to the stage and last stage is the maximum date

Input:

RecordID	ChangeDate	Stage
17764	31-08-2021	New
17764	02-09-2021	inprogress
17764	05-09-2021	won
70382	04-01-2022	new
70382	06-01-2022	hold
70382	07-01-2022	lost

Expceted output:

RecordID	ChangeDate	Stage
17764	31-08-2021	New
17764	01-09-2021	New
17764	02-09-2021	inprogress
17764	03-09-2021	inprogress
17764	04-09-2021	inprogress
17764	05-09-2021	won
70382	04-01-2022	new
70382	05-01-2022	new
70382	06-01-2022	hold
70382	07-01-2022	lost

CodePudding user response：

You can use a groupby.resample:

df['ChangeDate'] = pd.to_datetime(df['ChangeDate'], dayfirst=True)

(df.set_index('ChangeDate')
   .groupby('RecordID', as_index=False)
   .resample('1d').ffill()
   .reset_index('ChangeDate')
)

Output:

  ChangeDate  RecordID       Stage
0 2021-08-31     17764         New
0 2021-09-01     17764         New
0 2021-09-02     17764  inprogress
0 2021-09-03     17764  inprogress
0 2021-09-04     17764  inprogress
0 2021-09-05     17764         won
1 2022-01-04     70382         new
1 2022-01-05     70382         new
1 2022-01-06     70382        hold
1 2022-01-07     70382        lost

CodePudding user response：

One option is with complete from pyjanitor, to expose missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

# build a list of new dates
new_dates = {'ChangeDate': lambda df: pd.date_range(df.min(), df.max(), freq='1D')}

df.complete(new_dates, by = 'RecordID').ffill()
Out[70]: 
   RecordID ChangeDate       Stage
0     17764 2021-08-31         New
1     17764 2021-09-01         New
2     17764 2021-09-02  inprogress
3     17764 2021-09-03  inprogress
4     17764 2021-09-04  inprogress
5     17764 2021-09-05         won
6     70382 2022-01-04         new
7     70382 2022-01-05         new
8     70382 2022-01-06        hold
9     70382 2022-01-07        lost

Another option is to build a dataframe, and do a merge with the original dataframe - this is useful for non-unique values:

index = (df
.set_index('ChangeDate')
.drop(columns='Stage')
.groupby('RecordID')
.apply(lambda df: df.asfreq(freq='1D'))
.index)
new_df = pd.DataFrame([], index = index)
(new_df
.merge(df, how = 'outer', left_index = True, right_on = ['RecordID', 'ChangeDate'])
.ffill())

   RecordID ChangeDate       Stage
0     17764 2021-08-31         New
5     17764 2021-09-01         New
1     17764 2021-09-02  inprogress
5     17764 2021-09-03  inprogress
5     17764 2021-09-04  inprogress
2     17764 2021-09-05         won
3     70382 2022-01-04         new
5     70382 2022-01-05         new
4     70382 2022-01-06        hold
5     70382 2022-01-07        lost