Need to fill the data accrding to the stage and last stage is the maximum date
Input:
RecordID | ChangeDate | Stage |
---|---|---|
17764 | 31-08-2021 | New |
17764 | 02-09-2021 | inprogress |
17764 | 05-09-2021 | won |
70382 | 04-01-2022 | new |
70382 | 06-01-2022 | hold |
70382 | 07-01-2022 | lost |
Expceted output:
RecordID | ChangeDate | Stage |
---|---|---|
17764 | 31-08-2021 | New |
17764 | 01-09-2021 | New |
17764 | 02-09-2021 | inprogress |
17764 | 03-09-2021 | inprogress |
17764 | 04-09-2021 | inprogress |
17764 | 05-09-2021 | won |
70382 | 04-01-2022 | new |
70382 | 05-01-2022 | new |
70382 | 06-01-2022 | hold |
70382 | 07-01-2022 | lost |
CodePudding user response:
You can use a groupby.resample
:
df['ChangeDate'] = pd.to_datetime(df['ChangeDate'], dayfirst=True)
(df.set_index('ChangeDate')
.groupby('RecordID', as_index=False)
.resample('1d').ffill()
.reset_index('ChangeDate')
)
Output:
ChangeDate RecordID Stage
0 2021-08-31 17764 New
0 2021-09-01 17764 New
0 2021-09-02 17764 inprogress
0 2021-09-03 17764 inprogress
0 2021-09-04 17764 inprogress
0 2021-09-05 17764 won
1 2022-01-04 70382 new
1 2022-01-05 70382 new
1 2022-01-06 70382 hold
1 2022-01-07 70382 lost
CodePudding user response:
One option is with complete from pyjanitor, to expose missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# build a list of new dates
new_dates = {'ChangeDate': lambda df: pd.date_range(df.min(), df.max(), freq='1D')}
df.complete(new_dates, by = 'RecordID').ffill()
Out[70]:
RecordID ChangeDate Stage
0 17764 2021-08-31 New
1 17764 2021-09-01 New
2 17764 2021-09-02 inprogress
3 17764 2021-09-03 inprogress
4 17764 2021-09-04 inprogress
5 17764 2021-09-05 won
6 70382 2022-01-04 new
7 70382 2022-01-05 new
8 70382 2022-01-06 hold
9 70382 2022-01-07 lost
Another option is to build a dataframe, and do a merge with the original dataframe - this is useful for non-unique values:
index = (df
.set_index('ChangeDate')
.drop(columns='Stage')
.groupby('RecordID')
.apply(lambda df: df.asfreq(freq='1D'))
.index)
new_df = pd.DataFrame([], index = index)
(new_df
.merge(df, how = 'outer', left_index = True, right_on = ['RecordID', 'ChangeDate'])
.ffill())
RecordID ChangeDate Stage
0 17764 2021-08-31 New
5 17764 2021-09-01 New
1 17764 2021-09-02 inprogress
5 17764 2021-09-03 inprogress
5 17764 2021-09-04 inprogress
2 17764 2021-09-05 won
3 70382 2022-01-04 new
5 70382 2022-01-05 new
4 70382 2022-01-06 hold
5 70382 2022-01-07 lost