The following list
[[22415, 7, Timestamp('2022-02-15 00:00:00'), 'KEY', nan], [22415, 7, Timestamp('2022-02-24 00:00:00'), 'MELOXICA', nan], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'CEPFR', 12.0], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'MELOXICA', nan], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'TOX FL', 11.0], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-06-07 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'MAS', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FR', 12.0], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'SFFR', 12.0], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'MELOXICA', nan]]
Translates to the following dataframe example
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA NaN
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX NaN
2117 25302 8 2022-06-07 FLUNIX NaN
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX NaN
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA NaN
I would like to forward and backward fill "QUARTER" when it is missing and there is a value for the same ID and Lact and the interval between the "Date" is < 7 days
I have used bfill and ffill with groupby with other data when there has not been the requirement for the date constraint .
The output I am looking for in this example is.
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS 11.0
2118 25302 8 2022-07-07 FLUNIX 11.0
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0
The source dataset is large with varied intervals between dates for the same ID and lact
Appreciate any ideas how to fill na values based on the id lact grouping within the 6 day date constraint.
Thanks
CodePudding user response:
Let's group the dataframe by ID
and LACT
and apply a custom function to backfill and forward fill the values
def pad(grp):
g = pd.Grouper(key='Date', freq='7D', origin='start')
return grp.groupby(g)['QUARTER'].apply(lambda s: s.ffill().bfill())
df['QUARTER'] = df.groupby(['ID', 'LACT'], group_keys=False).apply(pad)
Result
ID LACT Date Remark QUARTER
0 22415 7 2022-02-15 KEY NaN
1 22415 7 2022-02-24 MELOXICA NaN
2 22415 7 2022-10-11 CEPFR 12.0
3 22415 7 2022-10-11 MELOXICA 12.0
4 25302 8 2022-06-05 TOX FL 11.0
5 25302 8 2022-06-05 FLUNIX 11.0
6 25302 8 2022-06-07 FLUNIX 11.0
7 25302 8 2022-07-07 MAS NaN
8 25302 8 2022-07-07 FLUNIX NaN
9 26662 8 2022-07-08 FR 12.0
10 26662 8 2022-07-08 FLUNIX 12.0
11 26662 8 2022-07-17 SFFR 12.0
12 26662 8 2022-07-17 MELOXICA 12.0
CodePudding user response:
Use DataFrameGroupBy.diff
for test intervals between dates and for less like 6
days create helper groups for use last forward filling missing values per ID, LACT
and groups by GroupBy.ffill
and bfill
:
#if necesary convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#if necessary sorting dates per ID LACT
df = df.sort_values(['ID','LACT','Date'])
groups = df.groupby(['ID','LACT'])['Date'].diff().dt.days.fillna(0).gt(6).cumsum()
f = lambda x: x.ffill().bfill()
df['QUARTER'] = df.groupby(['ID','LACT', groups])['QUARTER'].transform(f)
print (df)
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0