Home > Mobile >  Filling na values conditionally by group and date
Filling na values conditionally by group and date

Time:01-23

The following list

[[22415, 7, Timestamp('2022-02-15 00:00:00'), 'KEY', nan], [22415, 7, Timestamp('2022-02-24 00:00:00'), 'MELOXICA', nan], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'CEPFR', 12.0], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'MELOXICA', nan], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'TOX FL', 11.0], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-06-07 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'MAS', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FR', 12.0], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'SFFR', 12.0], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'MELOXICA', nan]]

Translates to the following dataframe example

         ID   LACT  Date      Remark    QUARTER
2105    22415   7   2022-02-15  KEY         NaN
2106    22415   7   2022-02-24  MELOXICA    NaN
4       22415   7   2022-10-11  CEPFR       12.0
2107    22415   7   2022-10-11  MELOXICA    NaN
9       25302   8   2022-06-05  TOX FL      11.0
2116    25302   8   2022-06-05  FLUNIX      NaN
2117    25302   8   2022-06-07  FLUNIX      NaN
10      25302   8   2022-07-07  MAS         NaN
2118    25302   8   2022-07-07  FLUNIX      NaN
14      26662   8   2022-07-08  FR          12.0
2125    26662   8   2022-07-08  FLUNIX      NaN
15      26662   8   2022-07-17  SFFR        12.0
2126    26662   8   2022-07-17  MELOXICA    NaN

I would like to forward and backward fill "QUARTER" when it is missing and there is a value for the same ID and Lact and the interval between the "Date" is < 7 days

I have used bfill and ffill with groupby with other data when there has not been the requirement for the date constraint .

The output I am looking for in this example is.

         ID   LACT  Date      Remark    QUARTER
2105    22415   7   2022-02-15  KEY         NaN
2106    22415   7   2022-02-24  MELOXICA    NaN
4       22415   7   2022-10-11  CEPFR       12.0
2107    22415   7   2022-10-11  MELOXICA    12.0
9       25302   8   2022-06-05  TOX FL      11.0
2116    25302   8   2022-06-05  FLUNIX      11.0
2117    25302   8   2022-06-07  FLUNIX      11.0
10      25302   8   2022-07-07  MAS         11.0
2118    25302   8   2022-07-07  FLUNIX      11.0
14      26662   8   2022-07-08  FR          12.0
2125    26662   8   2022-07-08  FLUNIX      12.0
15      26662   8   2022-07-17  SFFR        12.0
2126    26662   8   2022-07-17  MELOXICA    12.0

The source dataset is large with varied intervals between dates for the same ID and lact

Appreciate any ideas how to fill na values based on the id lact grouping within the 6 day date constraint.

Thanks

CodePudding user response:

Let's group the dataframe by ID and LACT and apply a custom function to backfill and forward fill the values

def pad(grp):
    g = pd.Grouper(key='Date', freq='7D', origin='start')
    return grp.groupby(g)['QUARTER'].apply(lambda s: s.ffill().bfill())

df['QUARTER'] = df.groupby(['ID', 'LACT'], group_keys=False).apply(pad)

Result

       ID  LACT       Date    Remark  QUARTER
0   22415     7 2022-02-15       KEY      NaN
1   22415     7 2022-02-24  MELOXICA      NaN
2   22415     7 2022-10-11     CEPFR     12.0
3   22415     7 2022-10-11  MELOXICA     12.0
4   25302     8 2022-06-05    TOX FL     11.0
5   25302     8 2022-06-05    FLUNIX     11.0
6   25302     8 2022-06-07    FLUNIX     11.0
7   25302     8 2022-07-07       MAS      NaN
8   25302     8 2022-07-07    FLUNIX      NaN
9   26662     8 2022-07-08        FR     12.0
10  26662     8 2022-07-08    FLUNIX     12.0
11  26662     8 2022-07-17      SFFR     12.0
12  26662     8 2022-07-17  MELOXICA     12.0

CodePudding user response:

Use DataFrameGroupBy.diff for test intervals between dates and for less like 6 days create helper groups for use last forward filling missing values per ID, LACT and groups by GroupBy.ffill and bfill:

#if necesary convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])

#if necessary sorting dates per ID LACT
df = df.sort_values(['ID','LACT','Date'])

groups = df.groupby(['ID','LACT'])['Date'].diff().dt.days.fillna(0).gt(6).cumsum()
f = lambda x: x.ffill().bfill()
df['QUARTER'] = df.groupby(['ID','LACT', groups])['QUARTER'].transform(f)
print (df)
         ID  LACT       Date    Remark  QUARTER
2105  22415     7 2022-02-15       KEY      NaN
2106  22415     7 2022-02-24  MELOXICA      NaN
4     22415     7 2022-10-11     CEPFR     12.0
2107  22415     7 2022-10-11  MELOXICA     12.0
9     25302     8 2022-06-05    TOX FL     11.0
2116  25302     8 2022-06-05    FLUNIX     11.0
2117  25302     8 2022-06-07    FLUNIX     11.0
10    25302     8 2022-07-07       MAS      NaN
2118  25302     8 2022-07-07    FLUNIX      NaN
14    26662     8 2022-07-08        FR     12.0
2125  26662     8 2022-07-08    FLUNIX     12.0
15    26662     8 2022-07-17      SFFR     12.0
2126  26662     8 2022-07-17  MELOXICA     12.0
  • Related