Create column that flags where another column has condition days ago-CodePudding

I'm working with a dataframe where I need to create a column ('Newcol') which will group by 'ColA', then for each row - relative to the date column ('ColB').... see if a particular string exists 2 or 3 days ago before that date value . Example below:

Original DF:

 ColA     ColB         ColC
  B      2021-10-24    dog
  B      2021-10-25    cat
  B      2021-10-26    bird
  B      2021-10-27    dog
  B      2021-10-28    bird
  H      2021-10-24    cat
  H      2021-10-25    dog
  H      2021-10-26    dog
  H      2021-10-27    bird
  H      2021-10-28    cat
  X      2021-10-24    bird
  X      2021-10-25    dog
  X      2021-10-26    dog
  X      2021-10-27    dog
  X      2021-10-28    bird

So, say the string to flag is 'dog'. Grouping by ColA, I need 'NewCol' to use the ColB date column, look back 2-3 days and see if ColC says 'dog'... if it does then flag it. So the values in 'NewCol' is binary: 0 for no, 1 for yes. See below for example:

Desired DF:

 ColA     ColB         ColC     NewCol
  B      2021-10-24    dog        NA
  B      2021-10-25    cat        NA
  B      2021-10-26    bird       1
  B      2021-10-27    dog        1
  B      2021-10-28    bird       0
  H      2021-10-24    cat        NA
  H      2021-10-25    dog        NA
  H      2021-10-26    dog        0
  H      2021-10-27    bird       1
  H      2021-10-28    cat        1
  X      2021-10-24    bird       NA
  X      2021-10-25    dog        NA
  X      2021-10-26    dog        0
  X      2021-10-27    dog        1
  X      2021-10-28    bird       1

Any suggestions is appreciated. Thanks.

CodePudding user response：

Try groupby.rolling:

df['NewCol'] = (df['ColC'].eq('dog')
                 .groupby(df['ColA']).rolling(3).max()
                 .reset_index('ColA', drop=True)
               )

Note that this works only ColB is sorted and consecutive. Otherwise, you need to set ColB as index and rolling with '3D'.

CodePudding user response：

Self join to obtain a new column with dates

extended_df = DF.merge( DF.rename(columns={'ColB':'ColDate'}), on=['ColA', 'ColC'], how='left']

and define the new column based on the condition

extended_df['NewCol'] = np.nan 

extended_df.loc[ (extended_df.colB - extended_df.ColDate)>=2 & \
                 (extended_df.colB - extended_df.ColDate)<=3 , \
                 'NewCol' ] = 1

extended_df.loc[ (extended_df.colB - extended_df.ColDate)<2 | \
                 (extended_df.colB - extended_df.ColDate)>3 , \
                 'NewCol' ] = 0

Then remove the added column to obtain the final output

df = extended_df.drop( 'ColDate', axis=1 )
del extended_df