I'm working with a dataframe where I need to create a column ('Newcol') which will group by 'ColA', then for each row - relative to the date column ('ColB').... see if a particular string exists 2 or 3 days ago before that date value . Example below:
Original DF:
ColA ColB ColC
B 2021-10-24 dog
B 2021-10-25 cat
B 2021-10-26 bird
B 2021-10-27 dog
B 2021-10-28 bird
H 2021-10-24 cat
H 2021-10-25 dog
H 2021-10-26 dog
H 2021-10-27 bird
H 2021-10-28 cat
X 2021-10-24 bird
X 2021-10-25 dog
X 2021-10-26 dog
X 2021-10-27 dog
X 2021-10-28 bird
So, say the string to flag is 'dog'. Grouping by ColA, I need 'NewCol' to use the ColB date column, look back 2-3 days and see if ColC says 'dog'... if it does then flag it. So the values in 'NewCol' is binary: 0 for no, 1 for yes. See below for example:
Desired DF:
ColA ColB ColC NewCol
B 2021-10-24 dog NA
B 2021-10-25 cat NA
B 2021-10-26 bird 1
B 2021-10-27 dog 1
B 2021-10-28 bird 0
H 2021-10-24 cat NA
H 2021-10-25 dog NA
H 2021-10-26 dog 0
H 2021-10-27 bird 1
H 2021-10-28 cat 1
X 2021-10-24 bird NA
X 2021-10-25 dog NA
X 2021-10-26 dog 0
X 2021-10-27 dog 1
X 2021-10-28 bird 1
Any suggestions is appreciated. Thanks.
CodePudding user response:
Try groupby.rolling:
df['NewCol'] = (df['ColC'].eq('dog')
.groupby(df['ColA']).rolling(3).max()
.reset_index('ColA', drop=True)
)
Note that this works only ColB
is sorted and consecutive. Otherwise, you need to set ColB
as index and rolling with '3D'
.
CodePudding user response:
Self join to obtain a new column with dates
extended_df = DF.merge( DF.rename(columns={'ColB':'ColDate'}), on=['ColA', 'ColC'], how='left']
and define the new column based on the condition
extended_df['NewCol'] = np.nan
extended_df.loc[ (extended_df.colB - extended_df.ColDate)>=2 & \
(extended_df.colB - extended_df.ColDate)<=3 , \
'NewCol' ] = 1
extended_df.loc[ (extended_df.colB - extended_df.ColDate)<2 | \
(extended_df.colB - extended_df.ColDate)>3 , \
'NewCol' ] = 0
Then remove the added column to obtain the final output
df = extended_df.drop( 'ColDate', axis=1 )
del extended_df