Mask column based on condition in another column-CodePudding

class	subclass	date	value
1	A	02-10-22	.5
1	A	02-21-22	.6
1	A	02-28-22	.8
1	B	02-09-22	.3
1	B	02-14-22	.4
1	B	02-28-22	.5
2	C	02-15-22	.9
2	C	02-28-22	.8

I have a dataframe like above. Several (class, subclass) pairs have values ordered by dates. The bottom date for each (class, subclass) is guaranteed to be the maximum date, for example 02-28-22.

I would like to transform to the dataset below. For the date right before the maximum date, if it is not exactly 7 days before the maximum date - we change the corresponding value to NaN. Otherwise we leave it alone, as well the other dates. FE, the row with date 02-21-22 is left alone; while the row with 02-14-22 is now NaN.

Dates are stored as strings: '02-15-22'.

class	subclass	date	value
1	A	02-10-22	.5
1	A	02-21-22	.6
1	A	02-28-22	.8
1	B	02-09-22	.3
1	B	02-14-22	NaN
1	B	02-28-22	.5
2	C	02-15-22	NaN
2	C	02-28-22	.8

CodePudding user response：

Find the max date and the second max date using groupby. Then use where to mask the relevant values:

maxdate = df.groupby(["class", "subclass"])["date"].transform('max')
nextmaxdate = df.groupby(["class","subclass"])["date"].transform(lambda x: x.nlargest(2).min())

df["value"] = df["value"].where(df["date"].ne(nextmaxdate) | maxdate.sub(nextmaxdate).dt.days.eq(7))

>>> df
   class subclass       date  value
0      1        A 2022-02-10    0.5
1      1        A 2022-02-21    0.6
2      1        A 2022-02-28    0.8
3      1        B 2022-02-09    0.3
4      1        B 2022-02-14    NaN
5      1        B 2022-02-28    0.5
6      2        C 2022-02-15    NaN
7      2        C 2022-02-28    0.8

CodePudding user response：

Calculate reverse cumcount to identify the row preceding the last row, then group and shift the date column and subtract a offset of 7 days, then mask the values in value column preceding to last row where the required condition is not met

c = df[::-1].groupby(['class', 'subclass']).cumcount()
d = df.groupby(['class', 'subclass'])['date'].shift(-1) - pd.DateOffset(days=7)
df['value'] = df['value'].mask(df['date'].ne(d) & c.eq(1))

   class subclass       date  value
0      1        A 2022-02-10    0.5
1      1        A 2022-02-21    0.6
2      1        A 2022-02-28    0.8
3      1        B 2022-02-09    0.3
4      1        B 2022-02-14    NaN
5      1        B 2022-02-28    0.5
6      2        C 2022-02-15    NaN
7      2        C 2022-02-28    0.8