I would like to select all rows with the 2 most recent dates by each ID. Max and Max-1 dates and number of rows for each ID may differ for ID's.
Example Data:
data = {'id': np.repeat((['a','b','c']), 6),
'date': ['2020-12-07', '2020-12-07','2020-12-05','2020-12-05','2020-12-04','2020-12-04',
'2021-12-07', '2021-12-07','2021-09-05','2021-09-05','2021-05-04','2021-05-04',
'2021-09-05', '2021-09-05','2021-02-05','2021-02-05','2020-12-04','2020-12-04'],
'value1': np.repeat(([10,20,30]), 6),
'value2': np.repeat(([1000,2000,3000]), 6)
}
df = pd.DataFrame(data)
Desired output:
id date value1 value2
0 a 2020-12-07 10 1000
1 a 2020-12-07 10 1000
2 a 2020-12-05 10 1000
3 a 2020-12-05 10 1000
4 b 2021-12-07 20 2000
5 b 2021-12-07 20 2000
6 b 2021-09-05 20 2000
7 b 2021-09-05 20 2000
8 c 2021-09-05 30 3000
9 c 2021-09-05 30 3000
10 c 2021-02-05 30 3000
11 c 2021-02-05 30 3000
I've read that .nlargest()
can pull the the last two dates, but I'm having trouble finding a way to apply it to my use case and maintain the other values in my df.
CodePudding user response:
You can try groupby().nth
:
df[df['date']>=df.groupby("id")["date"].transform('nth', n=2)]
Output:
id date value1 value2
0 a 2020-12-07 10 1000
1 a 2020-12-07 10 1000
2 a 2020-12-05 10 1000
3 a 2020-12-05 10 1000
6 b 2021-12-07 20 2000
7 b 2021-12-07 20 2000
8 b 2021-09-05 20 2000
9 b 2021-09-05 20 2000
12 c 2021-09-05 30 3000
13 c 2021-09-05 30 3000
14 c 2021-02-05 30 3000
15 c 2021-02-05 30 3000
CodePudding user response:
You can try with a "dense" ranking:
>>> df[df.groupby("id")["date"].transform(pd.Series.rank, ascending=False, method="dense")<=2]
id date value1 value2
0 a 2020-12-07 10 1000
1 a 2020-12-07 10 1000
2 a 2020-12-05 10 1000
3 a 2020-12-05 10 1000
6 b 2021-12-07 20 2000
7 b 2021-12-07 20 2000
8 b 2021-09-05 20 2000
9 b 2021-09-05 20 2000
12 c 2021-09-05 30 3000
13 c 2021-09-05 30 3000
14 c 2021-02-05 30 3000
15 c 2021-02-05 30 3000