Home > Software engineering >  Groupby function throwing error when attempting to get the second smallest date Python
Groupby function throwing error when attempting to get the second smallest date Python

Time:11-04

I am trying to pull the second closest date for each as of date in a data set. Below is some example data

asOfDate maturityDate   value
0  2022-09-01   2022-10-01  57.273
1  2022-09-01   2022-11-01  55.861
2  2022-09-01   2022-12-01  59.231
3  2022-09-01   2023-01-01  59.305
4  2022-09-01   2023-02-01  58.081
5  2022-09-01   2023-03-01  51.198
6  2022-09-01   2023-04-01  44.532
7  2022-09-01   2023-05-01  38.955
8  2022-09-01   2023-06-01  37.901
9  2022-09-01   2023-07-01  40.461
10 2022-09-01   2023-08-01  40.126
11 2022-09-01   2023-09-01  40.221
12 2022-09-01   2023-10-01  40.455
13 2022-09-01   2023-11-01  41.652
14 2022-09-01   2023-12-01  42.450
15 2022-09-01   2024-01-01  45.057
16 2022-09-01   2024-02-01  42.459
17 2022-09-01   2024-03-01  37.544
18 2022-09-01   2024-04-01  29.901
19 2022-09-01   2024-05-01  25.300
20 2022-09-01   2024-06-01  22.360
21 2022-09-01   2024-07-01  22.544
22 2022-09-01   2024-08-01  23.931

I have a list of asOfDates with several maturity dates going out. I currently am using this code to pull the maturityDate and value that is the smallest (or rolling front month) for each asOfDate.

df_final[df_final['maturityDate']==df_final.groupby(['asOfDate'])['maturityDate'].transform(min)]

The above works perfectly for my needs, but now I would like to, instead of getting the smallest date, receive the second smallest for each as of date. I have attempted to get this second smallest, but I received an error that states, "error: Can only compare identically-labeled Series objects". Below is what I have tried that does output the dates I want, but when I try to reindex the data frame I get the error.

df_final.groupby(['asOfDate'])['maturityDate'].nsmallest(2).groupby(['asOfDate']).last().reset_index()

df_final[df_final['maturityDate']==df_final.groupby(['asOfDate'])['maturityDate'].nsmallest(2).groupby(['asOfDate']).last().reset_index()]

The first one above outputs the below, which is the desired results along with the values any ideas here?

asOfDate maturityDate
0  2022-09-01   2022-11-01
1  2022-09-02   2022-11-01
2  2022-09-05   2022-11-01
3  2022-09-06   2022-11-01
4  2022-09-07   2022-11-01
5  2022-09-08   2022-11-01
6  2022-09-09   2022-11-01
7  2022-09-12   2022-11-01
8  2022-09-13   2022-11-01
9  2022-09-14   2022-11-01
10 2022-09-15   2022-11-01
11 2022-09-16   2022-12-01
12 2022-09-19   2022-12-01
13 2022-09-20   2022-12-01
14 2022-09-21   2022-12-01
15 2022-09-22   2022-12-01
16 2022-09-23   2022-12-01
17 2022-09-26   2022-12-01
18 2022-09-27   2022-12-01
19 2022-09-28   2022-12-01
20 2022-09-29   2022-12-01
21 2022-09-30   2022-12-01
22 2022-10-03   2022-12-01
23 2022-10-04   2022-12-01
24 2022-10-05   2022-12-01
25 2022-10-06   2022-12-01
26 2022-10-07   2022-12-01
27 2022-10-10   2022-12-01
28 2022-10-11   2022-12-01
29 2022-10-12   2022-12-01

CodePudding user response:

import pandas as pd
columns = ["asOfDate", "maturityDate", "value"]
data = [
["2022-09-01",   "2022-10-01",  57.273]
,["2022-09-01",   "2022-11-01",  55.861]
,["2022-09-01",   "2022-12-01",  59.231]
,["2022-09-01",   "2023-01-01",  59.305]
,["2022-09-01",   "2023-02-01",  58.081]
,["2022-09-01",   "2023-03-01",  51.198]
,["2022-09-01",   "2023-04-01",  44.532]
,["2022-09-01",   "2023-05-01",  38.955]
,["2022-09-01",   "2023-06-01",  37.901]
,["2022-09-01",   "2023-07-01",  40.461]
,["2022-09-01",   "2023-08-01",  40.126]
,["2022-09-01",   "2023-09-01",  40.221]
,["2022-09-01",   "2023-10-01",  40.455]
,["2022-09-01",   "2023-11-01",  41.652]
,["2022-09-01",   "2023-12-01",  42.450]
,["2022-09-01",   "2024-01-01",  45.057]
,["2022-09-01",   "2024-02-01",  42.459]
,["2022-09-01",   "2024-03-01",  37.544]
,["2022-09-01",   "2024-04-01",  29.901]
,["2022-09-01",   "2024-05-01",  25.300]
,["2022-09-01",   "2024-06-01",  22.360]
,["2022-09-01",   "2024-07-01",  22.544]
,["2022-09-01",   "2024-08-01",  23.931]
]

df = pd.DataFrame.from_records(data=data, columns=columns)

for col in ["asOfDate", "maturityDate"]:
    df[col] = pd.to_datetime(df[col])

df.sort_values(["asOfDate","maturityDate"]).groupby("asOfDate").agg({"maturityDate":lambda x: x.shift(-1).values[0]})
  • Related