I have the below dataframe called df:
Id | Stage1 | Stage2 | Stage3 |
---|---|---|---|
1 | 2022-02-01 | 2020-04-03 | 2022-06-07 |
--- | ------------ | ------------ | ----------- |
2 | 2023-06-07 | 2020-03-01 | 2020-09-03 |
--- | ------------ | ------------ | ----------- |
3 | 2023-02-04 | 2023-06-07 | 2022-06-07 |
I need to calculate the max date for each ID and its respective Stage. So for Order 1,2,3 the Stages I need are Stage 3, Stage 1 and Stage 2 respectively. I started this process by calculating the max date in each row first with the below code:
df2 = df[['Stage1', 'Stage2', 'Stage3', 'Stage4', 'Stage5']]
lis = list(df2.max(axis=1))
lis variable has the max dates stored for each row. Now, with each max date, I need to get the Stage Name of that row.
The below code calculates the max Stage for the whole df and not row.
new_lis = []
for i in lis:
new_lis.append(df.columns[df.isin([i]).any()])
How do I fix this? Output I need is "Stage 3", "Stage 1" and "Stage 2" for Order 1,2,3 respectively.
CodePudding user response:
Let's try idxmax(axis=1)
out = (df.filter(like='Stage')
.apply(pd.to_datetime)
.idxmax(axis=1))
print(out)
0 Stage3
1 Stage1
2 Stage2
dtype: object
If your stage columns contain NaT for the whole row, you can drop this row
out = (df.filter(like='Stage')
.apply(pd.to_datetime)
.dropna(how='all')
.idxmax(axis=1))
input dataframe
Id Stage1 Stage2 Stage3
0 1 2022-02-01 2020-04-03 2022-06-07
1 2 2023-06-07 2020-03-01 2020-09-03
2 3 2023-02-04 2023-06-07 2022-06-07
3 4 NaN NaN NaN
4 5 NaT 2023-06-07 2022-06-07
output dataframe, note the index 3 is dropped
0 Stage3
1 Stage1
2 Stage2
4 Stage2
dtype: object