I have a date 2020-05-31
and the following dataframe, where the column names are statuses:
rejected revocation decision rfe interview premium received rfe_response biometrics withdrawal appeal
196 None None 2020-01-28 None None None 2020-01-16 None None None None
203 None None 2020-06-20 2020-04-01 None None 2020-01-03 2020-08-08 None None None
209 None None 2020-12-03 2020-06-03 None None 2020-01-03 None None None None
213 None None 2020-06-23 None None None 2020-01-27 None 2020-02-19 None None
1449 None None 2020-05-12 None None None 2020-01-06 None None None None
1660 None None 2021-09-23 2021-05-27 None None 2020-01-21 2021-08-17 None None None
I want to get the latest step each row is in, such that the latest steap is on or before the date mentioed above 2020-05-31
So the output for this woud be:
196: decision
203: rfe
209: received
213: biometrics
1449: decision
1660: received
or even a count works:
{
rejected = 0,
revocation = 0,
decision = 2,
rfe = 1,
interview = 0,
premium = 0,
received = 2,
rfe_response = 0,
biometrics 0 0,
withdrawal = 0,
appeal = 0
}
Currently i am looping through each row, where i create a dict of {status: date}, then i sort by date, and take the key of the last value (which is a status)
This is very slow and takes forever
Is there a simpler or cleaner way of doing it?
NOTE: Each row will have atleast one date, in decision column
CodePudding user response:
you can mask where
the date is bigger than the chosen date, then use idxmax
along the columns.
dt_max = '2020-05-31'
res = df.where(df.le(dt_max)).astype('datetime64[ns]').idxmax(axis=1)
print(res)
# 196 decision
# 203 rfe
# 209 received
# 213 biometrics
# 1449 decision
# 1660 received
# dtype: object
And for the count, per status, then you can do with value_counts
like
dict_res = res.value_counts().reindex(df.columns, fill_value=0).to_dict()
print(dict_res)
#{'rejected': 0, 'revocation': 0, 'decision': 2, 'rfe': 1, 'interview': 0, 'premium': 0,
# 'received': 2, 'rfe_response': 0, 'biometrics': 1, 'withdrawal': 0, 'appeal': 0}
CodePudding user response:
You can use boolean masking and idxmax
:
# ensure datetime
df = df.replace('None', pd.NA).apply(pd.to_datetime)
# mask values above date
mask = df.gt('2020-05-31')
# get column name of max timestamp, ensure max is not NaT
df.mask(mask).iloc[:, ::-1].idxmax(1).mask(mask.all(1))
output:
196 decision
203 rfe
209 received
213 biometrics
1449 decision
1660 received
dtype: object
For the counts:
df.le('2020-05-31').sum()
output:
rejected 0
revocation 0
decision 2
rfe 1
interview 0
premium 0
received 6
rfe_response 0
biometrics 1
withdrawal 0
appeal 0
dtype: int64
CodePudding user response:
You can pivot long, and take the last row (by group) prior to your max date
(
df.melt(ignore_index=False)
.query("value<'2020-05-31'")
.sort_values("value")
.groupby("N")
.last()["variable"]
)
Output:
N
196 decision
203 rfe
209 received
213 biometrics
1449 decision
1660 received
Name: variable, dtype: object