get a count of vaues that are on or before a certain date from a pandas dataframe-CodePudding

I have a date 2020-05-31 and the following dataframe, where the column names are statuses:

     rejected revocation    decision         rfe   interview premium    received rfe_response  biometrics withdrawal appeal
196      None       None  2020-01-28        None        None    None  2020-01-16         None        None       None   None
203      None       None  2020-06-20  2020-04-01        None    None  2020-01-03   2020-08-08        None       None   None
209      None       None  2020-12-03  2020-06-03        None    None  2020-01-03         None        None       None   None
213      None       None  2020-06-23        None        None    None  2020-01-27         None  2020-02-19       None   None
1449     None       None  2020-05-12        None        None    None  2020-01-06         None        None       None   None
1660     None       None  2021-09-23  2021-05-27        None    None  2020-01-21   2021-08-17        None       None   None

I want to get the latest step each row is in, such that the latest steap is on or before the date mentioed above 2020-05-31

So the output for this woud be:

196: decision
203: rfe
209: received
213: biometrics
1449: decision
1660: received

or even a count works:

{
rejected = 0,
revocation = 0,
decision = 2,
rfe = 1,
interview = 0,
premium = 0,
received = 2,
rfe_response = 0,
biometrics 0 0,
withdrawal = 0,
appeal = 0 
}

Currently i am looping through each row, where i create a dict of {status: date}, then i sort by date, and take the key of the last value (which is a status)

This is very slow and takes forever

Is there a simpler or cleaner way of doing it?

NOTE: Each row will have atleast one date, in decision column

CodePudding user response：

you can mask where the date is bigger than the chosen date, then use idxmax along the columns.

dt_max = '2020-05-31'
res = df.where(df.le(dt_max)).astype('datetime64[ns]').idxmax(axis=1)
print(res)
# 196       decision
# 203            rfe
# 209       received
# 213     biometrics
# 1449      decision
# 1660      received
# dtype: object

And for the count, per status, then you can do with value_counts like

dict_res = res.value_counts().reindex(df.columns, fill_value=0).to_dict()
print(dict_res)
#{'rejected': 0, 'revocation': 0, 'decision': 2, 'rfe': 1, 'interview': 0, 'premium': 0, 
# 'received': 2, 'rfe_response': 0, 'biometrics': 1, 'withdrawal': 0, 'appeal': 0}

CodePudding user response：

You can use boolean masking and idxmax:

# ensure datetime
df = df.replace('None', pd.NA).apply(pd.to_datetime)

# mask values above date
mask = df.gt('2020-05-31')

# get column name of max timestamp, ensure max is not NaT
df.mask(mask).iloc[:, ::-1].idxmax(1).mask(mask.all(1))

output:

196       decision
203            rfe
209       received
213     biometrics
1449      decision
1660      received
dtype: object

For the counts:

df.le('2020-05-31').sum()

output:

rejected        0
revocation      0
decision        2
rfe             1
interview       0
premium         0
received        6
rfe_response    0
biometrics      1
withdrawal      0
appeal          0
dtype: int64

CodePudding user response：

You can pivot long, and take the last row (by group) prior to your max date

(
    df.melt(ignore_index=False)
    .query("value<'2020-05-31'")
    .sort_values("value")
    .groupby("N")
    .last()["variable"]
)

Output:

N
196       decision
203            rfe
209       received
213     biometrics
1449      decision
1660      received
Name: variable, dtype: object