Home > Mobile >  get a count of vaues that are on or before a certain date from a pandas dataframe
get a count of vaues that are on or before a certain date from a pandas dataframe

Time:08-30

I have a date 2020-05-31 and the following dataframe, where the column names are statuses:

     rejected revocation    decision         rfe   interview premium    received rfe_response  biometrics withdrawal appeal
196      None       None  2020-01-28        None        None    None  2020-01-16         None        None       None   None
203      None       None  2020-06-20  2020-04-01        None    None  2020-01-03   2020-08-08        None       None   None
209      None       None  2020-12-03  2020-06-03        None    None  2020-01-03         None        None       None   None
213      None       None  2020-06-23        None        None    None  2020-01-27         None  2020-02-19       None   None
1449     None       None  2020-05-12        None        None    None  2020-01-06         None        None       None   None
1660     None       None  2021-09-23  2021-05-27        None    None  2020-01-21   2021-08-17        None       None   None

I want to get the latest step each row is in, such that the latest steap is on or before the date mentioed above 2020-05-31

So the output for this woud be:

196: decision
203: rfe
209: received
213: biometrics
1449: decision
1660: received

or even a count works:

{
rejected = 0,
revocation = 0,
decision = 2,
rfe = 1,
interview = 0,
premium = 0,
received = 2,
rfe_response = 0,
biometrics 0 0,
withdrawal = 0,
appeal = 0 
}

Currently i am looping through each row, where i create a dict of {status: date}, then i sort by date, and take the key of the last value (which is a status)

This is very slow and takes forever

Is there a simpler or cleaner way of doing it?

NOTE: Each row will have atleast one date, in decision column

CodePudding user response:

you can mask where the date is bigger than the chosen date, then use idxmax along the columns.

dt_max = '2020-05-31'
res = df.where(df.le(dt_max)).astype('datetime64[ns]').idxmax(axis=1)
print(res)
# 196       decision
# 203            rfe
# 209       received
# 213     biometrics
# 1449      decision
# 1660      received
# dtype: object

And for the count, per status, then you can do with value_counts like

dict_res = res.value_counts().reindex(df.columns, fill_value=0).to_dict()
print(dict_res)
#{'rejected': 0, 'revocation': 0, 'decision': 2, 'rfe': 1, 'interview': 0, 'premium': 0, 
# 'received': 2, 'rfe_response': 0, 'biometrics': 1, 'withdrawal': 0, 'appeal': 0}

CodePudding user response:

You can use boolean masking and idxmax:

# ensure datetime
df = df.replace('None', pd.NA).apply(pd.to_datetime)

# mask values above date
mask = df.gt('2020-05-31')

# get column name of max timestamp, ensure max is not NaT
df.mask(mask).iloc[:, ::-1].idxmax(1).mask(mask.all(1))

output:

196       decision
203            rfe
209       received
213     biometrics
1449      decision
1660      received
dtype: object

For the counts:

df.le('2020-05-31').sum()

output:

rejected        0
revocation      0
decision        2
rfe             1
interview       0
premium         0
received        6
rfe_response    0
biometrics      1
withdrawal      0
appeal          0
dtype: int64

CodePudding user response:

You can pivot long, and take the last row (by group) prior to your max date

(
    df.melt(ignore_index=False)
    .query("value<'2020-05-31'")
    .sort_values("value")
    .groupby("N")
    .last()["variable"]
)

Output:

N
196       decision
203            rfe
209       received
213     biometrics
1449      decision
1660      received
Name: variable, dtype: object
  • Related