How to get cumulative counts for each ID at that point in time based on entries in another column in-CodePudding

I have a dataframe as shown below:

CLIENT_ID	ENCOUNTER_DATE	CONDITION
8222	2020-01-01	Positive
8222	2020-03-02	Treated
8222	2020-04-18	Treated
8222	2020-07-31	Negative
8300	2017-06-10	Negative
8300	2017-09-11	Treated
8300	2018-02-01	Future Treatment
8300	2018-04-01	Treated
8300	2018-05-31	Negative
8400	2020-12-31	Future Treatment
8401	2017-08-29	Negative
8401	2017-09-15	Positive
8500	2018-10-10	Positive

Here is the code to create df:

df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
                   "ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
                   "CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})

manage_condition_list = ['positive','treated','future treatment']

The table is sorted by the CLIENT_ID and DATE_ENCOUNTER.

I want to get the cumulative count (number of times) that client, CLIENT_ID had a CONDITION in the list manage_condition_list at that point in time. So that final dataframe or output will look like below:

CLIENT_ID	ENCOUNTER_DATE	CONDITION	CONDITION_COUNTS
8222	2020-01-01	Positive	1
8222	2020-03-02	Treated	2
8222	2020-04-18	Treated	3
8222	2020-07-31	Negative	3
8300	2017-06-10	Negative	0
8300	2017-09-11	Treated	1
8300	2018-02-01	Future Treatment	2
8300	2018-04-01	Treated	3
8300	2018-05-31	Negative	3
8400	2020-12-31	Future Treatment	1
8401	2017-08-29	Negative	0
8401	2017-09-15	Positive	1
8500	2018-10-10	Positive	1

Note the real data has significantly more entries NOT in the manage_condition_list. I am thinking a combination of df.where and cumcount() 1 but am not too sure.

CodePudding user response：

use isin to get True if the value is in the list manage_condition_list on the column CONDITION, then groupby.cumsum by the CLIENT_ID column

df['CONDITION_COUNTS'] = (
    df['CONDITION'].isin(manage_condition_list)
      .groupby(df['CLIENT_ID']).cumsum()
)
print(df)
    CLIENT_ID ENCOUNTER_DATE         CONDITION  CONDITION_COUNTS
0        8222     2020-01-01          positive                 1
1        8222     2020-03-02           treated                 2
2        8222     2020-04-18           treated                 3
3        8222     2020-07-31          negative                 3
4        8300     2017-06-10          negative                 0
5        8300     2017-09-11           treated                 1
6        8300     2018-02-01  future treatment                 2
7        8300     2018-04-01           treated                 3
8        8300     2018-05-31          negative                 3
9        8400     2020-12-31  future treatment                 1
10       8401     2017-08-29          negative                 0
11       8401     2017-09-15          positive                 1
12       8500     2018-10-10          positive                 1

CodePudding user response：

Not sure whether I understood the logic behind cum_counts, but hope this helps

df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df

df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()

df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())