Home > front end >  How to get cumulative counts for each ID at that point in time based on entries in another column in
How to get cumulative counts for each ID at that point in time based on entries in another column in

Time:09-17

I have a dataframe as shown below:

CLIENT_ID ENCOUNTER_DATE CONDITION
8222 2020-01-01 Positive
8222 2020-03-02 Treated
8222 2020-04-18 Treated
8222 2020-07-31 Negative
8300 2017-06-10 Negative
8300 2017-09-11 Treated
8300 2018-02-01 Future Treatment
8300 2018-04-01 Treated
8300 2018-05-31 Negative
8400 2020-12-31 Future Treatment
8401 2017-08-29 Negative
8401 2017-09-15 Positive
8500 2018-10-10 Positive

Here is the code to create df:

df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
                   "ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
                   "CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})

manage_condition_list = ['positive','treated','future treatment']

The table is sorted by the CLIENT_ID and DATE_ENCOUNTER.

I want to get the cumulative count (number of times) that client, CLIENT_ID had a CONDITION in the list manage_condition_list at that point in time. So that final dataframe or output will look like below:

CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
8222 2020-01-01 Positive 1
8222 2020-03-02 Treated 2
8222 2020-04-18 Treated 3
8222 2020-07-31 Negative 3
8300 2017-06-10 Negative 0
8300 2017-09-11 Treated 1
8300 2018-02-01 Future Treatment 2
8300 2018-04-01 Treated 3
8300 2018-05-31 Negative 3
8400 2020-12-31 Future Treatment 1
8401 2017-08-29 Negative 0
8401 2017-09-15 Positive 1
8500 2018-10-10 Positive 1

Note the real data has significantly more entries NOT in the manage_condition_list. I am thinking a combination of df.where and cumcount() 1 but am not too sure.

CodePudding user response:

use isin to get True if the value is in the list manage_condition_list on the column CONDITION, then groupby.cumsum by the CLIENT_ID column

df['CONDITION_COUNTS'] = (
    df['CONDITION'].isin(manage_condition_list)
      .groupby(df['CLIENT_ID']).cumsum()
)
print(df)
    CLIENT_ID ENCOUNTER_DATE         CONDITION  CONDITION_COUNTS
0        8222     2020-01-01          positive                 1
1        8222     2020-03-02           treated                 2
2        8222     2020-04-18           treated                 3
3        8222     2020-07-31          negative                 3
4        8300     2017-06-10          negative                 0
5        8300     2017-09-11           treated                 1
6        8300     2018-02-01  future treatment                 2
7        8300     2018-04-01           treated                 3
8        8300     2018-05-31          negative                 3
9        8400     2020-12-31  future treatment                 1
10       8401     2017-08-29          negative                 0
11       8401     2017-09-15          positive                 1
12       8500     2018-10-10          positive                 1

CodePudding user response:

Not sure whether I understood the logic behind cum_counts, but hope this helps

df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df

or

df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()

or

df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())
  • Related