I have a dataframe as shown below:
CLIENT_ID | ENCOUNTER_DATE | CONDITION |
---|---|---|
8222 | 2020-01-01 | Positive |
8222 | 2020-03-02 | Treated |
8222 | 2020-04-18 | Treated |
8222 | 2020-07-31 | Negative |
8300 | 2017-06-10 | Negative |
8300 | 2017-09-11 | Treated |
8300 | 2018-02-01 | Future Treatment |
8300 | 2018-04-01 | Treated |
8300 | 2018-05-31 | Negative |
8400 | 2020-12-31 | Future Treatment |
8401 | 2017-08-29 | Negative |
8401 | 2017-09-15 | Positive |
8500 | 2018-10-10 | Positive |
Here is the code to create df:
df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
"ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
"CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})
manage_condition_list = ['positive','treated','future treatment']
The table is sorted by the CLIENT_ID
and DATE_ENCOUNTER
.
I want to get the cumulative count (number of times) that client, CLIENT_ID
had a CONDITION
in the list manage_condition_list
at that point in time. So that final dataframe or output will look like below:
CLIENT_ID | ENCOUNTER_DATE | CONDITION | CONDITION_COUNTS |
---|---|---|---|
8222 | 2020-01-01 | Positive | 1 |
8222 | 2020-03-02 | Treated | 2 |
8222 | 2020-04-18 | Treated | 3 |
8222 | 2020-07-31 | Negative | 3 |
8300 | 2017-06-10 | Negative | 0 |
8300 | 2017-09-11 | Treated | 1 |
8300 | 2018-02-01 | Future Treatment | 2 |
8300 | 2018-04-01 | Treated | 3 |
8300 | 2018-05-31 | Negative | 3 |
8400 | 2020-12-31 | Future Treatment | 1 |
8401 | 2017-08-29 | Negative | 0 |
8401 | 2017-09-15 | Positive | 1 |
8500 | 2018-10-10 | Positive | 1 |
Note the real data has significantly more entries NOT in the manage_condition_list
. I am thinking a combination of df.where
and cumcount() 1
but am not too sure.
CodePudding user response:
use isin
to get True if the value is in the list manage_condition_list
on the column CONDITION, then groupby.cumsum
by the CLIENT_ID column
df['CONDITION_COUNTS'] = (
df['CONDITION'].isin(manage_condition_list)
.groupby(df['CLIENT_ID']).cumsum()
)
print(df)
CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
0 8222 2020-01-01 positive 1
1 8222 2020-03-02 treated 2
2 8222 2020-04-18 treated 3
3 8222 2020-07-31 negative 3
4 8300 2017-06-10 negative 0
5 8300 2017-09-11 treated 1
6 8300 2018-02-01 future treatment 2
7 8300 2018-04-01 treated 3
8 8300 2018-05-31 negative 3
9 8400 2020-12-31 future treatment 1
10 8401 2017-08-29 negative 0
11 8401 2017-09-15 positive 1
12 8500 2018-10-10 positive 1
CodePudding user response:
Not sure whether I understood the logic behind cum_counts, but hope this helps
df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df
or
df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()
or
df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())