I have some hospital visit healthcare data in a dataframe of the form:
CLIENT_ID | DATE_ENCOUNTER | DATE_COUNSELLING | COUNSELLING_COUNT |
---|---|---|---|
54950 | 2017-11-24 | NaN | 0 |
54950 | 2018-01-19 | NaN | 0 |
54950 | 2018-03-13 | NaN | 0 |
54950 | 2018-05-11 | 2018-04-30 | 1 |
54950 | 2018-12-17 | 2018-06-25 | 3 |
67777 | 2015-09-01 | NaN | 0 |
67777 | 2015-12-01 | NaN | 0 |
67777 | 2016-02-28 | 2016-02-28 | 1 |
70000 | 2019-06-07 | 2019-06-07 | 1 |
70000 | 2019-08-09 | 2019-06-07 | 1 |
I want to create a column COUNSELLING_STARTED
which indicates whether a client CLIENT_ID
has started counselling, but only the first time. i.e. The first occurence when COUNSELLING_COUNT == 1
for each CLIENT_ID
which should result in the dataframe below:
CLIENT_ID | DATE_ENCOUNTER | DATE_COUNSELLING | COUNSELLING_COUNT | COUNSELLING_STARTED |
---|---|---|---|---|
54950 | 2017-11-24 | NaN | 0 | 0 |
54950 | 2018-01-19 | NaN | 0 | 0 |
54950 | 2018-03-13 | NaN | 0 | 0 |
54950 | 2018-05-11 | 2018-04-30 | 1 | 1 |
54950 | 2018-12-17 | 2018-06-25 | 3 | 0 |
67777 | 2015-09-01 | NaN | 0 | 0 |
67777 | 2015-12-01 | NaN | 0 | 0 |
67777 | 2016-02-28 | 2016-02-28 | 1 | 1 |
70000 | 2019-06-07 | 2019-06-07 | 1 | 1 |
70000 | 2019-08-09 | 2019-06-07 | 1 | 0 |
below is the code to generate the dataframe:
data = {'CLIENT_ID':[54950,54950,54950,54950,54950,67777,67777,67777,70000,70000],
'DATE_ENCOUNTER':['2017-11-24','2018-01-19','2018-03-13','2018-05-11','2018-12-17','2015-09-01','2015-12-01','2016-02-28','2019-06-07','2019-08-09'],
'DATE_COUNSELLING':[np.nan,np.nan,np.nan,'2018-04-30','2018-06-25',np.nan,np.nan,'2016-02-28','2019-06-07','2019-06-07'],
'COUNSELLING_COUNT':[0,0,0,1,3,0,0,1,1,1]}
df = pd.DataFrame(data)
CodePudding user response:
df.loc[:, 'COUNSELLING_STARTED'] = 0
df.loc[df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin(), 'COUNSELLING_STARTED'] = 1
Explanation:
Find where the counseling dates are nan
; then groupby the client IDs and find the index of the minimum (which will be the first entry):
>>> df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin()
CLIENT_ID
54950 3
67777 7
70000 8
Name: DATE_COUNSELLING, dtype: int64
You are using these indices to choose where to write 1
in the new column.
The final df
is then:
CLIENT_ID DATE_ENCOUNTER ... COUNSELLING_COUNT COUNSELLING_STARTED
0 54950 2017-11-24 ... 0 0
1 54950 2018-01-19 ... 0 0
2 54950 2018-03-13 ... 0 0
3 54950 2018-05-11 ... 1 1
4 54950 2018-12-17 ... 3 0
5 67777 2015-09-01 ... 0 0
6 67777 2015-12-01 ... 0 0
7 67777 2016-02-28 ... 1 1
8 70000 2019-06-07 ... 1 1
9 70000 2019-08-09 ... 1 0
[10 rows x 5 columns]
If you wanted to instead explicitly select the earliest counseling date (rather than the first non-NA value), you could instead use this as your indexer:
>>> pd.to_datetime(df['DATE_COUNSELLING']).groupby(df['CLIENT_ID']).idxmin()
CLIENT_ID
54950 3
67777 7
70000 8
Name: DATE_COUNSELLING, dtype: int64
Which gives the same result here since the dates are sorted for each client (i.e. the earliest observed date is the first non-NA value).