Home > other >  How to create an indicator column of the first occurrence of a variable of groupby ID sorted by date
How to create an indicator column of the first occurrence of a variable of groupby ID sorted by date

Time:10-22

I have some hospital visit healthcare data in a dataframe of the form:

CLIENT_ID DATE_ENCOUNTER DATE_COUNSELLING COUNSELLING_COUNT
54950 2017-11-24 NaN 0
54950 2018-01-19 NaN 0
54950 2018-03-13 NaN 0
54950 2018-05-11 2018-04-30 1
54950 2018-12-17 2018-06-25 3
67777 2015-09-01 NaN 0
67777 2015-12-01 NaN 0
67777 2016-02-28 2016-02-28 1
70000 2019-06-07 2019-06-07 1
70000 2019-08-09 2019-06-07 1

I want to create a column COUNSELLING_STARTED which indicates whether a client CLIENT_ID has started counselling, but only the first time. i.e. The first occurence when COUNSELLING_COUNT == 1 for each CLIENT_ID which should result in the dataframe below:

CLIENT_ID DATE_ENCOUNTER DATE_COUNSELLING COUNSELLING_COUNT COUNSELLING_STARTED
54950 2017-11-24 NaN 0 0
54950 2018-01-19 NaN 0 0
54950 2018-03-13 NaN 0 0
54950 2018-05-11 2018-04-30 1 1
54950 2018-12-17 2018-06-25 3 0
67777 2015-09-01 NaN 0 0
67777 2015-12-01 NaN 0 0
67777 2016-02-28 2016-02-28 1 1
70000 2019-06-07 2019-06-07 1 1
70000 2019-08-09 2019-06-07 1 0

below is the code to generate the dataframe:

data = {'CLIENT_ID':[54950,54950,54950,54950,54950,67777,67777,67777,70000,70000],
'DATE_ENCOUNTER':['2017-11-24','2018-01-19','2018-03-13','2018-05-11','2018-12-17','2015-09-01','2015-12-01','2016-02-28','2019-06-07','2019-08-09'],
'DATE_COUNSELLING':[np.nan,np.nan,np.nan,'2018-04-30','2018-06-25',np.nan,np.nan,'2016-02-28','2019-06-07','2019-06-07'],
'COUNSELLING_COUNT':[0,0,0,1,3,0,0,1,1,1]}

df = pd.DataFrame(data)

CodePudding user response:

df.loc[:, 'COUNSELLING_STARTED'] = 0
df.loc[df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin(), 'COUNSELLING_STARTED'] = 1

Explanation:

Find where the counseling dates are nan; then groupby the client IDs and find the index of the minimum (which will be the first entry):

>>> df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin()
CLIENT_ID
54950    3
67777    7
70000    8
Name: DATE_COUNSELLING, dtype: int64

You are using these indices to choose where to write 1 in the new column.

The final df is then:

   CLIENT_ID DATE_ENCOUNTER  ... COUNSELLING_COUNT  COUNSELLING_STARTED
0      54950     2017-11-24  ...                 0                    0
1      54950     2018-01-19  ...                 0                    0
2      54950     2018-03-13  ...                 0                    0
3      54950     2018-05-11  ...                 1                    1
4      54950     2018-12-17  ...                 3                    0
5      67777     2015-09-01  ...                 0                    0
6      67777     2015-12-01  ...                 0                    0
7      67777     2016-02-28  ...                 1                    1
8      70000     2019-06-07  ...                 1                    1
9      70000     2019-08-09  ...                 1                    0

[10 rows x 5 columns]

If you wanted to instead explicitly select the earliest counseling date (rather than the first non-NA value), you could instead use this as your indexer:

>>> pd.to_datetime(df['DATE_COUNSELLING']).groupby(df['CLIENT_ID']).idxmin()
CLIENT_ID
54950    3
67777    7
70000    8
Name: DATE_COUNSELLING, dtype: int64

Which gives the same result here since the dates are sorted for each client (i.e. the earliest observed date is the first non-NA value).

  • Related