How do we use groupby and rolling fuctions and maintain structure of the original dataframe?-CodePudding

I am trying to search through the following "patients" by their ID # ['PAT_MRN_ID'] to see whether they have either two consecutive positive results for either NG or CT in a 90 day period. If they do, I'd like a sum of the positive results per each ID # with the specific dates of those results.

My intention is to later remove patients that have >=2 positive results in a 90 day period from my data set. I am able to run the data through the groupby and then through rolling. However, doing so results in truncation of the dataframe to only 3 colums.

An example of the df is as follows:

PAT_NAME	PAT_MRN_ID	AGE	GENDER	ETHNICITY	RACE	ORDER_INST	TEST_TYPE	RESULTS	ALL_RESULTS
Bee, Max	1234	24	Male	Hispanic or Latinx	Black	1/20/2023	NG	negative	False
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	1/01/2022	NG	negative	False
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	3/16/2022	CT	positive	True
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	3/17/2022	CT	positive	True
Rell, Hell	123456	30	Male	Hispanic or Latinx	Asian	3/03/2022	NG	positive	True
Rell, Hell	123456	30	Male	Hispanic or Latinx	Asian	3/03/2022	CT	positive	True
Writer, JR	1234567	88	Female	Hispanic or Latinx	Asian	1/03/2022	CT	positive	True
Writer, JR	1234567	88	Female	Hispanic or Latinx	Asian	9/03/2022	CT	positive	True
Ron, Cam	12345678	55	Male	Black	Black	9/03/2022	CT	positive	True
Ron, Cam	12345678	55	Male	Black	Black	9/03/2022	CT	positive	True

Tried the following:

df['SUM'] = df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()

ValueError: cannot handle a non-unique multi-index!

Expected:

PAT_NAME	PAT_MRN_ID	AGE	GENDER	ETHNICITY	RACE	ORDER_INST	TEST_TYPE	RESULTS	ALL_RESULTS	SUM
Bee, Max	1234	24	Male	Hispanic or Latinx	Black	1/20/2023	NG	negative	False	NaN
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	1/01/2022	NG	negative	False	NaN
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	3/16/2022	CT	positive	True	NaN
Jones, Jim	12345	35	Male	Hispanic or Latinx	White	3/17/2022	CT	positive	True	2
Rell, Hell	123456	30	Male	Hispanic or Latinx	Asian	3/03/2022	NG	positive	True	NaN
Rell, Hell	123456	30	Male	Hispanic or Latinx	Asian	3/03/2022	CT	positive	True	NaN
Writer, JR	1234567	88	Female	Hispanic or Latinx	Asian	1/03/2022	CT	positive	True	NaN
Writer, JR	1234567	88	Female	Hispanic or Latinx	Asian	9/03/2022	CT	positive	True	NaN
Ron, Cam	12345678	55	Male	Black	Black	9/03/2022	CT	positive	True
Ron, Cam	12345678	55	Male	Black	Black	9/03/2022	CT	positive	True	2

If I attempt the following:

df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()

I get this:

PAT_MRN_ID	ORDER_INST	SUM
1234	1/20/2023	NaN
12345	1/01/2022	Nan
12345	3/16/2022	NaN
12345	3/17/2022	2
123456	3/03/2022	NaN
123456	3/03/2022	NaN
1234567	9/03/2022	NaN
1234567	9/03/2022	NaN
12345678	9/03/2022	NaN
12345678	9/03/2022	2

Furthermore, I have tried passing this through transform(lambda x: x.rolling()) and get:

ValueError: invalid on specified as ORDER_INST, must be a column (of DataFrame), an Index or None The above exception was the direct cause of the following exception: ValueError: transform must return a scalar value for each group

I also attempted turing ['ALL_RESULTS'] from 'True'/'False' into '1'/'0' respectively to give transform a "scalar;" however, that also failed to work.

In an ideal world I want to keep the original dataframe format and add a row for the rolling sum OR simply just remove the instance of duplicate positive test results in a 90 day period from the dataframe.

CodePudding user response：

Use DataFrame.join by MultiIndex, also was removed TEST_TYPE column and as_index=False parameter:

df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])

df = (df.join(df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
                .rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
                .sum().rename('SUM'), on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST']))
print (df)
      PAT_NAME  PAT_MRN_ID  AGE   GENDER            ETHNICITY    RACE  \
0    Bee, Max         1234   24    Male   Hispanic or Latinx   Black    
1  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
2  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
3  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
4  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
5  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
6  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
7  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    

  ORDER_INST TEST_TYPE    RESULTS  ALL_RESULTS  SUM  
0 2023-01-20       NG   negative         False  NaN  
1 2022-01-01       NG   negative         False  NaN  
2 2022-03-16       CT   positive          True  NaN  
3 2022-03-17       CT   positive          True  2.0  
4 2022-03-03       NG   positive          True  NaN  
5 2022-03-03       CT   positive          True  NaN  
6 2022-01-03       CT   positive          True  NaN  
7 2022-09-03       CT   positive          True  NaN

EDIT: I add GroupBy.cumcount for counter of duplicated values:

df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])

df1 = (df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
         .rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
         .sum()
         .to_frame('SUM')
         .assign(g = lambda x:x.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount())
         .set_index('g', append=True))


df['g'] = df.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount()
df = df.join(df1, on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST','g']).drop('g', axis=1)

print (df)
      PAT_NAME  PAT_MRN_ID  AGE   GENDER            ETHNICITY    RACE  \
0    Bee, Max         1234   24    Male   Hispanic or Latinx   Black    
1  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
2  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
3  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
4  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
5  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
6  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
7  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
8    Ron, Cam     12345678   55    Male                Black   Black    
9    Ron, Cam     12345678   55    Male                Black   Black    

  ORDER_INST TEST_TYPE    RESULTS  ALL_RESULTS  SUM  
0 2023-01-20       NG   negative         False  NaN  
1 2022-01-01       NG   negative         False  NaN  
2 2022-03-16       CT   positive          True  NaN  
3 2022-03-17       CT   positive          True  2.0  
4 2022-03-03       NG   positive          True  NaN  
5 2022-03-03       CT   positive          True  NaN  
6 2022-01-03       CT   positive          True  NaN  
7 2022-09-03       CT   positive          True  NaN  
8 2022-09-03       CT   positive          True  NaN  
9 2022-09-03       CT   positive          True  2.0