Home > Blockchain >  How do we use groupby and rolling fuctions and maintain structure of the original dataframe?
How do we use groupby and rolling fuctions and maintain structure of the original dataframe?

Time:02-03

I am trying to search through the following "patients" by their ID # ['PAT_MRN_ID'] to see whether they have either two consecutive positive results for either NG or CT in a 90 day period. If they do, I'd like a sum of the positive results per each ID # with the specific dates of those results.

My intention is to later remove patients that have >=2 positive results in a 90 day period from my data set. I am able to run the data through the groupby and then through rolling. However, doing so results in truncation of the dataframe to only 3 colums.

An example of the df is as follows:

PAT_NAME PAT_MRN_ID AGE GENDER ETHNICITY RACE ORDER_INST TEST_TYPE RESULTS ALL_RESULTS
Bee, Max 1234 24 Male Hispanic or Latinx Black 1/20/2023 NG negative False
Jones, Jim 12345 35 Male Hispanic or Latinx White 1/01/2022 NG negative False
Jones, Jim 12345 35 Male Hispanic or Latinx White 3/16/2022 CT positive True
Jones, Jim 12345 35 Male Hispanic or Latinx White 3/17/2022 CT positive True
Rell, Hell 123456 30 Male Hispanic or Latinx Asian 3/03/2022 NG positive True
Rell, Hell 123456 30 Male Hispanic or Latinx Asian 3/03/2022 CT positive True
Writer, JR 1234567 88 Female Hispanic or Latinx Asian 1/03/2022 CT positive True
Writer, JR 1234567 88 Female Hispanic or Latinx Asian 9/03/2022 CT positive True
Ron, Cam 12345678 55 Male Black Black 9/03/2022 CT positive True
Ron, Cam 12345678 55 Male Black Black 9/03/2022 CT positive True

Tried the following:

df['SUM'] = df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()

ValueError: cannot handle a non-unique multi-index!

Expected:

PAT_NAME PAT_MRN_ID AGE GENDER ETHNICITY RACE ORDER_INST TEST_TYPE RESULTS ALL_RESULTS SUM
Bee, Max 1234 24 Male Hispanic or Latinx Black 1/20/2023 NG negative False NaN
Jones, Jim 12345 35 Male Hispanic or Latinx White 1/01/2022 NG negative False NaN
Jones, Jim 12345 35 Male Hispanic or Latinx White 3/16/2022 CT positive True NaN
Jones, Jim 12345 35 Male Hispanic or Latinx White 3/17/2022 CT positive True 2
Rell, Hell 123456 30 Male Hispanic or Latinx Asian 3/03/2022 NG positive True NaN
Rell, Hell 123456 30 Male Hispanic or Latinx Asian 3/03/2022 CT positive True NaN
Writer, JR 1234567 88 Female Hispanic or Latinx Asian 1/03/2022 CT positive True NaN
Writer, JR 1234567 88 Female Hispanic or Latinx Asian 9/03/2022 CT positive True NaN
Ron, Cam 12345678 55 Male Black Black 9/03/2022 CT positive True
Ron, Cam 12345678 55 Male Black Black 9/03/2022 CT positive True 2

If I attempt the following:

df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()

I get this:

PAT_MRN_ID ORDER_INST SUM
1234 1/20/2023 NaN
12345 1/01/2022 Nan
12345 3/16/2022 NaN
12345 3/17/2022 2
123456 3/03/2022 NaN
123456 3/03/2022 NaN
1234567 9/03/2022 NaN
1234567 9/03/2022 NaN
12345678 9/03/2022 NaN
12345678 9/03/2022 2

Furthermore, I have tried passing this through transform(lambda x: x.rolling()) and get:

ValueError: invalid on specified as ORDER_INST, must be a column (of DataFrame), an Index or None The above exception was the direct cause of the following exception: ValueError: transform must return a scalar value for each group

I also attempted turing ['ALL_RESULTS'] from 'True'/'False' into '1'/'0' respectively to give transform a "scalar;" however, that also failed to work.

In an ideal world I want to keep the original dataframe format and add a row for the rolling sum OR simply just remove the instance of duplicate positive test results in a 90 day period from the dataframe.

CodePudding user response:

Use DataFrame.join by MultiIndex, also was removed TEST_TYPE column and as_index=False parameter:

df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])

df = (df.join(df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
                .rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
                .sum().rename('SUM'), on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST']))
print (df)
      PAT_NAME  PAT_MRN_ID  AGE   GENDER            ETHNICITY    RACE  \
0    Bee, Max         1234   24    Male   Hispanic or Latinx   Black    
1  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
2  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
3  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
4  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
5  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
6  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
7  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    

  ORDER_INST TEST_TYPE    RESULTS  ALL_RESULTS  SUM  
0 2023-01-20       NG   negative         False  NaN  
1 2022-01-01       NG   negative         False  NaN  
2 2022-03-16       CT   positive          True  NaN  
3 2022-03-17       CT   positive          True  2.0  
4 2022-03-03       NG   positive          True  NaN  
5 2022-03-03       CT   positive          True  NaN  
6 2022-01-03       CT   positive          True  NaN  
7 2022-09-03       CT   positive          True  NaN  

EDIT: I add GroupBy.cumcount for counter of duplicated values:

df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])

df1 = (df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
         .rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
         .sum()
         .to_frame('SUM')
         .assign(g = lambda x:x.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount())
         .set_index('g', append=True))


df['g'] = df.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount()
df = df.join(df1, on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST','g']).drop('g', axis=1)

print (df)
      PAT_NAME  PAT_MRN_ID  AGE   GENDER            ETHNICITY    RACE  \
0    Bee, Max         1234   24    Male   Hispanic or Latinx   Black    
1  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
2  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
3  Jones, Jim        12345   35    Male   Hispanic or Latinx   White    
4  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
5  Rell, Hell       123456   30    Male   Hispanic or Latinx   Asian    
6  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
7  Writer, JR      1234567   88  Female   Hispanic or Latinx   Asian    
8    Ron, Cam     12345678   55    Male                Black   Black    
9    Ron, Cam     12345678   55    Male                Black   Black    

  ORDER_INST TEST_TYPE    RESULTS  ALL_RESULTS  SUM  
0 2023-01-20       NG   negative         False  NaN  
1 2022-01-01       NG   negative         False  NaN  
2 2022-03-16       CT   positive          True  NaN  
3 2022-03-17       CT   positive          True  2.0  
4 2022-03-03       NG   positive          True  NaN  
5 2022-03-03       CT   positive          True  NaN  
6 2022-01-03       CT   positive          True  NaN  
7 2022-09-03       CT   positive          True  NaN  
8 2022-09-03       CT   positive          True  NaN  
9 2022-09-03       CT   positive          True  2.0  
  • Related