I am trying to search through the following "patients" by their ID # ['PAT_MRN_ID'] to see whether they have either two consecutive positive results for either NG or CT in a 90 day period. If they do, I'd like a sum of the positive results per each ID # with the specific dates of those results.
My intention is to later remove patients that have >=2 positive results in a 90 day period from my data set. I am able to run the data through the groupby and then through rolling. However, doing so results in truncation of the dataframe to only 3 colums.
An example of the df is as follows:
PAT_NAME | PAT_MRN_ID | AGE | GENDER | ETHNICITY | RACE | ORDER_INST | TEST_TYPE | RESULTS | ALL_RESULTS |
---|---|---|---|---|---|---|---|---|---|
Bee, Max | 1234 | 24 | Male | Hispanic or Latinx | Black | 1/20/2023 | NG | negative | False |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 1/01/2022 | NG | negative | False |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 3/16/2022 | CT | positive | True |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 3/17/2022 | CT | positive | True |
Rell, Hell | 123456 | 30 | Male | Hispanic or Latinx | Asian | 3/03/2022 | NG | positive | True |
Rell, Hell | 123456 | 30 | Male | Hispanic or Latinx | Asian | 3/03/2022 | CT | positive | True |
Writer, JR | 1234567 | 88 | Female | Hispanic or Latinx | Asian | 1/03/2022 | CT | positive | True |
Writer, JR | 1234567 | 88 | Female | Hispanic or Latinx | Asian | 9/03/2022 | CT | positive | True |
Ron, Cam | 12345678 | 55 | Male | Black | Black | 9/03/2022 | CT | positive | True |
Ron, Cam | 12345678 | 55 | Male | Black | Black | 9/03/2022 | CT | positive | True |
Tried the following:
df['SUM'] = df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()
ValueError: cannot handle a non-unique multi-index!
Expected:
PAT_NAME | PAT_MRN_ID | AGE | GENDER | ETHNICITY | RACE | ORDER_INST | TEST_TYPE | RESULTS | ALL_RESULTS | SUM |
---|---|---|---|---|---|---|---|---|---|---|
Bee, Max | 1234 | 24 | Male | Hispanic or Latinx | Black | 1/20/2023 | NG | negative | False | NaN |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 1/01/2022 | NG | negative | False | NaN |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 3/16/2022 | CT | positive | True | NaN |
Jones, Jim | 12345 | 35 | Male | Hispanic or Latinx | White | 3/17/2022 | CT | positive | True | 2 |
Rell, Hell | 123456 | 30 | Male | Hispanic or Latinx | Asian | 3/03/2022 | NG | positive | True | NaN |
Rell, Hell | 123456 | 30 | Male | Hispanic or Latinx | Asian | 3/03/2022 | CT | positive | True | NaN |
Writer, JR | 1234567 | 88 | Female | Hispanic or Latinx | Asian | 1/03/2022 | CT | positive | True | NaN |
Writer, JR | 1234567 | 88 | Female | Hispanic or Latinx | Asian | 9/03/2022 | CT | positive | True | NaN |
Ron, Cam | 12345678 | 55 | Male | Black | Black | 9/03/2022 | CT | positive | True | |
Ron, Cam | 12345678 | 55 | Male | Black | Black | 9/03/2022 | CT | positive | True | 2 |
If I attempt the following:
df.groupby(["PAT_MRN_ID","TEST_TYPE"],as_index=False)['ORDER_INST', 'TEST_TYPE', 'ALL_RESULTS'].rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS'].sum()
I get this:
PAT_MRN_ID | ORDER_INST | SUM |
---|---|---|
1234 | 1/20/2023 | NaN |
12345 | 1/01/2022 | Nan |
12345 | 3/16/2022 | NaN |
12345 | 3/17/2022 | 2 |
123456 | 3/03/2022 | NaN |
123456 | 3/03/2022 | NaN |
1234567 | 9/03/2022 | NaN |
1234567 | 9/03/2022 | NaN |
12345678 | 9/03/2022 | NaN |
12345678 | 9/03/2022 | 2 |
Furthermore, I have tried passing this through transform(lambda x: x.rolling()) and get:
ValueError: invalid on specified as ORDER_INST, must be a column (of DataFrame), an Index or None The above exception was the direct cause of the following exception: ValueError: transform must return a scalar value for each group
I also attempted turing ['ALL_RESULTS'] from 'True'/'False' into '1'/'0' respectively to give transform a "scalar;" however, that also failed to work.
In an ideal world I want to keep the original dataframe format and add a row for the rolling sum OR simply just remove the instance of duplicate positive test results in a 90 day period from the dataframe.
CodePudding user response:
Use DataFrame.join
by MultiIndex, also was removed TEST_TYPE
column and as_index=False
parameter:
df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])
df = (df.join(df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
.rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
.sum().rename('SUM'), on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST']))
print (df)
PAT_NAME PAT_MRN_ID AGE GENDER ETHNICITY RACE \
0 Bee, Max 1234 24 Male Hispanic or Latinx Black
1 Jones, Jim 12345 35 Male Hispanic or Latinx White
2 Jones, Jim 12345 35 Male Hispanic or Latinx White
3 Jones, Jim 12345 35 Male Hispanic or Latinx White
4 Rell, Hell 123456 30 Male Hispanic or Latinx Asian
5 Rell, Hell 123456 30 Male Hispanic or Latinx Asian
6 Writer, JR 1234567 88 Female Hispanic or Latinx Asian
7 Writer, JR 1234567 88 Female Hispanic or Latinx Asian
ORDER_INST TEST_TYPE RESULTS ALL_RESULTS SUM
0 2023-01-20 NG negative False NaN
1 2022-01-01 NG negative False NaN
2 2022-03-16 CT positive True NaN
3 2022-03-17 CT positive True 2.0
4 2022-03-03 NG positive True NaN
5 2022-03-03 CT positive True NaN
6 2022-01-03 CT positive True NaN
7 2022-09-03 CT positive True NaN
EDIT: I add GroupBy.cumcount
for counter of duplicated values:
df['ORDER_INST'] = pd.to_datetime(df['ORDER_INST'])
df1 = (df.groupby(["PAT_MRN_ID","TEST_TYPE"])[['ORDER_INST', 'ALL_RESULTS']]
.rolling('90d', min_periods=2, on='ORDER_INST')['ALL_RESULTS']
.sum()
.to_frame('SUM')
.assign(g = lambda x:x.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount())
.set_index('g', append=True))
df['g'] = df.groupby(['PAT_MRN_ID','TEST_TYPE','ORDER_INST']).cumcount()
df = df.join(df1, on=['PAT_MRN_ID','TEST_TYPE','ORDER_INST','g']).drop('g', axis=1)
print (df)
PAT_NAME PAT_MRN_ID AGE GENDER ETHNICITY RACE \
0 Bee, Max 1234 24 Male Hispanic or Latinx Black
1 Jones, Jim 12345 35 Male Hispanic or Latinx White
2 Jones, Jim 12345 35 Male Hispanic or Latinx White
3 Jones, Jim 12345 35 Male Hispanic or Latinx White
4 Rell, Hell 123456 30 Male Hispanic or Latinx Asian
5 Rell, Hell 123456 30 Male Hispanic or Latinx Asian
6 Writer, JR 1234567 88 Female Hispanic or Latinx Asian
7 Writer, JR 1234567 88 Female Hispanic or Latinx Asian
8 Ron, Cam 12345678 55 Male Black Black
9 Ron, Cam 12345678 55 Male Black Black
ORDER_INST TEST_TYPE RESULTS ALL_RESULTS SUM
0 2023-01-20 NG negative False NaN
1 2022-01-01 NG negative False NaN
2 2022-03-16 CT positive True NaN
3 2022-03-17 CT positive True 2.0
4 2022-03-03 NG positive True NaN
5 2022-03-03 CT positive True NaN
6 2022-01-03 CT positive True NaN
7 2022-09-03 CT positive True NaN
8 2022-09-03 CT positive True NaN
9 2022-09-03 CT positive True 2.0