I have a dataset, with patient records that have either had consults or not indicated by 7 variables (FITM1-7, I will only showcase 2, FITM1, FITM2) having a particular code (I provide a list of codes, msp_codes).
My conditions are to check each record for the occurrence of the codes and then check whether the associated date of which consult took place (SERVDT1-7) was between the admission date (ADMIT_DATE) and surgery date (SURG_DATE), if so I want to create an additional variable med_consult
= 1 or med_consult
= 0
My issue is how to check for these pair of variables for the conditions (FITM1-7, SERVDT1-7) and ignore the irrelevant variable (CLMSPEC1-7) that comes between events, I have again used np.select
to do something similar but found that I was writing a lot of repetitive code to account for all conditions but would rather a better way of maybe melting melt()
the dataset and doing it this way.
msp_codes = [310, 311, 0305, 308, 910]
# Input
patients = [
("1", '2022-11-13', '2022-11-15', "310", "2022-11-14", "M3200", "00456", "2022-11-14", "G456"),
("2", '2022-11-10', '2022-11-22', "007", "2022-11-14", "M05", "0033", "2022-11-20", "G456"),
("3", '2022-11-09', '2022-11-13', "309", "2022-10-14", "N058", "308", "2021-10-02", "F70"),
]
labels = ["ID", 'ADMIT_DATE', 'SURG_DATE', "FITM1", "SERVDT1", "CLMSPEC1", "FITM2", "SERVDT2", "CLMSPEC2"]
df_patients = pd.DataFrame.from_records(patients, columns=labels)
date_cols = ['ADMIT_DATE', 'SURG_DATE', 'SERVDT1', 'SERVDT2']
df_patients[date_cols] = df_patients[date_cols].apply(pd.to_datetime, errors='coerce')
df_patients
ID ADMIT_DATE SURG_DATE FITM1 SERVDT1 CLMSPEC1 FITM2 SERVDT2 CLMSPEC2
1 2022-11-13 2022-11-15 310 2022-11-14 M3200 00456 2022-11-14 G456
2 2022-11-10 2022-11-22 007 2022-11-14 M05 0033 2022-11-20 G456
3 2022-11-09 2022-11-13 309 2022-10-14 N058 308 2021-10-02 F70
# Output
ID ADMIT_DATE SURG_DATE FITM1 SERVDT1 CLMSPEC1 FITM2 SERVDT2 CLMSPEC2 med_consult
1 2022-11-13 2022-11-15 310 2022-11-14 M3200 00456 2022-11-14 G456 1
2 2022-11-10 2022-11-22 007 2022-11-14 M05 0033 2022-11-20 G456 0
3 2022-11-09 2022-11-13 309 2022-10-14 N058 308 2021-10-02 F70 0
Row 1 meets both the condition of having a MSP code in the list 310 and a associated date within the admission date and surgery date.
Row 2 doesn't meet the criteria of having an associated MSP code, so dates aren't checked.
Row 3 has an associated MSP code but the date associated doesn't fall within admission date and surgery date.
CodePudding user response:
Setup, note how I've properly formatted msp_codes
and modified how the date_cols
are made:
msp_codes = ['310', '311', '0305', '308', '910']
patients = [
("1", '2022-11-13', '2022-11-15', "310", "2022-11-14", "M3200", "00456", "2022-11-14", "G456"),
("2", '2022-11-10', '2022-11-22', "007", "2022-11-14", "M05", "0033", "2022-11-20", "G456"),
("3", '2022-11-09', '2022-11-13', "309", "2022-10-14", "N058", "308", "2021-10-02", "F70"),
]
labels = ["ID", 'ADMIT_DATE', 'SURG_DATE', "FITM1", "SERVDT1", "CLMSPEC1", "FITM2", "SERVDT2", "CLMSPEC2"]
df_patients = pd.DataFrame.from_records(patients, columns=labels)
date_cols = ['ADMIT_DATE', 'SURG_DATE', 'SERVDT1', 'SERVDT2']
for col in date_cols:
df_patients[col] = pd.to_datetime(df_patients[col])
Doing:
mask = pd.DataFrame()
for servdt, fitm in zip(df_patients.filter(like='SERVDT'), df_patients.filter(like='FITM')):
sub_mask = (df_patients[servdt].between(df_patients.ADMIT_DATE, df_patients.SURG_DATE)
& df_patients[fitm].isin(msp_codes))
mask = pd.concat([mask, sub_mask], axis=1)
df_patients['med_consult'] = 0
df_patients.loc[mask.any(axis=1), 'med_consult'] = 1
Output:
ID ADMIT_DATE SURG_DATE FITM1 SERVDT1 CLMSPEC1 FITM2 SERVDT2 CLMSPEC2 med_consult
0 1 2022-11-13 2022-11-15 310 2022-11-14 M3200 00456 2022-11-14 G456 1
1 2 2022-11-10 2022-11-22 007 2022-11-14 M05 0033 2022-11-20 G456 0
2 3 2022-11-09 2022-11-13 309 2022-10-14 N058 308 2021-10-02 F70 0
Alternative method using pd.wide_to_long
:
df = pd.wide_to_long(df_patients, stubnames=['FITM', 'SERVDT', 'CLMSPEC'], i=['ID', 'ADMIT_DATE', 'SURG_DATE'], j='med_consult').reset_index()
df = df[df.FITM.isin(msp_codes) & df.SERVDT.between(df.ADMIT_DATE, df.SURG_DATE)]
df = df.groupby(['ID'],as_index=False)['med_consult'].first()
df.med_consult = 1
df_patients = df_patients.merge(df, 'left').fillna(0)
Output:
ID ADMIT_DATE SURG_DATE FITM1 SERVDT1 CLMSPEC1 FITM2 SERVDT2 CLMSPEC2 med_consult
0 1 2022-11-13 2022-11-15 310 2022-11-14 M3200 00456 2022-11-14 G456 1.0
1 2 2022-11-10 2022-11-22 007 2022-11-14 M05 0033 2022-11-20 G456 0.0
2 3 2022-11-09 2022-11-13 309 2022-10-14 N058 308 2021-10-02 F70 0.0