Home > Mobile >  How to keep rows based on conditions of a data frame
How to keep rows based on conditions of a data frame

Time:09-06

I have a df like below

VisitDate  FirstDate          ID       Dx
0  2018-04-16        NaT  7410013407  OtherPA
1  2018-05-15 2018-05-15  7410013407       PA
2  2018-05-15        NaT  7410013407  OtherPA
3  2018-05-24        NaT  7410013407  OtherPA
4  2019-11-02        NaT  7410022611  OtherPA
5  2019-11-05 2019-11-05  7410022611       OM
6  2018-05-15        NaT     8723458  OtherPA
7  2018-08-25        NaT     8723458  OtherPA
8  2018-09-10 2018-09-10     8723458       PA
9  2018-09-14 2018-09-10     8723458       PA
10 2018-12-07 2018-09-10     8723458       OM
11 2019-05-01 2018-09-10     8723458       OM
12 2020-08-06        NaT     8723458  OtherOM
13 2018-08-20        NaT     8723458  OtherOM

    df.to_dict(){
    "VisitDate": {
        0: Timestamp("2018-04-16 00:00:00"),
        1: Timestamp("2018-05-15 00:00:00"),
        2: Timestamp("2018-05-15 00:00:00"),
        3: Timestamp("2018-05-24 00:00:00"),
        4: Timestamp("2019-11-02 00:00:00"),
        5: Timestamp("2019-11-05 00:00:00"),
        6: Timestamp("2018-05-15 00:00:00"),
        7: Timestamp("2018-08-25 00:00:00"),
        8: Timestamp("2018-09-10 00:00:00"),
        9: Timestamp("2018-09-14 00:00:00"),
        10: Timestamp("2018-12-07 00:00:00"),
        11: Timestamp("2019-05-01 00:00:00"),
        12: Timestamp("2020-08-06 00:00:00"),
        13: Timestamp("2018-08-20 00:00:00"),
    },
    "FirstDate": {
        0: NaT,
        1: Timestamp("2018-05-15 00:00:00"),
        2: NaT,
        3: NaT,
        4: NaT,
        5: Timestamp("2019-11-05 00:00:00"),
        6: NaT,
        7: NaT,
        8: Timestamp("2018-09-10 00:00:00"),
        9: Timestamp("2018-09-10 00:00:00"),
        10: Timestamp("2018-09-10 00:00:00"),
        11: Timestamp("2018-09-10 00:00:00"),
        12: NaT,
        13: NaT,
    },
    "ID": {
        0: 7410013407,
        1: 7410013407,
        2: 7410013407,
        3: 7410013407,
        4: 7410022611,
        5: 7410022611,
        6: 8723458,
        7: 8723458,
        8: 8723458,
        9: 8723458,
        10: 8723458,
        11: 8723458,
        12: 8723458,
        13: 8723458,
    },
    "Dx": {
        0: "OtherPA",
        1: "PA",
        2: "OtherPA",
        3: "OtherPA",
        4: "OtherPA",
        5: "OM",
        6: "OtherPA",
        7: "OtherPA",
        8: "PA",
        9: "PA",
        10: "OM",
        11: "OM",
        12: "OtherOM",
        13: "OtherOM",
    },
}

df.dtypes
VisitDate    datetime64[ns]
FirstDate    datetime64[ns]
ID                    int64
Dx                   object

Each ID has one FirstDate.
For each ID, I want to keep:

  • All rows having PA or OM
  • Rows having OtherPA, if Dx on FirstDate is PA AND VisitDate is within 30 days pre-FirstDate
    For example: ID 8723458, Dx on FirstDate 10/09/2018 is PA
    _row having OtherPA and VisitDate 25/08/2018 would be kept
    _row having OtherPA and VisitDate 15/05/2018 would be removed
  • Rows having OtherOM, if Dx on FirstDate is OM AND VisitDate is within 30 days pre-FirstDate

My expected result

VisitDate  FirstDate          ID       Dx
0 2018-04-16        NaT  7410013407  OtherPA
1 2018-05-15 2018-05-15  7410013407       PA
2 2018-05-15        NaT  7410013407  OtherPA
3 2019-11-05 2019-11-05  7410022611       OM
4 2018-08-25        NaT     8723458  OtherPA
5 2018-09-10 2018-09-10     8723458       PA
6 2018-09-14 2018-09-10     8723458       PA
7 2018-12-07 2018-09-10     8723458       OM
8 2019-05-01 2018-09-10     8723458       OM
9 2018-08-20        NaT     8723458  OtherOM

Any suggestions would be greatly appreciated!!

CodePudding user response:

Try as follows:

cond1 = df['VisitDate'].sub(df.groupby('ID')['FirstDate']\
                            .transform('first')).dt.days.between(-30,0)
cond2 = df['Dx'].isin(['OM','PA'])

output = df[cond1 | cond2].reset_index(drop=True)

print(output)

    VisitDate  FirstDate          ID       Dx
0  2018-04-16        NaT  7410013407  OtherPA
1  2018-05-15 2018-05-15  7410013407       PA
2  2018-05-15        NaT  7410013407  OtherPA
3  2019-11-02        NaT  7410022611  OtherPA
4  2019-11-05 2019-11-05  7410022611       OM
5  2018-08-25        NaT     8723458  OtherPA
6  2018-09-10 2018-09-10     8723458       PA
7  2018-09-14 2018-09-10     8723458       PA
8  2018-12-07 2018-09-10     8723458       OM
9  2019-05-01 2018-09-10     8723458       OM
10 2018-08-20        NaT     8723458  OtherOM

CodePudding user response:

Because need distinguisg froup by PA and OM is extracted last 2 character in column Dx and get first non missing value in FirstDate column which is subtracted by VisitDate and filter values between 0, 30 days, last chain condition for all rows with OM,PA values in boolean indexing:

m1 = (df.groupby(['ID', df['Dx'].str[-2:]])['FirstDate']
        .transform('first').sub(df['VisitDate']).dt.days.between(0,30))
m2 = df['Dx'].isin(['PA','OM'])

df = df[m1 | m2]
print (df)
    VisitDate  FirstDate          ID       Dx
0  2018-04-16        NaT  7410013407  OtherPA
1  2018-05-15 2018-05-15  7410013407       PA
2  2018-05-15        NaT  7410013407  OtherPA
5  2019-11-05 2019-11-05  7410022611       OM
7  2018-08-25        NaT     8723458  OtherPA
8  2018-09-10 2018-09-10     8723458       PA
9  2018-09-14 2018-09-10     8723458       PA
10 2018-12-07 2018-09-10     8723458       OM
11 2019-05-01 2018-09-10     8723458       OM
13 2018-08-20        NaT     8723458  OtherOM

Details:

print (df['Dx'].str[-2:])
0     PA
1     PA
2     PA
3     PA
4     PA
5     OM
6     PA
7     PA
8     PA
9     PA
10    OM
11    OM
12    OM
13    OM
Name: Dx, dtype: object
  • Related