I have a df like below
VisitDate FirstDate ID Dx
0 2018-04-16 NaT 7410013407 OtherPA
1 2018-05-15 2018-05-15 7410013407 PA
2 2018-05-15 NaT 7410013407 OtherPA
3 2018-05-24 NaT 7410013407 OtherPA
4 2019-11-02 NaT 7410022611 OtherPA
5 2019-11-05 2019-11-05 7410022611 OM
6 2018-05-15 NaT 8723458 OtherPA
7 2018-08-25 NaT 8723458 OtherPA
8 2018-09-10 2018-09-10 8723458 PA
9 2018-09-14 2018-09-10 8723458 PA
10 2018-12-07 2018-09-10 8723458 OM
11 2019-05-01 2018-09-10 8723458 OM
12 2020-08-06 NaT 8723458 OtherOM
13 2018-08-20 NaT 8723458 OtherOM
df.to_dict(){
"VisitDate": {
0: Timestamp("2018-04-16 00:00:00"),
1: Timestamp("2018-05-15 00:00:00"),
2: Timestamp("2018-05-15 00:00:00"),
3: Timestamp("2018-05-24 00:00:00"),
4: Timestamp("2019-11-02 00:00:00"),
5: Timestamp("2019-11-05 00:00:00"),
6: Timestamp("2018-05-15 00:00:00"),
7: Timestamp("2018-08-25 00:00:00"),
8: Timestamp("2018-09-10 00:00:00"),
9: Timestamp("2018-09-14 00:00:00"),
10: Timestamp("2018-12-07 00:00:00"),
11: Timestamp("2019-05-01 00:00:00"),
12: Timestamp("2020-08-06 00:00:00"),
13: Timestamp("2018-08-20 00:00:00"),
},
"FirstDate": {
0: NaT,
1: Timestamp("2018-05-15 00:00:00"),
2: NaT,
3: NaT,
4: NaT,
5: Timestamp("2019-11-05 00:00:00"),
6: NaT,
7: NaT,
8: Timestamp("2018-09-10 00:00:00"),
9: Timestamp("2018-09-10 00:00:00"),
10: Timestamp("2018-09-10 00:00:00"),
11: Timestamp("2018-09-10 00:00:00"),
12: NaT,
13: NaT,
},
"ID": {
0: 7410013407,
1: 7410013407,
2: 7410013407,
3: 7410013407,
4: 7410022611,
5: 7410022611,
6: 8723458,
7: 8723458,
8: 8723458,
9: 8723458,
10: 8723458,
11: 8723458,
12: 8723458,
13: 8723458,
},
"Dx": {
0: "OtherPA",
1: "PA",
2: "OtherPA",
3: "OtherPA",
4: "OtherPA",
5: "OM",
6: "OtherPA",
7: "OtherPA",
8: "PA",
9: "PA",
10: "OM",
11: "OM",
12: "OtherOM",
13: "OtherOM",
},
}
df.dtypes
VisitDate datetime64[ns]
FirstDate datetime64[ns]
ID int64
Dx object
Each ID has one FirstDate.
For each ID, I want to keep:
- All rows having PA or OM
- Rows having OtherPA, if Dx on FirstDate is PA AND VisitDate is within 30 days pre-FirstDate
For example: ID 8723458, Dx on FirstDate 10/09/2018 is PA
_row having OtherPA and VisitDate 25/08/2018 would be kept
_row having OtherPA and VisitDate 15/05/2018 would be removed - Rows having OtherOM, if Dx on FirstDate is OM AND VisitDate is within 30 days pre-FirstDate
My expected result
VisitDate FirstDate ID Dx
0 2018-04-16 NaT 7410013407 OtherPA
1 2018-05-15 2018-05-15 7410013407 PA
2 2018-05-15 NaT 7410013407 OtherPA
3 2019-11-05 2019-11-05 7410022611 OM
4 2018-08-25 NaT 8723458 OtherPA
5 2018-09-10 2018-09-10 8723458 PA
6 2018-09-14 2018-09-10 8723458 PA
7 2018-12-07 2018-09-10 8723458 OM
8 2019-05-01 2018-09-10 8723458 OM
9 2018-08-20 NaT 8723458 OtherOM
Any suggestions would be greatly appreciated!!
CodePudding user response:
Try as follows:
cond1 = df['VisitDate'].sub(df.groupby('ID')['FirstDate']\
.transform('first')).dt.days.between(-30,0)
cond2 = df['Dx'].isin(['OM','PA'])
output = df[cond1 | cond2].reset_index(drop=True)
print(output)
VisitDate FirstDate ID Dx
0 2018-04-16 NaT 7410013407 OtherPA
1 2018-05-15 2018-05-15 7410013407 PA
2 2018-05-15 NaT 7410013407 OtherPA
3 2019-11-02 NaT 7410022611 OtherPA
4 2019-11-05 2019-11-05 7410022611 OM
5 2018-08-25 NaT 8723458 OtherPA
6 2018-09-10 2018-09-10 8723458 PA
7 2018-09-14 2018-09-10 8723458 PA
8 2018-12-07 2018-09-10 8723458 OM
9 2019-05-01 2018-09-10 8723458 OM
10 2018-08-20 NaT 8723458 OtherOM
CodePudding user response:
Because need distinguisg froup by PA
and OM
is extracted last 2 character in column Dx
and get first non missing value in FirstDate
column which is subtracted by VisitDate
and filter values between 0, 30
days, last chain condition for all rows with OM,PA
values in boolean indexing
:
m1 = (df.groupby(['ID', df['Dx'].str[-2:]])['FirstDate']
.transform('first').sub(df['VisitDate']).dt.days.between(0,30))
m2 = df['Dx'].isin(['PA','OM'])
df = df[m1 | m2]
print (df)
VisitDate FirstDate ID Dx
0 2018-04-16 NaT 7410013407 OtherPA
1 2018-05-15 2018-05-15 7410013407 PA
2 2018-05-15 NaT 7410013407 OtherPA
5 2019-11-05 2019-11-05 7410022611 OM
7 2018-08-25 NaT 8723458 OtherPA
8 2018-09-10 2018-09-10 8723458 PA
9 2018-09-14 2018-09-10 8723458 PA
10 2018-12-07 2018-09-10 8723458 OM
11 2019-05-01 2018-09-10 8723458 OM
13 2018-08-20 NaT 8723458 OtherOM
Details:
print (df['Dx'].str[-2:])
0 PA
1 PA
2 PA
3 PA
4 PA
5 OM
6 PA
7 PA
8 PA
9 PA
10 OM
11 OM
12 OM
13 OM
Name: Dx, dtype: object