I have a dataset as follows with 4 variables:
ID DX1 DX2 DX3
1 Cold Covid Sickle
2 Rash Burn
3 Resp Sore throat
I want to write a code that will only include certain patients with certain diagnoses in a dataframe. For example, I only want patients who have either of the following diagnoses (cold and/or resp) as their DX in either DX1, DX2, or DX3. E.g. the output would be the same as above but without patient 2.
I know I can do a contains statement below but how can I change my code so the search variables are DX1 DX2 and DX3 instead of just DX1. I also know I can concatenate each of the DX variables into one variable but am curious if there is a cleaner way to code this.
DF_v1 = DF[DF.DX1.str.contains("cold|resp", na=False, case=False)]
CodePudding user response:
filter
to select the 'DX' columns then because the .str
operations act on Series apply that over each Series separately. Then check if any
of the columns satisfy you condition for that row to form the Boolean mask that you use to slice the original DataFrame.
m = (df.filter(like='DX')
.apply(lambda col: col.str.contains("cold|resp", na=False, case=False))
.any(axis=1))
df[m]
# ID DX1 DX2 DX3
#0 1 Cold Covid Sickle
#2 3 Resp Sore throat