Home > Enterprise >  Filter if either variable contains a character
Filter if either variable contains a character

Time:10-23

I have a dataset as follows with 4 variables:

ID    DX1    DX2            DX3
1   Cold     Covid         Sickle
2   Rash     Burn
3   Resp     Sore throat

I want to write a code that will only include certain patients with certain diagnoses in a dataframe. For example, I only want patients who have either of the following diagnoses (cold and/or resp) as their DX in either DX1, DX2, or DX3. E.g. the output would be the same as above but without patient 2.

I know I can do a contains statement below but how can I change my code so the search variables are DX1 DX2 and DX3 instead of just DX1. I also know I can concatenate each of the DX variables into one variable but am curious if there is a cleaner way to code this.

DF_v1 = DF[DF.DX1.str.contains("cold|resp", na=False, case=False)]

CodePudding user response:

filter to select the 'DX' columns then because the .str operations act on Series apply that over each Series separately. Then check if any of the columns satisfy you condition for that row to form the Boolean mask that you use to slice the original DataFrame.

m = (df.filter(like='DX')
       .apply(lambda col: col.str.contains("cold|resp", na=False, case=False))
       .any(axis=1))

df[m]
#   ID   DX1          DX2     DX3
#0   1  Cold        Covid  Sickle
#2   3  Resp  Sore throat        
  • Related