Home > other >  Check if any column in a subset of columns contains any string in a list of strings pandas row-wise?
Check if any column in a subset of columns contains any string in a list of strings pandas row-wise?

Time:09-29

I am looking for a way to check if any column in a subset of dataframe columns contains any string from a list of strings. I want to know if there is a better way to do it than using apply?

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
         'col2': ['car', 'bike', '']})

def check_data(df, cols, strings):

    for j in cols:
        if df[j] in strings:
           
            return 1
    else:
        return 0

df['answer'] = df.apply(check_data, cols=['col1'], strings=['dog', 'cat'], axis=1)

This gives the desired output but I want to know if there is a better more pythonic way to do this without applying the function to each row of the data? Thanks!

CodePudding user response:

I would suggest the following solution.

  • use [] notation to access a subset of column values of a row using a list of column names

  • use set operations to compare if two sets have at least one shared value (not isdisjoint())

  • use lambda to have everything neatly in one line

    df['answer'] = df.apply(lambda row: not {'dog', 'cat'}.isdisjoint(row[['col1']].values), axis=1)
    

CodePudding user response:

I had to add few columns so as to not use apply function.

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
     'col2': ['car', 'bike', '']})
df['col3'] = df.values.tolist() # creating new column as lists 
df['strings'] = [['dog', 'cat'] for i in df.index]  # creating new column with list of strings 
df['common']  =  [list(set(a).intersection(set(b))) for a, b in zip(df['col3'], df['strings'])] # getting common elements 
df['answer'] = np.where(df['common'].str.len()>0,1,0) 
df.drop(['col3','strings','common'],axis=1,inplace=True) #dropping created cols

I guess this code can be cleaned further.

CodePudding user response:

your question stated list of columns, but expected result was for only one column.

would you have a separate answer column corresponding to each column when evaluating multiple columns?

so, in case you need to check one column here is one way to do it without apply

df['answer']=df['col1'].isin(strings).astype(int)
df
    col1    col2    answer
0   cat     car     1
1   dog     bike    1
2   mouse           0
  • Related