I am looking for a way to check if any column in a subset of dataframe columns contains any string from a list of strings. I want to know if there is a better way to do it than using apply?
df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
'col2': ['car', 'bike', '']})
def check_data(df, cols, strings):
for j in cols:
if df[j] in strings:
return 1
else:
return 0
df['answer'] = df.apply(check_data, cols=['col1'], strings=['dog', 'cat'], axis=1)
This gives the desired output but I want to know if there is a better more pythonic way to do this without applying the function to each row of the data? Thanks!
CodePudding user response:
I would suggest the following solution.
use [] notation to access a subset of column values of a row using a list of column names
use set operations to compare if two sets have at least one shared value (not isdisjoint())
use lambda to have everything neatly in one line
df['answer'] = df.apply(lambda row: not {'dog', 'cat'}.isdisjoint(row[['col1']].values), axis=1)
CodePudding user response:
I had to add few columns so as to not use apply function.
df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
'col2': ['car', 'bike', '']})
df['col3'] = df.values.tolist() # creating new column as lists
df['strings'] = [['dog', 'cat'] for i in df.index] # creating new column with list of strings
df['common'] = [list(set(a).intersection(set(b))) for a, b in zip(df['col3'], df['strings'])] # getting common elements
df['answer'] = np.where(df['common'].str.len()>0,1,0)
df.drop(['col3','strings','common'],axis=1,inplace=True) #dropping created cols
I guess this code can be cleaned further.
CodePudding user response:
your question stated list of columns, but expected result was for only one column.
would you have a separate answer column corresponding to each column when evaluating multiple columns?
so, in case you need to check one column here is one way to do it without apply
df['answer']=df['col1'].isin(strings).astype(int)
df
col1 col2 answer
0 cat car 1
1 dog bike 1
2 mouse 0