I have a dataframe with a column of products that are extremely inconsistent with extra words than are necessary. I would like to check each cell and see if there are words present based on two lists that contain the necessary keywords. I would like to check first if any of the words from the first list are present, if true, I would like to check if any of the words from the second list are present. If both end up being true, I would like to assign a value for that row in column product_clean
. If false, the value remains nan
.
list1: ['fruit', 'FRUIT', 'Fruit', 'FRUit']
list2: ['banana', 'strawberry', 'cherry']
df:
product product_clean
fruit 10% banana SPECIAL nan
FRUit strawberry 99OFF nan
milk nan
jam nan
cherry FRUIT Virginia nan
df_DESIRED:
product product_clean
fruit 10% banana SPECIAL Fruit
FRUit strawberry 99OFF Fruit
milk nan
jam nan
cherry FRUIT Virginia Fruit
CodePudding user response:
I think simpliest is check all values by both conditions and chain them by &
for bitwise AND
:
m = df['product'].str.contains("|".join(list1)) &
df['product'].str.contains("|".join(list2))
df['product_clean'] = np.where(m, 'Fruit', np.nan)
Or:
df.loc[m, 'product_clean'] = 'Fruit'
print (df)
product product_clean
0 fruit 10% banana SPECIAL Fruit
1 FRUit strawberry 99OFF Fruit
2 milk nan
3 jam nan
4 cherry FRUIT Virginia Fruit
If need first test by first condition aand then by second one it is possible by:
m1 = df['product'].str.contains("|".join(list1))
m = df.loc[m1, 'product'].str.contains("|".join(list2)).reindex(df.index, fill_value=False)
df['product_clean'] = np.where(m, 'Fruit', np.nan)