How to check if multiple words are in a string of a dataframe value based on 2 lists, then assign a-CodePudding

I have a dataframe with a column of products that are extremely inconsistent with extra words than are necessary. I would like to check each cell and see if there are words present based on two lists that contain the necessary keywords. I would like to check first if any of the words from the first list are present, if true, I would like to check if any of the words from the second list are present. If both end up being true, I would like to assign a value for that row in column product_clean. If false, the value remains nan.

list1: ['fruit', 'FRUIT', 'Fruit', 'FRUit']
list2: ['banana', 'strawberry', 'cherry']

df:

     product                    product_clean
fruit 10% banana SPECIAL           nan
FRUit strawberry 99OFF             nan
milk                               nan
jam                                nan
cherry FRUIT Virginia              nan



df_DESIRED:

     product                    product_clean
fruit 10% banana SPECIAL           Fruit
FRUit strawberry 99OFF             Fruit
milk                               nan
jam                                nan
cherry FRUIT Virginia              Fruit

CodePudding user response：

I think simpliest is check all values by both conditions and chain them by & for bitwise AND:

m = df['product'].str.contains("|".join(list1)) & 
    df['product'].str.contains("|".join(list2))

 df['product_clean'] = np.where(m, 'Fruit', np.nan)

Or:

df.loc[m, 'product_clean'] = 'Fruit'

print (df)
                    product product_clean
0  fruit 10% banana SPECIAL         Fruit
1    FRUit strawberry 99OFF         Fruit
2                      milk           nan
3                       jam           nan
4     cherry FRUIT Virginia         Fruit

If need first test by first condition aand then by second one it is possible by:

m1 = df['product'].str.contains("|".join(list1)) 
m = df.loc[m1, 'product'].str.contains("|".join(list2)).reindex(df.index, fill_value=False)

df['product_clean'] = np.where(m, 'Fruit', np.nan)